Research agenda
A research program, not a single result
PWM-Bench is built so that hard questions about person-specific forecasting become measurable. The open questions below motivate the benchmark; the illustrative experiments show how the same machinery answers them.
Open questions
Q1 Evidence sufficiency
Which evidence tiers are sufficient to beat the personal routine baseline (R2) on each task family?
Q2 Decay
How quickly does person-specific skill decay as evidence ages without refresh?
Q3 Sparsity
How sparse can evidence become before skill collapses to the routine baseline?
Q4 Modality value
Which modalities (text, audio, visual, location, physiology) carry marginal predictive value, and for which tasks?
Q5 Refresh rate
How often must evidence be refreshed to maintain an accurate estimate of an evolving state?
Q6 Transition anticipation
Can any system anticipate goal-state transitions (PWM-X) before they are behaviorally obvious?
Q7 Observation vs self-report
Does passive observation add skill beyond what the participant would self-report?
Q8 Cross-domain transfer
Does skill learned in one life domain transfer to forecasting in another?
Illustrative experiments
| # | Experiment | Questions | Design | Outcome |
|---|---|---|---|---|
| A | Observation duration | Q2, Q5 | Vary the length of the observation window feeding a system and measure skill as a function of how much history it has seen. | A skill-vs-observation-duration curve per task family. |
| B | Modality ablation | Q4 | Hold the system fixed and remove one modality at a time from the evidence stream. | Marginal skill attributable to each modality, per task. |
| C | Attention forecasting | Q1, Q4 | Focus on T4 (attention allocation) across evidence tiers L0→L3. | The evidence tier at which attention forecasting first beats R2. |
| D | Goal-transition detection | Q6 | Score PWM-X forecasts around known transition points and measure lead time. | Whether transitions are anticipated, and with what lead time and calibration. |
| E | Passive vs self-report | Q7 | Compare a passive-observation system against a self-report-only system on identical questions. | The marginal skill of observation over self-report. |
| F | Evidence-refresh rate | Q2, Q5 | Throttle how frequently a system's evidence is refreshed and measure skill decay between refreshes. | A required-refresh-rate estimate to maintain a target skill level. |
| G | Personalized vs population | Q1, Q8 | Contrast person-specific models against population models on the same forecasts, including under identity permutation. | The person-specific skill that survives permutation — the quantity PWM-Bench is built to measure. |
Open questions and experiments render from data/experiments.json.
Join the program
PWM-Bench is intended as shared infrastructure. Labs can contribute systems, evidence-processing tools, privacy infrastructure, or experimental designs that extend the open questions above. If your group works on personalization, world models, calibrated forecasting, or privacy-preserving ML, the protocol is designed to be something you can build on. See Participate.