Research agenda

A research program, not a single result

PWM-Bench is built so that hard questions about person-specific forecasting become measurable. The open questions below motivate the benchmark; the illustrative experiments show how the same machinery answers them.

Open questions

Q1 Evidence sufficiency

Which evidence tiers are sufficient to beat the personal routine baseline (R2) on each task family?

Q2 Decay

How quickly does person-specific skill decay as evidence ages without refresh?

Q3 Sparsity

How sparse can evidence become before skill collapses to the routine baseline?

Q4 Modality value

Which modalities (text, audio, visual, location, physiology) carry marginal predictive value, and for which tasks?

Q5 Refresh rate

How often must evidence be refreshed to maintain an accurate estimate of an evolving state?

Q6 Transition anticipation

Can any system anticipate goal-state transitions (PWM-X) before they are behaviorally obvious?

Q7 Observation vs self-report

Does passive observation add skill beyond what the participant would self-report?

Q8 Cross-domain transfer

Does skill learned in one life domain transfer to forecasting in another?

Illustrative experiments

#	Experiment	Questions	Design	Outcome
A	Observation duration	Q2, Q5	Vary the length of the observation window feeding a system and measure skill as a function of how much history it has seen.	A skill-vs-observation-duration curve per task family.
B	Modality ablation	Q4	Hold the system fixed and remove one modality at a time from the evidence stream.	Marginal skill attributable to each modality, per task.
C	Attention forecasting	Q1, Q4	Focus on T4 (attention allocation) across evidence tiers L0→L3.	The evidence tier at which attention forecasting first beats R2.
D	Goal-transition detection	Q6	Score PWM-X forecasts around known transition points and measure lead time.	Whether transitions are anticipated, and with what lead time and calibration.
E	Passive vs self-report	Q7	Compare a passive-observation system against a self-report-only system on identical questions.	The marginal skill of observation over self-report.
F	Evidence-refresh rate	Q2, Q5	Throttle how frequently a system's evidence is refreshed and measure skill decay between refreshes.	A required-refresh-rate estimate to maintain a target skill level.
G	Personalized vs population	Q1, Q8	Contrast person-specific models against population models on the same forecasts, including under identity permutation.	The person-specific skill that survives permutation — the quantity PWM-Bench is built to measure.

Open questions and experiments render from data/experiments.json.

Join the program

PWM-Bench is intended as shared infrastructure. Labs can contribute systems, evidence-processing tools, privacy infrastructure, or experimental designs that extend the open questions above. If your group works on personalization, world models, calibrated forecasting, or privacy-preserving ML, the protocol is designed to be something you can build on. See Participate.