Research agenda

A research program, not a single result

PWM-Bench is built so that hard questions about person-specific forecasting become measurable. The open questions below motivate the benchmark; the illustrative experiments show how the same machinery answers them.

Open questions

Q1 Evidence sufficiency

Which evidence tiers are sufficient to beat the personal routine baseline (R2) on each task family?

Q2 Decay

How quickly does person-specific skill decay as evidence ages without refresh?

Q3 Sparsity

How sparse can evidence become before skill collapses to the routine baseline?

Q4 Modality value

Which modalities (text, audio, visual, location, physiology) carry marginal predictive value, and for which tasks?

Q5 Refresh rate

How often must evidence be refreshed to maintain an accurate estimate of an evolving state?

Q6 Transition anticipation

Can any system anticipate goal-state transitions (PWM-X) before they are behaviorally obvious?

Q7 Observation vs self-report

Does passive observation add skill beyond what the participant would self-report?

Q8 Cross-domain transfer

Does skill learned in one life domain transfer to forecasting in another?

Illustrative experiments

#ExperimentQuestionsDesignOutcome
AObservation durationQ2, Q5Vary the length of the observation window feeding a system and measure skill as a function of how much history it has seen.A skill-vs-observation-duration curve per task family.
BModality ablationQ4Hold the system fixed and remove one modality at a time from the evidence stream.Marginal skill attributable to each modality, per task.
CAttention forecastingQ1, Q4Focus on T4 (attention allocation) across evidence tiers L0→L3.The evidence tier at which attention forecasting first beats R2.
DGoal-transition detectionQ6Score PWM-X forecasts around known transition points and measure lead time.Whether transitions are anticipated, and with what lead time and calibration.
EPassive vs self-reportQ7Compare a passive-observation system against a self-report-only system on identical questions.The marginal skill of observation over self-report.
FEvidence-refresh rateQ2, Q5Throttle how frequently a system's evidence is refreshed and measure skill decay between refreshes.A required-refresh-rate estimate to maintain a target skill level.
GPersonalized vs populationQ1, Q8Contrast person-specific models against population models on the same forecasts, including under identity permutation.The person-specific skill that survives permutation — the quantity PWM-Bench is built to measure.

Open questions and experiments render from data/experiments.json.

Join the program

PWM-Bench is intended as shared infrastructure. Labs can contribute systems, evidence-processing tools, privacy infrastructure, or experimental designs that extend the open questions above. If your group works on personalization, world models, calibrated forecasting, or privacy-preserving ML, the protocol is designed to be something you can build on. See Participate.