Questions

Frequently asked questions

Is PWM-Bench a model?

No. It is a benchmark framework — a task definition, scoring methodology, baselines, leakage controls, and governance protocol. It does not ship a model.

Is PWM-Bench a dataset?

No. Many datasets may instantiate it. PWM-Bench specifies how forecasts are made, sealed, resolved, and scored; the underlying participant data stays under participant control.

Does PWM-Bench claim to solve person understanding?

No. It proposes a way to measure progress toward it. The benchmark makes the claim falsifiable, not settled.

Why forecasting?

Understanding is latent and cannot be observed directly. Forecasting is observable: if a system understands an individual, it should predict that individual's future better than population knowledge and personal routine.

Why not self-report?

Self-report is valuable but retrospective and incomplete. PWM-Bench tests whether additional evidence improves future predictive accountability — a property self-report alone cannot establish.

Why is this not just personalization?

Personalization predicts outputs. PWM-Bench tests whether a system can forecast the evolving state that generates those outputs — attention, goals, and goal-state transitions — under sealed, prospective conditions.

Are there results yet?

No. The current release is pre-pilot. Scores will be reported in PWM-Pilot. No empirical results are currently on the leaderboard.

Will raw participant data be public?

No. PWM-Bench is designed for federated execution and aggregate-only reporting. Raw data stays under participant control; only resolved outcomes and aggregate metrics leave the client.

What stops a model from just memorizing a person?

The identity-permutation test. If a system's apparent skill survives when forecasts are scored against the wrong individual, that skill was not person-specific. PWM-Bench requires skill to collapse under permutation.

How do you prevent leakage from the future?

Forecasts are sealed and timestamped before outcomes occur, evaluation is strict walk-forward with no random cross-validation, and no system may access evidence dated after its forecast time.