The benchmark
What PWM-Bench is — and is not
PWM-Bench is a benchmark framework: a task definition, a scoring methodology, baselines, leakage controls, and a federated governance protocol. It is not a model, a dataset, an architecture, or a product.
PWM-Bench is
- a benchmark framework
- a task definition
- a scoring methodology
- a set of baselines (R1 population, R2 routine)
- a set of leakage controls
- an identity-permutation test
- a federated governance protocol
PWM-Bench is not
- not a model
- not a dataset
- not an architecture
- not a product
- not a claim that person understanding is solved
Core measurement — Personal Skill
The headline metric is Personal Skill: how much better a system forecasts a specific individual than the baselines, measured in bits and subject to gates.
Measured in bits
Personal Skill is the reduction in the proper (logarithmic) score relative to a baseline, expressed in bits per forecast. Zero bits means “no better than the baseline.”
Improvement over R1 and R2
Reported against both the population prior (R1) and the personal routine baseline (R2). Beating R2 — the person's own routine — is the bar for genuine person-specific skill.
Calibration-gated
A system must be calibrated: when it says 70%, the event should happen about 70% of the time. Skill from miscalibrated confidence does not count.
Permutation-gated
Apparent skill must collapse when forecasts are scored against the wrong individual. Skill that survives permutation was never person-specific.
Proper-scored, calibration-gated, and permutation-gated together: only skill that clears all three is reported as Personal Skill. See the scoring section for the formal definitions.
Benchmark logic
Understanding is latent; forecasting is observable. A system that truly models an individual's evolving state should be able to anticipate what that individual does next — not in the aggregate, and not in a way that any well-tuned routine model would also achieve, but specifically, and under sealed prospective conditions. That is the capability PWM-Bench is built to detect, and the reason it pairs every score with calibration and permutation gates.