The benchmark

What PWM-Bench is — and is not

PWM-Bench is a benchmark framework: a task definition, a scoring methodology, baselines, leakage controls, and a federated governance protocol. It is not a model, a dataset, an architecture, or a product.

PWM-Bench is

a benchmark framework
a task definition
a scoring methodology
a set of baselines (R1 population, R2 routine)
a set of leakage controls
an identity-permutation test
a federated governance protocol

PWM-Bench is not

not a model
not a dataset
not an architecture
not a product
not a claim that person understanding is solved

Core measurement — Personal Skill

The headline metric is Personal Skill: how much better a system forecasts a specific individual than the baselines, measured in bits and subject to gates.

unit

Measured in bits

Personal Skill is the reduction in the proper (logarithmic) score relative to a baseline, expressed in bits per forecast. Zero bits means “no better than the baseline.”

baselines

Improvement over R1 and R2

Reported against both the population prior (R1) and the personal routine baseline (R2). Beating R2 — the person's own routine — is the bar for genuine person-specific skill.

gate

Calibration-gated

A system must be calibrated: when it says 70%, the event should happen about 70% of the time. Skill from miscalibrated confidence does not count.

gate

Permutation-gated

Apparent skill must collapse when forecasts are scored against the wrong individual. Skill that survives permutation was never person-specific.

Proper-scored, calibration-gated, and permutation-gated together: only skill that clears all three is reported as Personal Skill. See the scoring section for the formal definitions.

Benchmark logic

PWM-Bench does not measure whether a model sounds like a person. It measures whether the model can forecast the person's future better than population knowledge and personal routine.

Understanding is latent; forecasting is observable. A system that truly models an individual's evolving state should be able to anticipate what that individual does next — not in the aggregate, and not in a way that any well-tuned routine model would also achieve, but specifically, and under sealed prospective conditions. That is the capability PWM-Bench is built to detect, and the reason it pairs every score with calibration and permutation gates.

What PWM-Bench is — and is not

PWM-Bench is

PWM-Bench is not

Core measurement — Personal Skill

Measured in bits

Improvement over R1 and R2

Calibration-gated

Permutation-gated

Benchmark logic

Where to go next

Tasks →

Evidence ladder →

Protocol →