The benchmark

What PWM-Bench is — and is not

PWM-Bench is a benchmark framework: a task definition, a scoring methodology, baselines, leakage controls, and a federated governance protocol. It is not a model, a dataset, an architecture, or a product.

PWM-Bench is

  • a benchmark framework
  • a task definition
  • a scoring methodology
  • a set of baselines (R1 population, R2 routine)
  • a set of leakage controls
  • an identity-permutation test
  • a federated governance protocol

PWM-Bench is not

  • not a model
  • not a dataset
  • not an architecture
  • not a product
  • not a claim that person understanding is solved

Core measurement — Personal Skill

The headline metric is Personal Skill: how much better a system forecasts a specific individual than the baselines, measured in bits and subject to gates.

unit

Measured in bits

Personal Skill is the reduction in the proper (logarithmic) score relative to a baseline, expressed in bits per forecast. Zero bits means “no better than the baseline.”

baselines

Improvement over R1 and R2

Reported against both the population prior (R1) and the personal routine baseline (R2). Beating R2 — the person's own routine — is the bar for genuine person-specific skill.

gate

Calibration-gated

A system must be calibrated: when it says 70%, the event should happen about 70% of the time. Skill from miscalibrated confidence does not count.

gate

Permutation-gated

Apparent skill must collapse when forecasts are scored against the wrong individual. Skill that survives permutation was never person-specific.

Proper-scored, calibration-gated, and permutation-gated together: only skill that clears all three is reported as Personal Skill. See the scoring section for the formal definitions.

Benchmark logic

PWM-Bench does not measure whether a model sounds like a person. It measures whether the model can forecast the person's future better than population knowledge and personal routine.

Understanding is latent; forecasting is observable. A system that truly models an individual's evolving state should be able to anticipate what that individual does next — not in the aggregate, and not in a way that any well-tuned routine model would also achieve, but specifically, and under sealed prospective conditions. That is the capability PWM-Bench is built to detect, and the reason it pairs every score with calibration and permutation gates.

Where to go next