Protocol · PWM-0.1
Protocol
The protocol makes PWM-Bench executable: the unit of evaluation, the forecast and outcome schemas, the scoring rules and gates, and the integrity controls that keep evaluation honest and prospective.
A. Forecast unit
A benchmark instance is a tuple:
A system observes only E≤t — evidence dated at or before the forecast time t — and must emit a calibrated probability distribution over the answer space A. The instance is sealed at t and resolved at r.
B. Forecast schema
A forecast commits a probability distribution over the answer space, stamped with the time it was made and the evidence tier it was allowed to use.
{
"participant_id": "P001",
"forecast_time": "2026-07-01T08:00:00-07:00",
"horizon": "24h",
"task": "T4_attention_allocation",
"question": "Which active project will receive the most attention tomorrow?",
"answer_space": ["PWM-Bench", "hardware", "family", "energy", "other"],
"probabilities": {
"PWM-Bench": 0.45,
"hardware": 0.25,
"family": 0.15,
"energy": 0.10,
"other": 0.05
},
"model": "example-model",
"evidence_tier": "L3",
"protocol_version": "PWM-0.1"
}C. Outcome schema
The outcome records the resolved label and how it was adjudicated. It references the forecast it resolves; only the resolved label leaves the federated client.
{
"participant_id": "P001",
"forecast_ref": "P001-2026-07-01T08:00:00-07:00-T4_attention_allocation",
"task": "T4_attention_allocation",
"resolution_time": "2026-07-02T08:00:00-07:00",
"resolved": true,
"outcome": "hardware",
"answer_space": ["PWM-Bench", "hardware", "family", "energy", "other"],
"resolution_method": "Participant end-of-window attention labelling adjudicated against the pre-registered attribution rule; only the resolved label leaves the client.",
"adjudication": "deterministic-rubric",
"protocol_version": "PWM-0.1"
}D. Scoring
Logarithmic score
The primary proper scoring rule. Each forecast is scored by the log probability it assigned to the realized outcome. Proper scoring rewards honest probabilities and penalizes overconfidence.
Brier score
A secondary proper score reported alongside the log score for robustness and interpretability.
Personal Skill in bits
The mean reduction in log score versus a baseline, expressed in bits per forecast — reported separately against R1 (population) and R2 (routine).
Calibration gates
Reliability is checked (e.g., reliability diagrams / calibration error). A system that is not calibrated does not pass, regardless of raw score.
Day-blocked bootstrap
Uncertainty is estimated by resampling whole days, not individual forecasts, to respect within-day dependence and avoid over-precise intervals.
Permutation null
The identity-permutation test: forecasts are re-scored against the wrong individuals to build a null distribution. Skill must exceed this null to count as person-specific.
E. Integrity
- Sealed, timestamped forecasts. Every forecast is committed before its resolution and cannot be revised afterward.
- No random cross-validation. The future is never used to predict the past; standard k-fold CV is disallowed.
- Strict walk-forward evaluation. Systems are evaluated forward in time, only ever conditioning on past evidence.
- No access to future evidence. A system may use only
E≤t; evidence dated aftertis withheld. - Identity-permutation test. Person-specific skill must collapse when identities are permuted.
- Aggregate reporting only. Results are reported in aggregate; raw participant data and individual forecasts are not published.