Pre-pilot · PWM-0.1 · updated 2026-06-09
Leaderboard
TargetSpace is a multi-track apparatus. The leaderboard below is the TS-Personal track and reports calibrated, target-specific forecasting skill under the protocol. This is the pre-pilot infrastructure: the planned baseline and candidate systems are listed, with metrics pending the first pilot round.
Application tracks
TargetSpace shares one scoring spine across application tracks. Only the first is implemented today (and only as a synthetic demonstration); the others are planned or under research and are not yet benchmarks.
- TS-Personal — Longitudinal Personal World Modeling current — First and highest-value track. Apparatus exists (synthetic, pre-pilot). The table below is this track.
- TS-Health — Physiology & Care-State Forecasting planned — Strong sealed precedent and own-patient baselines; not yet implemented.
- TS-Energy — Infrastructure, Grid & Market Forecasting planned — Strong precedent (GEFCom; M4/M5) and a natural seasonal-naive baseline; not yet implemented.
- TS-Robotics — Embodied Action-Conditioned Forecasting research — Distinct regime, but a proper-scored protocol and strong own-routine baseline are not yet established.
- TS-Enterprise — Organizational & Workflow Forecasting research — No sealed-forecast benchmark today; weak own-routine baseline.
| Rank | System | Evidence Tier | Task Set | Horizon | Participants | Forecasts | Skill vs R1 | Skill vs R2 | Calibration | Permutation Test | Status | Verification | Date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| — | R1 · Population Prior | L0 | T1–T5 | 24h – 7d | — | — | Reference | — | Pending | N/A | Pending pilot | synthetic_demo | — |
| — | R2 · Personal Routine Baseline | L0 | T1–T5 | 24h – 7d | — | — | Pending | Reference | Pending | N/A | Pending pilot | synthetic_demo | — |
| — | Digital Exhaust Model | L0 | T1–T5 | 24h – 7d | — | — | Pending | Pending | Pending | Pending | Pending pilot | synthetic_demo | — |
| — | Chat History Model | L1 | T1–T5 | 24h – 7d | — | — | Pending | Pending | Pending | Pending | Pending pilot | synthetic_demo | — |
| — | Passive Observation Model | L3 | T1–T5 | 24h – 7d | — | — | Pending | Pending | Pending | Pending | Pending pilot | synthetic_demo | — |
| — | Combined Evidence Model | L3 | T1–T5 | 24h – 7d | — | — | Pending | Pending | Pending | Pending | Pending pilot | synthetic_demo | — |
R1 = Population Prior (reference). R2 = own-routine baseline. Skill is reported in bits relative to each baseline. Cells marked Pending will be populated after the first pilot round.
Verification levels
Every row declares how the result was verified (distinct from whether the data is synthetic). Submitters may declare at most self_reported; only organizers assign higher tiers. The board below is entirely synthetic_demo.
- synthetic_demo — seeded synthetic generator; no real-world meaning.
- self_reported — submitter ran scoring themselves; not organizer-checked.
- artifact_verified — organizer validated submitted forecasts/artifacts.
- organizer_reproduced — organizer re-ran/scored the system end-to-end.
This table is a hand-maintained pre-pilot template, not official scored output. Official results are generated by the benchmark CLI (pwm_bench leaderboard) and carry provenance (benchmark version, source commit, generation date). Submissions contain forecasts only — outcomes are organizer-held and scored separately.
Reading the leaderboard
Two baselines, not one
A system is measured against both the Population Prior (R1) and the Personal Routine Baseline (R2). Beating R1 only shows the system knows generic base rates. Genuine person-specific skill requires beating R2 — the participant's own routine.
Gates, not just scores
Calibration and Permutation Test are gates. A high score that fails calibration, or that survives when forecasts are scored against the wrong individual, does not count as person-specific skill.
TS-Personal task families
- TS-R — Near-term event realization and contact within fixed windows.
- TS-A — Which active project or topic receives attention in the next window.
- TS-D — Decisions, commitments, and response behavior.
- TS-X — Anticipating changes in target states and transitions over longer horizons.
Evidence tiers
- L0 — Calendar + communications metadata
- L1 — Text evidence
- L2 — Text + audio transcript
- L3 — Multimodal passive evidence
- L4 — Location / behavioral traces
- L5 — Physiological signals
The leaderboard renders from data/leaderboard.json. After the pilot, results are updated by editing that file alone — no code changes required.