Pre-pilot · PWM-0.1 · updated 2026-06-09

Leaderboard

TargetSpace is a multi-track apparatus. The leaderboard below is the TS-Personal track and reports calibrated, target-specific forecasting skill under the protocol. This is the pre-pilot infrastructure: the planned baseline and candidate systems are listed, with metrics pending the first pilot round.

Application tracks

TargetSpace shares one scoring spine across application tracks. Only the first is implemented today (and only as a synthetic demonstration); the others are planned or under research and are not yet benchmarks.

TS-Personal — Longitudinal Personal World Modeling current — First and highest-value track. Apparatus exists (synthetic, pre-pilot). The table below is this track.
TS-Health — Physiology & Care-State Forecasting planned — Strong sealed precedent and own-patient baselines; not yet implemented.
TS-Energy — Infrastructure, Grid & Market Forecasting planned — Strong precedent (GEFCom; M4/M5) and a natural seasonal-naive baseline; not yet implemented.
TS-Robotics — Embodied Action-Conditioned Forecasting research — Distinct regime, but a proper-scored protocol and strong own-routine baseline are not yet established.
TS-Enterprise — Organizational & Workflow Forecasting research — No sealed-forecast benchmark today; weak own-routine baseline.

TargetSpace TS-Personal pre-pilot leaderboard. All scores pending the first pilot round.
Rank	System	Evidence Tier	Task Set	Horizon	Participants	Forecasts	Skill vs R1	Skill vs R2	Calibration	Permutation Test	Status	Verification	Date
—	R1 · Population Prior	L0	T1–T5	24h – 7d	—	—	Reference	—	Pending	N/A	Pending pilot	synthetic_demo	—
—	R2 · Personal Routine Baseline	L0	T1–T5	24h – 7d	—	—	Pending	Reference	Pending	N/A	Pending pilot	synthetic_demo	—
—	Digital Exhaust Model	L0	T1–T5	24h – 7d	—	—	Pending	Pending	Pending	Pending	Pending pilot	synthetic_demo	—
—	Chat History Model	L1	T1–T5	24h – 7d	—	—	Pending	Pending	Pending	Pending	Pending pilot	synthetic_demo	—
—	Passive Observation Model	L3	T1–T5	24h – 7d	—	—	Pending	Pending	Pending	Pending	Pending pilot	synthetic_demo	—
—	Combined Evidence Model	L3	T1–T5	24h – 7d	—	—	Pending	Pending	Pending	Pending	Pending pilot	synthetic_demo	—

R1 = Population Prior (reference). R2 = own-routine baseline. Skill is reported in bits relative to each baseline. Cells marked Pending will be populated after the first pilot round.

Verification levels

Every row declares how the result was verified (distinct from whether the data is synthetic). Submitters may declare at most self_reported; only organizers assign higher tiers. The board below is entirely synthetic_demo.

synthetic_demo — seeded synthetic generator; no real-world meaning.
self_reported — submitter ran scoring themselves; not organizer-checked.
artifact_verified — organizer validated submitted forecasts/artifacts.
organizer_reproduced — organizer re-ran/scored the system end-to-end.

This table is a hand-maintained pre-pilot template, not official scored output. Official results are generated by the benchmark CLI (pwm_bench leaderboard) and carry provenance (benchmark version, source commit, generation date). Submissions contain forecasts only — outcomes are organizer-held and scored separately.

Reading the leaderboard

Two baselines, not one

A system is measured against both the Population Prior (R1) and the Personal Routine Baseline (R2). Beating R1 only shows the system knows generic base rates. Genuine person-specific skill requires beating R2 — the participant's own routine.

Gates, not just scores

Calibration and Permutation Test are gates. A high score that fails calibration, or that survives when forecasts are scored against the wrong individual, does not count as person-specific skill.

TS-Personal task families

TS-R — Near-term event realization and contact within fixed windows.
TS-A — Which active project or topic receives attention in the next window.
TS-D — Decisions, commitments, and response behavior.
TS-X — Anticipating changes in target states and transitions over longer horizons.

Evidence tiers

L0 — Calendar + communications metadata
L1 — Text evidence
L2 — Text + audio transcript
L3 — Multimodal passive evidence
L4 — Location / behavioral traces
L5 — Physiological signals

The leaderboard renders from data/leaderboard.json. After the pilot, results are updated by editing that file alone — no code changes required.