Pre-pilot · PWM-0.1 · updated 2026-06-09

Leaderboard

TargetSpace is a multi-track apparatus. The leaderboard below is the TS-Personal track and reports calibrated, target-specific forecasting skill under the protocol. This is the pre-pilot infrastructure: the planned baseline and candidate systems are listed, with metrics pending the first pilot round.

Scores will be populated after the first pilot. No empirical results are currently reported. The rows below are the planned baseline and candidate systems for the TS-Personal track only, marked Pending pilot. TargetSpace does not report results it does not have, and no track other than TS-Personal is implemented.

Application tracks

TargetSpace shares one scoring spine across application tracks. Only the first is implemented today (and only as a synthetic demonstration); the others are planned or under research and are not yet benchmarks.

  • TS-Personal — Longitudinal Personal World Modeling currentFirst and highest-value track. Apparatus exists (synthetic, pre-pilot). The table below is this track.
  • TS-Health — Physiology & Care-State Forecasting plannedStrong sealed precedent and own-patient baselines; not yet implemented.
  • TS-Energy — Infrastructure, Grid & Market Forecasting plannedStrong precedent (GEFCom; M4/M5) and a natural seasonal-naive baseline; not yet implemented.
  • TS-Robotics — Embodied Action-Conditioned Forecasting researchDistinct regime, but a proper-scored protocol and strong own-routine baseline are not yet established.
  • TS-Enterprise — Organizational & Workflow Forecasting researchNo sealed-forecast benchmark today; weak own-routine baseline.
TargetSpace TS-Personal pre-pilot leaderboard. All scores pending the first pilot round.
RankSystemEvidence TierTask SetHorizonParticipantsForecastsSkill vs R1Skill vs R2CalibrationPermutation TestStatusVerificationDate
R1 · Population PriorL0T1–T524h – 7dReferencePendingN/APending pilotsynthetic_demo
R2 · Personal Routine BaselineL0T1–T524h – 7dPendingReferencePendingN/APending pilotsynthetic_demo
Digital Exhaust ModelL0T1–T524h – 7dPendingPendingPendingPendingPending pilotsynthetic_demo
Chat History ModelL1T1–T524h – 7dPendingPendingPendingPendingPending pilotsynthetic_demo
Passive Observation ModelL3T1–T524h – 7dPendingPendingPendingPendingPending pilotsynthetic_demo
Combined Evidence ModelL3T1–T524h – 7dPendingPendingPendingPendingPending pilotsynthetic_demo

R1 = Population Prior (reference). R2 = own-routine baseline. Skill is reported in bits relative to each baseline. Cells marked Pending will be populated after the first pilot round.

Verification levels

Every row declares how the result was verified (distinct from whether the data is synthetic). Submitters may declare at most self_reported; only organizers assign higher tiers. The board below is entirely synthetic_demo.

  • synthetic_demo — seeded synthetic generator; no real-world meaning.
  • self_reported — submitter ran scoring themselves; not organizer-checked.
  • artifact_verified — organizer validated submitted forecasts/artifacts.
  • organizer_reproduced — organizer re-ran/scored the system end-to-end.

This table is a hand-maintained pre-pilot template, not official scored output. Official results are generated by the benchmark CLI (pwm_bench leaderboard) and carry provenance (benchmark version, source commit, generation date). Submissions contain forecasts only — outcomes are organizer-held and scored separately.

Reading the leaderboard

Two baselines, not one

A system is measured against both the Population Prior (R1) and the Personal Routine Baseline (R2). Beating R1 only shows the system knows generic base rates. Genuine person-specific skill requires beating R2 — the participant's own routine.

Gates, not just scores

Calibration and Permutation Test are gates. A high score that fails calibration, or that survives when forecasts are scored against the wrong individual, does not count as person-specific skill.

TS-Personal task families

  • TS-RNear-term event realization and contact within fixed windows.
  • TS-AWhich active project or topic receives attention in the next window.
  • TS-DDecisions, commitments, and response behavior.
  • TS-XAnticipating changes in target states and transitions over longer horizons.

Evidence tiers

  • L0 — Calendar + communications metadata
  • L1 — Text evidence
  • L2 — Text + audio transcript
  • L3 — Multimodal passive evidence
  • L4 — Location / behavioral traces
  • L5 — Physiological signals

The leaderboard renders from data/leaderboard.json. After the pilot, results are updated by editing that file alone — no code changes required.