A multi-track benchmark apparatus · Pre-pilot · PWM-0.1

TargetSpace

Benchmarking Target-Specific Forecasting Under Partial Observation

Forecast the target, not the average

Pre-pilot: no empirical results yet synthetic demonstration only

TargetSpace asks whether a system can forecast what thistracked target will do or become next — beyond a population prior, and beyond the target's own routine. It is an architecture-neutral, multi-track apparatus; personal world modeling is the first track.

Understanding is the capability. Forecasting is the measurement.
Retrospective agreement is not enough. Understanding must be held accountable to the future.
Pre-pilot. TargetSpace is currently a benchmark proposal and pre-registered research agenda; all reported numbers are synthetic demonstration only. No empirical results are reported yet, and only the personal track is implemented. The first empirical round, PWM-Pilot-Audio (audio-first, in the personal track), is pre-registration pending. The benchmark apparatus is public; raw target data is private. The reference code repository is currently private during the pre-pilot.
01

Prospective

Forecasts are sealed and timestamped before outcomes occur.

02

Target-specific

Skill must collapse under a permutation specificity test to count.

03

Calibrated

Scores use proper scoring rules and calibration gates.

The claim, stated plainly

TargetSpace does not measure whether a model sounds plausible. It measures whether the model can forecast a specific target's future better than population knowledge and the target's own routine.

Understanding a partially observed system is a latent capability — you cannot read it off a transcript. TargetSpace makes it observable and falsifiable by requiring a system to commit, in advance, to calibrated probabilistic forecasts about that target's near future, and then scoring those forecasts against two baselines: what population-level knowledge predicts (R1), and what the target's own routine predicts (R2). The headline quantity is Skill — the improvement over those baselines, in bits, that survives calibration and the permutation specificity gate.

Novelty: a conjunction, not a primitive

TargetSpace does not claim to be first at architecture-neutral evaluation, calibrated sealed forecasting, evidence ablation, or goal/collapse-timing — each is adopted and cited. The contribution is the conjunction, anchored on the own-routine baseline (R2) and the permutation specificity gate, around one question: is the forecast about the target, or about the average?

To our knowledge, no existing benchmark scores prospective, calibrated, proper-scored forecasts of a tracked target system's latent target-state transitions under a strong instance-specific own-routine baseline (R2) and an instance-permutation specificity gate, with an evidence-tier ablation, across architecture classes.

The supported novelty is the conjunction:

  1. A strong instance-specific own-routine baseline (R2) the system must beat — not just a population prior.
  2. A permutation specificity gate: skill must collapse when forecasts are scored against the wrong target.
  3. Latent target-state transitions as the object — not event outcomes, surface generation, or imitation.
  4. An evidence-tier ablation that measures which evidence adds target-specific skill.
  5. Architecture-neutral evaluation across heterogeneous predictor classes on identical sealed instances.

On Michael Levin. Michael Levin's work motivates the hypothesis that goal-state is a privileged object of measurement in adaptive systems; it supplies conceptual vocabulary and inspiration, not a benchmark. The operational machinery of TargetSpace draws from theory-of-mind evaluation, goal recognition, proper scoring, and forecasting. (Michael Levin, developmental biology, is distinct from Sergey Levine, reinforcement learning.) The broader target-state idea spans people, organizations, biological, and engineered systems; TargetSpace is a multi-track apparatus whose first instantiation is individual humans (the TS-Personal track).

Related work we concede and build on

WorkWhat it doesRelation to TargetSpace
KnowMe-Bench (2026)Formalizes person understanding around goals/motivations.Closest framing and principal novelty threat — but retrospective QA over fictional literary autobiographies, LLM-as-judge; not prospective, not calibrated, not a real tracked individual.
EgoToM (2025)Infers a camera-wearer's goal, belief, and future actions from egocentric video.Closest prospective goal inference — but generic single clips, minutes-scale, no calibration, no persistent individual.
Park et al. — Generative Agent Simulations of 1,000 People (2024)Per-person agents replicate real individuals' survey answers.Closest person-specific work — but retrospective/concurrent replication, not future goal-state forecasting.
Machine Theory of Mind / ToMnet (2018)Infer an agent's latent goal from behavior, predict future actions.The intellectual template — but synthetic gridworld agents, not a real named individual. (RL/ToM lineage — Sergey Levine / Rabinowitz.)
PersonaMem (2025)Tracks evolving user preferences across sessions.Closest on state-over-time — but preference tracking, not goal-state-transition forecasting; synthetic users; not calibrated.
Goal Recognition Design / WCD (2014)Minimal evidence before a goal is identifiable.Closest metric — structurally the evidence-sufficiency question TargetSpace asks of a tracked target (the collapse-timing metrics).
ForecastBench (2024)Calibrated forecasting of world events.Provides the calibrated-forecasting machinery — applied to world events, never to a person's goals.

What you can do here