A multi-track benchmark apparatus · Pre-pilot · PWM-0.1

TargetSpace

Benchmarking Target-Specific Forecasting Under Partial Observation

Forecast the target, not the average

Pre-pilot: no empirical results yet synthetic demonstration only

TargetSpace asks whether a system can forecast what thistracked target will do or become next — beyond a population prior, and beyond the target's own routine. It is an architecture-neutral, multi-track apparatus; personal world modeling is the first track.

Understanding is the capability. Forecasting is the measurement.

Retrospective agreement is not enough. Understanding must be held accountable to the future.

Read the paper View protocol View leaderboard See pilot plan Submit interest

Prospective

Forecasts are sealed and timestamped before outcomes occur.

Target-specific

Skill must collapse under a permutation specificity test to count.

Calibrated

Scores use proper scoring rules and calibration gates.

The claim, stated plainly

TargetSpace does not measure whether a model sounds plausible. It measures whether the model can forecast a specific target's future better than population knowledge and the target's own routine.

Understanding a partially observed system is a latent capability — you cannot read it off a transcript. TargetSpace makes it observable and falsifiable by requiring a system to commit, in advance, to calibrated probabilistic forecasts about that target's near future, and then scoring those forecasts against two baselines: what population-level knowledge predicts (R1), and what the target's own routine predicts (R2). The headline quantity is Skill — the improvement over those baselines, in bits, that survives calibration and the permutation specificity gate.

Novelty: a conjunction, not a primitive

TargetSpace does not claim to be first at architecture-neutral evaluation, calibrated sealed forecasting, evidence ablation, or goal/collapse-timing — each is adopted and cited. The contribution is the conjunction, anchored on the own-routine baseline (R2) and the permutation specificity gate, around one question: is the forecast about the target, or about the average?

To our knowledge, no existing benchmark scores prospective, calibrated, proper-scored forecasts of a tracked target system's latent target-state transitions under a strong instance-specific own-routine baseline (R2) and an instance-permutation specificity gate, with an evidence-tier ablation, across architecture classes.

The supported novelty is the conjunction:

A strong instance-specific own-routine baseline (R2) the system must beat — not just a population prior.
A permutation specificity gate: skill must collapse when forecasts are scored against the wrong target.
Latent target-state transitions as the object — not event outcomes, surface generation, or imitation.
An evidence-tier ablation that measures which evidence adds target-specific skill.
Architecture-neutral evaluation across heterogeneous predictor classes on identical sealed instances.

On Michael Levin. Michael Levin's work motivates the hypothesis that goal-state is a privileged object of measurement in adaptive systems; it supplies conceptual vocabulary and inspiration, not a benchmark. The operational machinery of TargetSpace draws from theory-of-mind evaluation, goal recognition, proper scoring, and forecasting. (Michael Levin, developmental biology, is distinct from Sergey Levine, reinforcement learning.) The broader target-state idea spans people, organizations, biological, and engineered systems; TargetSpace is a multi-track apparatus whose first instantiation is individual humans (the TS-Personal track).

Related work we concede and build on

Work	What it does	Relation to TargetSpace
KnowMe-Bench (2026)	Formalizes person understanding around goals/motivations.	Closest framing and principal novelty threat — but retrospective QA over fictional literary autobiographies, LLM-as-judge; not prospective, not calibrated, not a real tracked individual.
EgoToM (2025)	Infers a camera-wearer's goal, belief, and future actions from egocentric video.	Closest prospective goal inference — but generic single clips, minutes-scale, no calibration, no persistent individual.
Park et al. — Generative Agent Simulations of 1,000 People (2024)	Per-person agents replicate real individuals' survey answers.	Closest person-specific work — but retrospective/concurrent replication, not future goal-state forecasting.
Machine Theory of Mind / ToMnet (2018)	Infer an agent's latent goal from behavior, predict future actions.	The intellectual template — but synthetic gridworld agents, not a real named individual. (RL/ToM lineage — Sergey Levine / Rabinowitz.)
PersonaMem (2025)	Tracks evolving user preferences across sessions.	Closest on state-over-time — but preference tracking, not goal-state-transition forecasting; synthetic users; not calibrated.
Goal Recognition Design / WCD (2014)	Minimal evidence before a goal is identifiable.	Closest metric — structurally the evidence-sufficiency question TargetSpace asks of a tracked target (the collapse-timing metrics).
ForecastBench (2024)	Calibrated forecasting of world events.	Provides the calibrated-forecasting machinery — applied to world events, never to a person's goals.

TargetSpace

Prospective

Target-specific

Calibrated

The claim, stated plainly

Novelty: a conjunction, not a primitive

Related work we concede and build on

What you can do here

Benchmark

Protocol

Leaderboard

Tasks

Evidence ladder

Pilot