A multi-track benchmark apparatus · Pre-pilot · PWM-0.1
TargetSpace
Benchmarking Target-Specific Forecasting Under Partial Observation
Forecast the target, not the average
Pre-pilot: no empirical results yet synthetic demonstration only
TargetSpace asks whether a system can forecast what thistracked target will do or become next — beyond a population prior, and beyond the target's own routine. It is an architecture-neutral, multi-track apparatus; personal world modeling is the first track.
Understanding is the capability. Forecasting is the measurement.
Retrospective agreement is not enough. Understanding must be held accountable to the future.
Prospective
Forecasts are sealed and timestamped before outcomes occur.
Target-specific
Skill must collapse under a permutation specificity test to count.
Calibrated
Scores use proper scoring rules and calibration gates.
The claim, stated plainly
Understanding a partially observed system is a latent capability — you cannot read it off a transcript. TargetSpace makes it observable and falsifiable by requiring a system to commit, in advance, to calibrated probabilistic forecasts about that target's near future, and then scoring those forecasts against two baselines: what population-level knowledge predicts (R1), and what the target's own routine predicts (R2). The headline quantity is Skill — the improvement over those baselines, in bits, that survives calibration and the permutation specificity gate.
Novelty: a conjunction, not a primitive
TargetSpace does not claim to be first at architecture-neutral evaluation, calibrated sealed forecasting, evidence ablation, or goal/collapse-timing — each is adopted and cited. The contribution is the conjunction, anchored on the own-routine baseline (R2) and the permutation specificity gate, around one question: is the forecast about the target, or about the average?
The supported novelty is the conjunction:
- A strong instance-specific own-routine baseline (R2) the system must beat — not just a population prior.
- A permutation specificity gate: skill must collapse when forecasts are scored against the wrong target.
- Latent target-state transitions as the object — not event outcomes, surface generation, or imitation.
- An evidence-tier ablation that measures which evidence adds target-specific skill.
- Architecture-neutral evaluation across heterogeneous predictor classes on identical sealed instances.
On Michael Levin. Michael Levin's work motivates the hypothesis that goal-state is a privileged object of measurement in adaptive systems; it supplies conceptual vocabulary and inspiration, not a benchmark. The operational machinery of TargetSpace draws from theory-of-mind evaluation, goal recognition, proper scoring, and forecasting. (Michael Levin, developmental biology, is distinct from Sergey Levine, reinforcement learning.) The broader target-state idea spans people, organizations, biological, and engineered systems; TargetSpace is a multi-track apparatus whose first instantiation is individual humans (the TS-Personal track).
Related work we concede and build on
| Work | What it does | Relation to TargetSpace |
|---|---|---|
| KnowMe-Bench (2026) | Formalizes person understanding around goals/motivations. | Closest framing and principal novelty threat — but retrospective QA over fictional literary autobiographies, LLM-as-judge; not prospective, not calibrated, not a real tracked individual. |
| EgoToM (2025) | Infers a camera-wearer's goal, belief, and future actions from egocentric video. | Closest prospective goal inference — but generic single clips, minutes-scale, no calibration, no persistent individual. |
| Park et al. — Generative Agent Simulations of 1,000 People (2024) | Per-person agents replicate real individuals' survey answers. | Closest person-specific work — but retrospective/concurrent replication, not future goal-state forecasting. |
| Machine Theory of Mind / ToMnet (2018) | Infer an agent's latent goal from behavior, predict future actions. | The intellectual template — but synthetic gridworld agents, not a real named individual. (RL/ToM lineage — Sergey Levine / Rabinowitz.) |
| PersonaMem (2025) | Tracks evolving user preferences across sessions. | Closest on state-over-time — but preference tracking, not goal-state-transition forecasting; synthetic users; not calibrated. |
| Goal Recognition Design / WCD (2014) | Minimal evidence before a goal is identifiable. | Closest metric — structurally the evidence-sufficiency question TargetSpace asks of a tracked target (the collapse-timing metrics). |
| ForecastBench (2024) | Calibrated forecasting of world events. | Provides the calibrated-forecasting machinery — applied to world events, never to a person's goals. |
What you can do here
Benchmark
What TargetSpace is and is not, and how Skill is defined.
Protocol
The forecast unit, schemas, scoring, and integrity controls that make it executable.
Leaderboard
Pre-pilot infrastructure with the planned baseline systems — scores pending.
Tasks
The T1–T5 task families: next contact, event realization, response, attention, deviation.
Evidence ladder
Six tiers L0–L5, and why measuring which evidence helps is a core contribution.
Pilot
The first prospective round: 5 participants, 30 days, sealed forecasts, four systems.