Task families

Tasks

PWM-Bench tasks are forecasting questions with explicit answer spaces and deterministic resolution rules. T1–T5 are defined for the first pilot round; future families extend the same machinery to longer horizons and richer state.

T1Next contactPWM-ST1_next_contact
Forecast which person or entity the participant will contact next, or whether they will contact a given entity within a fixed window.
Example question
Who will the participant initiate contact with first tomorrow morning (08:00–12:00)?
Answer space
A short, person-specific list of frequent contacts plus an explicit “other” / “no contact” option.
["contact_A","contact_B","contact_C","other","no_contact"]
Resolution
Resolved from communications metadata (first outbound message/call to a contact within the window). Adjudicated by the federated client; only the outcome label is reported.
Baseline
R1 population contact base rates; R2 the participant's own recent contact frequency/recency.
Difficulty
Heavy-tailed contact distributions make R2 strong. Skill must come from context that shifts the next contact away from routine.
T2Event realizationPWM-ST2_event_realization
Forecast whether a calendar or planned event actually occurs, is cancelled, or moves.
Example question
Will the 14:00 meeting on the participant's calendar tomorrow occur as scheduled, move, or cancel?
Answer space
Categorical: occurs as scheduled / moves / cancels.
["occurs","moves","cancels"]
Resolution
Resolved from calendar state and communications metadata at the resolution time. Status is adjudicated against an explicit rubric (e.g., a >30 min shift counts as “moves”).
Baseline
R1 population cancellation/move rates; R2 the participant's historical event-realization rates.
Difficulty
Base rates are informative; genuine skill requires reading signals that a specific event is at risk.
T3Response behaviorPWM-DT3_response_behavior
Forecast whether the participant replies to a given message and, if so, the response-latency band.
Example question
Given this received message, will the participant reply within 1h, within 24h, or not within 24h?
Answer space
Ordered bands: reply <1h / reply 1–24h / no reply <24h.
["reply_lt_1h","reply_1_24h","no_reply_24h"]
Resolution
Resolved from outbound reply timestamps relative to the triggering message. Bands are fixed in advance in the protocol.
Baseline
R1 population reply-latency distributions; R2 the participant's per-contact reply history.
Difficulty
Strong per-contact priors (R2) set a high bar; skill comes from situational context (workload, location, time of day).
T4Attention allocationPWM-AT4_attention_allocation
Forecast which active project or topic will receive the most attention in the next window.
Example question
Which active project will receive the most of the participant's working attention tomorrow?
Answer space
A person-specific set of active projects/topics plus “other.”
["project_1","project_2","family","admin","other"]
Resolution
Resolved from the participant's own end-of-window labelling and/or activity evidence, against a pre-registered attribution rule. Only the resolved label leaves the client.
Baseline
R1 population topic priors (weak); R2 the participant's recent attention distribution.
Difficulty
Attention is volatile and partly intention-driven. This is where richer evidence (L2–L3) is hypothesised to help most.
T5Routine deviationPWM-ST5_routine_deviation
Forecast deviations from the participant's established routine.
Example question
Will the participant's tomorrow deviate from their typical weekday routine on a pre-specified dimension (e.g., start time, commute, core block)?
Answer space
Binary or categorical deviation on a pre-registered dimension.
["no_deviation","minor_deviation","major_deviation"]
Resolution
Resolved by comparing the realized day to a routine model on the pre-specified dimension, against a fixed threshold rubric.
Baseline
R1 population deviation rates; R2 the participant's own deviation base rate (a deliberately strong baseline).
Difficulty
By construction R2 is hard to beat. Skill requires anticipating the specific causes of deviation, not its average frequency.

Future task families

The same forecast-unit, sealing, and scoring machinery extends to higher-stakes, longer-horizon questions:

Decisions

Forecast the outcome of an upcoming decision the participant faces.

Commitments

Forecast whether a stated commitment is kept, deferred, or dropped.

Long-horizon planning

Forecast multi-week plan realization and re-planning.

Drift

Forecast gradual shifts in priorities and attention over weeks.

Goal-state transitions

Anticipate discrete transitions between goal states (PWM-X).

Tasks render from data/tasks.json.