Governance & ethics

Privacy and human-subjects commitments

PWM-Bench observes and forecasts individuals. That makes privacy, consent, and human-subjects governance load-bearing parts of the design — not compliance text bolted on afterward.

PWM-Bench is designed to evaluate systems that observe and forecast, not systems that intervene on the participant during the scoring window.

Principles

Informed consent

Participation requires explicit, informed consent covering what is observed, how it is used, and what leaves the participant's control.

Revocable participation

Participants can withdraw at any time. Withdrawal stops observation and forecasting on that participant.

Federated execution

Evaluation runs where the data lives. Models are brought to the evidence; raw evidence is not centralized.

Raw data stays with the participant

Raw evidence remains under participant control. Only resolved outcomes and aggregate metrics leave the client.

Aggregate-only reporting

Results are reported in aggregate. Individual forecasts, raw streams, and per-participant detail are not published.

Third-party consent

Evidence frequently captures other people. Third-party consent and minimization are first-class concerns, not afterthoughts.

No manipulation during evaluation

Systems may observe and forecast, but must not intervene on the participant during the scoring window.

Institutional review

Institutional review (e.g., IRB / ethics board) is recommended before any human-subjects deployment.

High-sensitivity evidence

The upper rungs of the evidence ladder are extraordinarily sensitive. Video, audio, screens, and behavioral traces can capture:

  • bystanders who have not consented
  • children
  • home and other addresses
  • medical and financial information
  • on-screen private content

Because of this, richer evidence tiers are admissible only under federated execution, strict minimization, and aggregate-only reporting. The scientific value of the evidence ladder does not override these constraints; it is bounded by them.

Observe, do not intervene

A benchmark that rewarded systems for changing the participant would measure influence, not understanding — and would create an incentive to manipulate. PWM-Bench therefore scores forecasts about a future the system did not act to shape. Any system that intervenes on the participant during the scoring window is out of protocol.