Pith Number

pith:WQ66NAS7

pith:2026:WQ66NAS7QRJQWP4HPYLRQ2N4ST

not attested not anchored not stored refs resolved

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Ameya Prabhu, Arvindh Arun, Jonas Geiping, Maksym Andriushchenko, Moritz Hardt, Nikhil Chandak, Shashwat Goel, Steffen Staab

FutureSim evaluates AI agents by replaying real historical events in order and shows even the best achieve only 25 percent accuracy on future predictions.

arxiv:2605.15188 v1 · 2026-05-14 · cs.LG · cs.AI · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{WQ66NAS7QRJQWP4HPYLRQ2N4ST}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all.

C2weakest assumption

That replaying real historical events chronologically without future knowledge leakage accurately measures an agent's adaptive capabilities in open-ended real-world settings.

C3one line summary

FutureSim is a benchmark that replays real news from January to March 2026 for AI agents to forecast events, with top accuracy at 25% and some agents worse than no-prediction baselines on Brier skill score.

References

25 extracted · 25 resolved · 0 Pith anchors

[1] World models 2018 · doi:10.5281/zenodo.1207631

[2] Lost in simulation: Llm-simulated users are unreliable proxies for human users in agentic evaluations 2026

[3] just give the model shell and tool access 2025

[4] Context consumption feedback:After each tool call, the agent receives feedback about remaining context budget and approximate context occupancy. This is useful because the task spans thousands of turn

[5] The goal is to make memory writing and retrieval deliberate actions rather than accidental byproducts of shell usage

Cited by

1 paper in Pith

Agentic Time Machine as an Infrastructure for Future-Event Forecasting

Receipt and verification

First computed	2026-05-17T21:40:25.093331Z
Last reissued	2026-05-17T21:57:18.480573Z
Builder	pith-number-builder-2026-05-17-v1
Signature	unsigned_v0
Schema	pith-number/v1.0

Canonical hash

b43de6825f84530b3f877e171869bc94eab8d04bd392bd457dcb524b154deb46

Aliases

arxiv: 2605.15188 · arxiv_version: 2605.15188v1 · pith_short_12: WQ66NAS7QRJQ · pith_short_16: WQ66NAS7QRJQWP4H · pith_short_8: WQ66NAS7

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/WQ66NAS7QRJQWP4HPYLRQ2N4ST \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b43de6825f84530b3f877e171869bc94eab8d04bd392bd457dcb524b154deb46

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "f79f3e8bfe083d3301c4d1dae9a620b2c49a6264f13323f074fd97ad4e825d76",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-14T17:59:28Z",
    "title_canon_sha256": "131de9b90c4210166213f7230b50e3513bf7fc6742b5a6d98d95edbdd3897002"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15188",
    "kind": "arxiv",
    "version": 1
  }
}