pith. sign in
Pith Number

pith:WQ66NAS7

pith:2026:WQ66NAS7QRJQWP4HPYLRQ2N4ST
not attested not anchored not stored refs resolved

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Ameya Prabhu, Arvindh Arun, Jonas Geiping, Maksym Andriushchenko, Moritz Hardt, Nikhil Chandak, Shashwat Goel, Steffen Staab

FutureSim evaluates AI agents by replaying real historical events in order and shows even the best achieve only 25 percent accuracy on future predictions.

arxiv:2605.15188 v1 · 2026-05-14 · cs.LG · cs.AI · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{WQ66NAS7QRJQWP4HPYLRQ2N4ST}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all.

C2weakest assumption

That replaying real historical events chronologically without future knowledge leakage accurately measures an agent's adaptive capabilities in open-ended real-world settings.

C3one line summary

FutureSim is a benchmark that replays real news from January to March 2026 for AI agents to forecast events, with top accuracy at 25% and some agents worse than no-prediction baselines on Brier skill score.

References

25 extracted · 25 resolved · 0 Pith anchors

[1] World models 2018 · doi:10.5281/zenodo.1207631
[2] Lost in simulation: Llm-simulated users are unreliable proxies for human users in agentic evaluations 2026
[3] just give the model shell and tool access 2025
[4] Context consumption feedback:After each tool call, the agent receives feedback about remaining context budget and approximate context occupancy. This is useful because the task spans thousands of turn
[5] The goal is to make memory writing and retrieval deliberate actions rather than accidental byproducts of shell usage
Receipt and verification
First computed 2026-05-17T21:40:25.093331Z
Last reissued 2026-05-17T21:57:18.480573Z
Builder pith-number-builder-2026-05-17-v1
Signature unsigned_v0
Schema pith-number/v1.0

Canonical hash

b43de6825f84530b3f877e171869bc94eab8d04bd392bd457dcb524b154deb46

Aliases

arxiv: 2605.15188 · arxiv_version: 2605.15188v1 · pith_short_12: WQ66NAS7QRJQ · pith_short_16: WQ66NAS7QRJQWP4H · pith_short_8: WQ66NAS7
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/WQ66NAS7QRJQWP4HPYLRQ2N4ST \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b43de6825f84530b3f877e171869bc94eab8d04bd392bd457dcb524b154deb46
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "f79f3e8bfe083d3301c4d1dae9a620b2c49a6264f13323f074fd97ad4e825d76",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-14T17:59:28Z",
    "title_canon_sha256": "131de9b90c4210166213f7230b50e3513bf7fc6742b5a6d98d95edbdd3897002"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15188",
    "kind": "arxiv",
    "version": 1
  }
}