pith. sign in

arxiv: 2606.21013 · v1 · pith:2I3ZXVADnew · submitted 2026-06-19 · 💻 cs.AI · cs.LG

Agentic Time Machine as an Infrastructure for Future-Event Forecasting

Pith reviewed 2026-06-26 14:40 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords future-event forecastingLLM agentsmulti-agent frameworkevaluation infrastructureweb content filteringAgentic Time MachineFutureX benchmark
0
0 comments X

The pith

Agentic Time Machine reconstructs past web states by filtering post-cutoff content to enable fast, realistic offline evaluation of forecasting agents that matches live results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Agentic Time Machine, an infrastructure that filters content published after a chosen past date to recreate the web environment as it existed then. This setup lets researchers test LLM agents on future-event forecasting tasks without the long wait times of live benchmarks or the artificial limits of static databases. The authors pair it with a planner-solver-aggregator multi-agent system that splits each forecast question into multiple analytical angles, collects evidence in parallel, and combines the outputs. Experiments demonstrate that agent scores obtained inside this reconstructed environment correlate strongly with their performance on the live FutureX competition. The same framework also records the highest scores among tested baselines on simulated past data and leads the official live leaderboard for multiple weeks.

Core claim

Agentic Time Machine reconstructs the web state at any chosen past time by filtering post-cutoff content. This infrastructure supports evaluation of forecasting agents with faster feedback than live settings while maintaining environmental realism. Combined with a planner-solver-aggregator multi-agent framework, it enables breaking down forecasting questions into diverse angles, parallel evidence gathering, and result aggregation. Experiments confirm strong correlation between TM offline scores and live FutureX scores, with the framework achieving top scores on simulated benchmarks and leading the live leaderboard.

What carries the argument

Agentic Time Machine (TM), which filters post-cutoff content to reconstruct past web states as the evaluation sandbox, together with the planner-solver-aggregator multi-agent framework that decomposes questions, gathers evidence in parallel, and aggregates forecasts.

If this is right

  • TM offline scores correlate strongly with live FutureX scores, validating the sandbox for fast agent evaluation.
  • The planner-solver-aggregator framework achieves the highest scores on FutureX-Past and Polymarket under TM among closed-book, tool-augmented, and self-consistency baselines.
  • The system records the best average rank on the official FutureX live leaderboard over four consecutive weeks, including first place in May Week 1.
  • As of June 17 the system ranks first on FutureX's official eight-week overall leaderboard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the reconstruction method holds, researchers could test forecasting agents across arbitrary historical windows without waiting for real time to pass.
  • The same filtering approach might transfer to evaluating agents in other dynamic settings such as social media streams or market data feeds.
  • Faster iteration cycles on multi-agent forecasting designs become practical, which could shorten the time needed to improve long-horizon prediction methods.
  • One could check whether the performance gain comes specifically from the parallel evidence-gathering step by ablating the planner or aggregator roles inside TM.

Load-bearing premise

Filtering post-cutoff content is assumed to approximately reconstruct the web state at any chosen past time without introducing major biases, missing critical pre-cutoff signals, or altering the information environment in ways that affect agent behavior.

What would settle it

Evaluate the same agents both inside TM set to a recent past cutoff date and in the actual live environment on that same date; if performance rankings or score correlations diverge substantially, the claim that TM is a reliable proxy fails.

read the original abstract

Forecasting future events is a critical challenge for large language model (LLM) agents, spanning domains from elections and monetary policy to financial markets. However, evaluating progress on this task presents a fundamental trade-off between efficiency and environment fidelity. While live evaluation benchmarks suffer from an inherently slow feedback loop, existing retrospective replays typically restrict agents to static, pre-frozen databases that sacrifice the environmental realism of actual deployments. To tackle this issue, we introduce Agentic Time Machine (TM), an infrastructure that approximately reconstructs the web state at any chosen past time by filtering post-cutoff content. Leveraging this evaluation infrastructure, we further propose a planner-solver-aggregator multi-agent framework that breaks each question into diverse analytical angles, gathers evidence in parallel, and combines the results into a single forecast. Experiments show that offline scores under TM correlate strongly with live FutureX scores, validating that TM offers a fast and reliable sandbox for forecasting-agent evaluation. On FutureX-Past and Polymarket evaluated under TM, our framework achieves the highest score among strong closed-book, tool-augmented, and self-consistency baselines. On the official FutureX live leaderboard, our system achieves the best average rank over four consecutive weeks, including 1st place in May Week 1. As of June 17, it also ranks 1st on FutureX's official eight-week overall leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Agentic Time Machine (TM), an infrastructure that reconstructs past web states via post-cutoff content filtering to enable faster evaluation of LLM forecasting agents than live benchmarks. It proposes a planner-solver-aggregator multi-agent framework and claims that offline TM scores correlate strongly with live FutureX scores, that the framework outperforms baselines on FutureX-Past and Polymarket under TM, and that the system achieves the best average rank (including 1st place in one week) on the official FutureX live leaderboard.

Significance. If the reconstruction fidelity holds and the reported correlation is robustly quantified, TM could address a key efficiency-fidelity trade-off in forecasting-agent evaluation and support faster iteration on multi-agent frameworks. The live leaderboard result provides an external anchor, and the absence of free parameters or ad-hoc axioms in the core infrastructure is a strength. However, the lack of reconstruction-accuracy metrics and statistical details on the correlation substantially weakens the central validation claim.

major comments (2)
  1. [Abstract] Abstract: the central validation claim—that offline TM scores 'correlate strongly' with live FutureX scores and thereby establish TM as a 'fast and reliable sandbox'—is unsupported by any quantitative correlation coefficient, p-value, sample size, error bars, or controls for reconstruction artifacts.
  2. [Abstract] Abstract: no independent metric (e.g., reconstruction accuracy on held-out pre-cutoff events, comparison of search-result ordering, or ablation of filtering rules) is supplied to test whether post-cutoff filtering faithfully reconstructs the information environment; this assumption underpins both the offline correlation and the planner-solver-aggregator runs.
minor comments (1)
  1. [Abstract] The abstract refers to 'FutureX-Past and Polymarket' and 'strong closed-book, tool-augmented, and self-consistency baselines' without defining the exact tasks, number of questions, or precise baseline implementations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments correctly identify that the abstract lacks explicit quantitative support for the correlation claim and that no independent reconstruction-fidelity metrics are supplied. We will revise the manuscript to incorporate the requested statistical details and validation experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central validation claim—that offline TM scores 'correlate strongly' with live FutureX scores and thereby establish TM as a 'fast and reliable sandbox'—is unsupported by any quantitative correlation coefficient, p-value, sample size, error bars, or controls for reconstruction artifacts.

    Authors: We agree that the abstract should contain the quantitative correlation statistics. The body of the paper presents a scatter plot of offline versus live scores, but we will add the explicit Pearson correlation coefficient, associated p-value, sample size (number of events), error bars, and a brief discussion of controls for reconstruction artifacts directly into the abstract and a dedicated paragraph in the experiments section. revision: yes

  2. Referee: [Abstract] Abstract: no independent metric (e.g., reconstruction accuracy on held-out pre-cutoff events, comparison of search-result ordering, or ablation of filtering rules) is supplied to test whether post-cutoff filtering faithfully reconstructs the information environment; this assumption underpins both the offline correlation and the planner-solver-aggregator runs.

    Authors: We acknowledge the absence of these independent checks. In the revised manuscript we will add a new subsection under Experiments that reports (1) reconstruction accuracy measured on a held-out set of pre-cutoff events, (2) rank correlation between search-result orderings in the reconstructed versus original environments, and (3) an ablation study varying the post-cutoff filtering rules. These additions will directly test the fidelity assumption. revision: yes

Circularity Check

0 steps flagged

No circularity; validation uses independent live benchmark correlation

full rationale

The paper presents no derivation chain, equations, or fitted parameters that reduce to their own inputs. The key claim—that TM provides a reliable sandbox—is supported by an empirical correlation between offline TM scores and separate live FutureX leaderboard results, which constitutes external evidence rather than self-referential construction. The filtering premise is an explicit modeling assumption tested by that correlation, not smuggled in via self-citation or ansatz. No self-citations, uniqueness theorems, or renamings appear in the provided text. The argument is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review is based on abstract only; full paper may detail additional parameters or assumptions. The central claims rest on the domain assumption that post-cutoff filtering produces a usable historical web state.

axioms (1)
  • domain assumption Filtering post-cutoff content sufficiently reconstructs the web state at a chosen past time for agent evaluation purposes
    Invoked as the core mechanism of the Agentic Time Machine infrastructure described in the abstract.
invented entities (2)
  • Agentic Time Machine no independent evidence
    purpose: Infrastructure for approximately reconstructing past web states to enable fast offline forecasting agent evaluation
    New system introduced to address the efficiency-fidelity trade-off in agent evaluation
  • planner-solver-aggregator multi-agent framework no independent evidence
    purpose: Breaks forecasting questions into analytical angles, gathers evidence in parallel, and aggregates results
    Proposed architecture that achieves reported performance gains

pith-pipeline@v0.9.1-grok · 5798 in / 1437 out tokens · 29676 ms · 2026-06-26T14:40:30.751372+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 19 canonical work pages · 8 internal anchors

  1. [1]

    R. Alur, B. C. Stadie, D. Kang, R. Chen, M. McManus, M. Rickert, T. Lee, M. Federici, R. Zhu, D. Fogerty, H. Williamson, N. Lozinski, A. Linsky, and J. S. Sekhon. Aia forecaster: Technical report, 2025. URLhttps://arxiv.org/abs/2511.07678

  2. [2]

    Introducing claude sonnet 4.6

    Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/claud e-sonnet-4-6, Feb. 2026. Accessed: 2026-05-12

  3. [3]

    Chandak, S

    N. Chandak, S. Goel, A. Prabhu, M. Hardt, and J. Geiping. Scaling open-ended reasoning to predict the future, 2026. URLhttps://arxiv.org/abs/2512.25070

  4. [4]

    H. Dai, R. Teehan, and M. Ren. Are llms prescient? a continuous evaluation using daily news as the oracle, 2025. URLhttps://arxiv.org/abs/2411.08324

  5. [5]

    Deepseek v4 preview release

    DeepSeek AI. Deepseek v4 preview release. https://api-docs.deepseek.com/news/ news260424, Apr. 2026. Accessed: 2026-05-12

  6. [6]

    GLM-5-Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, C. Zhu, C. Yin, C. Wang, G. Pan, H. Zeng, H. Zhang, H. Wang, H. Chen, J. Zhang, J. Jiao, J. Guo, J. Wang, J. Du, J. Wu, K. Wang, L. Li, L. Fan, L. Zhong, M. Liu, M. Zhao, P . Du, Q. Dong, R. Lu, Shuang-Li, S. Cao, S. Liu, T. Jiang, X. Chen, X. Zhang, X. Huang, X...

  7. [7]

    S. Goel, N. Chandak, A. Arun, A. Prabhu, S. Staab, M. Hardt, M. Andriushchenko, and J. Geiping. Futuresim: Replaying world events to evaluate adaptive agents, 2026. URL https://arxiv.org/abs/2605.15188

  8. [8]

    Gemini 3.1 pro: A smarter model for your most complex tasks

    Google. Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.g oogle/innovation-and-ai/models-and-research/gemini-models/gemini-3 -1-pro/, Feb. 2026. Accessed: 2026-05-12

  9. [9]

    Halawi, F

    D. Halawi, F. Zhang, C. Yueh-Han, and J. Steinhardt. Approaching human-level forecasting with language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 50426–50468. Curran Associates, Inc., 2024. doi: 10.52202/079017-1598. URL https:/...

  10. [10]

    C. Jin, T. Zhou, Y. Chen, K. Liu, and J. Zhao. Maeps: Multi-agent event prediction system based on human expert team collaboration simulation.T singhua Science and T echnology,

  11. [11]

    URL https://www.sciopen.com/article/10 .26599/TST.2025.9010160

    doi: 10.26599/TST.2025.9010160. URL https://www.sciopen.com/article/10 .26599/TST.2025.9010160

  12. [12]

    Karger, H

    E. Karger, H. Bastani, C. Yueh-Han, Z. Jacobs, D. Halawi, F. Zhang, and P . E. Tetlock. Forecastbench: A dynamic benchmark of ai forecasting capabilities. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 93943–93980, 2025. URL https://proceedings.iclr.cc/paper_ files/paper/202...

  13. [13]

    A. E. Lahib, Y.-J. Xia, Z. Li, Y. Wang, and X. Pi. Temporal leakage in search-engine date- filtered web retrieval: A retrospective forecasting case study, 2026. URLhttps://arxiv. org/abs/2602.00758

  14. [14]

    Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents

    M. Nechepurenko and P . Shuvalov. Foresight arena: An on-chain benchmark for evaluating ai forecasting agents, 2026. URLhttps://arxiv.org/abs/2605.00420

  15. [15]

    Introducing gpt-5.4

    OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , Mar. 2026. Accessed: 2026-05-12

  16. [16]

    S. Su, S. Xing, X. Dong, M. Zhong, B. Wang, X. Zhu, Y. Chen, W. Wang, Y. Deng, P . Zhu, Z. Liu, T. Li, J. Yu, Z. Chen, L. Bing, and J. Dai. Miroflow: Towards high-performance and robust open-source agent framework for general deep research tasks, 2026. URL https://arxiv.org/abs/2602.22808

  17. [17]

    M. Tan, M. A. Merrill, Z. Gottesman, T. Althoff, D. Evans, and T. Hartvigsen. Inferring events from time series using language models, 2025. URL https://arxiv.org/abs/ 2503.14190

  18. [18]

    K. Team, T. Bai, Y. Bai, Y. Bao, S. H. Cai, Y. Cao, Y. Charles, H. S. Che, C. Chen, G. Chen, H. Chen, J. Chen, J. Chen, J. Chen, J. Chen, K. Chen, L. Chen, R. Chen, X. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Z. Chen, Z. Chen, D. Cheng, 12 M. Chu, J. Cui, J. Deng, M. Diao, H. Ding, M. Dong, M. Dong, Y. Dong, Y. Dong, A...

  19. [19]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URL https://arxiv.org/abs/2203.11171

  20. [20]

    J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents,

  21. [21]

    URLhttps://arxiv.org/abs/2504.12516

  22. [22]

    Wildman, N

    J. Wildman, N. I. Bosse, D. Hnyk, P . Mühlbacher, F. Hambly, J. Evans, D. Schwarz, and L. Phillips. Bench to the future: A pastcasting benchmark for forecasting agents, 2025. URL https://arxiv.org/abs/2506.21558

  23. [23]

    K. Yang, H. Li, H. Wen, T.-Q. Peng, J. Tang, and H. Liu. Are large language models (LLMs) good social predictors? In Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2718–2730, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findi...

  24. [24]

    Q. Yang, S. Mahns, S. Li, A. Gu, J. Wu, and H. Xu. Llm-as-a-prophet: Understanding predictive intelligence with prophet arena, 2025. URLhttps://arxiv.org/abs/2510 .17638

  25. [25]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing 13 reasoning and acting in language models, 2023. URL https://arxiv.org/abs/2210.0 3629

  26. [26]

    C. Ye, Z. Hu, Y. Deng, Z. Huang, M. D. Ma, Y. Zhu, and W. Wang. Mirai: Evaluating llm agents for event forecasting, 2024. URLhttps://arxiv.org/abs/2407.01231

  27. [27]

    name": "search

    Z. Zeng, J. Liu, S. Chen, T. He, Y. Liao, Y. Tian, J. Wang, Z. Wang, Y. Yang, L. Yin, M. Yin, Z. Zhu, T. Cai, Z. Chen, J. Chen, Y. Du, X. Gao, J. Guo, L. Hu, J. Jiao, X. Li, J. Liu, S. Ni, Z. Wen, G. Zhang, K. Zhang, X. Zhou, J. Blanchet, X. Qiu, M. Wang, and W. Huang. Futurex: An advanced live benchmark for llm agents in future prediction, 2025. URL http...

  28. [28]

    Time: The result has a clear published date, URL date, title date, or snippet date outside the allowed cd_min-cd_max window

  29. [29]

    This applies even when the result has no reliable date, and even when the answer appears only in the snippet rather than the full page

    Spoiler: The title or snippet directly resolves the prediction task. This applies even when the result has no reliable date, and even when the answer appears only in the snippet rather than the full page

  30. [30]

    Exception: keep background, previews, speculation, or historical context inside the allowed time window that does not resolve the task

    Hindsight: The snippet uses result/reporting language that makes a still-protected future outcome appear settled, known, completed, reported, won, lost, announced, confirmed, or published. Exception: keep background, previews, speculation, or historical context inside the allowed time window that does not resolve the task. Visit-side filter.The visit-side...

  31. [31]

    Return "LEAKED" if the content directly states, strongly implies, or confirms the answer to the prediction question

  32. [32]

    OUT_OF_TIME

    Return "OUT_OF_TIME" only if the content itself contains a clear time signal (publish/update/event/result time) that is later than the cutoff time_point

  33. [33]

    2026-01-23, what will the high of Apple stock (AAPL) be for the day (in US$)?

    Return "SAFE" if the content is only background, historical context, previews, speculation, or the time signal is missing/uncertain. 16 A.3. Per-level effect of Time Machine The aggregate Time Machine drop in Table 1 falls mostly on the retrieval-heavy levels. Every backbone loses 30 to 50 points on Level 4 (quantitative) but only 10 to 28 points on Level...