Agentic Time Machine as an Infrastructure for Future-Event Forecasting
Pith reviewed 2026-06-26 14:40 UTC · model grok-4.3
The pith
Agentic Time Machine reconstructs past web states by filtering post-cutoff content to enable fast, realistic offline evaluation of forecasting agents that matches live results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agentic Time Machine reconstructs the web state at any chosen past time by filtering post-cutoff content. This infrastructure supports evaluation of forecasting agents with faster feedback than live settings while maintaining environmental realism. Combined with a planner-solver-aggregator multi-agent framework, it enables breaking down forecasting questions into diverse angles, parallel evidence gathering, and result aggregation. Experiments confirm strong correlation between TM offline scores and live FutureX scores, with the framework achieving top scores on simulated benchmarks and leading the live leaderboard.
What carries the argument
Agentic Time Machine (TM), which filters post-cutoff content to reconstruct past web states as the evaluation sandbox, together with the planner-solver-aggregator multi-agent framework that decomposes questions, gathers evidence in parallel, and aggregates forecasts.
If this is right
- TM offline scores correlate strongly with live FutureX scores, validating the sandbox for fast agent evaluation.
- The planner-solver-aggregator framework achieves the highest scores on FutureX-Past and Polymarket under TM among closed-book, tool-augmented, and self-consistency baselines.
- The system records the best average rank on the official FutureX live leaderboard over four consecutive weeks, including first place in May Week 1.
- As of June 17 the system ranks first on FutureX's official eight-week overall leaderboard.
Where Pith is reading between the lines
- If the reconstruction method holds, researchers could test forecasting agents across arbitrary historical windows without waiting for real time to pass.
- The same filtering approach might transfer to evaluating agents in other dynamic settings such as social media streams or market data feeds.
- Faster iteration cycles on multi-agent forecasting designs become practical, which could shorten the time needed to improve long-horizon prediction methods.
- One could check whether the performance gain comes specifically from the parallel evidence-gathering step by ablating the planner or aggregator roles inside TM.
Load-bearing premise
Filtering post-cutoff content is assumed to approximately reconstruct the web state at any chosen past time without introducing major biases, missing critical pre-cutoff signals, or altering the information environment in ways that affect agent behavior.
What would settle it
Evaluate the same agents both inside TM set to a recent past cutoff date and in the actual live environment on that same date; if performance rankings or score correlations diverge substantially, the claim that TM is a reliable proxy fails.
read the original abstract
Forecasting future events is a critical challenge for large language model (LLM) agents, spanning domains from elections and monetary policy to financial markets. However, evaluating progress on this task presents a fundamental trade-off between efficiency and environment fidelity. While live evaluation benchmarks suffer from an inherently slow feedback loop, existing retrospective replays typically restrict agents to static, pre-frozen databases that sacrifice the environmental realism of actual deployments. To tackle this issue, we introduce Agentic Time Machine (TM), an infrastructure that approximately reconstructs the web state at any chosen past time by filtering post-cutoff content. Leveraging this evaluation infrastructure, we further propose a planner-solver-aggregator multi-agent framework that breaks each question into diverse analytical angles, gathers evidence in parallel, and combines the results into a single forecast. Experiments show that offline scores under TM correlate strongly with live FutureX scores, validating that TM offers a fast and reliable sandbox for forecasting-agent evaluation. On FutureX-Past and Polymarket evaluated under TM, our framework achieves the highest score among strong closed-book, tool-augmented, and self-consistency baselines. On the official FutureX live leaderboard, our system achieves the best average rank over four consecutive weeks, including 1st place in May Week 1. As of June 17, it also ranks 1st on FutureX's official eight-week overall leaderboard.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agentic Time Machine (TM), an infrastructure that reconstructs past web states via post-cutoff content filtering to enable faster evaluation of LLM forecasting agents than live benchmarks. It proposes a planner-solver-aggregator multi-agent framework and claims that offline TM scores correlate strongly with live FutureX scores, that the framework outperforms baselines on FutureX-Past and Polymarket under TM, and that the system achieves the best average rank (including 1st place in one week) on the official FutureX live leaderboard.
Significance. If the reconstruction fidelity holds and the reported correlation is robustly quantified, TM could address a key efficiency-fidelity trade-off in forecasting-agent evaluation and support faster iteration on multi-agent frameworks. The live leaderboard result provides an external anchor, and the absence of free parameters or ad-hoc axioms in the core infrastructure is a strength. However, the lack of reconstruction-accuracy metrics and statistical details on the correlation substantially weakens the central validation claim.
major comments (2)
- [Abstract] Abstract: the central validation claim—that offline TM scores 'correlate strongly' with live FutureX scores and thereby establish TM as a 'fast and reliable sandbox'—is unsupported by any quantitative correlation coefficient, p-value, sample size, error bars, or controls for reconstruction artifacts.
- [Abstract] Abstract: no independent metric (e.g., reconstruction accuracy on held-out pre-cutoff events, comparison of search-result ordering, or ablation of filtering rules) is supplied to test whether post-cutoff filtering faithfully reconstructs the information environment; this assumption underpins both the offline correlation and the planner-solver-aggregator runs.
minor comments (1)
- [Abstract] The abstract refers to 'FutureX-Past and Polymarket' and 'strong closed-book, tool-augmented, and self-consistency baselines' without defining the exact tasks, number of questions, or precise baseline implementations.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The two major comments correctly identify that the abstract lacks explicit quantitative support for the correlation claim and that no independent reconstruction-fidelity metrics are supplied. We will revise the manuscript to incorporate the requested statistical details and validation experiments.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central validation claim—that offline TM scores 'correlate strongly' with live FutureX scores and thereby establish TM as a 'fast and reliable sandbox'—is unsupported by any quantitative correlation coefficient, p-value, sample size, error bars, or controls for reconstruction artifacts.
Authors: We agree that the abstract should contain the quantitative correlation statistics. The body of the paper presents a scatter plot of offline versus live scores, but we will add the explicit Pearson correlation coefficient, associated p-value, sample size (number of events), error bars, and a brief discussion of controls for reconstruction artifacts directly into the abstract and a dedicated paragraph in the experiments section. revision: yes
-
Referee: [Abstract] Abstract: no independent metric (e.g., reconstruction accuracy on held-out pre-cutoff events, comparison of search-result ordering, or ablation of filtering rules) is supplied to test whether post-cutoff filtering faithfully reconstructs the information environment; this assumption underpins both the offline correlation and the planner-solver-aggregator runs.
Authors: We acknowledge the absence of these independent checks. In the revised manuscript we will add a new subsection under Experiments that reports (1) reconstruction accuracy measured on a held-out set of pre-cutoff events, (2) rank correlation between search-result orderings in the reconstructed versus original environments, and (3) an ablation study varying the post-cutoff filtering rules. These additions will directly test the fidelity assumption. revision: yes
Circularity Check
No circularity; validation uses independent live benchmark correlation
full rationale
The paper presents no derivation chain, equations, or fitted parameters that reduce to their own inputs. The key claim—that TM provides a reliable sandbox—is supported by an empirical correlation between offline TM scores and separate live FutureX leaderboard results, which constitutes external evidence rather than self-referential construction. The filtering premise is an explicit modeling assumption tested by that correlation, not smuggled in via self-citation or ansatz. No self-citations, uniqueness theorems, or renamings appear in the provided text. The argument is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Filtering post-cutoff content sufficiently reconstructs the web state at a chosen past time for agent evaluation purposes
invented entities (2)
-
Agentic Time Machine
no independent evidence
-
planner-solver-aggregator multi-agent framework
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
Introducing claude sonnet 4.6
Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/claud e-sonnet-4-6, Feb. 2026. Accessed: 2026-05-12
2026
-
[3]
N. Chandak, S. Goel, A. Prabhu, M. Hardt, and J. Geiping. Scaling open-ended reasoning to predict the future, 2026. URLhttps://arxiv.org/abs/2512.25070
- [4]
-
[5]
Deepseek v4 preview release
DeepSeek AI. Deepseek v4 preview release. https://api-docs.deepseek.com/news/ news260424, Apr. 2026. Accessed: 2026-05-12
2026
-
[6]
GLM-5-Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, C. Zhu, C. Yin, C. Wang, G. Pan, H. Zeng, H. Zhang, H. Wang, H. Chen, J. Zhang, J. Jiao, J. Guo, J. Wang, J. Du, J. Wu, K. Wang, L. Li, L. Fan, L. Zhong, M. Liu, M. Zhao, P . Du, Q. Dong, R. Lu, Shuang-Li, S. Cao, S. Liu, T. Jiang, X. Chen, X. Zhang, X. Huang, X...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
S. Goel, N. Chandak, A. Arun, A. Prabhu, S. Staab, M. Hardt, M. Andriushchenko, and J. Geiping. Futuresim: Replaying world events to evaluate adaptive agents, 2026. URL https://arxiv.org/abs/2605.15188
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Gemini 3.1 pro: A smarter model for your most complex tasks
Google. Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.g oogle/innovation-and-ai/models-and-research/gemini-models/gemini-3 -1-pro/, Feb. 2026. Accessed: 2026-05-12
2026
-
[9]
D. Halawi, F. Zhang, C. Yueh-Han, and J. Steinhardt. Approaching human-level forecasting with language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 50426–50468. Curran Associates, Inc., 2024. doi: 10.52202/079017-1598. URL https:/...
-
[10]
C. Jin, T. Zhou, Y. Chen, K. Liu, and J. Zhao. Maeps: Multi-agent event prediction system based on human expert team collaboration simulation.T singhua Science and T echnology,
-
[11]
URL https://www.sciopen.com/article/10 .26599/TST.2025.9010160
doi: 10.26599/TST.2025.9010160. URL https://www.sciopen.com/article/10 .26599/TST.2025.9010160
-
[12]
Karger, H
E. Karger, H. Bastani, C. Yueh-Han, Z. Jacobs, D. Halawi, F. Zhang, and P . E. Tetlock. Forecastbench: A dynamic benchmark of ai forecasting capabilities. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 93943–93980, 2025. URL https://proceedings.iclr.cc/paper_ files/paper/202...
2025
-
[13]
A. E. Lahib, Y.-J. Xia, Z. Li, Y. Wang, and X. Pi. Temporal leakage in search-engine date- filtered web retrieval: A retrospective forecasting case study, 2026. URLhttps://arxiv. org/abs/2602.00758
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents
M. Nechepurenko and P . Shuvalov. Foresight arena: An on-chain benchmark for evaluating ai forecasting agents, 2026. URLhttps://arxiv.org/abs/2605.00420
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Introducing gpt-5.4
OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , Mar. 2026. Accessed: 2026-05-12
2026
-
[16]
S. Su, S. Xing, X. Dong, M. Zhong, B. Wang, X. Zhu, Y. Chen, W. Wang, Y. Deng, P . Zhu, Z. Liu, T. Li, J. Yu, Z. Chen, L. Bing, and J. Dai. Miroflow: Towards high-performance and robust open-source agent framework for general deep research tasks, 2026. URL https://arxiv.org/abs/2602.22808
-
[17]
M. Tan, M. A. Merrill, Z. Gottesman, T. Althoff, D. Evans, and T. Hartvigsen. Inferring events from time series using language models, 2025. URL https://arxiv.org/abs/ 2503.14190
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
K. Team, T. Bai, Y. Bai, Y. Bao, S. H. Cai, Y. Cao, Y. Charles, H. S. Che, C. Chen, G. Chen, H. Chen, J. Chen, J. Chen, J. Chen, J. Chen, K. Chen, L. Chen, R. Chen, X. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Z. Chen, Z. Chen, D. Cheng, 12 M. Chu, J. Cui, J. Deng, M. Diao, H. Ding, M. Dong, M. Dong, Y. Dong, Y. Dong, A...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URL https://arxiv.org/abs/2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents,
-
[21]
URLhttps://arxiv.org/abs/2504.12516
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
J. Wildman, N. I. Bosse, D. Hnyk, P . Mühlbacher, F. Hambly, J. Evans, D. Schwarz, and L. Phillips. Bench to the future: A pastcasting benchmark for forecasting agents, 2025. URL https://arxiv.org/abs/2506.21558
-
[23]
K. Yang, H. Li, H. Wen, T.-Q. Peng, J. Tang, and H. Liu. Are large language models (LLMs) good social predictors? In Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2718–2730, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findi...
-
[24]
Q. Yang, S. Mahns, S. Li, A. Gu, J. Wu, and H. Xu. Llm-as-a-prophet: Understanding predictive intelligence with prophet arena, 2025. URLhttps://arxiv.org/abs/2510 .17638
2025
-
[25]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing 13 reasoning and acting in language models, 2023. URL https://arxiv.org/abs/2210.0 3629
2023
- [26]
-
[27]
Z. Zeng, J. Liu, S. Chen, T. He, Y. Liao, Y. Tian, J. Wang, Z. Wang, Y. Yang, L. Yin, M. Yin, Z. Zhu, T. Cai, Z. Chen, J. Chen, Y. Du, X. Gao, J. Guo, L. Hu, J. Jiao, X. Li, J. Liu, S. Ni, Z. Wen, G. Zhang, K. Zhang, X. Zhou, J. Blanchet, X. Qiu, M. Wang, and W. Huang. Futurex: An advanced live benchmark for llm agents in future prediction, 2025. URL http...
-
[28]
Time: The result has a clear published date, URL date, title date, or snippet date outside the allowed cd_min-cd_max window
-
[29]
This applies even when the result has no reliable date, and even when the answer appears only in the snippet rather than the full page
Spoiler: The title or snippet directly resolves the prediction task. This applies even when the result has no reliable date, and even when the answer appears only in the snippet rather than the full page
-
[30]
Exception: keep background, previews, speculation, or historical context inside the allowed time window that does not resolve the task
Hindsight: The snippet uses result/reporting language that makes a still-protected future outcome appear settled, known, completed, reported, won, lost, announced, confirmed, or published. Exception: keep background, previews, speculation, or historical context inside the allowed time window that does not resolve the task. Visit-side filter.The visit-side...
-
[31]
Return "LEAKED" if the content directly states, strongly implies, or confirms the answer to the prediction question
-
[32]
OUT_OF_TIME
Return "OUT_OF_TIME" only if the content itself contains a clear time signal (publish/update/event/result time) that is later than the cutoff time_point
-
[33]
2026-01-23, what will the high of Apple stock (AAPL) be for the day (in US$)?
Return "SAFE" if the content is only background, historical context, previews, speculation, or the time signal is missing/uncertain. 16 A.3. Per-level effect of Time Machine The aggregate Time Machine drop in Table 1 falls mostly on the retrieval-heavy levels. Every backbone loses 30 to 50 points on Level 4 (quantitative) but only 10 to 28 points on Level...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.