Agentic Time Machine as an Infrastructure for Future-Event Forecasting

Bingyang Zheng; Hao Lu; Jingyi Chai; Kemeng Zhang; Siheng Chen; Tianchen Wang; Xiangrui Liu; Zihang Zhou

arxiv: 2606.21013 · v1 · pith:2I3ZXVADnew · submitted 2026-06-19 · 💻 cs.AI · cs.LG

Agentic Time Machine as an Infrastructure for Future-Event Forecasting

Jingyi Chai , Bingyang Zheng , Xiangrui Liu , Hao Lu , Zihang Zhou , Tianchen Wang , Kemeng Zhang , Siheng Chen This is my paper

Pith reviewed 2026-06-26 14:40 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords future-event forecastingLLM agentsmulti-agent frameworkevaluation infrastructureweb content filteringAgentic Time MachineFutureX benchmark

0 comments

The pith

Agentic Time Machine reconstructs past web states by filtering post-cutoff content to enable fast, realistic offline evaluation of forecasting agents that matches live results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Agentic Time Machine, an infrastructure that filters content published after a chosen past date to recreate the web environment as it existed then. This setup lets researchers test LLM agents on future-event forecasting tasks without the long wait times of live benchmarks or the artificial limits of static databases. The authors pair it with a planner-solver-aggregator multi-agent system that splits each forecast question into multiple analytical angles, collects evidence in parallel, and combines the outputs. Experiments demonstrate that agent scores obtained inside this reconstructed environment correlate strongly with their performance on the live FutureX competition. The same framework also records the highest scores among tested baselines on simulated past data and leads the official live leaderboard for multiple weeks.

Core claim

Agentic Time Machine reconstructs the web state at any chosen past time by filtering post-cutoff content. This infrastructure supports evaluation of forecasting agents with faster feedback than live settings while maintaining environmental realism. Combined with a planner-solver-aggregator multi-agent framework, it enables breaking down forecasting questions into diverse angles, parallel evidence gathering, and result aggregation. Experiments confirm strong correlation between TM offline scores and live FutureX scores, with the framework achieving top scores on simulated benchmarks and leading the live leaderboard.

What carries the argument

Agentic Time Machine (TM), which filters post-cutoff content to reconstruct past web states as the evaluation sandbox, together with the planner-solver-aggregator multi-agent framework that decomposes questions, gathers evidence in parallel, and aggregates forecasts.

If this is right

TM offline scores correlate strongly with live FutureX scores, validating the sandbox for fast agent evaluation.
The planner-solver-aggregator framework achieves the highest scores on FutureX-Past and Polymarket under TM among closed-book, tool-augmented, and self-consistency baselines.
The system records the best average rank on the official FutureX live leaderboard over four consecutive weeks, including first place in May Week 1.
As of June 17 the system ranks first on FutureX's official eight-week overall leaderboard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the reconstruction method holds, researchers could test forecasting agents across arbitrary historical windows without waiting for real time to pass.
The same filtering approach might transfer to evaluating agents in other dynamic settings such as social media streams or market data feeds.
Faster iteration cycles on multi-agent forecasting designs become practical, which could shorten the time needed to improve long-horizon prediction methods.
One could check whether the performance gain comes specifically from the parallel evidence-gathering step by ablating the planner or aggregator roles inside TM.

Load-bearing premise

Filtering post-cutoff content is assumed to approximately reconstruct the web state at any chosen past time without introducing major biases, missing critical pre-cutoff signals, or altering the information environment in ways that affect agent behavior.

What would settle it

Evaluate the same agents both inside TM set to a recent past cutoff date and in the actual live environment on that same date; if performance rankings or score correlations diverge substantially, the claim that TM is a reliable proxy fails.

read the original abstract

Forecasting future events is a critical challenge for large language model (LLM) agents, spanning domains from elections and monetary policy to financial markets. However, evaluating progress on this task presents a fundamental trade-off between efficiency and environment fidelity. While live evaluation benchmarks suffer from an inherently slow feedback loop, existing retrospective replays typically restrict agents to static, pre-frozen databases that sacrifice the environmental realism of actual deployments. To tackle this issue, we introduce Agentic Time Machine (TM), an infrastructure that approximately reconstructs the web state at any chosen past time by filtering post-cutoff content. Leveraging this evaluation infrastructure, we further propose a planner-solver-aggregator multi-agent framework that breaks each question into diverse analytical angles, gathers evidence in parallel, and combines the results into a single forecast. Experiments show that offline scores under TM correlate strongly with live FutureX scores, validating that TM offers a fast and reliable sandbox for forecasting-agent evaluation. On FutureX-Past and Polymarket evaluated under TM, our framework achieves the highest score among strong closed-book, tool-augmented, and self-consistency baselines. On the official FutureX live leaderboard, our system achieves the best average rank over four consecutive weeks, including 1st place in May Week 1. As of June 17, it also ranks 1st on FutureX's official eight-week overall leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The reconstruction method is unvalidated and the correlation claim rests on an untested assumption, but the multi-agent setup and leaderboard placement are concrete enough to merit review.

read the letter

The main point is that this paper gives a way to run faster offline tests for forecasting agents by filtering newer web content to approximate past states, plus a planner-solver-aggregator agent design that beats some baselines and placed first on the FutureX live board for a week.

What works is the practical framing: live evaluation is slow, static replays lose realism, so they try to split the difference with the Time Machine. The reported correlation between TM scores and live FutureX results, plus the top rank over four weeks, supplies a usable data point for people who need quicker iteration. The agent breakdown into angles and parallel evidence gathering is a straightforward extension of existing tool-use patterns.

The soft spot is exactly the one the stress-test flags. The whole validation chain depends on the filtering step producing a faithful past environment, yet the abstract gives no separate test of reconstruction quality—no checks on missing signals, changed pages, or search ordering. Without that, the correlation could reflect shared artifacts rather than real fidelity. No error bars, correlation coefficients, or controls appear in the provided summary either.

This is for groups already running LLM agents on prediction benchmarks like FutureX or Polymarket. They might borrow the agent structure or the evaluation shortcut if the code ships. It is not aimed at broader forecasting theory.

Send it to referees. The infrastructure idea targets a genuine pain point and the empirical results are specific, even if the reconstruction claim needs tighter evidence in revision.

Referee Report

2 major / 1 minor

Summary. The paper introduces Agentic Time Machine (TM), an infrastructure that reconstructs past web states via post-cutoff content filtering to enable faster evaluation of LLM forecasting agents than live benchmarks. It proposes a planner-solver-aggregator multi-agent framework and claims that offline TM scores correlate strongly with live FutureX scores, that the framework outperforms baselines on FutureX-Past and Polymarket under TM, and that the system achieves the best average rank (including 1st place in one week) on the official FutureX live leaderboard.

Significance. If the reconstruction fidelity holds and the reported correlation is robustly quantified, TM could address a key efficiency-fidelity trade-off in forecasting-agent evaluation and support faster iteration on multi-agent frameworks. The live leaderboard result provides an external anchor, and the absence of free parameters or ad-hoc axioms in the core infrastructure is a strength. However, the lack of reconstruction-accuracy metrics and statistical details on the correlation substantially weakens the central validation claim.

major comments (2)

[Abstract] Abstract: the central validation claim—that offline TM scores 'correlate strongly' with live FutureX scores and thereby establish TM as a 'fast and reliable sandbox'—is unsupported by any quantitative correlation coefficient, p-value, sample size, error bars, or controls for reconstruction artifacts.
[Abstract] Abstract: no independent metric (e.g., reconstruction accuracy on held-out pre-cutoff events, comparison of search-result ordering, or ablation of filtering rules) is supplied to test whether post-cutoff filtering faithfully reconstructs the information environment; this assumption underpins both the offline correlation and the planner-solver-aggregator runs.

minor comments (1)

[Abstract] The abstract refers to 'FutureX-Past and Polymarket' and 'strong closed-book, tool-augmented, and self-consistency baselines' without defining the exact tasks, number of questions, or precise baseline implementations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments correctly identify that the abstract lacks explicit quantitative support for the correlation claim and that no independent reconstruction-fidelity metrics are supplied. We will revise the manuscript to incorporate the requested statistical details and validation experiments.

read point-by-point responses

Referee: [Abstract] Abstract: the central validation claim—that offline TM scores 'correlate strongly' with live FutureX scores and thereby establish TM as a 'fast and reliable sandbox'—is unsupported by any quantitative correlation coefficient, p-value, sample size, error bars, or controls for reconstruction artifacts.

Authors: We agree that the abstract should contain the quantitative correlation statistics. The body of the paper presents a scatter plot of offline versus live scores, but we will add the explicit Pearson correlation coefficient, associated p-value, sample size (number of events), error bars, and a brief discussion of controls for reconstruction artifacts directly into the abstract and a dedicated paragraph in the experiments section. revision: yes
Referee: [Abstract] Abstract: no independent metric (e.g., reconstruction accuracy on held-out pre-cutoff events, comparison of search-result ordering, or ablation of filtering rules) is supplied to test whether post-cutoff filtering faithfully reconstructs the information environment; this assumption underpins both the offline correlation and the planner-solver-aggregator runs.

Authors: We acknowledge the absence of these independent checks. In the revised manuscript we will add a new subsection under Experiments that reports (1) reconstruction accuracy measured on a held-out set of pre-cutoff events, (2) rank correlation between search-result orderings in the reconstructed versus original environments, and (3) an ablation study varying the post-cutoff filtering rules. These additions will directly test the fidelity assumption. revision: yes

Circularity Check

0 steps flagged

No circularity; validation uses independent live benchmark correlation

full rationale

The paper presents no derivation chain, equations, or fitted parameters that reduce to their own inputs. The key claim—that TM provides a reliable sandbox—is supported by an empirical correlation between offline TM scores and separate live FutureX leaderboard results, which constitutes external evidence rather than self-referential construction. The filtering premise is an explicit modeling assumption tested by that correlation, not smuggled in via self-citation or ansatz. No self-citations, uniqueness theorems, or renamings appear in the provided text. The argument is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review is based on abstract only; full paper may detail additional parameters or assumptions. The central claims rest on the domain assumption that post-cutoff filtering produces a usable historical web state.

axioms (1)

domain assumption Filtering post-cutoff content sufficiently reconstructs the web state at a chosen past time for agent evaluation purposes
Invoked as the core mechanism of the Agentic Time Machine infrastructure described in the abstract.

invented entities (2)

Agentic Time Machine no independent evidence
purpose: Infrastructure for approximately reconstructing past web states to enable fast offline forecasting agent evaluation
New system introduced to address the efficiency-fidelity trade-off in agent evaluation
planner-solver-aggregator multi-agent framework no independent evidence
purpose: Breaks forecasting questions into analytical angles, gathers evidence in parallel, and aggregates results
Proposed architecture that achieves reported performance gains

pith-pipeline@v0.9.1-grok · 5798 in / 1437 out tokens · 29676 ms · 2026-06-26T14:40:30.751372+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 19 canonical work pages · 8 internal anchors

[1]

R. Alur, B. C. Stadie, D. Kang, R. Chen, M. McManus, M. Rickert, T. Lee, M. Federici, R. Zhu, D. Fogerty, H. Williamson, N. Lozinski, A. Linsky, and J. S. Sekhon. Aia forecaster: Technical report, 2025. URLhttps://arxiv.org/abs/2511.07678

work page arXiv 2025
[2]

Introducing claude sonnet 4.6

Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/claud e-sonnet-4-6, Feb. 2026. Accessed: 2026-05-12

2026
[3]

Chandak, S

N. Chandak, S. Goel, A. Prabhu, M. Hardt, and J. Geiping. Scaling open-ended reasoning to predict the future, 2026. URLhttps://arxiv.org/abs/2512.25070

work page arXiv 2026
[4]

H. Dai, R. Teehan, and M. Ren. Are llms prescient? a continuous evaluation using daily news as the oracle, 2025. URLhttps://arxiv.org/abs/2411.08324

work page arXiv 2025
[5]

Deepseek v4 preview release

DeepSeek AI. Deepseek v4 preview release. https://api-docs.deepseek.com/news/ news260424, Apr. 2026. Accessed: 2026-05-12

2026
[6]

GLM-5-Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, C. Zhu, C. Yin, C. Wang, G. Pan, H. Zeng, H. Zhang, H. Wang, H. Chen, J. Zhang, J. Jiao, J. Guo, J. Wang, J. Du, J. Wu, K. Wang, L. Li, L. Fan, L. Zhong, M. Liu, M. Zhao, P . Du, Q. Dong, R. Lu, Shuang-Li, S. Cao, S. Liu, T. Jiang, X. Chen, X. Zhang, X. Huang, X...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

S. Goel, N. Chandak, A. Arun, A. Prabhu, S. Staab, M. Hardt, M. Andriushchenko, and J. Geiping. Futuresim: Replaying world events to evaluate adaptive agents, 2026. URL https://arxiv.org/abs/2605.15188

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Gemini 3.1 pro: A smarter model for your most complex tasks

Google. Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.g oogle/innovation-and-ai/models-and-research/gemini-models/gemini-3 -1-pro/, Feb. 2026. Accessed: 2026-05-12

2026
[9]

Halawi, F

D. Halawi, F. Zhang, C. Yueh-Han, and J. Steinhardt. Approaching human-level forecasting with language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 50426–50468. Curran Associates, Inc., 2024. doi: 10.52202/079017-1598. URL https:/...

work page doi:10.52202/079017-1598 2024
[10]

C. Jin, T. Zhou, Y. Chen, K. Liu, and J. Zhao. Maeps: Multi-agent event prediction system based on human expert team collaboration simulation.T singhua Science and T echnology,
[11]

URL https://www.sciopen.com/article/10 .26599/TST.2025.9010160

doi: 10.26599/TST.2025.9010160. URL https://www.sciopen.com/article/10 .26599/TST.2025.9010160

work page doi:10.26599/tst.2025.9010160 2025
[12]

Karger, H

E. Karger, H. Bastani, C. Yueh-Han, Z. Jacobs, D. Halawi, F. Zhang, and P . E. Tetlock. Forecastbench: A dynamic benchmark of ai forecasting capabilities. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 93943–93980, 2025. URL https://proceedings.iclr.cc/paper_ files/paper/202...

2025
[13]

A. E. Lahib, Y.-J. Xia, Z. Li, Y. Wang, and X. Pi. Temporal leakage in search-engine date- filtered web retrieval: A retrospective forecasting case study, 2026. URLhttps://arxiv. org/abs/2602.00758

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents

M. Nechepurenko and P . Shuvalov. Foresight arena: An on-chain benchmark for evaluating ai forecasting agents, 2026. URLhttps://arxiv.org/abs/2605.00420

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , Mar. 2026. Accessed: 2026-05-12

2026
[16]

S. Su, S. Xing, X. Dong, M. Zhong, B. Wang, X. Zhu, Y. Chen, W. Wang, Y. Deng, P . Zhu, Z. Liu, T. Li, J. Yu, Z. Chen, L. Bing, and J. Dai. Miroflow: Towards high-performance and robust open-source agent framework for general deep research tasks, 2026. URL https://arxiv.org/abs/2602.22808

work page arXiv 2026
[17]

M. Tan, M. A. Merrill, Z. Gottesman, T. Althoff, D. Evans, and T. Hartvigsen. Inferring events from time series using language models, 2025. URL https://arxiv.org/abs/ 2503.14190

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

K. Team, T. Bai, Y. Bai, Y. Bao, S. H. Cai, Y. Cao, Y. Charles, H. S. Che, C. Chen, G. Chen, H. Chen, J. Chen, J. Chen, J. Chen, J. Chen, K. Chen, L. Chen, R. Chen, X. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Z. Chen, Z. Chen, D. Cheng, 12 M. Chu, J. Cui, J. Deng, M. Diao, H. Ding, M. Dong, M. Dong, Y. Dong, Y. Dong, A...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URL https://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents,
[21]

URLhttps://arxiv.org/abs/2504.12516

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Wildman, N

J. Wildman, N. I. Bosse, D. Hnyk, P . Mühlbacher, F. Hambly, J. Evans, D. Schwarz, and L. Phillips. Bench to the future: A pastcasting benchmark for forecasting agents, 2025. URL https://arxiv.org/abs/2506.21558

work page arXiv 2025
[23]

K. Yang, H. Li, H. Wen, T.-Q. Peng, J. Tang, and H. Liu. Are large language models (LLMs) good social predictors? In Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2718–2730, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findi...

work page doi:10.18653/v1/2024.findi 2024
[24]

Q. Yang, S. Mahns, S. Li, A. Gu, J. Wu, and H. Xu. Llm-as-a-prophet: Understanding predictive intelligence with prophet arena, 2025. URLhttps://arxiv.org/abs/2510 .17638

2025
[25]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing 13 reasoning and acting in language models, 2023. URL https://arxiv.org/abs/2210.0 3629

2023
[26]

C. Ye, Z. Hu, Y. Deng, Z. Huang, M. D. Ma, Y. Zhu, and W. Wang. Mirai: Evaluating llm agents for event forecasting, 2024. URLhttps://arxiv.org/abs/2407.01231

work page arXiv 2024
[27]

name": "search

Z. Zeng, J. Liu, S. Chen, T. He, Y. Liao, Y. Tian, J. Wang, Z. Wang, Y. Yang, L. Yin, M. Yin, Z. Zhu, T. Cai, Z. Chen, J. Chen, Y. Du, X. Gao, J. Guo, L. Hu, J. Jiao, X. Li, J. Liu, S. Ni, Z. Wen, G. Zhang, K. Zhang, X. Zhou, J. Blanchet, X. Qiu, M. Wang, and W. Huang. Futurex: An advanced live benchmark for llm agents in future prediction, 2025. URL http...

work page arXiv 2025
[28]

Time: The result has a clear published date, URL date, title date, or snippet date outside the allowed cd_min-cd_max window
[29]

This applies even when the result has no reliable date, and even when the answer appears only in the snippet rather than the full page

Spoiler: The title or snippet directly resolves the prediction task. This applies even when the result has no reliable date, and even when the answer appears only in the snippet rather than the full page
[30]

Exception: keep background, previews, speculation, or historical context inside the allowed time window that does not resolve the task

Hindsight: The snippet uses result/reporting language that makes a still-protected future outcome appear settled, known, completed, reported, won, lost, announced, confirmed, or published. Exception: keep background, previews, speculation, or historical context inside the allowed time window that does not resolve the task. Visit-side filter.The visit-side...
[31]

Return "LEAKED" if the content directly states, strongly implies, or confirms the answer to the prediction question
[32]

OUT_OF_TIME

Return "OUT_OF_TIME" only if the content itself contains a clear time signal (publish/update/event/result time) that is later than the cutoff time_point
[33]

2026-01-23, what will the high of Apple stock (AAPL) be for the day (in US$)?

Return "SAFE" if the content is only background, historical context, previews, speculation, or the time signal is missing/uncertain. 16 A.3. Per-level effect of Time Machine The aggregate Time Machine drop in Table 1 falls mostly on the retrieval-heavy levels. Every backbone loses 30 to 50 points on Level 4 (quantitative) but only 10 to 28 points on Level...

work page arXiv 2026

[1] [1]

R. Alur, B. C. Stadie, D. Kang, R. Chen, M. McManus, M. Rickert, T. Lee, M. Federici, R. Zhu, D. Fogerty, H. Williamson, N. Lozinski, A. Linsky, and J. S. Sekhon. Aia forecaster: Technical report, 2025. URLhttps://arxiv.org/abs/2511.07678

work page arXiv 2025

[2] [2]

Introducing claude sonnet 4.6

Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/claud e-sonnet-4-6, Feb. 2026. Accessed: 2026-05-12

2026

[3] [3]

Chandak, S

N. Chandak, S. Goel, A. Prabhu, M. Hardt, and J. Geiping. Scaling open-ended reasoning to predict the future, 2026. URLhttps://arxiv.org/abs/2512.25070

work page arXiv 2026

[4] [4]

H. Dai, R. Teehan, and M. Ren. Are llms prescient? a continuous evaluation using daily news as the oracle, 2025. URLhttps://arxiv.org/abs/2411.08324

work page arXiv 2025

[5] [5]

Deepseek v4 preview release

DeepSeek AI. Deepseek v4 preview release. https://api-docs.deepseek.com/news/ news260424, Apr. 2026. Accessed: 2026-05-12

2026

[6] [6]

GLM-5-Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, C. Zhu, C. Yin, C. Wang, G. Pan, H. Zeng, H. Zhang, H. Wang, H. Chen, J. Zhang, J. Jiao, J. Guo, J. Wang, J. Du, J. Wu, K. Wang, L. Li, L. Fan, L. Zhong, M. Liu, M. Zhao, P . Du, Q. Dong, R. Lu, Shuang-Li, S. Cao, S. Liu, T. Jiang, X. Chen, X. Zhang, X. Huang, X...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

S. Goel, N. Chandak, A. Arun, A. Prabhu, S. Staab, M. Hardt, M. Andriushchenko, and J. Geiping. Futuresim: Replaying world events to evaluate adaptive agents, 2026. URL https://arxiv.org/abs/2605.15188

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Gemini 3.1 pro: A smarter model for your most complex tasks

Google. Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.g oogle/innovation-and-ai/models-and-research/gemini-models/gemini-3 -1-pro/, Feb. 2026. Accessed: 2026-05-12

2026

[9] [9]

Halawi, F

D. Halawi, F. Zhang, C. Yueh-Han, and J. Steinhardt. Approaching human-level forecasting with language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 50426–50468. Curran Associates, Inc., 2024. doi: 10.52202/079017-1598. URL https:/...

work page doi:10.52202/079017-1598 2024

[10] [10]

C. Jin, T. Zhou, Y. Chen, K. Liu, and J. Zhao. Maeps: Multi-agent event prediction system based on human expert team collaboration simulation.T singhua Science and T echnology,

[11] [11]

URL https://www.sciopen.com/article/10 .26599/TST.2025.9010160

doi: 10.26599/TST.2025.9010160. URL https://www.sciopen.com/article/10 .26599/TST.2025.9010160

work page doi:10.26599/tst.2025.9010160 2025

[12] [12]

Karger, H

E. Karger, H. Bastani, C. Yueh-Han, Z. Jacobs, D. Halawi, F. Zhang, and P . E. Tetlock. Forecastbench: A dynamic benchmark of ai forecasting capabilities. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 93943–93980, 2025. URL https://proceedings.iclr.cc/paper_ files/paper/202...

2025

[13] [13]

A. E. Lahib, Y.-J. Xia, Z. Li, Y. Wang, and X. Pi. Temporal leakage in search-engine date- filtered web retrieval: A retrospective forecasting case study, 2026. URLhttps://arxiv. org/abs/2602.00758

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents

M. Nechepurenko and P . Shuvalov. Foresight arena: An on-chain benchmark for evaluating ai forecasting agents, 2026. URLhttps://arxiv.org/abs/2605.00420

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , Mar. 2026. Accessed: 2026-05-12

2026

[16] [16]

S. Su, S. Xing, X. Dong, M. Zhong, B. Wang, X. Zhu, Y. Chen, W. Wang, Y. Deng, P . Zhu, Z. Liu, T. Li, J. Yu, Z. Chen, L. Bing, and J. Dai. Miroflow: Towards high-performance and robust open-source agent framework for general deep research tasks, 2026. URL https://arxiv.org/abs/2602.22808

work page arXiv 2026

[17] [17]

M. Tan, M. A. Merrill, Z. Gottesman, T. Althoff, D. Evans, and T. Hartvigsen. Inferring events from time series using language models, 2025. URL https://arxiv.org/abs/ 2503.14190

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

K. Team, T. Bai, Y. Bai, Y. Bao, S. H. Cai, Y. Cao, Y. Charles, H. S. Che, C. Chen, G. Chen, H. Chen, J. Chen, J. Chen, J. Chen, J. Chen, K. Chen, L. Chen, R. Chen, X. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Y. Chen, Z. Chen, Z. Chen, D. Cheng, 12 M. Chu, J. Cui, J. Deng, M. Diao, H. Ding, M. Dong, M. Dong, Y. Dong, Y. Dong, A...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URL https://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents,

[21] [21]

URLhttps://arxiv.org/abs/2504.12516

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Wildman, N

J. Wildman, N. I. Bosse, D. Hnyk, P . Mühlbacher, F. Hambly, J. Evans, D. Schwarz, and L. Phillips. Bench to the future: A pastcasting benchmark for forecasting agents, 2025. URL https://arxiv.org/abs/2506.21558

work page arXiv 2025

[23] [23]

K. Yang, H. Li, H. Wen, T.-Q. Peng, J. Tang, and H. Liu. Are large language models (LLMs) good social predictors? In Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2718–2730, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findi...

work page doi:10.18653/v1/2024.findi 2024

[24] [24]

Q. Yang, S. Mahns, S. Li, A. Gu, J. Wu, and H. Xu. Llm-as-a-prophet: Understanding predictive intelligence with prophet arena, 2025. URLhttps://arxiv.org/abs/2510 .17638

2025

[25] [25]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing 13 reasoning and acting in language models, 2023. URL https://arxiv.org/abs/2210.0 3629

2023

[26] [26]

C. Ye, Z. Hu, Y. Deng, Z. Huang, M. D. Ma, Y. Zhu, and W. Wang. Mirai: Evaluating llm agents for event forecasting, 2024. URLhttps://arxiv.org/abs/2407.01231

work page arXiv 2024

[27] [27]

name": "search

Z. Zeng, J. Liu, S. Chen, T. He, Y. Liao, Y. Tian, J. Wang, Z. Wang, Y. Yang, L. Yin, M. Yin, Z. Zhu, T. Cai, Z. Chen, J. Chen, Y. Du, X. Gao, J. Guo, L. Hu, J. Jiao, X. Li, J. Liu, S. Ni, Z. Wen, G. Zhang, K. Zhang, X. Zhou, J. Blanchet, X. Qiu, M. Wang, and W. Huang. Futurex: An advanced live benchmark for llm agents in future prediction, 2025. URL http...

work page arXiv 2025

[28] [28]

Time: The result has a clear published date, URL date, title date, or snippet date outside the allowed cd_min-cd_max window

[29] [29]

This applies even when the result has no reliable date, and even when the answer appears only in the snippet rather than the full page

Spoiler: The title or snippet directly resolves the prediction task. This applies even when the result has no reliable date, and even when the answer appears only in the snippet rather than the full page

[30] [30]

Exception: keep background, previews, speculation, or historical context inside the allowed time window that does not resolve the task

Hindsight: The snippet uses result/reporting language that makes a still-protected future outcome appear settled, known, completed, reported, won, lost, announced, confirmed, or published. Exception: keep background, previews, speculation, or historical context inside the allowed time window that does not resolve the task. Visit-side filter.The visit-side...

[31] [31]

Return "LEAKED" if the content directly states, strongly implies, or confirms the answer to the prediction question

[32] [32]

OUT_OF_TIME

Return "OUT_OF_TIME" only if the content itself contains a clear time signal (publish/update/event/result time) that is later than the cutoff time_point

[33] [33]

2026-01-23, what will the high of Apple stock (AAPL) be for the day (in US$)?

Return "SAFE" if the content is only background, historical context, previews, speculation, or the time signal is missing/uncertain. 16 A.3. Per-level effect of Time Machine The aggregate Time Machine drop in Table 1 falls mostly on the retrieval-heavy levels. Every backbone loses 30 to 50 points on Level 4 (quantitative) but only 10 to 28 points on Level...

work page arXiv 2026