pith. sign in

arxiv: 2604.26106 · v1 · submitted 2026-04-28 · 💻 cs.AI

Evaluating Strategic Reasoning in Forecasting Agents

Pith reviewed 2026-05-07 16:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI forecastingstrategic reasoningbenchmarkspastcastingincentive assessmentinstitutional modelingagent evaluation
0
0 comments X

The pith

Frontier AI agents' main forecasting weaknesses lie in judging leaders' incentives, plan follow-through, and institutional dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a controlled benchmark for past forecasts using a fixed set of documents so that AI agents can research and predict without peeking at later events. The benchmark separates skill in gathering information from skill in judging it, and reveals small but consistent accuracy gaps among top models. By identifying where a stronger combined forecaster pulls ahead, the work pinpoints that current agents most often err when they fail to map out what decision-makers want, whether they will stick to announced courses, and how organizations actually operate. Understanding these gaps matters because forecasting accuracy affects planning in business, policy, and risk management, and fixing them could make AI tools more reliable for high-stakes predictions.

Core claim

Using the BTF-2 benchmark of 1,417 pastcasting questions with a frozen research corpus, the authors show that a composite forecaster outperforms individual frontier agents by 0.011 in Brier score, primarily through better pre-mortem analysis of its own blind spots and explicit consideration of black swan events. Expert human forecasters reviewing the traces identify the dominant strategic reasoning failures as inadequate assessment of political and business leaders' incentives, underestimation of whether stated plans will be followed, and incomplete modeling of institutional processes.

What carries the argument

BTF-2 benchmark consisting of pastcasting questions and a frozen 15M-document corpus that enables reproducible, hindsight-free evaluation of agent reasoning traces.

Load-bearing premise

That the specific 1,417 pastcasting questions and the 15M-document frozen corpus provide an unbiased, generalizable test of strategic reasoning ability.

What would settle it

Running the same agents on a new set of pastcasting questions drawn from a different time period or topic distribution and finding that the same three failure modes no longer dominate expert judgments.

Figures

Figures reproduced from arXiv: 2604.26106 by Dan Schwarz, Jack Wildman, Nikos I. Bosse, Rafael Poyiadzi, Tom Liptay.

Figure 1
Figure 1. Figure 1: Calibration curves for the top three frontier agents and the SOTA forecaster ( view at source ↗
Figure 2
Figure 2. Figure 2: CHAMPS KNOW radar plot showing mean Borda score per dimension (rank 1 = 10 view at source ↗
read the original abstract

Forecasting benchmarks produce accuracy leaderboards but little insight into why some forecasters are more accurate than others. We introduce Bench to the Future 2 (BTF-2), 1,417 pastcasting questions with a frozen 15M-document research corpus in which agents reproducibly research and forecast offline, producing full reasoning traces. BTF-2 detects accuracy differences of 0.004 Brier score, and can distinguish differential agent strengths in research vs. judgment. We build a forecaster 0.011 Brier more accurate than any single frontier agent, and use it to evaluate agent strategic reasoning without hindsight bias. We find the better forecaster differs primarily in its pre-mortem analysis of its blind spots and consideration of black swans. Expert human forecasters found the dominant strategic reasoning failures of frontier agents are in assessing political and business leaders' incentives, judging their likelihood to follow through on stated plans, and modeling institutional processes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Bench to the Future 2 (BTF-2), a benchmark of 1,417 pastcasting questions paired with a frozen 15M-document research corpus. Agents perform offline research and forecasting to produce full reasoning traces; the benchmark distinguishes accuracy differences of 0.004 Brier score and research vs. judgment strengths. A composite forecaster is constructed that outperforms any single frontier agent by 0.011 Brier score. Expert human analysis of the traces identifies the dominant strategic-reasoning failures of frontier agents as assessing political/business leaders' incentives, judging follow-through on stated plans, and modeling institutional processes.

Significance. If the results hold, the work supplies concrete diagnostic evidence on why current agents underperform in forecasting, moving beyond accuracy leaderboards to actionable failure modes. The frozen-corpus, hindsight-free design and the construction of a demonstrably superior forecaster are strengths that could guide targeted improvements in agent reasoning.

major comments (2)
  1. [Abstract] Abstract: the assertion that the three listed failure modes are 'dominant' in frontier agents rests on expert analysis of traces from the BTF-2 set. No validation is provided that the 1,417 questions constitute a representative sample of strategic forecasting problems (e.g., no stratification by domain, no out-of-distribution hold-out set, no check against over-representation of political/business events), so generalization of the 'dominant' claim is not yet secured.
  2. [Evaluation and data sections] Evaluation and data sections: the manuscript reports concrete Brier-score deltas and human-expert analysis but supplies insufficient detail on question-selection criteria, exact composition of the 15M-document corpus, and plans for public data release. Without these, independent verification of leakage controls and reproducibility of the reported accuracy differences cannot be performed.
minor comments (2)
  1. [Abstract] Abstract: the reported Brier-score improvements (0.004 and 0.011) are given without accompanying standard errors, number of questions per comparison, or baseline definition, making it hard to judge statistical and practical significance.
  2. Notation: the distinction between 'research' and 'judgment' components of agent performance is introduced but not formally defined or operationalized in a way that allows readers to replicate the differential-strength analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that the three listed failure modes are 'dominant' in frontier agents rests on expert analysis of traces from the BTF-2 set. No validation is provided that the 1,417 questions constitute a representative sample of strategic forecasting problems (e.g., no stratification by domain, no out-of-distribution hold-out set, no check against over-representation of political/business events), so generalization of the 'dominant' claim is not yet secured.

    Authors: We appreciate the referee's observation on the scope of the 'dominant' claim. The BTF-2 questions were selected to emphasize strategic forecasting challenges across domains, and the expert trace analysis identified the three failure modes as the most frequent within this set. We did not include explicit stratification or an OOD hold-out in the reported experiments. To address the concern, we will revise the abstract and discussion to qualify that these are the dominant failures observed in expert analysis of BTF-2 traces, rather than asserting broader dominance without additional validation. This change preserves the empirical finding while limiting overgeneralization. revision: yes

  2. Referee: [Evaluation and data sections] Evaluation and data sections: the manuscript reports concrete Brier-score deltas and human-expert analysis but supplies insufficient detail on question-selection criteria, exact composition of the 15M-document corpus, and plans for public data release. Without these, independent verification of leakage controls and reproducibility of the reported accuracy differences cannot be performed.

    Authors: We agree that the current level of detail is insufficient for full reproducibility. In the revised manuscript we will expand the Evaluation and Data sections to specify the question-selection criteria (including domain balance and strategic-reasoning focus), provide a more granular description of the 15M-document corpus (sources, temporal coverage, and freezing protocol), and state our plans for public release of the questions, corpus metadata, and evaluation code. These additions will support independent verification of leakage controls and the reported Brier-score differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs BTF-2 as an external benchmark with a frozen 15M-document corpus and 1,417 pastcasting questions, then elicits full reasoning traces from agents and has independent expert human forecasters label the dominant failure modes in those traces. This evaluation step does not reduce any claim to a fitted parameter on the same test set, a self-definition, or a self-citation chain; the human analysis is performed outside the model's own outputs and the question set is treated as given input rather than derived from the results. No equations or steps in the abstract or described methodology exhibit the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard forecasting evaluation practices and human judgment for qualitative analysis; no new entities or fitted parameters are introduced in the abstract.

axioms (2)
  • standard math Brier score is a proper scoring rule suitable for comparing probabilistic forecasts
    Invoked implicitly when reporting 0.004 and 0.011 Brier differences
  • domain assumption Human expert review of reasoning traces can reliably identify strategic reasoning failures
    Central to the qualitative findings on incentives and institutions

pith-pipeline@v0.9.0 · 5464 in / 1335 out tokens · 41889 ms · 2026-05-07T16:05:35.314612+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    arXiv preprint arXiv:2506.21558 , year=

    Preprint, September 2024. Philip E. Tetlock and Dan Gardner.Superforecasting: The Art and Science of Prediction. Crown, 2015. Jack Wildman, Nikos I. Bosse, Daniel Hnyk, Peter Mühlbacher, Finn Hambly, Jon Evans, Dan Schwarz, and Lawrence Phillips. Bench to the future: A pastcasting benchmark for forecasting agents.arXiv preprint arXiv:2506.21558, 2025. Zel...

  2. [2]

    will be total and there will be no going back

    ASUU’s planned escalation sequence was explicitly: 14-day ultimatum → 2-week warning strike → indefinite strike. ASUU National President Prof. Chris Piwuna stated the next phase of action “will be total and there will be no going back.”

  3. [3]

    far from being over

    As of October 21, ASUU is holding a strategic NEC meeting to evaluate the strike, with many branches voting to continue/escalate. Officials from several universities indicated the matter is “far from being over.”

  4. [4]

    The NLC has issued a 4-week ultimatum threatening solidarity action if ASUU’s demands aren’t met. 13

  5. [5]

    no work, no pay

    The FG has implemented a “no work, no pay” policy, which could harden both sides’ positions. Evidence that could prevent escalation:

  6. [6]

    ASUU reports progress in negotiations with FG’s Renegotiation Team on 5 of 7 demands

  7. [7]

    The Senate has intervened to mediate

  8. [8]

    Aprovado

    Historical precedent shows some warning strikes end with partial concessions. However, ASUU’s demands are structural and longstanding (dating to 2009), making full resolution within weeks unlikely. The resolution window extends through December 31, 2025, a very long period that provides multiple opportunities for escalation. Even if the current round of n...

  9. [9]

    Identify the then-currentproposição principalfor the package including PL 1.874/2022 by checking itsficha de tramitação

  10. [10]

    Aprovado

    Open thefichaof that lead proposition (as of 2025-10-01, PL 3.899/2012) and check the 2025 events for a Plenary voting entry indicating “Aprovado” during the observation window

  11. [11]

    If PL 1.874/2022 itself is voted as principal in the Plenary during the window, check itsfichafor the Plenary approval event

  12. [12]

    Aprovado

    If theficha(s)explicitly show such a Plenary approval event within the observation window, resolve YES. Otherwise, resolve NO. Key context:PL 1.874/2022 was presented in the Câmara on 2024-03-21 and is currentlyapensadaunder PL 3.899/2012. The lead proposition’sfichashows a relator designation (Dep. Luciano Vieira, 2025-03-20), urgency regime metadata, an...

  13. [13]

    The bill has urgency status since November 2024

  14. [14]

    Finance Minister Haddad cited it as a government priority

  15. [15]

    historic

    A “historic” agreement was reached on September 24, 2025, between government, opposition, and productive sectors to de-append PL 1.874/2022 from PL 3.899/2012, potentially simplifying its path to a vote

  16. [16]

    However, significant obstacles remain:

    End-of-year legislative sessions typically see more bills cleared. However, significant obstacles remain:

  17. [17]

    The de-appending has NOT yet been effectuated—requests REQ 3873/2025 and REQ 4116/2025 are still pending. 16

  18. [18]

    legislative monster

    The productive sector called the current substitute text a “legislative monster” with over 100 articles and unrealistic obligations

  19. [19]

    The bill has failed to reach a vote in every Plenary session it was scheduled for in 2025

  20. [20]

    The procedural path forward remains uncertain due to pending de-appending requests. While there is political will and approximately 2.5 months remain in the observation window, the repeated failure to vote despite being on the agenda, ongoing procedural uncertainty about de-appending, and substantial industry opposition make approval uncertain but possibl...