ForecastBench-Sim: A Simulated-World Forecasting Benchmark

Ezra Karger; Jaeho Lee; Nick Merrill

arxiv: 2606.18686 · v1 · pith:R7N4E4K3new · submitted 2026-06-17 · 💻 cs.AI · cs.CL· cs.LG

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

Jaeho Lee , Nick Merrill , Ezra Karger This is my paper

Pith reviewed 2026-06-26 21:26 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords forecasting benchmarksimulated worldsprobabilistic reasoningdynamic statestail eventscausal questionsAI evaluationquestion generation

0 comments

The pith

A simulated strategy game benchmark generates forecasting questions at arbitrary time horizons that resolve immediately for scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a forecasting benchmark built from simulated game worlds to overcome constraints of real-world evaluation. Real-world setups face slow outcome resolution, infrequent tail events, and hard-to-score counterfactual questions. The simulated approach generates continuous or binary questions at any chosen horizon, creates paired intervention worlds for conditional or causal queries, and supplies ground truth for rare or disruptive events. This yields controlled tasks for examining probabilistic reasoning as world states change over time. The benchmark is framed as a complement to real-world forecasting tests rather than a substitute.

Core claim

The benchmark is constructed by running rollouts in a simulated turn-based strategy environment. Forecasters receive a fixed structured snapshot of the current state and answer questions about hidden future states. The simulation then continues to reveal actual outcomes, enabling immediate scoring. This design supports generation of question families at arbitrary horizons along with paired intervention worlds and resolved examples of low-probability outcomes.

What carries the argument

Rollouts in a simulated turn-based strategy environment that supply structured world reports as input and continue the simulation to produce ground truth for scoring predictions on future states.

Load-bearing premise

The simulation dynamics and world reports provide a faithful proxy for the challenges of real-world forecasting, including tail events and dynamic state changes.

What would settle it

A finding that model performance rankings on questions from the simulated benchmark show little or no correlation with rankings on established real-world forecasting tasks would indicate the simulation does not serve as an effective proxy.

Figures

Figures reproduced from arXiv: 2606.18686 by Ezra Karger, Jaeho Lee, Nick Merrill.

**Figure 1.** Figure 1: ForecastBench-Sim evaluates forecasts over hidden future states in controlled simulated worlds. A fixed report is shown to forecasters; future turns are withheld until the simulation is continued and scored. The same machinery also supports paired savegame interventions for conditional or causal questions [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: shows the curated subset with complete coverage across the artifacts used in this paper, while preserving rank context from the larger runs. Across the nine curated models, mean Brier on H1–H7 binary questions ranges from 0.220 (GPT-5.1) to 0.313 (Gemini 2.5 Flash), and mean normalized CRPS on H1–H7 continuous questions ranges from 0.283 (o3) to 0.590 (Gemini 2.5 Pro), a spread of roughly 40% on Brier and … view at source ↗

**Figure 3.** Figure 3: ForecastBench-Sim captures horizon-dependent difficulty: questions about further-future events are more difficult than questions about nearer-future events. (Scores are averaged by horizon for the curated model set.) pants answered 24 continuous questions over two worlds, covering city, technology, and treasury templates at H1, H3, H4, and H6. The crowd-mean normalized CRPS is close to a uniform-bin base… view at source ↗

**Figure 4.** Figure 4: Turn-60 territory map from world seed5. Colored regions are civilization-controlled tiles; black squares are cities; grey is unclaimed land; white is ocean. The full report includes one such map every 10 turns from turn 10 to turn 60. Structured text. The text portion of the report is a sequence of labeled tables: a roster of civilizations, the current-turn state vector for each (score, government, treasur… view at source ↗

**Figure 5.** Figure 5: Human pilot compared with models on the closest aggregate proxy: cities, technologies, and treasury at H1, H3, H4, and H6. The human pilot has 24 questions on 2 worlds; the model proxy uses the same template/horizon family across the larger model question set, so this is not a matched head-to-head comparison. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Full binary and continuous leaderboard. Curated models are highlighted, while additional models provide context for the compact validation figure in the main text. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Binary and continuous performance are related but not identical. Some models rank well under Brier score while ranking lower under normalized CRPS, consistent with the benchmark measuring more than one forecasting skill [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Template and horizon slices expose heterogeneous difficulty. This is useful for separating predictable quantities, such as technologies, from quantities more exposed to disruption, such as treasury or city count. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Binary calibration diagnostics for the curated model set. These diagnostics complement average Brier scores by showing whether errors are primarily calibration errors or discrimination errors. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Conditional intervention gains for Republic and +500 gold pilots. Positive values indicate that the conditional forecast was closer to the intervention-world outcome than the baseline forecast would have been. The appendix also includes a small placebo control for conditional phrasing. In an archived Opus 4.5 Republic pilot, replacing the intervention with a null conditional leaves performance near baseli… view at source ↗

**Figure 11.** Figure 11: Placebo control for conditional framing in an Opus 4.5 Republic pilot. Replacing the intervention with a null conditional leaves Brier essentially unchanged from baseline, while the real intervention sharply degrades performance. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Example tail-risk diagnostic enabled by dense simulated-world sampling, adapted from a companion anonymous analysis on the same ForecastBench-Sim continuous question family (Anonymous, 2026). On disruptable continuous templates, the fraction of actuals falling below the model’s stated p10 rises well above the nominal 10% rate, with treasury reaching about 50% by H6–H7. 15 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

read the original abstract

Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series. Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts. Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of rare or disruptive outcomes. We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot. ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ForecastBench-Sim gives a clean way to generate resolvable forecasting tasks with interventions via Freeciv rollouts, and the construction holds up internally.

read the letter

This paper's main contribution is a benchmark that runs Freeciv game simulations to produce forecasting questions with quick resolution, arbitrary time horizons, paired intervention worlds for causal or conditional queries, and examples of rare outcomes. The setup uses fixed world reports as input and then continues the simulation to score answers. That combination of features is new relative to earlier forecasting benchmarks.

The paper does a straightforward job describing the pipeline, question families, scoring protocol, and release artifacts. It also includes validation slices from model runs and an anonymized human pilot. The stress-test note is correct that the central claims follow directly from the simulation design without hidden assumptions about real-world fidelity; the authors position it explicitly as a complement rather than a replacement.

The soft spot is that the provided description gives no quantitative results from the model evaluations or the pilot, and no details on exact scoring implementation or data exclusion rules. This makes it hard to judge how hard the tasks actually are or how much signal they provide. The human pilot is mentioned but not characterized enough to assess its strength.

This is for researchers building or testing probabilistic reasoning in AI systems who want controlled, fast-turnaround tasks with interventions. A reader focused on benchmark design or dynamic forecasting would find the specific mechanics useful. It deserves peer review because the design is novel, the internal logic is consistent, and the artifacts are released.

Referee Report

0 major / 3 minor

Summary. The paper introduces ForecastBench-Sim, a simulated-world forecasting benchmark built on Freeciv game rollouts. Forecasters receive fixed world reports (structured game-state snapshots) and answer questions about future hidden states; the simulation is continued to score the forecasts. The design enables generation of continuous or binary questions at arbitrary time horizons, paired intervention worlds for conditional/causal questions, and resolved examples of rare or disruptive outcomes. The manuscript describes the benchmark pipeline, question families, scoring protocol, and release artifacts, and reports validation slices from model evaluations plus an anonymized human pilot. It positions the benchmark as a complement to real-world forecasting benchmarks for studying probabilistic reasoning under dynamic world states.

Significance. If the implementation details hold, ForecastBench-Sim supplies a controllable, immediately resolvable testbed that overcomes key constraints of real-world forecasting benchmarks (slow resolution, rarity of tail events, difficulty of counterfactual scoring). The simulation approach directly supports reproducible generation of arbitrary-horizon, paired-intervention, and rare-event questions, which could enable targeted experiments on dynamic-state probabilistic reasoning that are otherwise intractable.

minor comments (3)

[Abstract] Abstract: the claim that 'the same setup can generate continuous or binary forecasting questions at arbitrary time horizons' is presented without an explicit statement of the supported horizon range or any constraints imposed by Freeciv turn mechanics; adding one sentence would clarify the scope.
[Validation section] The manuscript states that validation slices from model evaluations and a human pilot are reported, yet no table or figure summarizing quantitative metrics (e.g., calibration scores, number of questions, or inter-rater agreement) is referenced in the provided abstract; ensure these appear with clear captions in the main text.
[Introduction] Minor notation issue: the term 'fixed world report' is used repeatedly but never given a formal definition or example schema in the abstract; a short illustrative excerpt would aid readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of ForecastBench-Sim, recognition of its significance in addressing limitations of real-world forecasting benchmarks, and recommendation for minor revision. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity; benchmark construction is self-contained

full rationale

The paper presents ForecastBench-Sim as a benchmark built directly on Freeciv game rollouts, with claims about generating questions, interventions, and resolved outcomes following immediately from the simulation design. No equations, fitted parameters, predictions, or derivations are present that reduce to inputs by construction. No self-citations are load-bearing for any uniqueness or ansatz, and the work does not rename prior results or import theorems from the authors' prior papers. The central contribution is the benchmark pipeline and artifacts, which are internally consistent by explicit construction without external validation dependencies that would create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that game simulations capture relevant forecasting properties; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Freeciv game dynamics and structured world reports serve as a suitable proxy for real-world forecasting challenges including tail events and dynamic states.
This assumption underpins the claim that the benchmark complements real-world forecasting tasks.

pith-pipeline@v0.9.1-grok · 5703 in / 1272 out tokens · 16984 ms · 2026-06-26T21:26:42.423172+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 2 canonical work pages

[1]

Verification of forecasts expressed in terms of probability

doi: 10.1175/1520-0493(1950)078 ⟨0001:VOFEIT⟩2.0 .CO;2. Epoch AI. Data on AI Capabilities and Benchmarking. ht tps://epoch.ai/benchmarks ,

work page doi:10.1175/1520-0493(1950)078 1950
[2]

FutureSearch, Wildman, J., Bosse, N

Accessed: 2026-06-16. FutureSearch, Wildman, J., Bosse, N. I., Hnyk, D., M¨uhlbacher, P., Hambly, F., Evans, J., Schwarz, D., and Phillips, L. Bench to the future: A pastcasting benchmark for forecasting agents.arXiv preprint arXiv:2506.21558,

arXiv 2026
[3]

Journal of the American Statistical Association , Year =

doi: 10.1198/016214506000001437. Halawi, D., Zhang, F., Yueh-Han, C., and Steinhardt, J. Ap- proaching human-level forecasting with language models. InAdvances in Neural Information Processing Systems (NeurIPS),

work page doi:10.1198/016214506000001437
[5]

Jin, Z., Chen, Y ., Leeb, F., Gresele, L., Kamal, O., Lyu, Z., Blin, K., Adauto, F

URL https://arxiv.or g/abs/2512.00193. Jin, Z., Chen, Y ., Leeb, F., Gresele, L., Kamal, O., Lyu, Z., Blin, K., Adauto, F. G., Kleiman-Weiner, M., Sachan, M., and Sch¨olkopf, B. CLadder: Assessing causal reasoning in language models. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv
[6]

Paleka, D., Goel, S., Geiping, J., and Tram`er, F

arXiv:2409.19839. Paleka, D., Goel, S., Geiping, J., and Tram`er, F. Pitfalls in evaluating language model forecasters.arXiv preprint arXiv:2506.00723, 2025a. URL https://arxiv.or g/abs/2506.00723. Paleka, D., Sudhir, A. P., Alvarez, A., Bhat, V ., Shen, A., Wang, E., and Tram `er, F. Consistency checks for lan- guage model forecasters. InInternational Co...

arXiv 2025
[7]

Requeima, J., Bronskill, J., Choi, D., Turner, R

arXiv:2401.10568. Requeima, J., Bronskill, J., Choi, D., Turner, R. E., and Du- venaud, D. LLM processes: Numerical predictive distri- butions conditioned on natural language. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv
[8]

The Freeciv Project

URLhttps://arxiv.org/abs/2405.12856. The Freeciv Project. Freeciv. https://www.freeciv. org/,

arXiv
[9]

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C

Accessed 2026-05-06. Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. D. Just ask for calibration: Strategies for eliciting calibrated confi- dence scores from language models fine-tuned with hu- man feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 5433–5442,

2026
[10]

Vezhnevets, A

URL https://aclantholo gy.org/2023.emnlp-main.330/. Vezhnevets, A. S., Agapiou, J. P., Matyas, A., Beyret, B., Aldridge, E., Lopez Guevara, T., Sunehag, P., Du ´e˜nez- Guzm´an, E. A., Webb, T., Bishop, J., Garrad, S., Vaughan, D., Ramos, D., Anderson, G., Rabinowitz, N., and Leibo, J. Z. Generative agent-based modeling with actions grounded in physical, s...

2023
[11]

5 ForecastBench-Sim A

URL https://icml.cc/virtual/ 2025/poster/44343. 5 ForecastBench-Sim A. Sample World Report Each forecasting task is built around a singleworld report: a structured text document, optionally accompanied by territory map images, that summarizes the state of a Freeciv game at the snapshot turn (currently turn 60). The report is the only world-specific input ...

2025
[12]

This appendix expands those numbers with per-horizon Spearman correlations and the underlying model lists

and a published real-world forecasting benchmark (ForecastBench Dataset Brier; Karger et al., 2024). This appendix expands those numbers with per-horizon Spearman correlations and the underlying model lists. Data sources.ECI scores are taken from the Epoch AI benchmarking hub (Epoch AI, 2026). ForecastBench Dataset Brier scores are zero-shot entries from ...

2024

[1] [1]

Verification of forecasts expressed in terms of probability

doi: 10.1175/1520-0493(1950)078 ⟨0001:VOFEIT⟩2.0 .CO;2. Epoch AI. Data on AI Capabilities and Benchmarking. ht tps://epoch.ai/benchmarks ,

work page doi:10.1175/1520-0493(1950)078 1950

[2] [2]

FutureSearch, Wildman, J., Bosse, N

Accessed: 2026-06-16. FutureSearch, Wildman, J., Bosse, N. I., Hnyk, D., M¨uhlbacher, P., Hambly, F., Evans, J., Schwarz, D., and Phillips, L. Bench to the future: A pastcasting benchmark for forecasting agents.arXiv preprint arXiv:2506.21558,

arXiv 2026

[3] [3]

Journal of the American Statistical Association , Year =

doi: 10.1198/016214506000001437. Halawi, D., Zhang, F., Yueh-Han, C., and Steinhardt, J. Ap- proaching human-level forecasting with language models. InAdvances in Neural Information Processing Systems (NeurIPS),

work page doi:10.1198/016214506000001437

[4] [5]

Jin, Z., Chen, Y ., Leeb, F., Gresele, L., Kamal, O., Lyu, Z., Blin, K., Adauto, F

URL https://arxiv.or g/abs/2512.00193. Jin, Z., Chen, Y ., Leeb, F., Gresele, L., Kamal, O., Lyu, Z., Blin, K., Adauto, F. G., Kleiman-Weiner, M., Sachan, M., and Sch¨olkopf, B. CLadder: Assessing causal reasoning in language models. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv

[5] [6]

Paleka, D., Goel, S., Geiping, J., and Tram`er, F

arXiv:2409.19839. Paleka, D., Goel, S., Geiping, J., and Tram`er, F. Pitfalls in evaluating language model forecasters.arXiv preprint arXiv:2506.00723, 2025a. URL https://arxiv.or g/abs/2506.00723. Paleka, D., Sudhir, A. P., Alvarez, A., Bhat, V ., Shen, A., Wang, E., and Tram `er, F. Consistency checks for lan- guage model forecasters. InInternational Co...

arXiv 2025

[6] [7]

Requeima, J., Bronskill, J., Choi, D., Turner, R

arXiv:2401.10568. Requeima, J., Bronskill, J., Choi, D., Turner, R. E., and Du- venaud, D. LLM processes: Numerical predictive distri- butions conditioned on natural language. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv

[7] [8]

The Freeciv Project

URLhttps://arxiv.org/abs/2405.12856. The Freeciv Project. Freeciv. https://www.freeciv. org/,

arXiv

[8] [9]

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C

Accessed 2026-05-06. Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. D. Just ask for calibration: Strategies for eliciting calibrated confi- dence scores from language models fine-tuned with hu- man feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 5433–5442,

2026

[9] [10]

Vezhnevets, A

URL https://aclantholo gy.org/2023.emnlp-main.330/. Vezhnevets, A. S., Agapiou, J. P., Matyas, A., Beyret, B., Aldridge, E., Lopez Guevara, T., Sunehag, P., Du ´e˜nez- Guzm´an, E. A., Webb, T., Bishop, J., Garrad, S., Vaughan, D., Ramos, D., Anderson, G., Rabinowitz, N., and Leibo, J. Z. Generative agent-based modeling with actions grounded in physical, s...

2023

[10] [11]

5 ForecastBench-Sim A

URL https://icml.cc/virtual/ 2025/poster/44343. 5 ForecastBench-Sim A. Sample World Report Each forecasting task is built around a singleworld report: a structured text document, optionally accompanied by territory map images, that summarizes the state of a Freeciv game at the snapshot turn (currently turn 60). The report is the only world-specific input ...

2025

[11] [12]

This appendix expands those numbers with per-horizon Spearman correlations and the underlying model lists

and a published real-world forecasting benchmark (ForecastBench Dataset Brier; Karger et al., 2024). This appendix expands those numbers with per-horizon Spearman correlations and the underlying model lists. Data sources.ECI scores are taken from the Epoch AI benchmarking hub (Epoch AI, 2026). ForecastBench Dataset Brier scores are zero-shot entries from ...

2024