Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

Ezra Karger; Jaeho Lee; Nick Merrill

REVIEW 3 major objections 1 minor 69 references

More capable language models produce worse distributional forecasts on problems with superlinear growth and tail risks of regime change.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 06:00 UTC pith:DF6FBB35

load-bearing objection Bigger models shift upper tails upward on superlinear series, worsening full distributional forecasts while looking better on threshold metrics. the 3 major comments →

arxiv 2605.22672 v2 pith:DF6FBB35 submitted 2026-05-21 cs.AI

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

Nick Merrill , Jaeho Lee , Ezra Karger This is my paper

classification cs.AI

keywords inverse scalingdistributional forecastinglanguage modelstail risksuperlinear growthforecast calibrationepidemiologyfinancial forecasting

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that on forecasting tasks whose time series grow rapidly and carry risks of sudden regime shifts, larger and more post-trained models generate less accurate probability distributions overall. The shortfall concentrates in the upper tail, where capable models push predictions higher to match aggressive growth extrapolations while lower tails stay largely unchanged. This inverse scaling appears consistently in a new contamination-free simulated benchmark, matched synthetic epidemic models, and real datasets covering COVID-19, measles, housing, and hyperinflation. Single-threshold accuracy metrics common in LLM evaluations miss the upper-tail cost and can reverse the apparent relationship between capability and performance on the same outputs. The authors argue that continuous, unbounded accuracy measures are needed alongside binary thresholds to reveal these failures.

Core claim

On forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim, synthetic SIR epidemics with a matched linear control, and real-world datasets; a per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put.

What carries the argument

Per-quantile decomposition of forecast errors, which isolates the upward shift in the upper tail as the source of worse performance in more capable models.

Load-bearing premise

The chosen tasks and datasets are representative of forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change.

What would settle it

Finding that more capable models achieve better upper-tail calibration than less capable ones on a new collection of tasks with documented superlinear growth and regime-change risk would falsify the claim.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Tail-inclusive scoring reverses the sign of the capability-accuracy relationship on identical outputs relative to single-threshold metrics.
Both model scale and post-training independently contribute to the inverse scaling, as shown within the Llama-3.1 family.
Domain knowledge does not reliably improve calibration on these tasks.
The pattern replicates across synthetic and multiple real-world datasets exhibiting the target structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Forecasting evaluations for language models should include continuous unbounded metrics as a standard check to surface tail-specific failures.
The result may apply to other prediction settings that involve rapid growth followed by potential breaks, such as certain financial or supply-chain risks.
Training methods could be tested for their ability to limit over-extrapolation specifically in the upper quantiles of growth trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Bigger models shift upper tails upward on superlinear series, worsening full distributional forecasts while looking better on threshold metrics.

read the letter

The core result is that on time series with superlinear growth plus regime-change tail risk, more capable LLMs produce worse distributional forecasts. The error concentrates in the upper tail, where stronger models extrapolate growth more aggressively. This holds on their new FBSim benchmark, matched SIR simulations, and real datasets including COVID and housing markets. A Llama-3.1 within-family split shows both scale and post-training contribute. The same outputs flip to positive scaling under single-threshold scoring, which the paper argues misses the tail cost. They recommend continuous unbounded metrics alongside binary ones.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that more capable LLMs produce worse distributional forecasts on problems whose time series exhibit superlinear growth and tail risk of regime change (common in finance and epidemiology). This inverse scaling is documented on the released ForecastBench-Sim (FBSim) benchmark, synthetic SIR epidemics with a matched linear control, and real-world datasets (COVID-19, measles, housing markets, hyperinflation). A per-quantile decomposition localizes the failure to the upper tail; within-family Llama-3.1 results indicate independent contributions from scale and post-training. Single-threshold metrics miss the effect and can reverse the sign of the capability-accuracy relationship.

Significance. If the central empirical pattern holds, the result is significant for LLM evaluation practices in forecasting domains. The release of a contamination-free benchmark and replications on external real-world datasets strengthen reproducibility and external validity. The demonstration that conventional single-threshold scoring can mask upper-tail costs provides a concrete, actionable critique of existing benchmarks.

major comments (3)

[Abstract and Methods] Abstract and Methods: The abstract states that replications appear across simulated and real datasets but supplies no details on statistical tests, data exclusion rules, error-bar construction, or exact quantile definitions. Without these, the support for the central inverse-scaling claim cannot be verified from the reported information.
[Results (per-quantile decomposition)] Results (per-quantile decomposition): The decomposition attributes upper-tail shifts to model capability, yet the manuscript reports no sensitivity analyses to FBSim generation parameters (e.g., superlinear trajectory sampling or regime-change probabilities) or to the matched SIR control. This leaves the attribution vulnerable to confounding by data-generation choices.
[Within-family Llama-3.1 analysis] Within-family Llama-3.1 analysis: The study addresses family-level confounds but does not examine robustness to alternative quantile decompositions or variations in how the target structure is instantiated, which is load-bearing for the claim that the pattern is diagnostic of the described class of forecasting problems rather than specific to the chosen tasks.

minor comments (1)

[Abstract] The abstract is information-dense; consider separating the description of the per-quantile finding and the metric-reversal result into distinct sentences for readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on statistical transparency and robustness. We address each comment below and have revised the manuscript to incorporate additional details and analyses.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The abstract states that replications appear across simulated and real datasets but supplies no details on statistical tests, data exclusion rules, error-bar construction, or exact quantile definitions. Without these, the support for the central inverse-scaling claim cannot be verified from the reported information.

Authors: The abstract is intentionally concise. The Methods section specifies bootstrap resampling (1000 iterations) for 95% CIs, inclusion of all series with no exclusions, paired Wilcoxon tests for significance, and quantiles as equal-mass bins of the empirical target distribution. To improve accessibility we have added a brief clause to the abstract on the statistical approach and inserted a summary table of these choices in the Methods. revision: yes
Referee: [Results (per-quantile decomposition)] Results (per-quantile decomposition): The decomposition attributes upper-tail shifts to model capability, yet the manuscript reports no sensitivity analyses to FBSim generation parameters (e.g., superlinear trajectory sampling or regime-change probabilities) or to the matched SIR control. This leaves the attribution vulnerable to confounding by data-generation choices.

Authors: The matched linear SIR control already isolates the contribution of superlinear growth and regime change. Nevertheless, we agree that explicit sensitivity checks strengthen attribution. We have added analyses varying the growth exponent (1.1–2.0) and regime-change probability (0.05–0.2) in FBSim; the inverse-scaling pattern and upper-tail localization remain stable. These results appear as a new supplementary figure and table. revision: yes
Referee: [Within-family Llama-3.1 analysis] Within-family Llama-3.1 analysis: The study addresses family-level confounds but does not examine robustness to alternative quantile decompositions or variations in how the target structure is instantiated, which is load-bearing for the claim that the pattern is diagnostic of the described class of forecasting problems rather than specific to the chosen tasks.

Authors: The within-family results employ the same decile decomposition used throughout for consistency. To demonstrate that the pattern is not an artifact of the exact instantiation, we have added robustness checks using quintile and vigintile decompositions as well as additional problem variants with altered growth exponents and tail-risk probabilities. These appear in a new subsection of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observational study with no derivation chain or fitted quantities

full rationale

The paper reports empirical observations of inverse scaling on forecasting tasks using a new benchmark (FBSim), synthetic SIR controls, and real-world datasets. No equations, first-principles derivations, parameter fits, or predictions are described that could reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The per-quantile decomposition is a descriptive breakdown of observed outputs rather than a definitional equivalence. Central claims rest on replication across datasets and within-family controls, not on any load-bearing ansatz or uniqueness theorem imported from prior work. This is a standard empirical finding with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical validity of the selected forecasting tasks as representative of superlinear growth with tail risk; no free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)

domain assumption The selected tasks (FBSim, synthetic SIR epidemics with linear control, and listed real-world datasets) represent forecasting problems with superlinear growth and tail risk of regime change common in finance and epidemiology.
Invoked to frame the scope of the inverse-scaling claim.

pith-pipeline@v0.9.0 · 5752 in / 1412 out tokens · 43352 ms · 2026-05-25T06:00:47.636695+00:00 · methodology

0 comments

read the original abstract

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

Figures

Figures reproduced from arXiv: 2605.22672 by Ezra Karger, Jaeho Lee, Nick Merrill.

**Figure 2.** Figure 2: Upper-tail predictions drive the inverse scaling, in both domains. Per-quantile pinball-loss decomposition. Top: FBSim disruptable templates, N=28 (apples-to-apples panel; same model set as Appendix [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Time series with superlinear growth and tail-risk of regime change trigger the inverse scaling. Top: Ground-truth series shown to models as history (black) with continuations not shown (gray). Left: SIR epidemic (log scale): exponential growth then intervention-driven decline. Right: linear growth with the same downward-jump structure. Bottom: CRPS vs. ECI at h=210. On SIR data, more capable models produce… view at source ↗

**Figure 4.** Figure 4: Domain knowledge has inconsistent effects across domains. Naming the domain rescues positive scaling on COVID-19 (ρ: −0.49 → +0.39), substantially attenuates it on housing (∆ρ=+0.86), measles (∆ρ=+0.36), and SIR (∆ρ=+0.24), but has essentially no effect on hyperinflation (∆ρ=+0.00). Red dots: unlabeled numbers (inverse-scaling baseline). Orange crosses: “the current trend may or may not continue.” Green … view at source ↗

**Figure 5.** Figure 5: Across-horizon evolution of the capability–accuracy relationship. Spearman ρ between model capability (Epoch Capabilities Index) and forecast accuracy vs. horizon, sign-flipped so positive = positive scaling, negative = inverse scaling. Top: FBSim, pooled across all six question templates (H1–H7 = game turns). Bottom: pre-vaccine US measles case counts, 1928–1962, pooled across all 35 seasons. In both doma… view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 8 internal anchors

[1]

International Conference on Learning Representations (ICLR) , year=

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities , author=. International Conference on Learning Representations (ICLR) , year=

work page
[2]

Advances in Neural Information Processing Systems (NeurIPS) , year=

LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[3]

arXiv preprint arXiv:2402.18563 , year=

Approaching Human-Level Forecasting with Language Models , author=. arXiv preprint arXiv:2402.18563 , year=

work page arXiv
[4]

and Bastos, Rafael Valdece Sousa and Tetlock, Philip E

Schoenegger, Philipp and Tuminauskaite, Indre and Park, Peter S. and Bastos, Rafael Valdece Sousa and Tetlock, Philip E. , journal=. Wisdom of the Silicon Crowd:

work page
[5]

Do Large Language Models Know What They Don't Know?

Nel, Lukas , journal=. Do Large Language Models Know What They Don't Know?

work page
[6]

arXiv preprint arXiv:2506.00723 , year=

Pitfalls in Evaluating Language Model Forecasters , author=. arXiv preprint arXiv:2506.00723 , year=

work page arXiv
[7]

Uncertainty distillation: Teaching language models to express semantic confidence.arXiv preprint arXiv:2503.14749,

Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence , author=. arXiv preprint arXiv:2503.14749 , year=

work page arXiv
[8]

Alur, Rohan and Stadie, Bradly C. and Kang, Daniel and Chen, Ryan and McManus, Matt and Rickert, Michael and Lee, Tyler and Federici, Michael and Zhu, Richard and Fogerty, Dennis and Williamson, Hayley and Lozinski, Nina and Linsky, Aaron and Sekhon, Jasjeet S. , journal=

work page
[9]

International Conference on Machine Learning (ICML) , year=

On Calibration of Modern Neural Networks , author=. International Conference on Machine Learning (ICML) , year=

work page
[10]

International Conference on Learning Representations (ICLR) , year=

Taming Overconfidence in LLMs: Reward Calibration in RLHF , author=. International Conference on Learning Representations (ICLR) , year=

work page
[11]

Empirical Methods in Natural Language Processing (EMNLP) , year=

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback , author=. Empirical Methods in Natural Language Processing (EMNLP) , year=

work page
[12]

International Conference on Machine Learning (ICML) , year=

Linguistic Calibration of Long-Form Generations , author=. International Conference on Machine Learning (ICML) , year=

work page
[13]

Conference on Uncertainty in Artificial Intelligence (UAI) , year=

Mitigating Overconfidence in Out-of-Distribution Detection by Capturing Extreme Activations , author=. Conference on Uncertainty in Artificial Intelligence (UAI) , year=

work page
[14]

Advances in Neural Information Processing Systems (NeurIPS) , year=

LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[15]

Transactions on Machine Learning Research , year=

Inverse Scaling: When Bigger Isn't Better , author=. Transactions on Machine Learning Research , year=

work page
[16]

Journal of the American Statistical Association , volume=

Strictly Proper Scoring Rules, Prediction, and Estimation , author=. Journal of the American Statistical Association , volume=

work page
[17]

Journal of Applied Meteorology , volume=

A New Vector Partition of the Probability Score , author=. Journal of Applied Meteorology , volume=

work page
[18]

Monthly Weather Review , volume=

Verification of forecasts expressed in terms of probability , author=. Monthly Weather Review , volume=

work page
[19]

International Journal of Forecasting , volume=

The M4 Competition: 100,000 time series and 61 forecasting methods , author=. International Journal of Forecasting , volume=

work page
[20]

arXiv preprint arXiv:2412.18544 , year=

Consistency Checks for Language Model Forecasters , author=. arXiv preprint arXiv:2412.18544 , year=

work page arXiv
[21]

NeurIPS 2024 Workshop , year=

Context is Key: A Benchmark for Forecasting with Essential Textual Information , author=. NeurIPS 2024 Workshop , year=

work page 2024
[22]

Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks , year=

ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons , author=. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks , year=

work page
[23]

Decision Analysis , year=

Comparing Elicitation Techniques for Probability Distributions , author=. Decision Analysis , year=

work page
[24]

arXiv preprint arXiv:2404.07452 , year=

RiskLabs: Predicting Financial Risk Using Large Language Model Based on Multi-Sources Data , author=. arXiv preprint arXiv:2404.07452 , year=

work page arXiv
[25]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[26]

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , journal=

work page
[27]

, journal=

Wei, Jason and Kim, Najoung and Tay, Yi and Le, Quoc V. , journal=. Inverse Scaling Can Become

work page
[28]

Nature , volume=

Larger and More Instructable Language Models Become Less Reliable , author=. Nature , volume=

work page
[29]

A Rosetta Stone for

Ho, Anson and Denain, Jean-Stanislas and Atanasov, David and Albanie, Samuel and Shah, Rohin , journal=. A Rosetta Stone for

work page
[30]

, journal=

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , journal=

work page
[31]

2024 , eprint=

MIRAI: Evaluating LLM Agents for Event Forecasting , author=. 2024 , eprint=

work page 2024
[32]

Chen, Zhen and others , journal=

work page
[33]

2026 , howpublished=

Freeciv.org: open source empire-building strategy game , author=. 2026 , howpublished=

work page 2026
[34]

2026 , howpublished=

Freeciv-web. 2026 , howpublished=

work page 2026
[35]

2026 , howpublished=

Multiplayer Game Manual , author=. 2026 , howpublished=

work page 2026
[36]

2026 , howpublished=

Government.mp , author=. 2026 , howpublished=

work page 2026
[37]

2026 , howpublished=

Technology.mp , author=. 2026 , howpublished=

work page 2026
[38]

2026 , howpublished=

Combat.mp , author=. 2026 , howpublished=

work page 2026
[39]

2026 , howpublished=

Score , author=. 2026 , howpublished=

work page 2026
[40]

Wildman, N

Bench to the Future: A Pastcasting Benchmark for Forecasting Agents , author=. arXiv preprint arXiv:2506.21558 , year=

work page arXiv
[41]

Mirzadeh, Iman and Alizadeh, Keivan and Shahrokhi, Hooman and Tuzel, Oncel and Bengio, Samy and Farajtabar, Mehrdad , journal=

work page
[42]

Qi, Siyuan and Chen, Shuo and Li, Yexin and Kong, Xiangyu and Wang, Junqi and Yang, Bangcheng and Wong, Pring and Zhong, Yifan and Zhang, Xiaoyuan and Zhang, Zhaowei and Liu, Nian and Wang, Wei and Yang, Yaodong and Zhu, Song-Chun , booktitle=

work page
[43]

arXiv preprint arXiv:2206.15474 , year=

Forecasting Future World Events with Neural Networks , author=. arXiv preprint arXiv:2206.15474 , year=

work page arXiv
[44]

Expert Political Judgment: How Good Is It? How Can We Know? , author=

work page
[45]

Towards Understanding Sycophancy in Language Models

Towards Understanding Sycophancy in Language Models , author=. arXiv preprint arXiv:2310.13548 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. arXiv preprint arXiv:2109.07958 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Discovering Language Model Behaviors with Model-Written Evaluations

Discovering Language Model Behaviors with Model-Written Evaluations , author=. arXiv preprint arXiv:2212.09251 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

2026 , url=

Benchmark scores are well correlated, even across domains , author=. 2026 , url=

work page 2026
[49]

Weather and Forecasting , volume=

Decomposition of the continuous ranked probability score for ensemble prediction systems , author=. Weather and Forecasting , volume=. 2000 , publisher=

work page 2000
[50]

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=

Probabilistic forecasts, calibration and sharpness , author=. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=

work page
[51]

Advances in Neural Information Processing Systems , volume=

Are Emergent Abilities of Large Language Models a Mirage? , author=. Advances in Neural Information Processing Systems , volume=

work page
[52]

Transactions on Machine Learning Research , year=

Chronos: Learning the Language of Time Series , author=. Transactions on Machine Learning Research , year=

work page
[53]

Advances in Neural Information Processing Systems , volume=

Large Language Models Are Zero-Shot Time Series Forecasters , author=. Advances in Neural Information Processing Systems , volume=

work page
[54]

2006 , publisher=

Gaussian Processes for Machine Learning , author=. 2006 , publisher=

work page 2006
[55]

2026 , howpublished=

Assessing. 2026 , howpublished=

work page 2026
[56]

2026 , howpublished=

Claude Opus 4.5 System Card , author=. 2026 , howpublished=

work page 2026
[57]

2025 , howpublished=

OpenAI o3 and o4-mini System Card , author=. 2025 , howpublished=

work page 2025
[58]

2025 , howpublished=

GPT-5 System Card , author=. 2025 , howpublished=

work page 2025
[59]

2025 , howpublished=

Gemini 2.5 Technical Report , author=. 2025 , howpublished=

work page 2025
[60]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

DeepSeek-V3 Technical Report

DeepSeek-V3 Technical Report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1 Technical Report , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

2025 , howpublished=

Grok 4.1 Model Card , author=. 2025 , howpublished=

work page 2025
[65]

Journal of the American Statistical Association , volume=

Making and evaluating point forecasts , author=. Journal of the American Statistical Association , volume=. 2011 , publisher=

work page 2011
[66]

Counts of Measles reported in

Van Panhuis, Willem and Cross, Anne and Burke, Donald , year=. Counts of Measles reported in

work page
[67]

International Journal of Forecasting , volume=

On single point forecasts for fat-tailed variables , author=. International Journal of Forecasting , volume=. 2022 , publisher=

work page 2022
[68]

International Journal of Forecasting , volume=

False dichotomy alert: Improving subjective-probability estimates vs.\ raising awareness of systemic risk , author=. International Journal of Forecasting , volume=. 2022 , publisher=

work page 2022
[69]

Nature Computational Science , volume=

Advancing real-time infectious disease forecasting using large language models , author=. Nature Computational Science , volume=. 2025 , publisher=

work page 2025

[1] [1]

International Conference on Learning Representations (ICLR) , year=

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities , author=. International Conference on Learning Representations (ICLR) , year=

work page

[2] [2]

Advances in Neural Information Processing Systems (NeurIPS) , year=

LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[3] [3]

arXiv preprint arXiv:2402.18563 , year=

Approaching Human-Level Forecasting with Language Models , author=. arXiv preprint arXiv:2402.18563 , year=

work page arXiv

[4] [4]

and Bastos, Rafael Valdece Sousa and Tetlock, Philip E

Schoenegger, Philipp and Tuminauskaite, Indre and Park, Peter S. and Bastos, Rafael Valdece Sousa and Tetlock, Philip E. , journal=. Wisdom of the Silicon Crowd:

work page

[5] [5]

Do Large Language Models Know What They Don't Know?

Nel, Lukas , journal=. Do Large Language Models Know What They Don't Know?

work page

[6] [6]

arXiv preprint arXiv:2506.00723 , year=

Pitfalls in Evaluating Language Model Forecasters , author=. arXiv preprint arXiv:2506.00723 , year=

work page arXiv

[7] [7]

Uncertainty distillation: Teaching language models to express semantic confidence.arXiv preprint arXiv:2503.14749,

Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence , author=. arXiv preprint arXiv:2503.14749 , year=

work page arXiv

[8] [8]

Alur, Rohan and Stadie, Bradly C. and Kang, Daniel and Chen, Ryan and McManus, Matt and Rickert, Michael and Lee, Tyler and Federici, Michael and Zhu, Richard and Fogerty, Dennis and Williamson, Hayley and Lozinski, Nina and Linsky, Aaron and Sekhon, Jasjeet S. , journal=

work page

[9] [9]

International Conference on Machine Learning (ICML) , year=

On Calibration of Modern Neural Networks , author=. International Conference on Machine Learning (ICML) , year=

work page

[10] [10]

International Conference on Learning Representations (ICLR) , year=

Taming Overconfidence in LLMs: Reward Calibration in RLHF , author=. International Conference on Learning Representations (ICLR) , year=

work page

[11] [11]

Empirical Methods in Natural Language Processing (EMNLP) , year=

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback , author=. Empirical Methods in Natural Language Processing (EMNLP) , year=

work page

[12] [12]

International Conference on Machine Learning (ICML) , year=

Linguistic Calibration of Long-Form Generations , author=. International Conference on Machine Learning (ICML) , year=

work page

[13] [13]

Conference on Uncertainty in Artificial Intelligence (UAI) , year=

Mitigating Overconfidence in Out-of-Distribution Detection by Capturing Extreme Activations , author=. Conference on Uncertainty in Artificial Intelligence (UAI) , year=

work page

[14] [14]

Advances in Neural Information Processing Systems (NeurIPS) , year=

LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[15] [15]

Transactions on Machine Learning Research , year=

Inverse Scaling: When Bigger Isn't Better , author=. Transactions on Machine Learning Research , year=

work page

[16] [16]

Journal of the American Statistical Association , volume=

Strictly Proper Scoring Rules, Prediction, and Estimation , author=. Journal of the American Statistical Association , volume=

work page

[17] [17]

Journal of Applied Meteorology , volume=

A New Vector Partition of the Probability Score , author=. Journal of Applied Meteorology , volume=

work page

[18] [18]

Monthly Weather Review , volume=

Verification of forecasts expressed in terms of probability , author=. Monthly Weather Review , volume=

work page

[19] [19]

International Journal of Forecasting , volume=

The M4 Competition: 100,000 time series and 61 forecasting methods , author=. International Journal of Forecasting , volume=

work page

[20] [20]

arXiv preprint arXiv:2412.18544 , year=

Consistency Checks for Language Model Forecasters , author=. arXiv preprint arXiv:2412.18544 , year=

work page arXiv

[21] [21]

NeurIPS 2024 Workshop , year=

Context is Key: A Benchmark for Forecasting with Essential Textual Information , author=. NeurIPS 2024 Workshop , year=

work page 2024

[22] [22]

Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks , year=

ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons , author=. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks , year=

work page

[23] [23]

Decision Analysis , year=

Comparing Elicitation Techniques for Probability Distributions , author=. Decision Analysis , year=

work page

[24] [24]

arXiv preprint arXiv:2404.07452 , year=

RiskLabs: Predicting Financial Risk Using Large Language Model Based on Multi-Sources Data , author=. arXiv preprint arXiv:2404.07452 , year=

work page arXiv

[25] [25]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[26] [26]

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , journal=

work page

[27] [27]

, journal=

Wei, Jason and Kim, Najoung and Tay, Yi and Le, Quoc V. , journal=. Inverse Scaling Can Become

work page

[28] [28]

Nature , volume=

Larger and More Instructable Language Models Become Less Reliable , author=. Nature , volume=

work page

[29] [29]

A Rosetta Stone for

Ho, Anson and Denain, Jean-Stanislas and Atanasov, David and Albanie, Samuel and Shah, Rohin , journal=. A Rosetta Stone for

work page

[30] [30]

, journal=

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , journal=

work page

[31] [31]

2024 , eprint=

MIRAI: Evaluating LLM Agents for Event Forecasting , author=. 2024 , eprint=

work page 2024

[32] [32]

Chen, Zhen and others , journal=

work page

[33] [33]

2026 , howpublished=

Freeciv.org: open source empire-building strategy game , author=. 2026 , howpublished=

work page 2026

[34] [34]

2026 , howpublished=

Freeciv-web. 2026 , howpublished=

work page 2026

[35] [35]

2026 , howpublished=

Multiplayer Game Manual , author=. 2026 , howpublished=

work page 2026

[36] [36]

2026 , howpublished=

Government.mp , author=. 2026 , howpublished=

work page 2026

[37] [37]

2026 , howpublished=

Technology.mp , author=. 2026 , howpublished=

work page 2026

[38] [38]

2026 , howpublished=

Combat.mp , author=. 2026 , howpublished=

work page 2026

[39] [39]

2026 , howpublished=

Score , author=. 2026 , howpublished=

work page 2026

[40] [40]

Wildman, N

Bench to the Future: A Pastcasting Benchmark for Forecasting Agents , author=. arXiv preprint arXiv:2506.21558 , year=

work page arXiv

[41] [41]

Mirzadeh, Iman and Alizadeh, Keivan and Shahrokhi, Hooman and Tuzel, Oncel and Bengio, Samy and Farajtabar, Mehrdad , journal=

work page

[42] [42]

Qi, Siyuan and Chen, Shuo and Li, Yexin and Kong, Xiangyu and Wang, Junqi and Yang, Bangcheng and Wong, Pring and Zhong, Yifan and Zhang, Xiaoyuan and Zhang, Zhaowei and Liu, Nian and Wang, Wei and Yang, Yaodong and Zhu, Song-Chun , booktitle=

work page

[43] [43]

arXiv preprint arXiv:2206.15474 , year=

Forecasting Future World Events with Neural Networks , author=. arXiv preprint arXiv:2206.15474 , year=

work page arXiv

[44] [44]

Expert Political Judgment: How Good Is It? How Can We Know? , author=

work page

[45] [45]

Towards Understanding Sycophancy in Language Models

Towards Understanding Sycophancy in Language Models , author=. arXiv preprint arXiv:2310.13548 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. arXiv preprint arXiv:2109.07958 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Discovering Language Model Behaviors with Model-Written Evaluations

Discovering Language Model Behaviors with Model-Written Evaluations , author=. arXiv preprint arXiv:2212.09251 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

2026 , url=

Benchmark scores are well correlated, even across domains , author=. 2026 , url=

work page 2026

[49] [49]

Weather and Forecasting , volume=

Decomposition of the continuous ranked probability score for ensemble prediction systems , author=. Weather and Forecasting , volume=. 2000 , publisher=

work page 2000

[50] [50]

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=

Probabilistic forecasts, calibration and sharpness , author=. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=

work page

[51] [51]

Advances in Neural Information Processing Systems , volume=

Are Emergent Abilities of Large Language Models a Mirage? , author=. Advances in Neural Information Processing Systems , volume=

work page

[52] [52]

Transactions on Machine Learning Research , year=

Chronos: Learning the Language of Time Series , author=. Transactions on Machine Learning Research , year=

work page

[53] [53]

Advances in Neural Information Processing Systems , volume=

Large Language Models Are Zero-Shot Time Series Forecasters , author=. Advances in Neural Information Processing Systems , volume=

work page

[54] [54]

2006 , publisher=

Gaussian Processes for Machine Learning , author=. 2006 , publisher=

work page 2006

[55] [55]

2026 , howpublished=

Assessing. 2026 , howpublished=

work page 2026

[56] [56]

2026 , howpublished=

Claude Opus 4.5 System Card , author=. 2026 , howpublished=

work page 2026

[57] [57]

2025 , howpublished=

OpenAI o3 and o4-mini System Card , author=. 2025 , howpublished=

work page 2025

[58] [58]

2025 , howpublished=

GPT-5 System Card , author=. 2025 , howpublished=

work page 2025

[59] [59]

2025 , howpublished=

Gemini 2.5 Technical Report , author=. 2025 , howpublished=

work page 2025

[60] [60]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

DeepSeek-V3 Technical Report

DeepSeek-V3 Technical Report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[62] [62]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1 Technical Report , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[64] [64]

2025 , howpublished=

Grok 4.1 Model Card , author=. 2025 , howpublished=

work page 2025

[65] [65]

Journal of the American Statistical Association , volume=

Making and evaluating point forecasts , author=. Journal of the American Statistical Association , volume=. 2011 , publisher=

work page 2011

[66] [66]

Counts of Measles reported in

Van Panhuis, Willem and Cross, Anne and Burke, Donald , year=. Counts of Measles reported in

work page

[67] [67]

International Journal of Forecasting , volume=

On single point forecasts for fat-tailed variables , author=. International Journal of Forecasting , volume=. 2022 , publisher=

work page 2022

[68] [68]

International Journal of Forecasting , volume=

False dichotomy alert: Improving subjective-probability estimates vs.\ raising awareness of systemic risk , author=. International Journal of Forecasting , volume=. 2022 , publisher=

work page 2022

[69] [69]

Nature Computational Science , volume=

Advancing real-time infectious disease forecasting using large language models , author=. Nature Computational Science , volume=. 2025 , publisher=

work page 2025