Two AI Metrics Diverged: Will it Make All the Difference?

Alex Fogelson; Hans Gundlach; Jayson Lynch; Neil Thompson; Zachary A. Brown

arxiv: 2607.00913 · v1 · pith:RK2SITR6new · submitted 2026-07-01 · 💻 cs.AI

Two AI Metrics Diverged: Will it Make All the Difference?

Alex Fogelson , Zachary A. Brown , Hans Gundlach , Jayson Lynch , Neil Thompson This is my paper

Pith reviewed 2026-07-02 12:27 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI performance metricsscaling lawsbounded metricsunbounded metricsmeek modelsfrontier modelscompute scalingcapability measurement

0 comments

The pith

The choice of performance metric decides whether frontier AI models keep a growing lead or smaller ones catch up as compute scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how different ways of measuring AI performance affect whether advanced models will keep pulling ahead or if smaller models will catch up as compute increases. It classifies metrics by their mathematical relationship to training and inference compute and proves that all bounded metrics favor convergence to smaller models. The authors also show that many common metrics have closely related versions that switch between bounded and unbounded behavior. This distinction changes predictions about whether AI capabilities will spread widely or stay concentrated among a few large developers. Choosing the right metric for a given domain is therefore required before drawing policy conclusions.

Core claim

We provide tight mathematical conditions for determining which metrics favor meek models, and show that bounded performance metrics always do. But careful interpretation of performance metrics is essential: we show that many common bounded metrics have closely-related counterpart metrics that are unbounded (and vice versa). Determining the apt metric in a domain is a prerequisite for policy, since bounded and unbounded metrics may suggest opposing policy responses. If a particular capability is unbounded when measured in the terms we care about, frontier-level capability will likely be concentrated in the hands of a few wealthy actors. Conversely, if that capability is instead bounded, front

What carries the argument

Classification of performance metrics into bounded or unbounded categories according to their functional forms with respect to training and inference compute.

If this is right

If a capability is measured with a bounded metric, frontier-level performance spreads to smaller models with more compute.
If a capability is measured with an unbounded metric, frontier-level performance remains concentrated among high-compute developers.
Many standard metrics have direct counterparts that reverse the bounded/unbounded property, so the same capability can yield opposite policy conclusions depending on the chosen measure.
Policy debates over AI access and concentration must first settle which metric best reflects the valued outcome in each domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

For domains such as software engineering or synthetic biology, the paper implies that the dominant metric will determine whether large-scale compute access remains a decisive advantage.
The classification could be tested by tracking whether specific bounded metrics approach saturation while their unbounded counterparts keep improving with scale.
Extending the same functional-form analysis to inference-only scaling or to multi-model ensembles might reveal additional conditions on when meek models gain ground.
The result suggests that metric selection itself functions as a lever in debates about AI governance and resource allocation.

Load-bearing premise

The functional forms used to classify metrics as bounded or unbounded correctly capture how those metrics behave under real scaling of training and inference compute.

What would settle it

A direct measurement showing that a metric classified as bounded continues to exhibit a widening performance gap between frontier and small-budget models as training compute increases by another order of magnitude.

Figures

Figures reproduced from arXiv: 2607.00913 by Alex Fogelson, Hans Gundlach, Jayson Lynch, Neil Thompson, Zachary A. Brown.

**Figure 1.** Figure 1: Metric classification decision tree. Any normal performance metric is meek (gap converges) or mighty (gap persists) and can be classified with two questions. Bounded metrics are always meek (Theorem 2.4); for unbounded metrics meekness holds iff the metric grows slower than log log C, subject to the weak growth conditions given in Theorem 2.5. Together, these provide a tight criterion on the space of norm… view at source ↗

**Figure 2.** Figure 2: Gap between frontier model performance improvements and the imputed meek model performance (following the same trend more slowly, coinciding with the frontier at GPT-4) on two related metrics: the METR time horizon (Kwa et al., 2026) (Top) and the probability of task success at 4-minute, 1-hour, and 4- hour tasks (Bottom). Both metrics are generated from the same underlying trend, but some lead to a meek o… view at source ↗

read the original abstract

As exponential compute scaling continues, will the capabilities of frontier AI models outstrip what is accessible to developers on a small fixed budget? Or will capabilities converge, with "meek models inheriting the earth"? Building on Gundlach et al. (2025b), we show that the answer depends on how we value and measure AI capabilities. We discuss conventional performance measures and show that, while validation loss shows a shrinking gap, on other metrics frontier models grow their lead forever. Classifying performance metrics by their functional forms in relation to training (and inference) compute, we provide tight mathematical conditions for determining which metrics favor meek models, and show that bounded performance metrics always do. But careful interpretation of performance metrics is essential: we show that many common bounded metrics have closely-related counterpart metrics that are unbounded (and vice versa). Determining the apt metric in a domain is a prerequisite for policy, since bounded and unbounded metrics may suggest opposing policy responses. If a particular capability -- like software engineering, synthetic biology, or rhetorical persuasiveness -- is unbounded when measured in the terms we care about, frontier-level capability will likely be concentrated in the hands of a few wealthy actors. Conversely, if that capability is instead bounded, frontier-level capabilities proliferate through meek models into the hands of the many.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a functional-form split between metrics that favor meek models and those that don't, but the split assumes idealized scaling that real metrics may not follow.

read the letter

The main new piece is the claim that bounded metrics always favor meek models once you classify them by how they depend on training and inference compute, plus the observation that many standard bounded measures have unbounded cousins. That distinction is worth making because it directly affects whether you expect capability concentration or diffusion in areas like coding or biology.

The paper does a clean job laying out why validation loss behaves differently from accuracy or pass@k, and it flags that policy conclusions flip depending on which measure you pick. The taxonomy is a straightforward extension of the cited prior work rather than a restatement.

The soft spot is the weakest assumption the stress-test flags: the classification treats bounded metrics as exactly saturating functions of compute, but many practical bounded scores are post-processing steps on top of unbounded quantities like logits or raw counts. If those transformations change the limiting behavior, the "always" result does not hold for the metrics people actually use. The abstract states tight conditions without showing the derivations, so it is not possible to check whether the conditions are exhaustive or whether the functional forms match observed scaling. That gap is real but not fatal; it is the sort of thing a referee can press on.

This is for readers who already follow scaling-law arguments and want to think through measurement choices in governance contexts. It is not a broad CS contribution, but the distinction is sharp enough that a serious editor should send it out for review rather than desk-reject. The math needs verification and the applicability to concrete metrics needs examples, but the core question is worth referee time.

Referee Report

2 major / 0 minor

Summary. The paper claims that the divergence between frontier and meek AI models under compute scaling depends on the performance metric used. Classifying metrics by their functional forms with respect to training and inference compute, it derives tight mathematical conditions determining which metrics favor meek models and concludes that all bounded metrics do so. It further notes that many common bounded metrics have closely related unbounded counterparts (and vice versa), with implications for policy on whether advanced capabilities concentrate among few actors or proliferate widely.

Significance. If the mathematical classification and derivations hold, the work offers a principled way to evaluate how metric choice affects conclusions about AI capability distribution, which could inform policy debates on concentration vs. democratization of frontier capabilities. The observation that bounded and unbounded counterparts can suggest opposing policy responses is a useful cautionary point, though its impact hinges on whether the functional-form assumptions align with empirical metric behavior.

major comments (2)

[Abstract] Abstract and classification section: The central claim of 'tight mathematical conditions' and that 'bounded performance metrics always do' favor meek models is asserted without explicit functional forms, derivations, or an exhaustive verification that the classification covers all relevant cases. This leaves the 'always' conclusion uncheckable from the provided text.
[Abstract] Abstract paragraph on classification by functional forms: The argument assumes that the chosen functional forms (e.g., saturating for bounded vs. power-law or logarithmic for unbounded) correctly describe how metrics such as accuracy, F1, or pass@k behave as compute scales, but these metrics are frequently post-hoc transformations of unbounded quantities (logits, losses, or raw counts). This could change the limiting behavior and undermine the claim that bounded metrics always favor meek models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the clarity of our mathematical claims. We address each major comment below, providing references to the full manuscript derivations and making targeted revisions for improved checkability and discussion of transformations.

read point-by-point responses

Referee: [Abstract] Abstract and classification section: The central claim of 'tight mathematical conditions' and that 'bounded performance metrics always do' favor meek models is asserted without explicit functional forms, derivations, or an exhaustive verification that the classification covers all relevant cases. This leaves the 'always' conclusion uncheckable from the provided text.

Authors: The full manuscript (Section 3 and Appendix A) explicitly defines the functional forms: bounded metrics as saturating functions f(C) o L (finite limit) as compute C o o o, and unbounded as power-law or logarithmic f(C) = a C^b or log(C). We derive the tight conditions showing that for any bounded f, the performance gap between frontier (high C) and meek (fixed C) models converges to zero, while unbounded metrics allow indefinite divergence. The classification covers all standard metrics by their asymptotic behavior. To enhance checkability, we have added an explicit summary table of conditions and proofs to the main text. revision: yes
Referee: [Abstract] Abstract paragraph on classification by functional forms: The argument assumes that the chosen functional forms (e.g., saturating for bounded vs. power-law or logarithmic for unbounded) correctly describe how metrics such as accuracy, F1, or pass@k behave as compute scales, but these metrics are frequently post-hoc transformations of unbounded quantities (logits, losses, or raw counts). This could change the limiting behavior and undermine the claim that bounded metrics always favor meek models.

Authors: We agree many bounded metrics (e.g., accuracy from logits) are post-hoc transformations of unbounded base quantities. Our classification, however, is applied to the final evaluated metric, whose boundedness governs the scaling behavior regardless of origin. For example, loss may be unbounded while accuracy saturates, leading to convergence. We have revised the abstract, Section 2, and added a new subsection clarifying this distinction and confirming that the final metric's form determines the meek-model convergence result. revision: partial

Circularity Check

0 steps flagged

Minor self-citation present but central mathematical classification of metrics by functional form is independent

full rationale

The paper explicitly builds on Gundlach et al. (2025b) for its starting point, and one author overlaps, but the abstract and description present the classification of metrics (bounded vs. unbounded via functional dependence on compute) and the derivation of tight conditions as the paper's own mathematical contribution. No equations or steps are shown reducing the 'bounded metrics always favor meek models' result to a fit, a self-referential definition, or a chain that collapses to the citation alone. The functional-form classification is offered as an analytical tool rather than a post-hoc fit to data, so the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is populated from the abstract's stated claims; no explicit free parameters, axioms, or invented entities are named.

pith-pipeline@v0.9.1-grok · 5766 in / 1057 out tokens · 18386 ms · 2026-07-02T12:27:38.863607+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 31 canonical work pages · 7 internal anchors

[1]

2025 , month = dec, day =

Ord, Toby , title =. 2025 , month = dec, day =

2025
[2]

arXiv preprint arXiv:2104.03113 , year=

Scaling Scaling Laws with Board Games , author=. arXiv preprint arXiv:2104.03113 , year=

work page arXiv
[3]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2507.00004 , year=

A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search , author=. arXiv preprint arXiv:2507.00004 , year=

work page arXiv
[5]

arXiv preprint arXiv:2410.04343 , year=

Inference Scaling for Long-Context Retrieval Augmented Generation , author=. arXiv preprint arXiv:2410.04343 , year=

work page arXiv
[6]

arXiv preprint arXiv:2502.17578 , year=

How do large language monkeys get their power (laws)? , author=. arXiv preprint arXiv:2502.17578 , year=

work page arXiv
[7]

arXiv:2507.07931 , year =

Hans Gundlach and Jayson Lynch and Neil Thompson , title =. arXiv:2507.07931 , year =

work page arXiv
[8]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , author=. arXiv preprint arXiv:2201.02177 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

and Barnes, Elizabeth and Chan, Lawrence , journal=

Kwa, Thomas and West, Ben and Becker, Joel and Deng, Amy and Garcia, Katharyn and Hasin, Max and Jawhar, Sami and Kinniment, Megan and Rush, Nate and Von Arx, Sydney and Bloom, Ryan and Broadley, Thomas and Du, Haoxing and Goodrich, Brian and Jurkovic, Nikola and Miles, Luke Harold and Nix, Seraphina and Lin, Tao and Painter, Chris and Parikh, Neev and Re...
[10]

Forecasting

Whitfill, Parker and Snodin, Ben and Becker, Joel , journal=. Forecasting
[11]

and Lyu, Harry and Li, Wensu and Rosenfeld, Jonathan and Anto, Meiri and Fleming, Martin and Thompson, Neil , journal=

Mertens, Matthias and Kuzee, Adam and Harris, Brittany S. and Lyu, Harry and Li, Wensu and Rosenfeld, Jonathan and Anto, Meiri and Fleming, Martin and Thompson, Neil , journal=. Crashing Waves vs. Rising Tides: Preliminary Findings on. 2026 , doi=

2026
[12]

arXiv preprint arXiv:2210.00849 , year=

Scaling Laws for a Multi-Agent Reinforcement Learning Model , author=. arXiv preprint arXiv:2210.00849 , year=

work page arXiv
[13]

arXiv preprint arXiv:2301.13442 , year=

Scaling Laws for Single-Agent Reinforcement Learning , author=. arXiv preprint arXiv:2301.13442 , year=

work page arXiv
[14]

Dota 2 with Large Scale Deep Reinforcement Learning

Dota 2 with Large Scale Deep Reinforcement Learning , author=. arXiv preprint arXiv:1912.06680 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1912
[15]

arXiv preprint arXiv:2401.04757 , year=

How predictable is language model benchmark performance? , author=. arXiv preprint arXiv:2401.04757 , year=

work page arXiv
[16]

Extrapolating

Finnveden, Lukas , year=. Extrapolating
[17]

Advances in Neural Information Processing Systems , year=

Observational Scaling Laws and the Predictability of Language Model Performance , author=. Advances in Neural Information Processing Systems , year=
[18]

Ho, Anson and Denain, Jean-Stanislas and Atanasov, David and Albanie, Samuel and Shah, Rohin , journal=. A. 2025 , doi=

2025
[19]

arXiv preprint arXiv:2304.15004 , year=

Are Emergent Abilities of Large Language Models a Mirage? , author=. arXiv preprint arXiv:2304.15004 , year=. doi:10.48550/arXiv.2304.15004 , url=

work page doi:10.48550/arxiv.2304.15004
[20]

Is the gap between open and closed models growing?

H. Is the gap between open and closed models growing?. 2025 , month = aug, day =

2025
[21]

2025 , month = oct, day =

Luke Emberson , title =. 2025 , month = oct, day =

2025
[22]

2025 , publisher =

Chapter 2: Technical Performance , booktitle =. 2025 , publisher =

2025
[23]

1999 , school=

Reversibility for efficient computing , author=. 1999 , school=

1999
[24]

Physical Review D , volume=

Entropy content and information flow in systems with limited energy , author=. Physical Review D , volume=. 1984 , publisher=

1984
[25]

Earley, Hannah , journal=
[26]

Hardware-Enabled Mechanisms for Verifying Responsible

Aidan O'Gara and Gabriel Kulp and Will Hodgkins and James Petrie and Vincent Immler and Aydin Aysu and Kanad Basu and Shivam Bhasin and Stjepan Picek and Ankur Srivastava , year=. Hardware-Enabled Mechanisms for Verifying Responsible. 2505.03742 , archivePrefix=

work page arXiv
[27]

There’s plenty of room at the Top: What will drive computer performance after

Leiserson, Charles E and Thompson, Neil C and Emer, Joel S and Kuszmaul, Bradley C and Lampson, Butler W and Sanchez, Daniel and Schardl, Tao B , journal=. There’s plenty of room at the Top: What will drive computer performance after. 2020 , publisher=

2020
[28]

Communications of the ACM , volume=

The future of microprocessors , author=. Communications of the ACM , volume=. 2011 , publisher=

2011
[29]

Semiconductor Industry Association and others , year=
[30]

arXiv preprint arXiv:2206.14007 , year=

The importance of (exponentially more) computing power , author=. arXiv preprint arXiv:2206.14007 , year=

work page arXiv
[31]

arXiv preprint arXiv:2108.07686 , year=

Scaling laws for deep learning , author=. arXiv preprint arXiv:2108.07686 , year=

work page arXiv
[32]

Explaining neural scaling laws , volume=

Bahri, Yasaman and Dyer, Ethan and Kaplan, Jared and Lee, Jaehoon and Sharma, Utkarsh , year=. Explaining neural scaling laws , volume=. Proceedings of the National Academy of Sciences , publisher=. doi:10.1073/pnas.2311878121 , number=

work page doi:10.1073/pnas.2311878121
[33]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

2020
[34]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

2022
[35]

Compute Trends Across Three Eras of Machine Learning , url=

Sevilla, Jaime and Heim, Lennart and Ho, Anson and Besiroglu, Tamay and Hobbhahn, Marius and Villalobos, Pablo , year=. Compute Trends Across Three Eras of Machine Learning , url=. doi:10.1109/ijcnn55064.2022.9891914 , booktitle=

work page doi:10.1109/ijcnn55064.2022.9891914 2022
[36]

Is there "Secret Sauce'' in Large Language Model Development?

Matthias Mertens and Natalia Fischl-Lanzoni and Neil Thompson , year =. Is there. arXiv preprint arXiv:2602.07238 , url =

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Advances in Neural Information Processing Systems , year=

Training Compute-Optimal Large Language Models , author=. Advances in Neural Information Processing Systems , year=
[38]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Srivastava, Aarohi and Rastogi, Abhishek and Rao, Abhishek and Shoeb, Ahmad and Abid, Abubakar and Fisch, Adam and Brown, Adam R. and Santoro, Adam and Gupta, Aditya and Garriga-Alonso, Adri. Beyond the Imitation Game:. arXiv preprint arXiv:2206.04615 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding , author =. arXiv preprint arXiv:2009.03300 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2009
[40]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle =
[41]

Learning Multiple Layers of Features from Tiny Images , author =
[42]

Proceedings of the IEEE , volume =

Gradient-Based Learning Applied to Document Recognition , author =. Proceedings of the IEEE , volume =
[43]

Nature , volume =

Human-level control through deep reinforcement learning , author =. Nature , volume =
[44]

and Naddaf, Yavar and Veness, Joel and Bowling, Michael , journal =

Bellemare, Marc G. and Naddaf, Yavar and Veness, Joel and Bowling, Michael , journal =. The Arcade Learning Environment:
[45]

2026 , month =

Data on. 2026 , month =

2026
[46]

Venkat Somala , year=
[47]

2309.07045 , archivePrefix =

Zhang, Zhexin and Lei, Leqi and Wu, Lindong and Sun, Rui and Huang, Yongkang and Long, Chong and Liu, Xiao and Lei, Xuanyu and Tang, Jie and Huang, Minlie , year =. 2309.07045 , archivePrefix =

work page arXiv
[48]

2024 , url =

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , booktitle =. 2024 , url =

2024
[49]

EvilGenie: A Reward Hacking Benchmark

Gabor, Jonathan and Lynch, Jayson and Rosenfeld, Jonathan , year =. 2511.21654 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[50]

On the Origin of Algorithmic Progress in

Gundlach, Hans and Fogelson, Alex and Lynch, Jayson and Trisovic, Ana and Rosenfeld, Jonathan and Sandhu, Anmol and Thompson, Neil , journal=. On the Origin of Algorithmic Progress in
[51]

On the Origin of Algorithmic Progress in

Gundlach, Hans and Fogelson, Alex and Lynch, Jayson and Trisovic, Ana and Rosenfeld, Jonathan and Sandhu, Anmol and Thompson, Neil , year=. On the Origin of Algorithmic Progress in. doi:10.48550/arXiv.2511.21622 , url=. 2511.21622 , archivePrefix=

work page doi:10.48550/arxiv.2511.21622
[52]

American Economic Review , volume=

Are Ideas Getting Harder to Find? , author=. American Economic Review , volume=. 2020 , doi=

2020
[53]

2024 , url=

How far behind are open models? , author=. 2024 , url=

2024
[54]

2023 , url=

Trading Off Compute in Training and Inference , author=. 2023 , url=

2023
[55]

2026 , month = apr, type =

2026
[56]

2022 , month = jun, howpublished =

Hobbhahn, Marius and Besiroglu, Tamay , title =. 2022 , month = jun, howpublished =

2022
[57]

2024 , eprint=

Algorithmic progress in language models , author=. 2024 , eprint=

2024
[58]

2025 , month = apr, url =

Frontier Capability Assessments:. 2025 , month = apr, url =

2025
[59]

and Lucas, Caleb and Guest, Ella , institution =

Mouton, Christopher A. and Lucas, Caleb and Guest, Ella , institution =. The Operational Risks of. 2024 , number =. doi:10.7249/RRA2977-2 , url =

work page doi:10.7249/rra2977-2 2024
[60]

2024 , month = jan, url =

Building an Early Warning System for. 2024 , month = jan, url =

2024
[61]

The reality of

Peppin, Aidan and Reuel, Anka and Casper, Stephen and Jones, Elliot and Strait, Andrew and Anwar, Usman and Agrawal, Anurag and Kapoor, Sayash and Koyejo, Sanmi and Pellat, Marie and Bommasani, Rishi and Frosst, Nick and Hooker, Sara , year =. The reality of. doi:10.48550/arXiv.2412.01946 , url =. 2412.01946 , archivePrefix =

work page doi:10.48550/arxiv.2412.01946
[62]

2506.02548 , archivePrefix =

International Conference on Learning Representations , year =. 2506.02548 , archivePrefix =

work page arXiv
[63]

2025 , month = sep, url =

Why Do We Take. 2025 , month = sep, url =

2025
[64]

, Title =

Jones, Charles I. , Title =. American Economic Review: Insights , Volume =. 2024 , Month =. doi:10.1257/aeri.20230570 , URL =

work page doi:10.1257/aeri.20230570 2024
[65]

2023 , url=

Trading off compute in training and inference , author=. 2023 , url=

2023
[66]

The Illusion of Diminishing Returns:

Akshit Sinha and Arvindh Arun and Shashwat Goel and Steffen Staab and Jonas Geiping , booktitle=. The Illusion of Diminishing Returns:. 2026 , url=

2026
[67]

2503.17354 , archivePrefix =

Rein, David and Becker, Joel and Deng, Amy and Nix, Seraphina and Canal, Chris and O'Connell, Daniel and Arnott, Pip and Bloom, Ryan and Broadley, Thomas and Garcia, Katharyn and Goodrich, Brian and Hasin, Max and Jawhar, Sami and Kinniment, Megan and Kwa, Thomas and Lajko, Aron and Rush, Nate and Sato, Lucas Jun Koba and Von Arx, Sydney and West, Ben and...

work page arXiv
[68]

2411.15114 , archivePrefix =

Wijk, Hjalmar and Lin, Tao and Becker, Joel and Jawhar, Sami and Parikh, Neev and Broadley, Thomas and Chan, Lawrence and Chen, Michael and Clymer, Josh and Dhyani, Jai and Ericheva, Elena and Garcia, Katharyn and Goodrich, Brian and Jurkovic, Nikola and Kinniment, Megan and Lajko, Aron and Nix, Seraphina and Sato, Lucas and Saunders, William and Taran, M...

work page arXiv
[69]

Global Policy , volume =

Bostrom, Nick , title =. Global Policy , volume =. doi:https://doi.org/10.1111/1758-5899.12718 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/1758-5899.12718 , abstract =

work page doi:10.1111/1758-5899.12718
[70]

Lewis Hammond and Alan Chan and Jesse Clifton and Jason Hoelscher-Obermaier and Akbir Khan and Euan McLean and Chandler Smith and Wolfram Barfuss and Jakob Foerster and Tomáš Gavenčiak and The Anh Han and Edward Hughes and Vojtěch Kovařík and Jan Kulveit and Joel Z. Leibo and Caspar Oesterheld and Christian Schroeder de Witt and Nisarg Shah and Michael We...

work page arXiv
[71]

Hyperscaler capex has quadrupled since

Isabel Juniewicz , year =. Hyperscaler capex has quadrupled since
[72]

How Much Progress Has There Been in

Del Sozzo, Emanuele and Fleming, Martin and Flamm, Kenneth and Thompson, Neil , journal=. How Much Progress Has There Been in
[73]

and Lynch, Jayson and Thompson, Neil , title =

Gundlach, Hans and Brown, Zachary A. and Lynch, Jayson and Thompson, Neil , title =. 2026 , month = jun, note =

2026

[1] [1]

2025 , month = dec, day =

Ord, Toby , title =. 2025 , month = dec, day =

2025

[2] [2]

arXiv preprint arXiv:2104.03113 , year=

Scaling Scaling Laws with Board Games , author=. arXiv preprint arXiv:2104.03113 , year=

work page arXiv

[3] [3]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2507.00004 , year=

A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search , author=. arXiv preprint arXiv:2507.00004 , year=

work page arXiv

[5] [5]

arXiv preprint arXiv:2410.04343 , year=

Inference Scaling for Long-Context Retrieval Augmented Generation , author=. arXiv preprint arXiv:2410.04343 , year=

work page arXiv

[6] [6]

arXiv preprint arXiv:2502.17578 , year=

How do large language monkeys get their power (laws)? , author=. arXiv preprint arXiv:2502.17578 , year=

work page arXiv

[7] [7]

arXiv:2507.07931 , year =

Hans Gundlach and Jayson Lynch and Neil Thompson , title =. arXiv:2507.07931 , year =

work page arXiv

[8] [8]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , author=. arXiv preprint arXiv:2201.02177 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

and Barnes, Elizabeth and Chan, Lawrence , journal=

Kwa, Thomas and West, Ben and Becker, Joel and Deng, Amy and Garcia, Katharyn and Hasin, Max and Jawhar, Sami and Kinniment, Megan and Rush, Nate and Von Arx, Sydney and Bloom, Ryan and Broadley, Thomas and Du, Haoxing and Goodrich, Brian and Jurkovic, Nikola and Miles, Luke Harold and Nix, Seraphina and Lin, Tao and Painter, Chris and Parikh, Neev and Re...

[10] [10]

Forecasting

Whitfill, Parker and Snodin, Ben and Becker, Joel , journal=. Forecasting

[11] [11]

and Lyu, Harry and Li, Wensu and Rosenfeld, Jonathan and Anto, Meiri and Fleming, Martin and Thompson, Neil , journal=

Mertens, Matthias and Kuzee, Adam and Harris, Brittany S. and Lyu, Harry and Li, Wensu and Rosenfeld, Jonathan and Anto, Meiri and Fleming, Martin and Thompson, Neil , journal=. Crashing Waves vs. Rising Tides: Preliminary Findings on. 2026 , doi=

2026

[12] [12]

arXiv preprint arXiv:2210.00849 , year=

Scaling Laws for a Multi-Agent Reinforcement Learning Model , author=. arXiv preprint arXiv:2210.00849 , year=

work page arXiv

[13] [13]

arXiv preprint arXiv:2301.13442 , year=

Scaling Laws for Single-Agent Reinforcement Learning , author=. arXiv preprint arXiv:2301.13442 , year=

work page arXiv

[14] [14]

Dota 2 with Large Scale Deep Reinforcement Learning

Dota 2 with Large Scale Deep Reinforcement Learning , author=. arXiv preprint arXiv:1912.06680 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1912

[15] [15]

arXiv preprint arXiv:2401.04757 , year=

How predictable is language model benchmark performance? , author=. arXiv preprint arXiv:2401.04757 , year=

work page arXiv

[16] [16]

Extrapolating

Finnveden, Lukas , year=. Extrapolating

[17] [17]

Advances in Neural Information Processing Systems , year=

Observational Scaling Laws and the Predictability of Language Model Performance , author=. Advances in Neural Information Processing Systems , year=

[18] [18]

Ho, Anson and Denain, Jean-Stanislas and Atanasov, David and Albanie, Samuel and Shah, Rohin , journal=. A. 2025 , doi=

2025

[19] [19]

arXiv preprint arXiv:2304.15004 , year=

Are Emergent Abilities of Large Language Models a Mirage? , author=. arXiv preprint arXiv:2304.15004 , year=. doi:10.48550/arXiv.2304.15004 , url=

work page doi:10.48550/arxiv.2304.15004

[20] [20]

Is the gap between open and closed models growing?

H. Is the gap between open and closed models growing?. 2025 , month = aug, day =

2025

[21] [21]

2025 , month = oct, day =

Luke Emberson , title =. 2025 , month = oct, day =

2025

[22] [22]

2025 , publisher =

Chapter 2: Technical Performance , booktitle =. 2025 , publisher =

2025

[23] [23]

1999 , school=

Reversibility for efficient computing , author=. 1999 , school=

1999

[24] [24]

Physical Review D , volume=

Entropy content and information flow in systems with limited energy , author=. Physical Review D , volume=. 1984 , publisher=

1984

[25] [25]

Earley, Hannah , journal=

[26] [26]

Hardware-Enabled Mechanisms for Verifying Responsible

Aidan O'Gara and Gabriel Kulp and Will Hodgkins and James Petrie and Vincent Immler and Aydin Aysu and Kanad Basu and Shivam Bhasin and Stjepan Picek and Ankur Srivastava , year=. Hardware-Enabled Mechanisms for Verifying Responsible. 2505.03742 , archivePrefix=

work page arXiv

[27] [27]

There’s plenty of room at the Top: What will drive computer performance after

Leiserson, Charles E and Thompson, Neil C and Emer, Joel S and Kuszmaul, Bradley C and Lampson, Butler W and Sanchez, Daniel and Schardl, Tao B , journal=. There’s plenty of room at the Top: What will drive computer performance after. 2020 , publisher=

2020

[28] [28]

Communications of the ACM , volume=

The future of microprocessors , author=. Communications of the ACM , volume=. 2011 , publisher=

2011

[29] [29]

Semiconductor Industry Association and others , year=

[30] [30]

arXiv preprint arXiv:2206.14007 , year=

The importance of (exponentially more) computing power , author=. arXiv preprint arXiv:2206.14007 , year=

work page arXiv

[31] [31]

arXiv preprint arXiv:2108.07686 , year=

Scaling laws for deep learning , author=. arXiv preprint arXiv:2108.07686 , year=

work page arXiv

[32] [32]

Explaining neural scaling laws , volume=

Bahri, Yasaman and Dyer, Ethan and Kaplan, Jared and Lee, Jaehoon and Sharma, Utkarsh , year=. Explaining neural scaling laws , volume=. Proceedings of the National Academy of Sciences , publisher=. doi:10.1073/pnas.2311878121 , number=

work page doi:10.1073/pnas.2311878121

[33] [33]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

2020

[34] [34]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

2022

[35] [35]

Compute Trends Across Three Eras of Machine Learning , url=

Sevilla, Jaime and Heim, Lennart and Ho, Anson and Besiroglu, Tamay and Hobbhahn, Marius and Villalobos, Pablo , year=. Compute Trends Across Three Eras of Machine Learning , url=. doi:10.1109/ijcnn55064.2022.9891914 , booktitle=

work page doi:10.1109/ijcnn55064.2022.9891914 2022

[36] [36]

Is there "Secret Sauce'' in Large Language Model Development?

Matthias Mertens and Natalia Fischl-Lanzoni and Neil Thompson , year =. Is there. arXiv preprint arXiv:2602.07238 , url =

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Advances in Neural Information Processing Systems , year=

Training Compute-Optimal Large Language Models , author=. Advances in Neural Information Processing Systems , year=

[38] [38]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Srivastava, Aarohi and Rastogi, Abhishek and Rao, Abhishek and Shoeb, Ahmad and Abid, Abubakar and Fisch, Adam and Brown, Adam R. and Santoro, Adam and Gupta, Aditya and Garriga-Alonso, Adri. Beyond the Imitation Game:. arXiv preprint arXiv:2206.04615 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding , author =. arXiv preprint arXiv:2009.03300 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2009

[40] [40]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle =

[41] [41]

Learning Multiple Layers of Features from Tiny Images , author =

[42] [42]

Proceedings of the IEEE , volume =

Gradient-Based Learning Applied to Document Recognition , author =. Proceedings of the IEEE , volume =

[43] [43]

Nature , volume =

Human-level control through deep reinforcement learning , author =. Nature , volume =

[44] [44]

and Naddaf, Yavar and Veness, Joel and Bowling, Michael , journal =

Bellemare, Marc G. and Naddaf, Yavar and Veness, Joel and Bowling, Michael , journal =. The Arcade Learning Environment:

[45] [45]

2026 , month =

Data on. 2026 , month =

2026

[46] [46]

Venkat Somala , year=

[47] [47]

2309.07045 , archivePrefix =

Zhang, Zhexin and Lei, Leqi and Wu, Lindong and Sun, Rui and Huang, Yongkang and Long, Chong and Liu, Xiao and Lei, Xuanyu and Tang, Jie and Huang, Minlie , year =. 2309.07045 , archivePrefix =

work page arXiv

[48] [48]

2024 , url =

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , booktitle =. 2024 , url =

2024

[49] [49]

EvilGenie: A Reward Hacking Benchmark

Gabor, Jonathan and Lynch, Jayson and Rosenfeld, Jonathan , year =. 2511.21654 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

On the Origin of Algorithmic Progress in

Gundlach, Hans and Fogelson, Alex and Lynch, Jayson and Trisovic, Ana and Rosenfeld, Jonathan and Sandhu, Anmol and Thompson, Neil , journal=. On the Origin of Algorithmic Progress in

[51] [51]

On the Origin of Algorithmic Progress in

Gundlach, Hans and Fogelson, Alex and Lynch, Jayson and Trisovic, Ana and Rosenfeld, Jonathan and Sandhu, Anmol and Thompson, Neil , year=. On the Origin of Algorithmic Progress in. doi:10.48550/arXiv.2511.21622 , url=. 2511.21622 , archivePrefix=

work page doi:10.48550/arxiv.2511.21622

[52] [52]

American Economic Review , volume=

Are Ideas Getting Harder to Find? , author=. American Economic Review , volume=. 2020 , doi=

2020

[53] [53]

2024 , url=

How far behind are open models? , author=. 2024 , url=

2024

[54] [54]

2023 , url=

Trading Off Compute in Training and Inference , author=. 2023 , url=

2023

[55] [55]

2026 , month = apr, type =

2026

[56] [56]

2022 , month = jun, howpublished =

Hobbhahn, Marius and Besiroglu, Tamay , title =. 2022 , month = jun, howpublished =

2022

[57] [57]

2024 , eprint=

Algorithmic progress in language models , author=. 2024 , eprint=

2024

[58] [58]

2025 , month = apr, url =

Frontier Capability Assessments:. 2025 , month = apr, url =

2025

[59] [59]

and Lucas, Caleb and Guest, Ella , institution =

Mouton, Christopher A. and Lucas, Caleb and Guest, Ella , institution =. The Operational Risks of. 2024 , number =. doi:10.7249/RRA2977-2 , url =

work page doi:10.7249/rra2977-2 2024

[60] [60]

2024 , month = jan, url =

Building an Early Warning System for. 2024 , month = jan, url =

2024

[61] [61]

The reality of

Peppin, Aidan and Reuel, Anka and Casper, Stephen and Jones, Elliot and Strait, Andrew and Anwar, Usman and Agrawal, Anurag and Kapoor, Sayash and Koyejo, Sanmi and Pellat, Marie and Bommasani, Rishi and Frosst, Nick and Hooker, Sara , year =. The reality of. doi:10.48550/arXiv.2412.01946 , url =. 2412.01946 , archivePrefix =

work page doi:10.48550/arxiv.2412.01946

[62] [62]

2506.02548 , archivePrefix =

International Conference on Learning Representations , year =. 2506.02548 , archivePrefix =

work page arXiv

[63] [63]

2025 , month = sep, url =

Why Do We Take. 2025 , month = sep, url =

2025

[64] [64]

, Title =

Jones, Charles I. , Title =. American Economic Review: Insights , Volume =. 2024 , Month =. doi:10.1257/aeri.20230570 , URL =

work page doi:10.1257/aeri.20230570 2024

[65] [65]

2023 , url=

Trading off compute in training and inference , author=. 2023 , url=

2023

[66] [66]

The Illusion of Diminishing Returns:

Akshit Sinha and Arvindh Arun and Shashwat Goel and Steffen Staab and Jonas Geiping , booktitle=. The Illusion of Diminishing Returns:. 2026 , url=

2026

[67] [67]

2503.17354 , archivePrefix =

Rein, David and Becker, Joel and Deng, Amy and Nix, Seraphina and Canal, Chris and O'Connell, Daniel and Arnott, Pip and Bloom, Ryan and Broadley, Thomas and Garcia, Katharyn and Goodrich, Brian and Hasin, Max and Jawhar, Sami and Kinniment, Megan and Kwa, Thomas and Lajko, Aron and Rush, Nate and Sato, Lucas Jun Koba and Von Arx, Sydney and West, Ben and...

work page arXiv

[68] [68]

2411.15114 , archivePrefix =

Wijk, Hjalmar and Lin, Tao and Becker, Joel and Jawhar, Sami and Parikh, Neev and Broadley, Thomas and Chan, Lawrence and Chen, Michael and Clymer, Josh and Dhyani, Jai and Ericheva, Elena and Garcia, Katharyn and Goodrich, Brian and Jurkovic, Nikola and Kinniment, Megan and Lajko, Aron and Nix, Seraphina and Sato, Lucas and Saunders, William and Taran, M...

work page arXiv

[69] [69]

Global Policy , volume =

Bostrom, Nick , title =. Global Policy , volume =. doi:https://doi.org/10.1111/1758-5899.12718 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/1758-5899.12718 , abstract =

work page doi:10.1111/1758-5899.12718

[70] [70]

Lewis Hammond and Alan Chan and Jesse Clifton and Jason Hoelscher-Obermaier and Akbir Khan and Euan McLean and Chandler Smith and Wolfram Barfuss and Jakob Foerster and Tomáš Gavenčiak and The Anh Han and Edward Hughes and Vojtěch Kovařík and Jan Kulveit and Joel Z. Leibo and Caspar Oesterheld and Christian Schroeder de Witt and Nisarg Shah and Michael We...

work page arXiv

[71] [71]

Hyperscaler capex has quadrupled since

Isabel Juniewicz , year =. Hyperscaler capex has quadrupled since

[72] [72]

How Much Progress Has There Been in

Del Sozzo, Emanuele and Fleming, Martin and Flamm, Kenneth and Thompson, Neil , journal=. How Much Progress Has There Been in

[73] [73]

and Lynch, Jayson and Thompson, Neil , title =

Gundlach, Hans and Brown, Zachary A. and Lynch, Jayson and Thompson, Neil , title =. 2026 , month = jun, note =

2026