pith. machine review for the scientific record. sign in

arxiv: 2601.14053 · v2 · submitted 2026-01-20 · 💻 cs.LG · cs.AI· cs.CV· cs.MA· eess.IV

Recognition: 2 theorem links

· Lean Theorem

LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.MAeess.IV
keywords large language modelsscaling wallagentic AILLM taxonomytest-time computepost-trainingmodel efficiencydata scarcity
0
0 comments X

The pith

Large language models face a scaling wall from data scarcity, cost growth and energy demands, but six paradigms are breaking through to agentic systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey organizes over 50 large language models from 2019 to 2025 into a circular taxonomy called LLMOrbit built around eight interconnected dimensions. It identifies three crises that form a scaling wall: projected depletion of 9-27 trillion tokens by 2026-2028, training costs rising from $3 million to over $300 million, and 22 times higher energy consumption. The paper shows six paradigms overcoming these limits, including test-time compute where models reach GPT-4 performance with 10 times more inference effort, quantization for compression, and small specialized models that match larger ones. It traces the move from passive generation to post-training methods and tool-using agentic systems. A sympathetic reader would care because these pathways determine whether progress continues at manageable cost and energy scales.

Core claim

The central claim is that brute-force scaling has hit a wall defined by data scarcity, exponential costs and energy use, and that six paradigms—test-time compute, quantization, distributed edge computing, model merging, efficient training and small specialized models—together with post-training gains and efficiency revolutions are enabling continued progress toward reasoning-capable agentic AI systems.

What carries the argument

The LLMOrbit circular taxonomy of eight orbital dimensions that interconnects architectural innovations, training methodologies and efficiency patterns around the scaling wall.

If this is right

  • Post-training techniques such as RLHF and pure RL deliver substantial benchmark gains without additional pretraining data.
  • Efficiency methods like MoE routing and latent attention achieve GPT-4-level performance at under $0.30 per million tokens.
  • Open-source models such as Llama 3 surpass closed models on benchmarks like MMLU, pointing to broader access.
  • Agentic frameworks using tools and multi-agent coordination extend capabilities beyond single-pass generation.
  • Small specialized models match the performance of much larger ones on targeted tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The circular taxonomy suggests agentic systems will generate higher-quality data that feeds back into future base models.
  • Verification and alignment overhead may become the next dominant constraint once efficiency gains reduce compute costs.
  • Distributed edge deployments could introduce consistency and coordination problems that limit the 10 times cost reduction in practice.

Load-bearing premise

The six paradigms will keep delivering gains without running into new hard limits on data quality, verification or coordination overhead.

What would settle it

Direct measurement of whether actual token consumption reaches the 9-27 trillion depletion projection by 2026-2028 or whether claimed efficiency gains such as 10 times cost reduction from edge computing appear in deployed systems would test the central claim.

Figures

Figures reproduced from arXiv: 2601.14053 by Badri N. Patro, Vijay S. Agneeswaran.

Figure 1
Figure 1. Figure 1: Evolution from LLM Foundation to Agentic AI. Three nested paradigms converge to a unified frame￾work: LLM Foundation (blue) encompasses model evolution, scaling challenges, and architectural innovations; GenAI (purple) adds training methodologies (RLHF, PPO, DPO, GRPO, ORPO) and environments; Agentic AI (light blue) extends capabilities through reasoning (ReAct, CoT/ToT), tool use (RAG), and multi-agent sy… view at source ↗
Figure 2
Figure 2. Figure 2: LLMOrbit: A Circular Taxonomy of Large Language Models (2019-2025). This circular orbital archi￾tecture presents eight interconnected dimensions navigating the complete LLM landscape: (1) Scaling Wall Analysis examining data scarcity, training costs, and energy consumption with quantitative projections; (2) Model Taxonomy covering 50+ foundation models including GPT-4, Gemini 1.5, Claude 3.5, Llama 3, Deep… view at source ↗
Figure 3
Figure 3. Figure 3: Training Costs Evolution: Exponential Growth Across Three Eras. Stacked bar chart showing hardware costs (blue: amortized chip depreciation + 23% networking overhead) and energy costs (orange: electric￾ity consumption) for 8 frontier models spanning 2020- 2025 [15, 133]. Era annotations mark three periods: Early (2020-2021), Scaling (2022-2023), and Frontier (2024- 2025). Key numbers demonstrate 100× cost … view at source ↗
Figure 4
Figure 4. Figure 4: Cloud vs. Amortized Costs: Ownership Eco￾nomics for Large-Scale Training. Grouped bar chart comparing cloud rental pricing (darker bars) with amor￾tized ownership costs (lighter bars: hardware + energy) for 8 frontier models [15, 133]. Cost multipliers displayed above bar pairs show cloud costs are typically 2-4× higher than ownership due to provider margins (30-50%), maintenance overhead, and infrastructu… view at source ↗
Figure 5
Figure 5. Figure 5: Data Exhaustion Projection: The Impend￾ing Data Scarcity Crisis. Scatter plot showing histori￾cal dataset sizes of actual models (GPT-3: 300B tokens, Llama 2: 2T tokens, Llama 3: 15T tokens, DeepSeek-V3: 14.8T tokens) with exponential growth trend line (2019- 2025) and future projection assuming compute-optimal training [151, 49]. The shaded region (9-27 trillion to￾kens) indicates the estimated range of t… view at source ↗
Figure 6
Figure 6. Figure 6: Overtraining Scenarios: How Training Poli￾cies Affect Data Consumption Timeline. Three pro￾jection lines showing future dataset requirements under different scaling policies [151, 49]: (1) Compute-optimal following Chinchilla scaling laws (20 tokens/param), (2) Moderate overtraining (50-100 tokens/param), and (3) Aggressive overtraining following Llama 3 approach (200+ tokens/param). Intersection points ma… view at source ↗
Figure 7
Figure 7. Figure 7: Timeline of major language model releases (2019-2025) organized by model families. The vertical axis [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance scaling trends across model families on key benchmarks (MMLU, MATH, HumanEval). Each [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Architectural evolution of OpenAI GPT series. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: LLaMA architectural innovations. (a) Pre [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: DeepSeek architectural and training innovations. (a) Multi-head Latent Attention (MLA) compresses KV [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Architectural evolution from GPT-2 (2019) to present frontier models. While the core Transformer struc [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
read the original abstract

The field of artificial intelligence has undergone a revolution from foundational Transformer architectures to reasoning-capable systems approaching human-level performance. We present LLMOrbit, a comprehensive circular taxonomy navigating the landscape of large language models spanning 2019-2025. This survey examines over 50 models across 15 organizations through eight interconnected orbital dimensions, documenting architectural innovations, training methodologies, and efficiency patterns defining modern LLMs, generative AI, and agentic systems. We identify three critical crises: (1) data scarcity (9-27T tokens depleted by 2026-2028), (2) exponential cost growth ($3M to $300M+ in 5 years), and (3) unsustainable energy consumption (22x increase), establishing the scaling wall limiting brute-force approaches. Our analysis reveals six paradigms breaking this wall: (1) test-time compute (o1, DeepSeek-R1 achieve GPT-4 performance with 10x inference compute), (2) quantization (4-8x compression), (3) distributed edge computing (10x cost reduction), (4) model merging, (5) efficient training (ORPO reduces memory 50%), and (6) small specialized models (Phi-4 14B matches larger models). Three paradigm shifts emerge: (1) post-training gains (RLHF, GRPO, pure RL contribute substantially, DeepSeek-R1 achieving 79.8% MATH), (2) efficiency revolution (MoE routing 18x efficiency, Multi-head Latent Attention 8x KV cache compression enables GPT-4-level performance at $<$$0.30/M tokens), and (3) democratization (open-source Llama 3 88.6% MMLU surpasses GPT-4 86.4%). We provide insights into techniques (RLHF, PPO, DPO, GRPO, ORPO), trace evolution from passive generation to tool-using agents (ReAct, RAG, multi-agent systems), and analyze post-training innovations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LLMOrbit, a circular taxonomy of large language models spanning 2019-2025, surveying over 50 models from 15 organizations across eight interconnected dimensions. It identifies three crises establishing a scaling wall—data scarcity (9-27T tokens depleted by 2026-2028), exponential cost growth ($3M to $300M+ over 5 years), and 22x energy consumption increase—and proposes six paradigms to overcome it: test-time compute, quantization, distributed edge computing, model merging, efficient training, and small specialized models. The paper further discusses three paradigm shifts in post-training (e.g., RLHF, GRPO), efficiency (e.g., MoE, MLA), and democratization (e.g., open-source models surpassing closed ones), while tracing evolution toward agentic systems.

Significance. If the synthesis of cited literature is accurate and complete, LLMOrbit provides a useful organizational framework for mapping the transition from scaling-law-driven models to efficient and agentic AI systems. The survey's breadth across architectures, training methods, and efficiency techniques could help consolidate knowledge in a rapidly evolving field, particularly by highlighting how post-training and inference optimizations address brute-force limits.

major comments (2)
  1. [Abstract] Abstract: The headline quantitative claims—data scarcity of 9-27T tokens depleted by 2026-2028, cost growth from $3M to $300M+, and 22x energy consumption increase—are stated as established facts without citations, error bars, sensitivity analysis, or references to the underlying studies. These assertions are load-bearing for the central 'scaling wall' concept and the motivation for the six paradigms, so explicit sourcing and verification against primary sources are required.
  2. [Abstract] Abstract and paradigm discussion: The specific performance claims, such as o1 and DeepSeek-R1 achieving GPT-4-level results with 10x inference compute or DeepSeek-R1 reaching 79.8% on MATH, are presented without direct baseline comparisons, error margins, or pointers to the original evaluation protocols. Since these examples are used to illustrate the test-time compute and post-training paradigms, they need traceable references to ensure the taxonomy's supporting evidence is reproducible.
minor comments (2)
  1. The circular taxonomy is described as a presentational device spanning eight orbital dimensions, but the manuscript would benefit from an explicit table or diagram legend defining each dimension and how models are positioned within the orbit.
  2. Ensure all model performance numbers (e.g., Llama 3 88.6% MMLU, Phi-4 comparisons) include the exact evaluation benchmarks and dates of the cited results to avoid ambiguity in a fast-moving field.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help improve the clarity and rigor of our survey. We agree that the abstract requires explicit citations for the quantitative claims and traceable references for performance examples. We will incorporate these changes in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline quantitative claims—data scarcity of 9-27T tokens depleted by 2026-2028, cost growth from $3M to $300M+, and 22x energy consumption increase—are stated as established facts without citations, error bars, sensitivity analysis, or references to the underlying studies. These assertions are load-bearing for the central 'scaling wall' concept and the motivation for the six paradigms, so explicit sourcing and verification against primary sources are required.

    Authors: We fully agree with this observation. The claims in the abstract are drawn from established literature on scaling laws and resource constraints, but we omitted inline citations to maintain brevity. In the revision, we will add specific references (e.g., to reports on data availability from Epoch AI, cost analyses from training papers, and energy studies) directly in the abstract or as a supporting note. We will also provide ranges and note the sources of the estimates to allow verification. revision: yes

  2. Referee: [Abstract] Abstract and paradigm discussion: The specific performance claims, such as o1 and DeepSeek-R1 achieving GPT-4-level results with 10x inference compute or DeepSeek-R1 reaching 79.8% on MATH, are presented without direct baseline comparisons, error margins, or pointers to the original evaluation protocols. Since these examples are used to illustrate the test-time compute and post-training paradigms, they need traceable references to ensure the taxonomy's supporting evidence is reproducible.

    Authors: We appreciate this point and will address it by adding citations to the original papers and benchmarks. For the o1 and DeepSeek-R1 examples, we will reference the respective model cards or technical reports, include the exact benchmark scores with comparisons to GPT-4, and specify the evaluation protocols (e.g., the MATH dataset version and prompting methods). This will ensure reproducibility and strengthen the evidence for the paradigms discussed. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a survey paper that synthesizes existing literature on LLM scaling limits, data scarcity projections, cost growth, energy consumption, and efficiency paradigms. All three crises and six breaking paradigms are asserted via citations to prior external work rather than internal equations or derivations. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citation chains appear. The 'circular taxonomy' is explicitly a presentational device for organizing models across dimensions and does not reduce any claimed performance metric or prediction to a quantity defined inside the paper itself. The central claims therefore inherit strength from referenced sources and remain independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The taxonomy rests on the authors' choice to group literature into eight orbital dimensions and to treat observed trends in data, cost, and energy as hard limits without new measurements.

axioms (1)
  • domain assumption Existing LLM literature can be exhaustively organized into eight interconnected orbital dimensions without significant omissions.
    This choice defines the structure of the entire survey.
invented entities (2)
  • scaling wall no independent evidence
    purpose: Conceptual limit on brute-force scaling caused by the three crises
    New framing introduced to unify the data, cost, and energy observations.
  • LLMOrbit circular taxonomy no independent evidence
    purpose: New organizational framework for navigating LLM landscape
    Invented structure that does not appear in prior cited work.

pith-pipeline@v0.9.0 · 5692 in / 1541 out tokens · 50494 ms · 2026-05-16T12:44:51.616583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

173 extracted references · 173 canonical work pages · 81 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ah- mad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024

  3. [3]

    Armen Aghajanyan, Sonal Gupta, and Luke Zettle- moyer. Intrinsic dimensionality explains the effec- tiveness of language model fine-tuning.Proceed- ings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics, pages 7319– 7328, 2021. Foundation for understanding low- rank adaptation methods

  4. [4]

    Olmo 2: Post-norm architecture and training stability.arXiv preprint, 2025

    AI2. Olmo 2: Post-norm architecture and training stability.arXiv preprint, 2025

  5. [5]

    Olmo 3 think: Open reasoning model with full transparency.arXiv preprint, 2025

    AI2. Olmo 3 think: Open reasoning model with full transparency.arXiv preprint, 2025

  6. [6]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr ´on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023

  7. [7]

    Flamingo: a visual language model for few-shot learning.Advances in Neural Informa- tion Processing Systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Informa- tion Processing Systems, 35:23716–23736, 2022

  8. [8]

    Qwen 3: Advancing open-source language models.arXiv preprint, 2025

    Alibaba Cloud. Qwen 3: Advancing open-source language models.arXiv preprint, 2025

  9. [9]

    Qwen3-next: Hybrid architecture with gated deltanet.arXiv preprint, 2025

    Alibaba Cloud. Qwen3-next: Hybrid architecture with gated deltanet.arXiv preprint, 2025. 66

  10. [10]

    Model context protocol (MCP): Stan- dardizing AI-tool communication.https:// modelcontextprotocol.io, 2024

    Anthropic. Model context protocol (MCP): Stan- dardizing AI-tool communication.https:// modelcontextprotocol.io, 2024. Protocol for standardized communication between AI mod- els and external tools/data sources using JSON- RPC

  11. [11]

    Neural machine translation by jointly learning to align and translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. InInternational Conference on Learning Representations, 2015

  12. [12]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  13. [13]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Co- han. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  14. [14]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas L ´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional compu- tation. InarXiv preprint arXiv:1308.3432, 2013. Straight-through estimator for gradient approxima- tion

  15. [15]

    The rising costs of training frontier ai mod- els.arXiv preprint arXiv:2405.21015, 2024

    Tamay Besiroglu, Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Pablo Villalobos, and David Owen. The rising costs of training frontier ai mod- els.arXiv preprint arXiv:2405.21015, 2024

  16. [16]

    Graph of thoughts: Solving elaborate problems with large language models.arXiv preprint arXiv:2308.09687, 2023

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models.arXiv preprint arXiv:2308.09687, 2023. Extends tree-of-thoughts to DAG structure enabling parallel exploration an...

  17. [17]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324– 345, 1952. Bradley-Terry model for pairwise pref- erence modeling

  18. [18]

    Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sas- try, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

  19. [19]

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, and Tri Dao. Medusa: Simple llm infer- ence acceleration framework with multiple decod- ing heads.arXiv preprint arXiv:2401.10774, 2024

  20. [20]

    Langchain: Building applications with llms through composability.GitHub reposi- tory, 2023

    Harrison Chase. Langchain: Building applications with llms through composability.GitHub reposi- tory, 2023

  21. [21]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irv- ing, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model de- coding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

  22. [22]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large lan- guage models trained on code.arXiv preprint arXiv:2107.03374, 2021

  23. [23]

    AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agent- Verse: Facilitating multi-agent collaboration and exploring emergent behaviors.arXiv preprint arXiv:2308.10848, 2023. Dynamic team assembly with blackboard architecture for variable expertise requirements

  24. [24]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by hu- man preference.arXiv preprint arXiv:2403.04132, 2024. 67

  25. [25]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. InarXiv preprint arXiv:1904.10509,

  26. [26]

    Sparse attention patterns for efficient long- sequence modeling

  27. [27]

    Supervising strong learners by amplifying weak experts

    Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts.arXiv preprint arXiv:1810.08575, 2018. Scalable oversight through iterated amplification and distillation

  28. [28]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavar- ian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  29. [29]

    Introducing devin: The first ai software engineer.https: //www.cognition-labs.com/ introducing-devin, 2024

    Cognition Labs. Introducing devin: The first ai software engineer.https: //www.cognition-labs.com/ introducing-devin, 2024. Autonomous AI coding agent with end-to-end software develop- ment capabilities

  30. [30]

    Blackboard systems.AI ex- pert, 6(9):40–47, 1991

    Daniel D Corkill. Blackboard systems.AI ex- pert, 6(9):40–47, 1991. Blackboard architecture: shared memory space for multi-agent coordination and problem-solving

  31. [31]

    Coveney and Sauro Succi

    Peter V . Coveney and Sauro Succi. The wall con- fronting large language models.arXiv preprint arXiv:2507.19703, 2025. Demonstrates that scal- ing laws severely limit LLMs’ ability to improve prediction uncertainty and reliability

  32. [32]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  33. [33]

    Flashattention: Fast and memory- efficient exact attention with io-awareness.Ad- vances in Neural Information Processing Systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory- efficient exact attention with io-awareness.Ad- vances in Neural Information Processing Systems, 35:16344–16359, 2022

  34. [34]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    DeepSeek-AI. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024

  35. [35]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

  36. [39]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    DeepSeek-AI. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts lan- guage models.arXiv preprint arXiv:2401.06066, 2024

  37. [40]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  38. [41]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2025

  39. [42]

    Scaling vision trans- formers to 22 billion parameters.arXiv preprint arXiv:2302.05442, 2023

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision trans- formers to 22 billion parameters.arXiv preprint arXiv:2302.05442, 2023

  40. [43]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtz- man, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.arXiv preprint arXiv:2305.14314, 2023

  41. [44]

    QLoRA: Efficient finetun- ing of quantized LLMs.Advances in Neural In- formation Processing Systems, 36, 2024

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetun- ing of quantized LLMs.Advances in Neural In- formation Processing Systems, 36, 2024. NeurIPS 2023 proceedings published in 2024. 4-bit quanti- zation with backpropagation through frozen quan- tized weights. 68

  42. [45]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language un- derstanding.arXiv preprint arXiv:1810.04805, 2019

  43. [46]

    Palm-e: An embodied multi- modal language model.International Conference on Machine Learning (ICML), pages 8469–8488,

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multi- modal language model.International Conference on Machine Learning (ICML), pages 8469–8488,

  44. [47]

    Embodied multimodal model integrating vi- sion and language for robotics

  45. [48]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  46. [49]

    Alpacafarm: A simulation framework for methods that learn from human feedback.arXiv preprint arXiv:2305.14387, 2023

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.arXiv preprint arXiv:2305.14387, 2023

  47. [50]

    Elicit: The ai research assistant.https: //elicit.org, 2024

    Elicit. Elicit: The ai research assistant.https: //elicit.org, 2024. AI assistant for literature review and research synthesis

  48. [51]

    Can ai scaling continue through 2030? Epoch AI Research, 2024

    Epoch AI. Can ai scaling continue through 2030? Epoch AI Research, 2024

  49. [52]

    FIPA ACL message structure specification

    FIPA. FIPA ACL message structure specification. Foundation for Intelligent Physical Agents, 2002. FIPA Agent Communication Language: standard- ized agent message protocols with performatives

  50. [53]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transform- ers.arXiv preprint arXiv:2210.17323, 2023

  51. [54]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  52. [55]

    Arcee’s MergeKit: A toolkit for merging large language models.arXiv preprint arXiv:2403.13257, 2024

    Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, et al. Arcee’s MergeKit: A toolkit for merging large language models.arXiv preprint arXiv:2403.13257, 2024

  53. [56]

    Gemma 2: Improving open language models at a practical size.arXiv preprint, 2024

    Google DeepMind. Gemma 2: Improving open language models at a practical size.arXiv preprint, 2024

  54. [57]

    Gemma 3: Aggressive sliding window attention with 5:1 ratio.arXiv preprint, 2025

    Google DeepMind. Gemma 3: Aggressive sliding window attention with 5:1 ratio.arXiv preprint, 2025

  55. [58]

    Olmo: Accelerating the science of language models.arXiv preprint arXiv:2402.00838, 2024

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Ak- shita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnus- son, Yizhong Wang, et al. Olmo: Accelerating the science of language models.arXiv preprint arXiv:2402.00838, 2024

  56. [59]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time se- quence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  57. [60]

    Textbooks Are All You Need

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C´esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauff- mann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023. Phi-1 model paper

  58. [61]

    World Models

    David Ha and J ¨urgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

  59. [62]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse do- mains through world models.arXiv preprint arXiv:2301.04104, 2023

  60. [63]

    Understanding masked self-attention as implicit positional encoding.arXiv preprint arXiv:2310.04393, 2023

    Adi Haviv, Jonathan Berant, and Amir Glober- son. Understanding masked self-attention as implicit positional encoding.arXiv preprint arXiv:2310.04393, 2023

  61. [64]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2021. 69

  62. [65]

    De- noising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840– 6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. De- noising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840– 6851, 2020

  63. [66]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hen- dricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  64. [67]

    ORPO: Monolithic Preference Optimization without Reference Model

    Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic preference optimization without refer- ence model.arXiv preprint arXiv:2403.07691, 2024

  65. [68]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xi- awu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for multi- agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023

  66. [69]

    On the slow death of scaling.SSRN Electronic Journal, 2025

    Sara Hooker. On the slow death of scaling.SSRN Electronic Journal, 2025. Available at SSRN: https://ssrn.com/abstract=5877662

  67. [70]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzeb- ski, Bruna Morrone, Quentin De Laroussilhe, An- drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational Conference on Machine Learning, pages 2790–2799. PMLR, 2019

  68. [71]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adapta- tion of large language models.arXiv preprint arXiv:2106.09685, 2021

  69. [72]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yun- zhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manip- ulation with language models.arXiv preprint arXiv:2307.05973, 2023. LLM-based framework for robot manipulation via composable 3D affor- dances

  70. [73]

    AI safety via debate

    Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate.arXiv preprint arXiv:1805.00899, 2018. Proposes debate-based oversight for AI safety through adversarial interac- tions

  71. [74]

    Unsupervised Dense Information Retrieval with Contrastive Learning

    Gautier Izacard, Mathilde Caron, Lucas Hos- seini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense in- formation retrieval with contrastive learning.arXiv preprint arXiv:2112.09118, 2022

  72. [75]

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Men- glong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quanti- zation and training of neural networks for efficient integer-arithmetic-only inference.Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 2704–2713, 2018

  73. [76]

    Phi-2: The surprising power of small language models.Microsoft Research Blog,

    Mojan Javaheripi, S ´ebastien Bubeck, Marah Abdin, Jyoti Aneja, S ´ebastien Bubeck, Caio C´esar Teodoro Mendes, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Ece Kamar, et al. Phi-2: The surprising power of small language models.Microsoft Research Blog,

  74. [77]

    Available at https://www.microsoft.com/en- us/research/blog/phi-2-the-surprising-power-of- small-language-models/

  75. [78]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de las Casas, Florian Bressand, Gi- anna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

  76. [79]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mix- tral of experts.arXiv preprint arXiv:2401.04088, 2024

  77. [80]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?arXiv preprint 70 arXiv:2310.06770, 2023. Real GitHub issues from popular repositories, state-of-art resolves 13.8% of issues

  78. [81]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  79. [82]

    Transformers are rnns: Fast autoregressive transformers with linear attention.arXiv preprint arXiv:2006.16236, 2020

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention.arXiv preprint arXiv:2006.16236, 2020

  80. [83]

    Fast inference from transformers via specu- lative decoding.arXiv preprint arXiv:2211.17192, 2023

    Yaniv Leviathan, Matan Kalman, and Yossi Ma- tias. Fast inference from transformers via specu- lative decoding.arXiv preprint arXiv:2211.17192, 2023

Showing first 80 references.