arxiv: 2601.14053 · v2 · submitted 2026-01-20 · 💻 cs.LG · cs.AI· cs.CV· cs.MA· eess.IV

Recognition: 2 theorem links

· Lean Theorem

LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

Badri N. Patro , Vijay S. Agneeswaran

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.MAeess.IV

keywords large language modelsscaling wallagentic AILLM taxonomytest-time computepost-trainingmodel efficiencydata scarcity

0 comments

The pith

Large language models face a scaling wall from data scarcity, cost growth and energy demands, but six paradigms are breaking through to agentic systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey organizes over 50 large language models from 2019 to 2025 into a circular taxonomy called LLMOrbit built around eight interconnected dimensions. It identifies three crises that form a scaling wall: projected depletion of 9-27 trillion tokens by 2026-2028, training costs rising from $3 million to over $300 million, and 22 times higher energy consumption. The paper shows six paradigms overcoming these limits, including test-time compute where models reach GPT-4 performance with 10 times more inference effort, quantization for compression, and small specialized models that match larger ones. It traces the move from passive generation to post-training methods and tool-using agentic systems. A sympathetic reader would care because these pathways determine whether progress continues at manageable cost and energy scales.

Core claim

The central claim is that brute-force scaling has hit a wall defined by data scarcity, exponential costs and energy use, and that six paradigms—test-time compute, quantization, distributed edge computing, model merging, efficient training and small specialized models—together with post-training gains and efficiency revolutions are enabling continued progress toward reasoning-capable agentic AI systems.

What carries the argument

The LLMOrbit circular taxonomy of eight orbital dimensions that interconnects architectural innovations, training methodologies and efficiency patterns around the scaling wall.

If this is right

Post-training techniques such as RLHF and pure RL deliver substantial benchmark gains without additional pretraining data.
Efficiency methods like MoE routing and latent attention achieve GPT-4-level performance at under $0.30 per million tokens.
Open-source models such as Llama 3 surpass closed models on benchmarks like MMLU, pointing to broader access.
Agentic frameworks using tools and multi-agent coordination extend capabilities beyond single-pass generation.
Small specialized models match the performance of much larger ones on targeted tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The circular taxonomy suggests agentic systems will generate higher-quality data that feeds back into future base models.
Verification and alignment overhead may become the next dominant constraint once efficiency gains reduce compute costs.
Distributed edge deployments could introduce consistency and coordination problems that limit the 10 times cost reduction in practice.

Load-bearing premise

The six paradigms will keep delivering gains without running into new hard limits on data quality, verification or coordination overhead.

What would settle it

Direct measurement of whether actual token consumption reaches the 9-27 trillion depletion projection by 2026-2028 or whether claimed efficiency gains such as 10 times cost reduction from edge computing appear in deployed systems would test the central claim.

Figures

Figures reproduced from arXiv: 2601.14053 by Badri N. Patro, Vijay S. Agneeswaran.

**Figure 1.** Figure 1: Evolution from LLM Foundation to Agentic AI. Three nested paradigms converge to a unified framework: LLM Foundation (blue) encompasses model evolution, scaling challenges, and architectural innovations; GenAI (purple) adds training methodologies (RLHF, PPO, DPO, GRPO, ORPO) and environments; Agentic AI (light blue) extends capabilities through reasoning (ReAct, CoT/ToT), tool use (RAG), and multi-agent sy… view at source ↗

**Figure 2.** Figure 2: LLMOrbit: A Circular Taxonomy of Large Language Models (2019-2025). This circular orbital architecture presents eight interconnected dimensions navigating the complete LLM landscape: (1) Scaling Wall Analysis examining data scarcity, training costs, and energy consumption with quantitative projections; (2) Model Taxonomy covering 50+ foundation models including GPT-4, Gemini 1.5, Claude 3.5, Llama 3, Deep… view at source ↗

**Figure 3.** Figure 3: Training Costs Evolution: Exponential Growth Across Three Eras. Stacked bar chart showing hardware costs (blue: amortized chip depreciation + 23% networking overhead) and energy costs (orange: electricity consumption) for 8 frontier models spanning 2020- 2025 [15, 133]. Era annotations mark three periods: Early (2020-2021), Scaling (2022-2023), and Frontier (2024- 2025). Key numbers demonstrate 100× cost … view at source ↗

**Figure 4.** Figure 4: Cloud vs. Amortized Costs: Ownership Economics for Large-Scale Training. Grouped bar chart comparing cloud rental pricing (darker bars) with amortized ownership costs (lighter bars: hardware + energy) for 8 frontier models [15, 133]. Cost multipliers displayed above bar pairs show cloud costs are typically 2-4× higher than ownership due to provider margins (30-50%), maintenance overhead, and infrastructu… view at source ↗

**Figure 5.** Figure 5: Data Exhaustion Projection: The Impending Data Scarcity Crisis. Scatter plot showing historical dataset sizes of actual models (GPT-3: 300B tokens, Llama 2: 2T tokens, Llama 3: 15T tokens, DeepSeek-V3: 14.8T tokens) with exponential growth trend line (2019- 2025) and future projection assuming compute-optimal training [151, 49]. The shaded region (9-27 trillion tokens) indicates the estimated range of t… view at source ↗

**Figure 6.** Figure 6: Overtraining Scenarios: How Training Policies Affect Data Consumption Timeline. Three projection lines showing future dataset requirements under different scaling policies [151, 49]: (1) Compute-optimal following Chinchilla scaling laws (20 tokens/param), (2) Moderate overtraining (50-100 tokens/param), and (3) Aggressive overtraining following Llama 3 approach (200+ tokens/param). Intersection points ma… view at source ↗

**Figure 7.** Figure 7: Timeline of major language model releases (2019-2025) organized by model families. The vertical axis [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Performance scaling trends across model families on key benchmarks (MMLU, MATH, HumanEval). Each [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Architectural evolution of OpenAI GPT series. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: LLaMA architectural innovations. (a) Pre [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: DeepSeek architectural and training innovations. (a) Multi-head Latent Attention (MLA) compresses KV [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Architectural evolution from GPT-2 (2019) to present frontier models. While the core Transformer struc [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

read the original abstract

The field of artificial intelligence has undergone a revolution from foundational Transformer architectures to reasoning-capable systems approaching human-level performance. We present LLMOrbit, a comprehensive circular taxonomy navigating the landscape of large language models spanning 2019-2025. This survey examines over 50 models across 15 organizations through eight interconnected orbital dimensions, documenting architectural innovations, training methodologies, and efficiency patterns defining modern LLMs, generative AI, and agentic systems. We identify three critical crises: (1) data scarcity (9-27T tokens depleted by 2026-2028), (2) exponential cost growth ($3M to $300M+ in 5 years), and (3) unsustainable energy consumption (22x increase), establishing the scaling wall limiting brute-force approaches. Our analysis reveals six paradigms breaking this wall: (1) test-time compute (o1, DeepSeek-R1 achieve GPT-4 performance with 10x inference compute), (2) quantization (4-8x compression), (3) distributed edge computing (10x cost reduction), (4) model merging, (5) efficient training (ORPO reduces memory 50%), and (6) small specialized models (Phi-4 14B matches larger models). Three paradigm shifts emerge: (1) post-training gains (RLHF, GRPO, pure RL contribute substantially, DeepSeek-R1 achieving 79.8% MATH), (2) efficiency revolution (MoE routing 18x efficiency, Multi-head Latent Attention 8x KV cache compression enables GPT-4-level performance at $<$$0.30/M tokens), and (3) democratization (open-source Llama 3 88.6% MMLU surpasses GPT-4 86.4%). We provide insights into techniques (RLHF, PPO, DPO, GRPO, ORPO), trace evolution from passive generation to tool-using agents (ReAct, RAG, multi-agent systems), and analyze post-training innovations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LLMOrbit, a circular taxonomy of large language models spanning 2019-2025, surveying over 50 models from 15 organizations across eight interconnected dimensions. It identifies three crises establishing a scaling wall—data scarcity (9-27T tokens depleted by 2026-2028), exponential cost growth ($3M to $300M+ over 5 years), and 22x energy consumption increase—and proposes six paradigms to overcome it: test-time compute, quantization, distributed edge computing, model merging, efficient training, and small specialized models. The paper further discusses three paradigm shifts in post-training (e.g., RLHF, GRPO), efficiency (e.g., MoE, MLA), and democratization (e.g., open-source models surpassing closed ones), while tracing evolution toward agentic systems.

Significance. If the synthesis of cited literature is accurate and complete, LLMOrbit provides a useful organizational framework for mapping the transition from scaling-law-driven models to efficient and agentic AI systems. The survey's breadth across architectures, training methods, and efficiency techniques could help consolidate knowledge in a rapidly evolving field, particularly by highlighting how post-training and inference optimizations address brute-force limits.

major comments (2)

[Abstract] Abstract: The headline quantitative claims—data scarcity of 9-27T tokens depleted by 2026-2028, cost growth from $3M to $300M+, and 22x energy consumption increase—are stated as established facts without citations, error bars, sensitivity analysis, or references to the underlying studies. These assertions are load-bearing for the central 'scaling wall' concept and the motivation for the six paradigms, so explicit sourcing and verification against primary sources are required.
[Abstract] Abstract and paradigm discussion: The specific performance claims, such as o1 and DeepSeek-R1 achieving GPT-4-level results with 10x inference compute or DeepSeek-R1 reaching 79.8% on MATH, are presented without direct baseline comparisons, error margins, or pointers to the original evaluation protocols. Since these examples are used to illustrate the test-time compute and post-training paradigms, they need traceable references to ensure the taxonomy's supporting evidence is reproducible.

minor comments (2)

The circular taxonomy is described as a presentational device spanning eight orbital dimensions, but the manuscript would benefit from an explicit table or diagram legend defining each dimension and how models are positioned within the orbit.
Ensure all model performance numbers (e.g., Llama 3 88.6% MMLU, Phi-4 comparisons) include the exact evaluation benchmarks and dates of the cited results to avoid ambiguity in a fast-moving field.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help improve the clarity and rigor of our survey. We agree that the abstract requires explicit citations for the quantitative claims and traceable references for performance examples. We will incorporate these changes in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline quantitative claims—data scarcity of 9-27T tokens depleted by 2026-2028, cost growth from $3M to $300M+, and 22x energy consumption increase—are stated as established facts without citations, error bars, sensitivity analysis, or references to the underlying studies. These assertions are load-bearing for the central 'scaling wall' concept and the motivation for the six paradigms, so explicit sourcing and verification against primary sources are required.

Authors: We fully agree with this observation. The claims in the abstract are drawn from established literature on scaling laws and resource constraints, but we omitted inline citations to maintain brevity. In the revision, we will add specific references (e.g., to reports on data availability from Epoch AI, cost analyses from training papers, and energy studies) directly in the abstract or as a supporting note. We will also provide ranges and note the sources of the estimates to allow verification. revision: yes
Referee: [Abstract] Abstract and paradigm discussion: The specific performance claims, such as o1 and DeepSeek-R1 achieving GPT-4-level results with 10x inference compute or DeepSeek-R1 reaching 79.8% on MATH, are presented without direct baseline comparisons, error margins, or pointers to the original evaluation protocols. Since these examples are used to illustrate the test-time compute and post-training paradigms, they need traceable references to ensure the taxonomy's supporting evidence is reproducible.

Authors: We appreciate this point and will address it by adding citations to the original papers and benchmarks. For the o1 and DeepSeek-R1 examples, we will reference the respective model cards or technical reports, include the exact benchmark scores with comparisons to GPT-4, and specify the evaluation protocols (e.g., the MATH dataset version and prompting methods). This will ensure reproducibility and strengthen the evidence for the paradigms discussed. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a survey paper that synthesizes existing literature on LLM scaling limits, data scarcity projections, cost growth, energy consumption, and efficiency paradigms. All three crises and six breaking paradigms are asserted via citations to prior external work rather than internal equations or derivations. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citation chains appear. The 'circular taxonomy' is explicitly a presentational device for organizing models across dimensions and does not reduce any claimed performance metric or prediction to a quantity defined inside the paper itself. The central claims therefore inherit strength from referenced sources and remain independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The taxonomy rests on the authors' choice to group literature into eight orbital dimensions and to treat observed trends in data, cost, and energy as hard limits without new measurements.

axioms (1)

domain assumption Existing LLM literature can be exhaustively organized into eight interconnected orbital dimensions without significant omissions.
This choice defines the structure of the entire survey.

invented entities (2)

scaling wall no independent evidence
purpose: Conceptual limit on brute-force scaling caused by the three crises
New framing introduced to unify the data, cost, and energy observations.
LLMOrbit circular taxonomy no independent evidence
purpose: New organizational framework for navigating LLM landscape
Invented structure that does not appear in prior cited work.

pith-pipeline@v0.9.0 · 5692 in / 1541 out tokens · 50494 ms · 2026-05-16T12:44:51.616583+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify three critical crises: (1) data scarcity (9-27T tokens depleted by 2026-2028), (2) exponential cost growth ($3M to $300M+ in 5 years), and (3) unsustainable energy consumption (22x increase)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

six paradigms breaking this wall: (1) test-time compute … (2) quantization … (3) distributed edge computing … (4) model merging … (5) efficient training (ORPO) … (6) small specialized models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

173 extracted references · 173 canonical work pages · 81 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ah- mad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Armen Aghajanyan, Sonal Gupta, and Luke Zettle- moyer. Intrinsic dimensionality explains the effec- tiveness of language model fine-tuning.Proceed- ings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics, pages 7319– 7328, 2021. Foundation for understanding low- rank adaptation methods

work page 2021
[4]

Olmo 2: Post-norm architecture and training stability.arXiv preprint, 2025

AI2. Olmo 2: Post-norm architecture and training stability.arXiv preprint, 2025

work page 2025
[5]

Olmo 3 think: Open reasoning model with full transparency.arXiv preprint, 2025

AI2. Olmo 3 think: Open reasoning model with full transparency.arXiv preprint, 2025

work page 2025
[6]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr ´on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Flamingo: a visual language model for few-shot learning.Advances in Neural Informa- tion Processing Systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Informa- tion Processing Systems, 35:23716–23736, 2022

work page 2022
[8]

Qwen 3: Advancing open-source language models.arXiv preprint, 2025

Alibaba Cloud. Qwen 3: Advancing open-source language models.arXiv preprint, 2025

work page 2025
[9]

Qwen3-next: Hybrid architecture with gated deltanet.arXiv preprint, 2025

Alibaba Cloud. Qwen3-next: Hybrid architecture with gated deltanet.arXiv preprint, 2025. 66

work page 2025
[10]

Model context protocol (MCP): Stan- dardizing AI-tool communication.https:// modelcontextprotocol.io, 2024

Anthropic. Model context protocol (MCP): Stan- dardizing AI-tool communication.https:// modelcontextprotocol.io, 2024. Protocol for standardized communication between AI mod- els and external tools/data sources using JSON- RPC

work page 2024
[11]

Neural machine translation by jointly learning to align and translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. InInternational Conference on Learning Representations, 2015

work page 2015
[12]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Co- han. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[14]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas L ´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional compu- tation. InarXiv preprint arXiv:1308.3432, 2013. Straight-through estimator for gradient approxima- tion

work page internal anchor Pith review Pith/arXiv arXiv 2013
[15]

The rising costs of training frontier ai mod- els.arXiv preprint arXiv:2405.21015, 2024

Tamay Besiroglu, Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Pablo Villalobos, and David Owen. The rising costs of training frontier ai mod- els.arXiv preprint arXiv:2405.21015, 2024

work page arXiv 2024
[16]

Graph of thoughts: Solving elaborate problems with large language models.arXiv preprint arXiv:2308.09687, 2023

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models.arXiv preprint arXiv:2308.09687, 2023. Extends tree-of-thoughts to DAG structure enabling parallel exploration an...

work page arXiv 2023
[17]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324– 345, 1952. Bradley-Terry model for pairwise pref- erence modeling

work page 1952
[18]

Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sas- try, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

work page 1901
[19]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, and Tri Dao. Medusa: Simple llm infer- ence acceleration framework with multiple decod- ing heads.arXiv preprint arXiv:2401.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Langchain: Building applications with llms through composability.GitHub reposi- tory, 2023

Harrison Chase. Langchain: Building applications with llms through composability.GitHub reposi- tory, 2023

work page 2023
[21]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irv- ing, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model de- coding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large lan- guage models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agent- Verse: Facilitating multi-agent collaboration and exploring emergent behaviors.arXiv preprint arXiv:2308.10848, 2023. Dynamic team assembly with blackboard architecture for variable expertise requirements

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by hu- man preference.arXiv preprint arXiv:2403.04132, 2024. 67

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. InarXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[26]

Sparse attention patterns for efficient long- sequence modeling

work page
[27]

Supervising strong learners by amplifying weak experts

Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts.arXiv preprint arXiv:1810.08575, 2018. Scalable oversight through iterated amplification and distillation

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavar- ian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Introducing devin: The first ai software engineer.https: //www.cognition-labs.com/ introducing-devin, 2024

Cognition Labs. Introducing devin: The first ai software engineer.https: //www.cognition-labs.com/ introducing-devin, 2024. Autonomous AI coding agent with end-to-end software develop- ment capabilities

work page 2024
[30]

Blackboard systems.AI ex- pert, 6(9):40–47, 1991

Daniel D Corkill. Blackboard systems.AI ex- pert, 6(9):40–47, 1991. Blackboard architecture: shared memory space for multi-agent coordination and problem-solving

work page 1991
[31]

Coveney and Sauro Succi

Peter V . Coveney and Sauro Succi. The wall con- fronting large language models.arXiv preprint arXiv:2507.19703, 2025. Demonstrates that scal- ing laws severely limit LLMs’ ability to improve prediction uncertainty and reliability

work page arXiv 2025
[32]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Flashattention: Fast and memory- efficient exact attention with io-awareness.Ad- vances in Neural Information Processing Systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory- efficient exact attention with io-awareness.Ad- vances in Neural Information Processing Systems, 35:16344–16359, 2022

work page 2022
[34]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek-AI. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

DeepSeek-AI. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts lan- guage models.arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Scaling vision trans- formers to 22 billion parameters.arXiv preprint arXiv:2302.05442, 2023

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision trans- formers to 22 billion parameters.arXiv preprint arXiv:2302.05442, 2023

work page arXiv 2023
[43]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtz- man, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

QLoRA: Efficient finetun- ing of quantized LLMs.Advances in Neural In- formation Processing Systems, 36, 2024

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetun- ing of quantized LLMs.Advances in Neural In- formation Processing Systems, 36, 2024. NeurIPS 2023 proceedings published in 2024. 4-bit quanti- zation with backpropagation through frozen quan- tized weights. 68

work page 2024
[45]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language un- derstanding.arXiv preprint arXiv:1810.04805, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[46]

Palm-e: An embodied multi- modal language model.International Conference on Machine Learning (ICML), pages 8469–8488,

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multi- modal language model.International Conference on Machine Learning (ICML), pages 8469–8488,

work page
[47]

Embodied multimodal model integrating vi- sion and language for robotics

work page
[48]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Alpacafarm: A simulation framework for methods that learn from human feedback.arXiv preprint arXiv:2305.14387, 2023

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.arXiv preprint arXiv:2305.14387, 2023

work page arXiv 2023
[50]

Elicit: The ai research assistant.https: //elicit.org, 2024

Elicit. Elicit: The ai research assistant.https: //elicit.org, 2024. AI assistant for literature review and research synthesis

work page 2024
[51]

Can ai scaling continue through 2030? Epoch AI Research, 2024

Epoch AI. Can ai scaling continue through 2030? Epoch AI Research, 2024

work page 2030
[52]

FIPA ACL message structure specification

FIPA. FIPA ACL message structure specification. Foundation for Intelligent Physical Agents, 2002. FIPA Agent Communication Language: standard- ized agent message protocols with performatives

work page 2002
[53]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transform- ers.arXiv preprint arXiv:2210.17323, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Arcee’s MergeKit: A toolkit for merging large language models.arXiv preprint arXiv:2403.13257, 2024

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, et al. Arcee’s MergeKit: A toolkit for merging large language models.arXiv preprint arXiv:2403.13257, 2024

work page arXiv 2024
[56]

Gemma 2: Improving open language models at a practical size.arXiv preprint, 2024

Google DeepMind. Gemma 2: Improving open language models at a practical size.arXiv preprint, 2024

work page 2024
[57]

Gemma 3: Aggressive sliding window attention with 5:1 ratio.arXiv preprint, 2025

Google DeepMind. Gemma 3: Aggressive sliding window attention with 5:1 ratio.arXiv preprint, 2025

work page 2025
[58]

Olmo: Accelerating the science of language models.arXiv preprint arXiv:2402.00838, 2024

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Ak- shita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnus- son, Yizhong Wang, et al. Olmo: Accelerating the science of language models.arXiv preprint arXiv:2402.00838, 2024

work page arXiv 2024
[59]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time se- quence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C´esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauff- mann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023. Phi-1 model paper

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

World Models

David Ha and J ¨urgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[62]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse do- mains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Understanding masked self-attention as implicit positional encoding.arXiv preprint arXiv:2310.04393, 2023

Adi Haviv, Jonathan Berant, and Amir Glober- son. Understanding masked self-attention as implicit positional encoding.arXiv preprint arXiv:2310.04393, 2023

work page arXiv 2023
[64]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2021. 69

work page internal anchor Pith review Pith/arXiv arXiv 2009
[65]

De- noising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840– 6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. De- noising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840– 6851, 2020

work page 2020
[66]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hen- dricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[67]

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic preference optimization without refer- ence model.arXiv preprint arXiv:2403.07691, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xi- awu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for multi- agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

On the slow death of scaling.SSRN Electronic Journal, 2025

Sara Hooker. On the slow death of scaling.SSRN Electronic Journal, 2025. Available at SSRN: https://ssrn.com/abstract=5877662

work page 2025
[70]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzeb- ski, Bruna Morrone, Quentin De Laroussilhe, An- drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational Conference on Machine Learning, pages 2790–2799. PMLR, 2019

work page 2019
[71]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adapta- tion of large language models.arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[72]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yun- zhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manip- ulation with language models.arXiv preprint arXiv:2307.05973, 2023. LLM-based framework for robot manipulation via composable 3D affor- dances

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

AI safety via debate

Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate.arXiv preprint arXiv:1805.00899, 2018. Proposes debate-based oversight for AI safety through adversarial interac- tions

work page internal anchor Pith review Pith/arXiv arXiv 2018
[74]

Unsupervised Dense Information Retrieval with Contrastive Learning

Gautier Izacard, Mathilde Caron, Lucas Hos- seini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense in- formation retrieval with contrastive learning.arXiv preprint arXiv:2112.09118, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[75]

Benoit Jacob, Skirmantas Kligys, Bo Chen, Men- glong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quanti- zation and training of neural networks for efficient integer-arithmetic-only inference.Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 2704–2713, 2018

work page 2018
[76]

Phi-2: The surprising power of small language models.Microsoft Research Blog,

Mojan Javaheripi, S ´ebastien Bubeck, Marah Abdin, Jyoti Aneja, S ´ebastien Bubeck, Caio C´esar Teodoro Mendes, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Ece Kamar, et al. Phi-2: The surprising power of small language models.Microsoft Research Blog,

work page
[77]

Available at https://www.microsoft.com/en- us/research/blog/phi-2-the-surprising-power-of- small-language-models/

work page
[78]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de las Casas, Florian Bressand, Gi- anna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mix- tral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?arXiv preprint 70 arXiv:2310.06770, 2023. Real GitHub issues from popular repositories, state-of-art resolves 13.8% of issues

work page internal anchor Pith review Pith/arXiv arXiv 2023
[81]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[82]

Transformers are rnns: Fast autoregressive transformers with linear attention.arXiv preprint arXiv:2006.16236, 2020

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention.arXiv preprint arXiv:2006.16236, 2020

work page arXiv 2006
[83]

Fast inference from transformers via specu- lative decoding.arXiv preprint arXiv:2211.17192, 2023

Yaniv Leviathan, Matan Kalman, and Yossi Ma- tias. Fast inference from transformers via specu- lative decoding.arXiv preprint arXiv:2211.17192, 2023

work page arXiv 2023

Showing first 80 references.