Recognition: 2 theorem links
· Lean TheoremLLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
Pith reviewed 2026-05-16 12:44 UTC · model grok-4.3
The pith
Large language models face a scaling wall from data scarcity, cost growth and energy demands, but six paradigms are breaking through to agentic systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that brute-force scaling has hit a wall defined by data scarcity, exponential costs and energy use, and that six paradigms—test-time compute, quantization, distributed edge computing, model merging, efficient training and small specialized models—together with post-training gains and efficiency revolutions are enabling continued progress toward reasoning-capable agentic AI systems.
What carries the argument
The LLMOrbit circular taxonomy of eight orbital dimensions that interconnects architectural innovations, training methodologies and efficiency patterns around the scaling wall.
If this is right
- Post-training techniques such as RLHF and pure RL deliver substantial benchmark gains without additional pretraining data.
- Efficiency methods like MoE routing and latent attention achieve GPT-4-level performance at under $0.30 per million tokens.
- Open-source models such as Llama 3 surpass closed models on benchmarks like MMLU, pointing to broader access.
- Agentic frameworks using tools and multi-agent coordination extend capabilities beyond single-pass generation.
- Small specialized models match the performance of much larger ones on targeted tasks.
Where Pith is reading between the lines
- The circular taxonomy suggests agentic systems will generate higher-quality data that feeds back into future base models.
- Verification and alignment overhead may become the next dominant constraint once efficiency gains reduce compute costs.
- Distributed edge deployments could introduce consistency and coordination problems that limit the 10 times cost reduction in practice.
Load-bearing premise
The six paradigms will keep delivering gains without running into new hard limits on data quality, verification or coordination overhead.
What would settle it
Direct measurement of whether actual token consumption reaches the 9-27 trillion depletion projection by 2026-2028 or whether claimed efficiency gains such as 10 times cost reduction from edge computing appear in deployed systems would test the central claim.
Figures
read the original abstract
The field of artificial intelligence has undergone a revolution from foundational Transformer architectures to reasoning-capable systems approaching human-level performance. We present LLMOrbit, a comprehensive circular taxonomy navigating the landscape of large language models spanning 2019-2025. This survey examines over 50 models across 15 organizations through eight interconnected orbital dimensions, documenting architectural innovations, training methodologies, and efficiency patterns defining modern LLMs, generative AI, and agentic systems. We identify three critical crises: (1) data scarcity (9-27T tokens depleted by 2026-2028), (2) exponential cost growth ($3M to $300M+ in 5 years), and (3) unsustainable energy consumption (22x increase), establishing the scaling wall limiting brute-force approaches. Our analysis reveals six paradigms breaking this wall: (1) test-time compute (o1, DeepSeek-R1 achieve GPT-4 performance with 10x inference compute), (2) quantization (4-8x compression), (3) distributed edge computing (10x cost reduction), (4) model merging, (5) efficient training (ORPO reduces memory 50%), and (6) small specialized models (Phi-4 14B matches larger models). Three paradigm shifts emerge: (1) post-training gains (RLHF, GRPO, pure RL contribute substantially, DeepSeek-R1 achieving 79.8% MATH), (2) efficiency revolution (MoE routing 18x efficiency, Multi-head Latent Attention 8x KV cache compression enables GPT-4-level performance at $<$$0.30/M tokens), and (3) democratization (open-source Llama 3 88.6% MMLU surpasses GPT-4 86.4%). We provide insights into techniques (RLHF, PPO, DPO, GRPO, ORPO), trace evolution from passive generation to tool-using agents (ReAct, RAG, multi-agent systems), and analyze post-training innovations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LLMOrbit, a circular taxonomy of large language models spanning 2019-2025, surveying over 50 models from 15 organizations across eight interconnected dimensions. It identifies three crises establishing a scaling wall—data scarcity (9-27T tokens depleted by 2026-2028), exponential cost growth ($3M to $300M+ over 5 years), and 22x energy consumption increase—and proposes six paradigms to overcome it: test-time compute, quantization, distributed edge computing, model merging, efficient training, and small specialized models. The paper further discusses three paradigm shifts in post-training (e.g., RLHF, GRPO), efficiency (e.g., MoE, MLA), and democratization (e.g., open-source models surpassing closed ones), while tracing evolution toward agentic systems.
Significance. If the synthesis of cited literature is accurate and complete, LLMOrbit provides a useful organizational framework for mapping the transition from scaling-law-driven models to efficient and agentic AI systems. The survey's breadth across architectures, training methods, and efficiency techniques could help consolidate knowledge in a rapidly evolving field, particularly by highlighting how post-training and inference optimizations address brute-force limits.
major comments (2)
- [Abstract] Abstract: The headline quantitative claims—data scarcity of 9-27T tokens depleted by 2026-2028, cost growth from $3M to $300M+, and 22x energy consumption increase—are stated as established facts without citations, error bars, sensitivity analysis, or references to the underlying studies. These assertions are load-bearing for the central 'scaling wall' concept and the motivation for the six paradigms, so explicit sourcing and verification against primary sources are required.
- [Abstract] Abstract and paradigm discussion: The specific performance claims, such as o1 and DeepSeek-R1 achieving GPT-4-level results with 10x inference compute or DeepSeek-R1 reaching 79.8% on MATH, are presented without direct baseline comparisons, error margins, or pointers to the original evaluation protocols. Since these examples are used to illustrate the test-time compute and post-training paradigms, they need traceable references to ensure the taxonomy's supporting evidence is reproducible.
minor comments (2)
- The circular taxonomy is described as a presentational device spanning eight orbital dimensions, but the manuscript would benefit from an explicit table or diagram legend defining each dimension and how models are positioned within the orbit.
- Ensure all model performance numbers (e.g., Llama 3 88.6% MMLU, Phi-4 comparisons) include the exact evaluation benchmarks and dates of the cited results to avoid ambiguity in a fast-moving field.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which help improve the clarity and rigor of our survey. We agree that the abstract requires explicit citations for the quantitative claims and traceable references for performance examples. We will incorporate these changes in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline quantitative claims—data scarcity of 9-27T tokens depleted by 2026-2028, cost growth from $3M to $300M+, and 22x energy consumption increase—are stated as established facts without citations, error bars, sensitivity analysis, or references to the underlying studies. These assertions are load-bearing for the central 'scaling wall' concept and the motivation for the six paradigms, so explicit sourcing and verification against primary sources are required.
Authors: We fully agree with this observation. The claims in the abstract are drawn from established literature on scaling laws and resource constraints, but we omitted inline citations to maintain brevity. In the revision, we will add specific references (e.g., to reports on data availability from Epoch AI, cost analyses from training papers, and energy studies) directly in the abstract or as a supporting note. We will also provide ranges and note the sources of the estimates to allow verification. revision: yes
-
Referee: [Abstract] Abstract and paradigm discussion: The specific performance claims, such as o1 and DeepSeek-R1 achieving GPT-4-level results with 10x inference compute or DeepSeek-R1 reaching 79.8% on MATH, are presented without direct baseline comparisons, error margins, or pointers to the original evaluation protocols. Since these examples are used to illustrate the test-time compute and post-training paradigms, they need traceable references to ensure the taxonomy's supporting evidence is reproducible.
Authors: We appreciate this point and will address it by adding citations to the original papers and benchmarks. For the o1 and DeepSeek-R1 examples, we will reference the respective model cards or technical reports, include the exact benchmark scores with comparisons to GPT-4, and specify the evaluation protocols (e.g., the MATH dataset version and prompting methods). This will ensure reproducibility and strengthen the evidence for the paradigms discussed. revision: yes
Circularity Check
No significant circularity detected
full rationale
This is a survey paper that synthesizes existing literature on LLM scaling limits, data scarcity projections, cost growth, energy consumption, and efficiency paradigms. All three crises and six breaking paradigms are asserted via citations to prior external work rather than internal equations or derivations. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citation chains appear. The 'circular taxonomy' is explicitly a presentational device for organizing models across dimensions and does not reduce any claimed performance metric or prediction to a quantity defined inside the paper itself. The central claims therefore inherit strength from referenced sources and remain independently falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing LLM literature can be exhaustively organized into eight interconnected orbital dimensions without significant omissions.
invented entities (2)
-
scaling wall
no independent evidence
-
LLMOrbit circular taxonomy
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We identify three critical crises: (1) data scarcity (9-27T tokens depleted by 2026-2028), (2) exponential cost growth ($3M to $300M+ in 5 years), and (3) unsustainable energy consumption (22x increase)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
six paradigms breaking this wall: (1) test-time compute … (2) quantization … (3) distributed edge computing … (4) model merging … (5) efficient training (ORPO) … (6) small specialized models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, Ammar Ah- mad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Armen Aghajanyan, Sonal Gupta, and Luke Zettle- moyer. Intrinsic dimensionality explains the effec- tiveness of language model fine-tuning.Proceed- ings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics, pages 7319– 7328, 2021. Foundation for understanding low- rank adaptation methods
work page 2021
-
[4]
Olmo 2: Post-norm architecture and training stability.arXiv preprint, 2025
AI2. Olmo 2: Post-norm architecture and training stability.arXiv preprint, 2025
work page 2025
-
[5]
Olmo 3 think: Open reasoning model with full transparency.arXiv preprint, 2025
AI2. Olmo 3 think: Open reasoning model with full transparency.arXiv preprint, 2025
work page 2025
-
[6]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr ´on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Informa- tion Processing Systems, 35:23716–23736, 2022
work page 2022
-
[8]
Qwen 3: Advancing open-source language models.arXiv preprint, 2025
Alibaba Cloud. Qwen 3: Advancing open-source language models.arXiv preprint, 2025
work page 2025
-
[9]
Qwen3-next: Hybrid architecture with gated deltanet.arXiv preprint, 2025
Alibaba Cloud. Qwen3-next: Hybrid architecture with gated deltanet.arXiv preprint, 2025. 66
work page 2025
-
[10]
Anthropic. Model context protocol (MCP): Stan- dardizing AI-tool communication.https:// modelcontextprotocol.io, 2024. Protocol for standardized communication between AI mod- els and external tools/data sources using JSON- RPC
work page 2024
-
[11]
Neural machine translation by jointly learning to align and translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. InInternational Conference on Learning Representations, 2015
work page 2015
-
[12]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Co- han. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[14]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas L ´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional compu- tation. InarXiv preprint arXiv:1308.3432, 2013. Straight-through estimator for gradient approxima- tion
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[15]
The rising costs of training frontier ai mod- els.arXiv preprint arXiv:2405.21015, 2024
Tamay Besiroglu, Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Pablo Villalobos, and David Owen. The rising costs of training frontier ai mod- els.arXiv preprint arXiv:2405.21015, 2024
-
[16]
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models.arXiv preprint arXiv:2308.09687, 2023. Extends tree-of-thoughts to DAG structure enabling parallel exploration an...
-
[17]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324– 345, 1952. Bradley-Terry model for pairwise pref- erence modeling
work page 1952
-
[18]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sas- try, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020
work page 1901
-
[19]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, and Tri Dao. Medusa: Simple llm infer- ence acceleration framework with multiple decod- ing heads.arXiv preprint arXiv:2401.10774, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Langchain: Building applications with llms through composability.GitHub reposi- tory, 2023
Harrison Chase. Langchain: Building applications with llms through composability.GitHub reposi- tory, 2023
work page 2023
-
[21]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irv- ing, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model de- coding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large lan- guage models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agent- Verse: Facilitating multi-agent collaboration and exploring emergent behaviors.arXiv preprint arXiv:2308.10848, 2023. Dynamic team assembly with blackboard architecture for variable expertise requirements
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by hu- man preference.arXiv preprint arXiv:2403.04132, 2024. 67
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. InarXiv preprint arXiv:1904.10509,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[26]
Sparse attention patterns for efficient long- sequence modeling
-
[27]
Supervising strong learners by amplifying weak experts
Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts.arXiv preprint arXiv:1810.08575, 2018. Scalable oversight through iterated amplification and distillation
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavar- ian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Cognition Labs. Introducing devin: The first ai software engineer.https: //www.cognition-labs.com/ introducing-devin, 2024. Autonomous AI coding agent with end-to-end software develop- ment capabilities
work page 2024
-
[30]
Blackboard systems.AI ex- pert, 6(9):40–47, 1991
Daniel D Corkill. Blackboard systems.AI ex- pert, 6(9):40–47, 1991. Blackboard architecture: shared memory space for multi-agent coordination and problem-solving
work page 1991
-
[31]
Peter V . Coveney and Sauro Succi. The wall con- fronting large language models.arXiv preprint arXiv:2507.19703, 2025. Demonstrates that scal- ing laws severely limit LLMs’ ability to improve prediction uncertainty and reliability
-
[32]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory- efficient exact attention with io-awareness.Ad- vances in Neural Information Processing Systems, 35:16344–16359, 2022
work page 2022
-
[34]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
DeepSeek-AI. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeek-AI. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts lan- guage models.arXiv preprint arXiv:2401.06066, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Scaling vision trans- formers to 22 billion parameters.arXiv preprint arXiv:2302.05442, 2023
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision trans- formers to 22 billion parameters.arXiv preprint arXiv:2302.05442, 2023
-
[43]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtz- man, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.arXiv preprint arXiv:2305.14314, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetun- ing of quantized LLMs.Advances in Neural In- formation Processing Systems, 36, 2024. NeurIPS 2023 proceedings published in 2024. 4-bit quanti- zation with backpropagation through frozen quan- tized weights. 68
work page 2024
-
[45]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language un- derstanding.arXiv preprint arXiv:1810.04805, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[46]
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multi- modal language model.International Conference on Machine Learning (ICML), pages 8469–8488,
-
[47]
Embodied multimodal model integrating vi- sion and language for robotics
-
[48]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.arXiv preprint arXiv:2305.14387, 2023
-
[50]
Elicit: The ai research assistant.https: //elicit.org, 2024
Elicit. Elicit: The ai research assistant.https: //elicit.org, 2024. AI assistant for literature review and research synthesis
work page 2024
-
[51]
Can ai scaling continue through 2030? Epoch AI Research, 2024
Epoch AI. Can ai scaling continue through 2030? Epoch AI Research, 2024
work page 2030
-
[52]
FIPA ACL message structure specification
FIPA. FIPA ACL message structure specification. Foundation for Intelligent Physical Agents, 2002. FIPA Agent Communication Language: standard- ized agent message protocols with performatives
work page 2002
-
[53]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transform- ers.arXiv preprint arXiv:2210.17323, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Arcee’s MergeKit: A toolkit for merging large language models.arXiv preprint arXiv:2403.13257, 2024
Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, et al. Arcee’s MergeKit: A toolkit for merging large language models.arXiv preprint arXiv:2403.13257, 2024
-
[56]
Gemma 2: Improving open language models at a practical size.arXiv preprint, 2024
Google DeepMind. Gemma 2: Improving open language models at a practical size.arXiv preprint, 2024
work page 2024
-
[57]
Gemma 3: Aggressive sliding window attention with 5:1 ratio.arXiv preprint, 2025
Google DeepMind. Gemma 3: Aggressive sliding window attention with 5:1 ratio.arXiv preprint, 2025
work page 2025
-
[58]
Olmo: Accelerating the science of language models.arXiv preprint arXiv:2402.00838, 2024
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Ak- shita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnus- son, Yizhong Wang, et al. Olmo: Accelerating the science of language models.arXiv preprint arXiv:2402.00838, 2024
-
[59]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time se- quence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C´esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauff- mann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023. Phi-1 model paper
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
David Ha and J ¨urgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[62]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse do- mains through world models.arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Adi Haviv, Jonathan Berant, and Amir Glober- son. Understanding masked self-attention as implicit positional encoding.arXiv preprint arXiv:2310.04393, 2023
-
[64]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2021. 69
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[65]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. De- noising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840– 6851, 2020
work page 2020
-
[66]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hen- dricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[67]
ORPO: Monolithic Preference Optimization without Reference Model
Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic preference optimization without refer- ence model.arXiv preprint arXiv:2403.07691, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xi- awu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for multi- agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
On the slow death of scaling.SSRN Electronic Journal, 2025
Sara Hooker. On the slow death of scaling.SSRN Electronic Journal, 2025. Available at SSRN: https://ssrn.com/abstract=5877662
work page 2025
-
[70]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzeb- ski, Bruna Morrone, Quentin De Laroussilhe, An- drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational Conference on Machine Learning, pages 2790–2799. PMLR, 2019
work page 2019
-
[71]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adapta- tion of large language models.arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[72]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yun- zhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manip- ulation with language models.arXiv preprint arXiv:2307.05973, 2023. LLM-based framework for robot manipulation via composable 3D affor- dances
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate.arXiv preprint arXiv:1805.00899, 2018. Proposes debate-based oversight for AI safety through adversarial interac- tions
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[74]
Unsupervised Dense Information Retrieval with Contrastive Learning
Gautier Izacard, Mathilde Caron, Lucas Hos- seini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense in- formation retrieval with contrastive learning.arXiv preprint arXiv:2112.09118, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[75]
Benoit Jacob, Skirmantas Kligys, Bo Chen, Men- glong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quanti- zation and training of neural networks for efficient integer-arithmetic-only inference.Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 2704–2713, 2018
work page 2018
-
[76]
Phi-2: The surprising power of small language models.Microsoft Research Blog,
Mojan Javaheripi, S ´ebastien Bubeck, Marah Abdin, Jyoti Aneja, S ´ebastien Bubeck, Caio C´esar Teodoro Mendes, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Ece Kamar, et al. Phi-2: The surprising power of small language models.Microsoft Research Blog,
-
[77]
Available at https://www.microsoft.com/en- us/research/blog/phi-2-the-surprising-power-of- small-language-models/
-
[78]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de las Casas, Florian Bressand, Gi- anna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mix- tral of experts.arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?arXiv preprint 70 arXiv:2310.06770, 2023. Real GitHub issues from popular repositories, state-of-art resolves 13.8% of issues
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[81]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[82]
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention.arXiv preprint arXiv:2006.16236, 2020
-
[83]
Fast inference from transformers via specu- lative decoding.arXiv preprint arXiv:2211.17192, 2023
Yaniv Leviathan, Matan Kalman, and Yossi Ma- tias. Fast inference from transformers via specu- lative decoding.arXiv preprint arXiv:2211.17192, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.