A Theory of Training Profit-Optimal LLMs
Pith reviewed 2026-05-20 20:08 UTC · model grok-4.3
The pith
In the compute-bound regime, optimal LLM model size and token budget scale nearly linearly with hardware efficiency, making total training cost grow sub-quadratically.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that rational profit maximization leads to specific scaling relationships: in compute-bound settings optimal model size and tokens track hardware efficiency E nearly linearly while total cost scales sub-quadratically in E; data-efficiency gains encourage larger models and higher spending; and in the data-bound regime with fixed D the optimal expenditure scales as D squared over E.
What carries the argument
The profit-maximization problem that integrates scaling laws for loss with a fixed quality threshold governing consumer adoption.
If this is right
- Data-efficiency improvements cause the firm to select larger models and higher total training expenditure.
- In the data-limited regime profit-optimal spending rises with the square of available data and falls with hardware or data efficiency.
- Observed industry trends match the model's permissive variants under compute bounds but diverge from predictions under data bounds or stalled hardware progress.
Where Pith is reading between the lines
- Continued hardware-efficiency growth would make progressively larger models cheaper relative to their quality gains, potentially shortening the time between capability thresholds.
- If adoption thresholds prove sensitive to price or competing models, the optimal training expenditure could shift downward from the levels predicted here.
- The same framework could be used to compare profit-optimal choices against current public statements about planned training runs.
Load-bearing premise
Consumer adoption depends only on whether model quality exceeds a fixed threshold that does not change with price, competition, or network effects.
What would settle it
Direct observation that real training expenditures fail to increase sub-quadratically with measured hardware-efficiency gains or quadratically with available data volume would falsify the predicted profit-optimal behavior.
Figures
read the original abstract
Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency $E$ (FLOPs/\$) at a near-linear rate; total training cost then scales sub-quadratically in $E$. Data efficiency improvements incentivize larger models and training expenditure. When we are limited to $D$ data, profit-optimal training expenditure scales as $D^2/E$, i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a microeconomic model of an LLM training firm that maximizes profit by combining neural scaling laws for model quality Q(N,D) with a consumer adoption function based on a fixed quality threshold q*. It derives closed-form expressions for profit-optimal model size N*, token count D*, and training expenditure in a compute-bound regime (where optimal N* and D* scale near-linearly with hardware efficiency E, yielding sub-quadratic cost scaling in E) and a data-bound regime (where expenditure scales as D²/E). The analysis also compares current industry trends to the model's predictions under different assumptions about hardware progress.
Significance. If the central modeling choices are accepted, the work supplies an explicit theoretical framework for profit-optimal scaling decisions that can be directly compared to observed training runs and used for long-term capital planning. The explicit derivations from standard optimization and the trend-consistency checks are strengths that allow falsifiable predictions; the absence of empirical calibration for the adoption threshold is the primary limitation on immediate applicability.
major comments (2)
- [§3] §3 (Consumer adoption model): The revenue term is constructed from a fixed quality threshold q* that does not depend on price, competition, or network effects. This functional form is load-bearing for the first-order conditions that produce the claimed near-linear scaling of N* and D* with E and the D²/E expenditure law; the manuscript provides neither empirical justification for q* nor sensitivity analysis under alternative (e.g., price-sensitive) adoption functions.
- [§4.2, §5.1] §4.2 and §5.1 (Compute-bound derivations): The sub-quadratic cost scaling in E follows directly from substituting the optimal N*(E) and D*(E) into the cost function under the stated adoption and cost assumptions. Because these assumptions are introduced without independent measurement, the exponents are model-dependent rather than robust predictions; the paper does not report how the scalings change when the threshold is made endogenous.
minor comments (2)
- [§2] Notation for hardware efficiency E (FLOPs/$) is introduced without an explicit definition of the constant factors relating FLOPs to dollars; a short appendix clarifying the conversion would improve reproducibility.
- [§6] The trend-comparison section would benefit from a table listing the exact parameter values used for the 'permissive' versus 'stalling hardware' scenarios.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript while defending the modeling choices on substantive grounds.
read point-by-point responses
-
Referee: [§3] §3 (Consumer adoption model): The revenue term is constructed from a fixed quality threshold q* that does not depend on price, competition, or network effects. This functional form is load-bearing for the first-order conditions that produce the claimed near-linear scaling of N* and D* with E and the D²/E expenditure law; the manuscript provides neither empirical justification for q* nor sensitivity analysis under alternative (e.g., price-sensitive) adoption functions.
Authors: The fixed q* is introduced as a baseline modeling assumption that permits closed-form solutions for the profit-maximization problem, allowing explicit comparison of optimal scaling across regimes. We agree that the absence of sensitivity analysis limits the ability to assess robustness. In the revised manuscript we will add a dedicated subsection performing numerical sensitivity analysis under price-sensitive and competition-adjusted adoption functions, reporting the resulting changes (or lack thereof) to the near-linear N*(E), D*(E) and sub-quadratic cost scalings. Regarding empirical justification, the paper is theoretical and draws on stylized industry observations rather than new data; we will explicitly note this limitation and outline how q* could be calibrated in future empirical extensions. revision: partial
-
Referee: [§4.2, §5.1] §4.2 and §5.1 (Compute-bound derivations): The sub-quadratic cost scaling in E follows directly from substituting the optimal N*(E) and D*(E) into the cost function under the stated adoption and cost assumptions. Because these assumptions are introduced without independent measurement, the exponents are model-dependent rather than robust predictions; the paper does not report how the scalings change when the threshold is made endogenous.
Authors: The reported scalings are derived under the model's stated assumptions and are presented as such to yield falsifiable predictions. We will revise §§4.2 and 5.1 to include an explicit analysis of an endogenous q* (for example, making the threshold depend on price or market share). We will derive or numerically evaluate the altered first-order conditions and report the resulting exponents for N*, D*, and total cost with respect to E, thereby clarifying the conditions under which the near-linear and sub-quadratic behaviors persist. revision: yes
Circularity Check
Derivations are direct implications of profit maximization under stated modeling assumptions
full rationale
The paper constructs a theoretical economic model that combines established scaling laws for quality Q(N, D) with a profit function defined as (number of adopters exceeding fixed exogenous threshold q*) times price minus training costs. Optimal N* and D* (and resulting scalings with E and D) are obtained by solving the first-order conditions of this maximization in the compute-bound and data-bound regimes. These steps are mathematical consequences of the model's equations and assumptions rather than any reduction to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The threshold q* and cost coefficients are introduced explicitly as primitives; the results follow from the optimization and do not feed back into the inputs. The derivation chain is therefore self-contained against external benchmarks and receives no circularity flags.
Axiom & Free-Parameter Ledger
free parameters (3)
- consumer quality threshold
- hardware efficiency E
- data limit D
axioms (2)
- domain assumption LLM loss follows established scaling laws with parameters and tokens
- domain assumption Firm is a rational profit maximizer
invented entities (1)
-
consumer quality threshold
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
profit maximization problem ... π(n,d,t) = ω t · f(q(n,d)) − δ t² − (6 n d + 2 n t)/E
-
IndisputableMonolith/Foundation/DimensionForcingalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Leontief scaling law q(n,d)=min{an^α,bd^β} with α≈β≈0.3
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Daron Acemoglu. 2025. https://doi.org/10.1093/epolic/eiae042 The simple macroeconomics of AI . Economic Policy, 40(121):13--58
-
[2]
Alex Altair and Kaj Sotala. 2025. https://www.alignmentforum.org/w/recursive-self-improvement Recursive Self-Improvement . Webpage, AI Alignment Forum. Accessed: 2026-05-06
work page 2025
-
[3]
Sam Altman. 2025. https://blog.samaltman.com/three-observations Three Observations . Blog post, Sam Altman. Accessed: 2026-03-26
work page 2025
-
[4]
Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You. 2024. https://doi.org/10.48550/arXiv.2404.10102 Chinchilla Scaling : A replication attempt . Computing Research Repository, arXiv:2404.10102
-
[5]
Gwern Branwen. 2022. https://gwern.net/scaling-hypothesis The Scaling Hypothesis . Blog post, Gwern.net. Accessed: 2026-03-20
work page 2022
-
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss , Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...
work page 2020
-
[7]
Tim Dettmers. 2025. https://timdettmers.com/2025/12/10/why-agi-will-not-happen Why AGI will not happen . Blog post, Tim Dettmers. Accessed: 2026-05-06
work page 2025
-
[8]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas , Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche , Bogdan Damoc, Aurelia Guy, Simon Osindero, Kar \'e n Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Lau...
work page 2022
-
[9]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. https://doi.org/10.48550/arXiv.2001.08361 Scaling Laws for Neural Language Models . Computing Research Repository, arXiv:2001.08361
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2020
-
[10]
Wassily Leontief. 1941. The Structure of American Economy, 1919--1929: An Empirical Application of Equilibrium Analysis . Harvard University Press, Cambridge, MA, USA
work page 1941
-
[11]
William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, D. J. Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, and Ashish Sabharwal. 2026. https://doi.org/...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.03444 2026
-
[12]
Eric Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. 2023. The Quantization Model of Neural Scaling . In Advances in Neural Information Processing Systems, volume 36, pages 28699--28722. Curran Associates, Inc
work page 2023
-
[13]
Arvind Narayanan and Sayash Kapoor. 2025. https://kfai-documents.s3.amazonaws.com/documents/c3cac5a2a7/AI-as-Normal-Technology---Narayanan---Kapoor.pdf AI as normal technology: An alternative to the vision of AI as a potential superintelligence . Technical report, Knight First Amendment Institute, Columbia University
work page 2025
-
[14]
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso , Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, A...
work page 2023
-
[15]
Hal R. Varian and Marc Melitz. 2024. Intermediate Microeconomics : A Modern Approach , 10 edition. W. W. Norton & Company, New York, NY, USA
work page 2024
-
[16]
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent abilities of large language models. Transactions on Machine Learning Research
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.