pith. sign in

arxiv: 2605.16430 · v1 · pith:JSPXI5O2new · submitted 2026-05-14 · 💻 cs.LG · cs.AI

A Theory of Training Profit-Optimal LLMs

Pith reviewed 2026-05-20 20:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM trainingscaling lawsprofit maximizationeconomic modelhardware efficiencycompute-bound regimedata-bound regimetraining expenditure
0
0 comments X

The pith

In the compute-bound regime, optimal LLM model size and token budget scale nearly linearly with hardware efficiency, making total training cost grow sub-quadratically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs an economic model of a profit-maximizing LLM training firm that combines scaling laws relating parameters and tokens to model quality with a simple adoption rule in which consumers use the model only if quality exceeds a fixed threshold. Under this setup the firm chooses model size and training tokens to maximize revenue from adoption minus the costs of compute and data. In the regime limited only by compute, the optimal size and token count rise almost in direct proportion to hardware efficiency measured in FLOPs per dollar, so that total training cost grows less than quadratically with efficiency gains. When data volume D is instead the binding constraint, the profit-maximizing expenditure scales as D squared divided by hardware efficiency.

Core claim

The central claim is that rational profit maximization leads to specific scaling relationships: in compute-bound settings optimal model size and tokens track hardware efficiency E nearly linearly while total cost scales sub-quadratically in E; data-efficiency gains encourage larger models and higher spending; and in the data-bound regime with fixed D the optimal expenditure scales as D squared over E.

What carries the argument

The profit-maximization problem that integrates scaling laws for loss with a fixed quality threshold governing consumer adoption.

If this is right

  • Data-efficiency improvements cause the firm to select larger models and higher total training expenditure.
  • In the data-limited regime profit-optimal spending rises with the square of available data and falls with hardware or data efficiency.
  • Observed industry trends match the model's permissive variants under compute bounds but diverge from predictions under data bounds or stalled hardware progress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Continued hardware-efficiency growth would make progressively larger models cheaper relative to their quality gains, potentially shortening the time between capability thresholds.
  • If adoption thresholds prove sensitive to price or competing models, the optimal training expenditure could shift downward from the levels predicted here.
  • The same framework could be used to compare profit-optimal choices against current public statements about planned training runs.

Load-bearing premise

Consumer adoption depends only on whether model quality exceeds a fixed threshold that does not change with price, competition, or network effects.

What would settle it

Direct observation that real training expenditures fail to increase sub-quadratically with measured hardware-efficiency gains or quadratically with available data volume would falsify the predicted profit-optimal behavior.

Figures

Figures reproduced from arXiv: 2605.16430 by Sophie Hao, William Merrill.

Figure 1
Figure 1. Figure 1: Our model predicts that a profit-maximizing LLM firm scales its LLM training expenditures [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Inverse demand functions for tokens generated by an LLM of quality [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Inverse demand linking functions are parameterized by [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A scaling law’s elasticity of substitution [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency $E$ (FLOPs/\$) at a near-linear rate; total training cost then scales sub-quadratically in $E$. Data efficiency improvements incentivize larger models and training expenditure. When we are limited to $D$ data, profit-optimal training expenditure scales as $D^2/E$, i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a microeconomic model of an LLM training firm that maximizes profit by combining neural scaling laws for model quality Q(N,D) with a consumer adoption function based on a fixed quality threshold q*. It derives closed-form expressions for profit-optimal model size N*, token count D*, and training expenditure in a compute-bound regime (where optimal N* and D* scale near-linearly with hardware efficiency E, yielding sub-quadratic cost scaling in E) and a data-bound regime (where expenditure scales as D²/E). The analysis also compares current industry trends to the model's predictions under different assumptions about hardware progress.

Significance. If the central modeling choices are accepted, the work supplies an explicit theoretical framework for profit-optimal scaling decisions that can be directly compared to observed training runs and used for long-term capital planning. The explicit derivations from standard optimization and the trend-consistency checks are strengths that allow falsifiable predictions; the absence of empirical calibration for the adoption threshold is the primary limitation on immediate applicability.

major comments (2)
  1. [§3] §3 (Consumer adoption model): The revenue term is constructed from a fixed quality threshold q* that does not depend on price, competition, or network effects. This functional form is load-bearing for the first-order conditions that produce the claimed near-linear scaling of N* and D* with E and the D²/E expenditure law; the manuscript provides neither empirical justification for q* nor sensitivity analysis under alternative (e.g., price-sensitive) adoption functions.
  2. [§4.2, §5.1] §4.2 and §5.1 (Compute-bound derivations): The sub-quadratic cost scaling in E follows directly from substituting the optimal N*(E) and D*(E) into the cost function under the stated adoption and cost assumptions. Because these assumptions are introduced without independent measurement, the exponents are model-dependent rather than robust predictions; the paper does not report how the scalings change when the threshold is made endogenous.
minor comments (2)
  1. [§2] Notation for hardware efficiency E (FLOPs/$) is introduced without an explicit definition of the constant factors relating FLOPs to dollars; a short appendix clarifying the conversion would improve reproducibility.
  2. [§6] The trend-comparison section would benefit from a table listing the exact parameter values used for the 'permissive' versus 'stalling hardware' scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript while defending the modeling choices on substantive grounds.

read point-by-point responses
  1. Referee: [§3] §3 (Consumer adoption model): The revenue term is constructed from a fixed quality threshold q* that does not depend on price, competition, or network effects. This functional form is load-bearing for the first-order conditions that produce the claimed near-linear scaling of N* and D* with E and the D²/E expenditure law; the manuscript provides neither empirical justification for q* nor sensitivity analysis under alternative (e.g., price-sensitive) adoption functions.

    Authors: The fixed q* is introduced as a baseline modeling assumption that permits closed-form solutions for the profit-maximization problem, allowing explicit comparison of optimal scaling across regimes. We agree that the absence of sensitivity analysis limits the ability to assess robustness. In the revised manuscript we will add a dedicated subsection performing numerical sensitivity analysis under price-sensitive and competition-adjusted adoption functions, reporting the resulting changes (or lack thereof) to the near-linear N*(E), D*(E) and sub-quadratic cost scalings. Regarding empirical justification, the paper is theoretical and draws on stylized industry observations rather than new data; we will explicitly note this limitation and outline how q* could be calibrated in future empirical extensions. revision: partial

  2. Referee: [§4.2, §5.1] §4.2 and §5.1 (Compute-bound derivations): The sub-quadratic cost scaling in E follows directly from substituting the optimal N*(E) and D*(E) into the cost function under the stated adoption and cost assumptions. Because these assumptions are introduced without independent measurement, the exponents are model-dependent rather than robust predictions; the paper does not report how the scalings change when the threshold is made endogenous.

    Authors: The reported scalings are derived under the model's stated assumptions and are presented as such to yield falsifiable predictions. We will revise §§4.2 and 5.1 to include an explicit analysis of an endogenous q* (for example, making the threshold depend on price or market share). We will derive or numerically evaluate the altered first-order conditions and report the resulting exponents for N*, D*, and total cost with respect to E, thereby clarifying the conditions under which the near-linear and sub-quadratic behaviors persist. revision: yes

Circularity Check

0 steps flagged

Derivations are direct implications of profit maximization under stated modeling assumptions

full rationale

The paper constructs a theoretical economic model that combines established scaling laws for quality Q(N, D) with a profit function defined as (number of adopters exceeding fixed exogenous threshold q*) times price minus training costs. Optimal N* and D* (and resulting scalings with E and D) are obtained by solving the first-order conditions of this maximization in the compute-bound and data-bound regimes. These steps are mathematical consequences of the model's equations and assumptions rather than any reduction to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The threshold q* and cost coefficients are introduced explicitly as primitives; the results follow from the optimization and do not feed back into the inputs. The derivation chain is therefore self-contained against external benchmarks and receives no circularity flags.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 1 invented entities

The model rests on standard scaling-law forms plus a microeconomic profit objective; the main added elements are the consumer quality threshold and the explicit cost functions for parameters and tokens.

free parameters (3)
  • consumer quality threshold
    Determines the fraction of consumers who adopt the model; introduced to link model quality to revenue.
  • hardware efficiency E
    FLOPs per dollar; treated as an exogenous variable that scales optimal size and cost.
  • data limit D
    Caps the token budget in the data-bound regime; appears as an input parameter.
axioms (2)
  • domain assumption LLM loss follows established scaling laws with parameters and tokens
    Invoked to translate model size and tokens into quality and therefore into adoption.
  • domain assumption Firm is a rational profit maximizer
    Standard microeconomic premise used to set up the optimization problem.
invented entities (1)
  • consumer quality threshold no independent evidence
    purpose: Links model quality to market adoption and revenue
    New modeling device that converts scaling-law outputs into economic quantities; no independent empirical calibration supplied in abstract.

pith-pipeline@v0.9.0 · 5835 in / 1480 out tokens · 74197 ms · 2026-05-20T20:08:56.797778+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

  1. [1]

    Daron Acemoglu. 2025. https://doi.org/10.1093/epolic/eiae042 The simple macroeconomics of AI . Economic Policy, 40(121):13--58

  2. [2]

    Alex Altair and Kaj Sotala. 2025. https://www.alignmentforum.org/w/recursive-self-improvement Recursive Self-Improvement . Webpage, AI Alignment Forum. Accessed: 2026-05-06

  3. [3]

    Sam Altman. 2025. https://blog.samaltman.com/three-observations Three Observations . Blog post, Sam Altman. Accessed: 2026-03-26

  4. [4]

    Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You. 2024. https://doi.org/10.48550/arXiv.2404.10102 Chinchilla Scaling : A replication attempt . Computing Research Repository, arXiv:2404.10102

  5. [5]

    Gwern Branwen. 2022. https://gwern.net/scaling-hypothesis The Scaling Hypothesis . Blog post, Gwern.net. Accessed: 2026-03-20

  6. [6]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss , Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

  7. [7]

    Tim Dettmers. 2025. https://timdettmers.com/2025/12/10/why-agi-will-not-happen Why AGI will not happen . Blog post, Tim Dettmers. Accessed: 2026-05-06

  8. [8]

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas , Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche , Bogdan Damoc, Aurelia Guy, Simon Osindero, Kar \'e n Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Lau...

  9. [9]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. https://doi.org/10.48550/arXiv.2001.08361 Scaling Laws for Neural Language Models . Computing Research Repository, arXiv:2001.08361

  10. [10]

    Wassily Leontief. 1941. The Structure of American Economy, 1919--1929: An Empirical Application of Equilibrium Analysis . Harvard University Press, Cambridge, MA, USA

  11. [11]

    William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, D. J. Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, and Ashish Sabharwal. 2026. https://doi.org/...

  12. [12]

    Eric Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. 2023. The Quantization Model of Neural Scaling . In Advances in Neural Information Processing Systems, volume 36, pages 28699--28722. Curran Associates, Inc

  13. [13]

    Arvind Narayanan and Sayash Kapoor. 2025. https://kfai-documents.s3.amazonaws.com/documents/c3cac5a2a7/AI-as-Normal-Technology---Narayanan---Kapoor.pdf AI as normal technology: An alternative to the vision of AI as a potential superintelligence . Technical report, Knight First Amendment Institute, Columbia University

  14. [14]

    Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso , Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso , Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, A...

  15. [15]

    Varian and Marc Melitz

    Hal R. Varian and Marc Melitz. 2024. Intermediate Microeconomics : A Modern Approach , 10 edition. W. W. Norton & Company, New York, NY, USA

  16. [16]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent abilities of large language models. Transactions on Machine Learning Research