The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

Gustav Olaf Yunus Laitinen-Fredriksson Lundstr\"om-Imanov

arxiv: 2605.20279 · v1 · pith:2YWDHA23new · submitted 2026-05-19 · 💰 econ.GN · cs.CY· cs.LG· q-fin.EC

The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

Gustav Olaf Yunus Laitinen-Fredriksson Lundstr\"om-Imanov This is my paper

Pith reviewed 2026-05-21 02:05 UTC · model grok-4.3

classification 💰 econ.GN cs.CYcs.LGq-fin.EC

keywords model collapsesynthetic dataprovenance subsidywelfare decompositioncontamination equilibriumgenerative AI markets

0 comments

The pith

In synthetic data markets, the welfare-maximizing provenance subsidy equals KL(q||p) divided by twice the collapse cost parameter kappa once the market settles into its contamination equilibrium.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds the first unified economic model of markets in which an increasing share of training data comes from prior generative models rather than human sources. Recursive use of this synthetic content produces measurable distributional drift known as model collapse. The authors prove that such markets converge to a unique Synthetic Data Contamination Equilibrium and decompose social welfare into production benefits, consumption benefits, collapse losses, and information losses. From this decomposition they obtain closed-form expressions for the optimal provenance subsidy and the optimal watermark strength. They further supply an iterative algorithm that converges to the equilibrium while meeting an information-theoretic bound and confirm the predicted collapse rate on a multi-generation C4 benchmark.

Core claim

Under the Synthetic Data Contamination Equilibrium the welfare-maximizing provenance subsidy takes the closed form s* = KL(q||p)/(2 kappa) and the welfare-maximizing watermark strength takes the form w* = (1 - psi) KL(q||p)/(2 kappa psi). These expressions follow from a welfare decomposition W = W_prod + W_cons - L_coll - L_info together with the mean-field limit of the contamination process governed by Wasserstein gradient flows; the same framework yields an impossibility result for information-constrained implementation and an algorithm that attains the Cramer-Rao bound while converging to an epsilon-equilibrium in O(epsilon^-2 log T) steps.

What carries the argument

The Synthetic Data Contamination Equilibrium (SDCE), the fixed point of producer and consumer optimization under recursive synthetic contamination, which serves as the platform for the welfare decomposition and the closed-form policy derivations.

If this is right

Welfare-maximizing provenance subsidies can be computed directly from the KL divergence between the synthetic and original distributions and the collapse cost parameter.
Watermark strength can be set as a direct complement to the subsidy to internalize both collapse and information externalities.
The Provenance-Market Iterative Retraining algorithm reaches near-equilibrium outcomes while satisfying the information-theoretic lower bound on provenance estimation.
Unregulated retraining produces a logarithmic decay in model quality whose coefficient matches the structural collapse rate of 0.183.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regulators could plug observable divergence statistics into the closed-form expressions to set data-provenance payments in public AI training pools.
The logarithmic collapse law implies that quality degradation accelerates with each retraining cycle unless subsidies or watermarks are applied early.
The framework suggests that provenance verification costs should be weighed against the marginal welfare gain from the optimal subsidy when designing enforcement mechanisms.

Load-bearing premise

The market reaches the Synthetic Data Contamination Equilibrium whose existence and generic uniqueness are proved in the model.

What would settle it

An ordinary-least-squares estimate of the collapse-rate coefficient on repeated generations of synthetic data that lies statistically far from the structural prediction 0.183 would falsify the equilibrium and welfare results.

read the original abstract

Generative artificial intelligence is rapidly transforming the supply side of training data: an increasing share of new tokens, images, and structured records is produced by previous-generation models rather than by human originators. Recursive training on such synthetic content induces a measurable and often irreversible loss of distributional fidelity, a phenomenon known as model collapse. We develop the first unified microeconomic theory of synthetic data markets under model collapse. We introduce the Synthetic Data Contamination Equilibrium (SDCE), prove existence and generic uniqueness, derive a welfare decomposition W = W_prod + W_cons - L_coll - L_info, establish a Wasserstein-gradient-flow mean-field collapse limit, prove an impossibility of information-constrained implementation, and obtain closed-form expressions for the welfare-maximizing provenance subsidy s* = KL(q||p)/(2 kappa) and the welfare-maximizing watermark strength w* = (1 - psi) KL(q||p)/(2 kappa psi). We prove an information-theoretic Cramer-Rao lower bound on any provenance estimator using only producer-side observations and show that the Provenance-Market Iterative Retraining (PMIR) algorithm attains this bound up to constants while converging to an epsilon-SDCE in O(epsilon^-2 log T) iterations. A reduced-form OLS estimation on a C4-synthetic benchmark over ten retraining generations yields a collapse-rate coefficient b-hat = 0.181 (HAC s.e. 0.024), within one standard error of the structural prediction 0.183. Calibrated experiments raise generation-ten model quality by 23.1 percent over the unregulated benchmark while lowering the 2-Wasserstein drift on a held-out diversity probe from 0.318 to 0.142. Scaling experiments over generations t in {1,...,10} recover a logarithmic-in-t collapse law log Q_t = log Q_0 - 0.183 t rho^2 with R^2 = 0.962.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a microeconomic model of synthetic data markets with closed-form subsidies and a new equilibrium, but the structural prediction matches the OLS fit so closely that it looks more like calibration than independent validation.

read the letter

Colleague, the main thing to know is that this work frames model collapse as an equilibrium in a synthetic data market and derives closed-form optimal provenance subsidies and watermarks from a welfare decomposition and mean-field limit. They introduce the Synthetic Data Contamination Equilibrium, prove existence and generic uniqueness, split welfare into production, consumption, collapse loss, and information loss terms, and obtain s* = KL(q||p)/(2 kappa) along with the corresponding watermark expression. They also give a PMIR algorithm that converges to an epsilon-SDCE and a Cramer-Rao bound on provenance estimation. The C4 experiments report a collapse coefficient of 0.181 that sits within one standard error of the structural value 0.183, plus a 23 percent quality lift at generation ten and a clean log collapse law with high R-squared. That is the actual contribution on the page. The framing is new for the economics side and the policy angle on subsidies is direct. The attempt to connect the theory to a concrete benchmark is better than most pure theory papers in this area. The soft spots sit mainly in the empirical identification and the approximation quality. The structural and reduced-form coefficients are nearly identical, which raises the possibility that the model was aligned to the same data rather than delivering a genuine out-of-sample prediction. The stress-test concern about the Wasserstein mean-field limit versus a discrete ten-generation process is worth checking, because any material deviation would weaken both the optimality claims and the reported welfare gains. The welfare decomposition and the assumption that the market settles at SDCE are load-bearing, so the full proofs need scrutiny on those steps. Free parameters like kappa and psi also require clearer justification from primitives. This paper is for economists who work on AI data supply and for readers who want a policy handle on provenance and synthetic content. Someone interested in formal models of training-data sustainability will find usable pieces even if the empirics need tightening. It deserves a serious referee. The equilibrium concept and closed-form policies are novel enough that external eyes should examine the derivations and the identification strategy rather than desk-rejecting on the abstract alone.

Referee Report

2 major / 2 minor

Summary. The manuscript develops the first unified microeconomic theory of synthetic data markets under model collapse. It introduces the Synthetic Data Contamination Equilibrium (SDCE) and proves its existence and generic uniqueness, derives the welfare decomposition W = W_prod + W_cons - L_coll - L_info, establishes a Wasserstein-gradient-flow mean-field collapse limit, proves an impossibility result for information-constrained implementation, obtains closed-form welfare-maximizing provenance subsidy s* = KL(q||p)/(2 kappa) and watermark strength w* = (1 - psi) KL(q||p)/(2 kappa psi), proves a Cramer-Rao lower bound on provenance estimators, shows that the PMIR algorithm attains the bound up to constants while converging to an epsilon-SDCE, and reports reduced-form OLS results on a C4-synthetic benchmark over ten generations with collapse-rate coefficient 0.181 (within one SE of the structural value 0.183) together with a 23.1 percent quality improvement under the optimal subsidy.

Significance. If the derivations are robust and the mean-field limit accurately approximates the finite discrete dynamics, the paper supplies the first formal equilibrium and welfare framework for regulating synthetic data markets, with explicit policy instruments (provenance subsidies and watermarks) and an implementable algorithm. The combination of existence/uniqueness proofs, closed-form optima, information-theoretic bounds, and calibrated empirical results on a standard benchmark would constitute a substantial contribution to the economics of AI and data production.

major comments (2)

Abstract and empirical section: the structural collapse-rate prediction of 0.183 is reported as being within one SE of the OLS estimate 0.181 obtained from the identical C4-synthetic benchmark. Please supply the explicit first-principles derivation of the numerical value 0.183 from the model primitives (kappa, psi, KL(q||p), etc.) that is independent of the regression, so that the match can be evaluated as a genuine prediction rather than a post-hoc alignment.
Welfare-maximizing expressions (abstract) and Wasserstein-gradient-flow mean-field limit: the closed forms s* = KL(q||p)/(2 kappa) and w* rest on the market settling at the SDCE and on the continuous mean-field limit. The experiments employ a finite discrete retraining process over t = 10 generations; a direct comparison or error bound between the mean-field trajectory and the discrete PMIR path at small t is required to substantiate the claimed optimality and the 23.1 percent quality lift.

minor comments (2)

Notation: define the free parameters kappa and psi, the distributions p and q, and the precise form of the welfare decomposition before their first appearance in the closed-form results.
Empirical reporting: the scaling experiments that recover log Q_t = log Q_0 - 0.183 t rho^2 with R^2 = 0.962 should report robustness to alternative seeds, different synthetic-data generators, or alternative diversity probes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment below and will revise the manuscript to improve clarity on the theoretical predictions and their relation to the finite-horizon experiments.

read point-by-point responses

Referee: Abstract and empirical section: the structural collapse-rate prediction of 0.183 is reported as being within one SE of the OLS estimate 0.181 obtained from the identical C4-synthetic benchmark. Please supply the explicit first-principles derivation of the numerical value 0.183 from the model primitives (kappa, psi, KL(q||p), etc.) that is independent of the regression, so that the match can be evaluated as a genuine prediction rather than a post-hoc alignment.

Authors: We agree that the presentation should make the independence from the regression fully transparent. The value 0.183 is obtained by substituting the C4-calibrated primitives (kappa = 1, psi = 0.75, KL(q||p) = 0.366, rho = 1) into the closed-form coefficient of the logarithmic collapse law that follows from the Wasserstein-gradient-flow mean-field limit of the SDCE (Theorem 3.4). We will add an explicit derivation in a new subsection of the empirical section that computes this number step-by-step from the primitives alone, without any reference to the OLS estimates, so that readers can verify it as an a-priori prediction. revision: yes
Referee: Welfare-maximizing expressions (abstract) and Wasserstein-gradient-flow mean-field limit: the closed forms s* = KL(q||p)/(2 kappa) and w* rest on the market settling at the SDCE and on the continuous mean-field limit. The experiments employ a finite discrete retraining process over t = 10 generations; a direct comparison or error bound between the mean-field trajectory and the discrete PMIR path at small t is required to substantiate the claimed optimality and the 23.1 percent quality lift.

Authors: The referee correctly identifies a gap between the asymptotic mean-field analysis and the finite-t experiments. While the paper proves convergence of the discrete PMIR dynamics to the mean-field limit as t grows, it does not supply a quantitative error bound or side-by-side trajectory comparison for t = 10. We will add an appendix section that (i) derives a non-asymptotic Wasserstein-distance bound between the discrete and continuous paths and (ii) reports a direct numerical comparison of the two trajectories on the C4 benchmark for generations 1 through 10, confirming that the optimality claims remain valid within the reported error tolerance at this horizon. revision: yes

Circularity Check

1 steps flagged

Structural collapse-rate 'prediction' of 0.183 reduces to OLS fit on identical C4 benchmark

specific steps

fitted input called prediction [Abstract]
"A reduced-form OLS estimation on a C4-synthetic benchmark over ten retraining generations yields a collapse-rate coefficient b-hat = 0.181 (HAC s.e. 0.024), within one standard error of the structural prediction 0.183. ... Scaling experiments over generations t in {1,...,10} recover a logarithmic-in-t collapse law log Q_t = log Q_0 - 0.183 t rho^2 with R^2 = 0.962."

The paper presents 0.183 as the first-principles structural rate from the mean-field collapse limit, yet the identical numerical value is recovered by fitting the logarithmic law directly to the C4 benchmark; the OLS estimate on the same data is then reported as 'within one standard error' of this value, making the match and the claimed 23.1% quality lift tautological rather than an out-of-sample test of the theory.

full rationale

The closed-form subsidy and watermark expressions derive from the SDCE existence proof, additive welfare decomposition, and Wasserstein mean-field limit; these steps are self-contained and do not reduce to the empirical benchmark. However, the central empirical claim equates a 'structural prediction' of 0.183 to the reduced-form OLS coefficient 0.181 obtained from the same C4-synthetic data over ten generations, while the scaling experiments recover the exact same coefficient in the fitted logarithmic law. This constitutes a fitted-input-called-prediction pattern in which the reported match and quality-lift calculations are statistically forced by construction rather than independently validated.

Axiom & Free-Parameter Ledger

2 free parameters · 3 axioms · 2 invented entities

The model introduces parameters kappa and psi whose values are not independently sourced, plus new equilibrium and algorithmic constructs whose validity rests on unverified proofs and the benchmark fit.

free parameters (2)

kappa
Scaling or cost parameter appearing in the denominator of the optimal subsidy s* = KL(q||p)/(2 kappa).
psi
Parameter in the optimal watermark strength formula w* = (1 - psi) KL(q||p)/(2 kappa psi), likely tied to detection or information constraints.

axioms (3)

domain assumption Existence and generic uniqueness of the Synthetic Data Contamination Equilibrium (SDCE)
Invoked to support welfare analysis and optimal policy derivations.
domain assumption Welfare can be additively decomposed as W = W_prod + W_cons - L_coll - L_info
Central structural assumption enabling closed-form welfare maximization.
domain assumption Collapse dynamics admit a Wasserstein-gradient-flow mean-field limit
Used to derive the scaling law and limit behavior.

invented entities (2)

Synthetic Data Contamination Equilibrium (SDCE) no independent evidence
purpose: Characterize stable market state under recursive synthetic training
Newly defined equilibrium concept central to all results.
Provenance-Market Iterative Retraining (PMIR) algorithm no independent evidence
purpose: Attain Cramer-Rao bound and converge to epsilon-SDCE
New iterative procedure proposed to implement the theory.

pith-pipeline@v0.9.0 · 5915 in / 1781 out tokens · 77817 ms · 2026-05-21T02:05:36.254343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

[1]

AI models collapse when trained on recursively generated data,

I. Shumailov, Z. Shumaylov, Y . Zhao, N. Papernot, R. Anderson, and Y . Gal, “AI models collapse when trained on recursively generated data,” Nature, vol. 631, no. 8022, pp. 755–759, 2024

work page 2024
[2]

Self-consuming generative models go MAD,

S. Alemohammad, J. Casco-Rodriguez, L. Luzi, A. I. Humayun, H. Babaei, D. LeJeune, A. Siahkoohi, and R. G. Baraniuk, “Self-consuming generative models go MAD,” in Proc. International Conference on Learning Representations, 2024

work page 2024
[3]

On the stability of iterative retraining of generative models,

Q. Bertrand, A. J. Bose, A. Duplessis, M. Jiralerspong, and G. Gidel, “On the stability of iterative retraining of generative models,” in Proc. International Conference on Learning Representations, 2024

work page 2024
[4]

arXiv preprint , volume =

M. Briesch, D. Sobania, and F. Rothlauf, “Large language models suffer from their own output: An analysis of the self-consuming training loop,” arXiv:2311.16822, 2023

work page arXiv 2023
[5]

A tale of tails: Model collapse as a change of scaling laws,

E. Dohmatob, Y . Feng, P. Yang, F. Charton, and J. Kempe, “A tale of tails: Model collapse as a change of scaling laws,” in Proc. International Conference on Machine Learning, 2024

work page 2024
[6]

Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data,

M. Gerstgrasser, R. Schaeffer, A. Dey, R. Rafailov, H. Sleight, J. Hughes, T. Korbak, R. Agrawal, D. Pai, A. Gromov, D. A. Roberts, D. Yang, D. L. Donoho, and S. Koyejo, “Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data,” arXiv:2404.01413, 2024

work page arXiv 2024
[7]

Combining generative artificial intelligence (AI) and the internet: Heading towards evolution or degradation?,

G. Martinez, L. Watson, P. Reviriego, J. A. Hernandez, M. Juarez, and R. Sarkar, “Combining generative artificial intelligence (AI) and the internet: Heading towards evolution or degradation?,” arXiv:2303.01255, 2023

work page arXiv 2023
[8]

Nonrivalry and the economics of data,

C. I. Jones and C. Tonetti, “Nonrivalry and the economics of data,” American Economic Review, vol. 110, no. 9, pp. 2819–2858, 2020

work page 2020
[9]

Digital economics,

A. Goldfarb and C. Tucker, “Digital economics,” Journal of Economic Literature, vol. 57, no. 1, pp. 3–43, 2019

work page 2019
[10]

The market for ‘lemons’: Quality uncertainty and the market mechanism,

G. A. Akerlof, “The market for ‘lemons’: Quality uncertainty and the market mechanism,” Quarterly Journal of Economics, vol. 84, no. 3, pp. 488–500, 1970

work page 1970
[11]

Job market signaling,

M. Spence, “Job market signaling,” Quarterly Journal of Economics, vol. 87, no. 3, pp. 355–374, 1973

work page 1973
[12]

Economic welfare and the allocation of resources for invention,

K. J. Arrow, “Economic welfare and the allocation of resources for invention,” in The Rate and Direction of Inventive Activity: Economic and Social Factors. Princeton, NJ: Princeton Univ. Press, 1962, pp. 609– 626

work page 1962
[13]

A theory of production,

C. W. Cobb and P. H. Douglas, “A theory of production,” American Economic Review, vol. 18, no. 1, pp. 139–165, 1928

work page 1928
[14]

The race between man and machine: Implications of technology for growth, factor shares, and employment,

D. Acemoglu and P. Restrepo, “The race between man and machine: Implications of technology for growth, factor shares, and employment,” American Economic Review, vol. 108, no. 6, pp. 1488–1542, 2018

work page 2018
[15]

On the Opportunities and Risks of Foundation Models

R. Bommasani et al., “On the opportunities and risks of foundation models,” arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Foundation models and fair use,

P. Henderson, X. Li, D. Jurafsky, T. Hashimoto, M. A. Lemley, and P. Liang, “Foundation models and fair use,” Journal of Machine Learning Research, vol. 24, no. 400, pp. 1–79, 2023

work page 2023
[17]

Extracting training data from large language models,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel, “Extracting training data from large language models,” in Proc. USENIX Security Symposium, 2021, pp. 2633–2650

work page 2021
[18]

Quantifying memorization across neural language models,

N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying memorization across neural language models,” in Proc. International Conference on Learning Representations, 2023

work page 2023
[19]

Scalable Extraction of Training Data from (Production) Language Models

M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace, F. Tramer, and K. Lee, “Scal- able extraction of training data from (production) language models,” arXiv:2311.17035, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

A watermark for large language models,

J. Kirchenbauer, J. Geiping, Y . Wen, J. Katz, I. Miers, and T. Gold- stein, “A watermark for large language models,” in Proc. International Conference on Machine Learning, 2023, pp. 17061–17084

work page 2023
[21]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[22]

Training Compute-Optimal Large Language Models

J. Hoffmann et al., “Training compute-optimal large language models,” arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Language models are few-shot learners,

T. Brown et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901

work page 2020
[24]

Equilibrium points in n-person games,

J. F. Nash, “Equilibrium points in n-person games,” Proceedings of the National Academy of Sciences, vol. 36, no. 1, pp. 48–49, 1950

work page 1950
[25]

Existence of an equilibrium for a competitive economy,

K. J. Arrow and G. Debreu, “Existence of an equilibrium for a competitive economy,” Econometrica, vol. 22, no. 3, pp. 265–290, 1954

work page 1954
[26]

T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Hoboken, NJ: Wiley-Interscience, 2006

work page 2006
[27]

The algorithmic foundations of differential privacy,

C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014

work page 2014
[28]

The AI Index 2025 Annual Report,

N. Maslej et al., “The AI Index 2025 Annual Report,” AI In- dex Steering Committee, Institute for Human-Centered AI, Stan- ford University, Stanford, CA, Apr. 2025. [Online]. Available: https://aiindex.stanford.edu/report/

work page 2025
[29]

A value for n-person games,

L. S. Shapley, “A value for n-person games,” in Contributions to the Theory of Games, vol. II, H. W. Kuhn and A. W. Tucker, Eds. Princeton, NJ: Princeton Univ. Press, 1953, pp. 307–317

work page 1953
[30]

Data Shapley: Equitable valuation of data for machine learning,

A. Ghorbani and J. Zou, “Data Shapley: Equitable valuation of data for machine learning,” in Proc. International Conference on Machine Learning, vol. 97, 2019, pp. 2242–2251

work page 2019
[31]

Villani, Optimal Transport: Old and New, Grundlehren der mathema- tischen Wissenschaften, vol

C. Villani, Optimal Transport: Old and New, Grundlehren der mathema- tischen Wissenschaften, vol. 338. Berlin, Heidelberg: Springer, 2009

work page 2009
[32]

Mean field games,

J.-M. Lasry and P.-L. Lions, “Mean field games,” Japanese Journal of Mathematics, vol. 2, no. 1, pp. 229–260, 2007

work page 2007
[33]

Computational optimal transport,

G. Peyre and M. Cuturi, “Computational optimal transport,” Foundations and Trends in Machine Learning, vol. 11, no. 5–6, pp. 355–607, 2019

work page 2019
[34]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, vol. 27, 2014, pp. 2672–2680

work page 2014
[35]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 6840–6851

work page 2020
[36]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020

work page 2020
[37]

The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only,

G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay, “The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 79155–79172

work page 2023

[1] [1]

AI models collapse when trained on recursively generated data,

I. Shumailov, Z. Shumaylov, Y . Zhao, N. Papernot, R. Anderson, and Y . Gal, “AI models collapse when trained on recursively generated data,” Nature, vol. 631, no. 8022, pp. 755–759, 2024

work page 2024

[2] [2]

Self-consuming generative models go MAD,

S. Alemohammad, J. Casco-Rodriguez, L. Luzi, A. I. Humayun, H. Babaei, D. LeJeune, A. Siahkoohi, and R. G. Baraniuk, “Self-consuming generative models go MAD,” in Proc. International Conference on Learning Representations, 2024

work page 2024

[3] [3]

On the stability of iterative retraining of generative models,

Q. Bertrand, A. J. Bose, A. Duplessis, M. Jiralerspong, and G. Gidel, “On the stability of iterative retraining of generative models,” in Proc. International Conference on Learning Representations, 2024

work page 2024

[4] [4]

arXiv preprint , volume =

M. Briesch, D. Sobania, and F. Rothlauf, “Large language models suffer from their own output: An analysis of the self-consuming training loop,” arXiv:2311.16822, 2023

work page arXiv 2023

[5] [5]

A tale of tails: Model collapse as a change of scaling laws,

E. Dohmatob, Y . Feng, P. Yang, F. Charton, and J. Kempe, “A tale of tails: Model collapse as a change of scaling laws,” in Proc. International Conference on Machine Learning, 2024

work page 2024

[6] [6]

Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data,

M. Gerstgrasser, R. Schaeffer, A. Dey, R. Rafailov, H. Sleight, J. Hughes, T. Korbak, R. Agrawal, D. Pai, A. Gromov, D. A. Roberts, D. Yang, D. L. Donoho, and S. Koyejo, “Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data,” arXiv:2404.01413, 2024

work page arXiv 2024

[7] [7]

Combining generative artificial intelligence (AI) and the internet: Heading towards evolution or degradation?,

G. Martinez, L. Watson, P. Reviriego, J. A. Hernandez, M. Juarez, and R. Sarkar, “Combining generative artificial intelligence (AI) and the internet: Heading towards evolution or degradation?,” arXiv:2303.01255, 2023

work page arXiv 2023

[8] [8]

Nonrivalry and the economics of data,

C. I. Jones and C. Tonetti, “Nonrivalry and the economics of data,” American Economic Review, vol. 110, no. 9, pp. 2819–2858, 2020

work page 2020

[9] [9]

Digital economics,

A. Goldfarb and C. Tucker, “Digital economics,” Journal of Economic Literature, vol. 57, no. 1, pp. 3–43, 2019

work page 2019

[10] [10]

The market for ‘lemons’: Quality uncertainty and the market mechanism,

G. A. Akerlof, “The market for ‘lemons’: Quality uncertainty and the market mechanism,” Quarterly Journal of Economics, vol. 84, no. 3, pp. 488–500, 1970

work page 1970

[11] [11]

Job market signaling,

M. Spence, “Job market signaling,” Quarterly Journal of Economics, vol. 87, no. 3, pp. 355–374, 1973

work page 1973

[12] [12]

Economic welfare and the allocation of resources for invention,

K. J. Arrow, “Economic welfare and the allocation of resources for invention,” in The Rate and Direction of Inventive Activity: Economic and Social Factors. Princeton, NJ: Princeton Univ. Press, 1962, pp. 609– 626

work page 1962

[13] [13]

A theory of production,

C. W. Cobb and P. H. Douglas, “A theory of production,” American Economic Review, vol. 18, no. 1, pp. 139–165, 1928

work page 1928

[14] [14]

The race between man and machine: Implications of technology for growth, factor shares, and employment,

D. Acemoglu and P. Restrepo, “The race between man and machine: Implications of technology for growth, factor shares, and employment,” American Economic Review, vol. 108, no. 6, pp. 1488–1542, 2018

work page 2018

[15] [15]

On the Opportunities and Risks of Foundation Models

R. Bommasani et al., “On the opportunities and risks of foundation models,” arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Foundation models and fair use,

P. Henderson, X. Li, D. Jurafsky, T. Hashimoto, M. A. Lemley, and P. Liang, “Foundation models and fair use,” Journal of Machine Learning Research, vol. 24, no. 400, pp. 1–79, 2023

work page 2023

[17] [17]

Extracting training data from large language models,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel, “Extracting training data from large language models,” in Proc. USENIX Security Symposium, 2021, pp. 2633–2650

work page 2021

[18] [18]

Quantifying memorization across neural language models,

N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying memorization across neural language models,” in Proc. International Conference on Learning Representations, 2023

work page 2023

[19] [19]

Scalable Extraction of Training Data from (Production) Language Models

M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace, F. Tramer, and K. Lee, “Scal- able extraction of training data from (production) language models,” arXiv:2311.17035, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

A watermark for large language models,

J. Kirchenbauer, J. Geiping, Y . Wen, J. Katz, I. Miers, and T. Gold- stein, “A watermark for large language models,” in Proc. International Conference on Machine Learning, 2023, pp. 17061–17084

work page 2023

[21] [21]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[22] [22]

Training Compute-Optimal Large Language Models

J. Hoffmann et al., “Training compute-optimal large language models,” arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Language models are few-shot learners,

T. Brown et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901

work page 2020

[24] [24]

Equilibrium points in n-person games,

J. F. Nash, “Equilibrium points in n-person games,” Proceedings of the National Academy of Sciences, vol. 36, no. 1, pp. 48–49, 1950

work page 1950

[25] [25]

Existence of an equilibrium for a competitive economy,

K. J. Arrow and G. Debreu, “Existence of an equilibrium for a competitive economy,” Econometrica, vol. 22, no. 3, pp. 265–290, 1954

work page 1954

[26] [26]

T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Hoboken, NJ: Wiley-Interscience, 2006

work page 2006

[27] [27]

The algorithmic foundations of differential privacy,

C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014

work page 2014

[28] [28]

The AI Index 2025 Annual Report,

N. Maslej et al., “The AI Index 2025 Annual Report,” AI In- dex Steering Committee, Institute for Human-Centered AI, Stan- ford University, Stanford, CA, Apr. 2025. [Online]. Available: https://aiindex.stanford.edu/report/

work page 2025

[29] [29]

A value for n-person games,

L. S. Shapley, “A value for n-person games,” in Contributions to the Theory of Games, vol. II, H. W. Kuhn and A. W. Tucker, Eds. Princeton, NJ: Princeton Univ. Press, 1953, pp. 307–317

work page 1953

[30] [30]

Data Shapley: Equitable valuation of data for machine learning,

A. Ghorbani and J. Zou, “Data Shapley: Equitable valuation of data for machine learning,” in Proc. International Conference on Machine Learning, vol. 97, 2019, pp. 2242–2251

work page 2019

[31] [31]

Villani, Optimal Transport: Old and New, Grundlehren der mathema- tischen Wissenschaften, vol

C. Villani, Optimal Transport: Old and New, Grundlehren der mathema- tischen Wissenschaften, vol. 338. Berlin, Heidelberg: Springer, 2009

work page 2009

[32] [32]

Mean field games,

J.-M. Lasry and P.-L. Lions, “Mean field games,” Japanese Journal of Mathematics, vol. 2, no. 1, pp. 229–260, 2007

work page 2007

[33] [33]

Computational optimal transport,

G. Peyre and M. Cuturi, “Computational optimal transport,” Foundations and Trends in Machine Learning, vol. 11, no. 5–6, pp. 355–607, 2019

work page 2019

[34] [34]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, vol. 27, 2014, pp. 2672–2680

work page 2014

[35] [35]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 6840–6851

work page 2020

[36] [36]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020

work page 2020

[37] [37]

The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only,

G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay, “The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 79155–79172

work page 2023