The Curse of Recursion: Training on Generated Data Makes Models Forget
Pith reviewed 2026-05-20 13:58 UTC · model grok-4.3
The pith
Use of model-generated data in training causes irreversible loss of the tails of the original content distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. The authors term this effect Model Collapse and show that it occurs in Variational Autoencoders, Gaussian Mixture Models and LLMs, building theoretical intuition and demonstrating its presence across learned generative models.
What carries the argument
Model Collapse, the progressive disappearance of tail mass from the learned distribution when training data is replaced by samples drawn from a previous model.
If this is right
- Models trained recursively on their own outputs will assign lower probability to events that were uncommon in the initial data.
- The diversity of outputs from successive model generations will shrink as tail coverage is lost.
- Data collected from genuine human interactions will retain higher value than synthetic content for sustaining model performance.
- Continued use of unfiltered web data will eventually degrade the capabilities of models trained on it.
Where Pith is reading between the lines
- Detection and removal of synthetic text from training corpora may become a standard preprocessing step.
- The same collapse dynamic could appear in image or video generation pipelines that rely on web-scraped synthetic examples.
- Hybrid datasets that deliberately mix real and generated samples in controlled ratios could be tested to measure the rate of tail erosion.
Load-bearing premise
Generated data enters training directly, without curation or mechanisms that would restore the missing low-probability regions of the original distribution.
What would settle it
Train one model on real data, then train successor models on data sampled from each predecessor and check whether the probability mass assigned to the original rare events steadily declines across generations.
read the original abstract
Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that recursive training of generative models on their own outputs produces 'model collapse,' in which the tails of the original data distribution are irreversibly lost. This is illustrated with theoretical derivations for VAEs and GMMs together with iterative training experiments on small LLMs that exhibit rising perplexity and falling output diversity; the authors argue the effect will become material once LLM-generated content dominates web-scale training corpora.
Significance. If the central mechanism is confirmed, the work identifies a structural risk to continued scaling of generative models that rely on uncurated web data. The explicit variance-shrinkage and support-reduction results for GMMs and VAEs supply clear mathematical grounding, while the LLM demonstrations point to measurable degradation; the findings would directly affect data-acquisition strategies and the long-term value of human-generated text.
major comments (2)
- [LLM experiments] LLM experiments section: the reported rise in perplexity and drop in lexical diversity are consistent with degradation but do not quantify whether low-frequency tokens or rare n-grams from the original corpus receive disproportionately lower probability or are eliminated. Without this measurement the link between observed defects and the claimed tail-loss mechanism remains unverified.
- [Theoretical analysis] Theoretical sections on GMMs and VAEs: the derivations show variance shrinkage and latent-space support reduction under recursive maximum-likelihood updates, yet the manuscript does not demonstrate that these effects are strictly irreversible once standard regularization or data-filtering steps are introduced.
minor comments (2)
- [Abstract] Abstract: the claim that collapse 'has to be taken seriously' would be strengthened by a brief quantitative statement of the scale of the LLM models and number of recursion steps used.
- [Introduction] Notation: define the precise meaning of 'tail mass' (e.g., probability below a fixed quantile) at first use so that later empirical claims can be checked against it.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comments below, indicating where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [LLM experiments] LLM experiments section: the reported rise in perplexity and drop in lexical diversity are consistent with degradation but do not quantify whether low-frequency tokens or rare n-grams from the original corpus receive disproportionately lower probability or are eliminated. Without this measurement the link between observed defects and the claimed tail-loss mechanism remains unverified.
Authors: We appreciate the referee's suggestion to more directly link the experimental observations to the tail-loss mechanism. In the revised manuscript, we will add a new subsection in the LLM experiments that measures the log-probabilities of low-frequency tokens and rare n-grams from the original corpus. We will show that these rare elements receive progressively lower probabilities in recursively trained models, providing quantitative support for the claimed mechanism. This analysis will use the same experimental setup as the existing perplexity and diversity metrics. revision: yes
-
Referee: [Theoretical analysis] Theoretical sections on GMMs and VAEs: the derivations show variance shrinkage and latent-space support reduction under recursive maximum-likelihood updates, yet the manuscript does not demonstrate that these effects are strictly irreversible once standard regularization or data-filtering steps are introduced.
Authors: The theoretical analysis isolates the effect under standard maximum-likelihood estimation to highlight the fundamental issue. To address the referee's point, we will include an extended discussion and additional derivations showing the effects under L2 regularization. While regularization slows the rate of collapse, the support reduction in the latent space and variance shrinkage persist over multiple generations, rendering the loss irreversible in the long term. For data filtering, we will argue that since the generated data inherently lacks the tails, filtering cannot recover them. These points will be incorporated into the theoretical sections and a new discussion paragraph. revision: yes
Circularity Check
No circularity: derivations and experiments are independent of inputs
full rationale
The paper derives model collapse explicitly for GMMs via recursive MLE showing variance shrinkage and for VAEs via latent support reduction, then reports separate LLM experiments on perplexity and diversity. None of these steps reduce by construction to fitted parameters renamed as predictions, self-citations that bear the central load, or ansatzes smuggled from prior author work. The theoretical sections use standard generative modeling assumptions without redefining the target phenomenon in terms of itself, and the LLM results are empirical observations rather than forced outputs of the same equations. This is a standard non-circular finding for a paper whose core claims rest on explicit math and fresh experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generative models are trained to approximate the full support of the data distribution
Lean theorems connected to this paper
-
LawOfExistencenothing_cannot_exist echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as model collapse
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
Evaluating Very Long-Term Conversational Memory of LLM Agents
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
-
When Does Model Collapse Occur in Structured Interactive Learning?
Model collapse occurs in structured interactive learning if and only if the directed interaction graph satisfies a specific topological condition, with finite-sample guarantees for linear regression and asymptotic res...
-
Base Models Look Human To AI Detectors
Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.
-
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
-
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
-
RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience
RLSpoofer trains a 4B model on 100 watermarked paraphrase pairs to spoof PF watermarks at 62% success rate, far exceeding baselines trained on up to 10,000 samples.
-
Synthetic Data Generation for Training Diversified Commonsense Reasoning Models
A two-stage synthetic data generation method creates the CommonSyn dataset, allowing LLMs fine-tuned on it to produce more diverse and higher-quality commonsense responses than vanilla or human-data-trained models.
-
EmbGen: Teaching with Reassembled Corpora
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on...
-
Annotations Mitigate Post-Training Mode Collapse
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
-
Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling
CSRS improves MLLM self-evolution stability by using retracing mechanisms and softened continuous rewards instead of majority voting, reaching SOTA on geometric reasoning benchmarks like MathVision.
-
Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task
LLM use for essay writing correlates with reduced brain network connectivity, lower self-reported ownership, and poorer recall of one's own content compared to unaided or search-based writing.
-
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
-
Scaling Synthetic Data Creation with 1,000,000,000 Personas
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
-
Reinforced Self-Training (ReST) for Language Modeling
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
AgentSim: A Platform for Verifiable Agent-Trace Simulation
AgentSim creates and releases the Agent-Trace Corpus of over 103,000 verifiable reasoning steps across three IR benchmarks with claimed 100% grounding on substantive answers.
-
Position: No Retroactive Cure for Infringement during Training
Post-hoc mitigation cannot retroactively cure infringement that occurred during unauthorized data ingestion and training because liability attaches to data lineage and retained expressive value in model weights.
-
Losing our Tail, Again: (Un)Natural Selection & Multilingual LLMs
Position paper warns that model collapse in self-consuming multilingual LLM training loops risks flattening linguistic diversity and cultural nuance.
-
Content Platform GenAI Regulation via Compensation
A compensation-based incentive scheme for human creators on content platforms can increase high-value original content, reduce GenAI data pollution, and raise platform profits without needing AI detectors.
Reference graph
Works this paper leans on
-
[1]
Poisoning Attacks against Support Vector Machines
Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[3]
Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667,
Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667,
-
[4]
Poisoning web-scale training datasets is practical
5https://googleblog.blogspot.com/2011/02/finding-more-high-quality-sites-in.html 6https://www.technologyreview.com/2010/07/26/26327/the-search-engine-backlash-against-content-mills/ 13 Model Collapse Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramè...
-
[5]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
doi: 10.1017/S0305004100016595. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1017/s0305004100016595
-
[6]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
URL https://proceedings.neurips.cc/paper_files/paper/2013/file/ e034fb6b66aacc1d48f445ddfb08da98-Paper.pdf. Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733,
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[7]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[8]
Sponge examples: Energy-latency attacks on neural networks
Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. Sponge examples: Energy-latency attacks on neural networks. In 2021 IEEE European Symposium on Security and Privacy (EuroS&P), pages 212–231. IEEE,
work page 2021
-
[9]
Energy and Policy Considerations for Deep Learning in NLP
Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243,
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[10]
Three scenarios for continual learning
Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning.arXiv preprint arXiv:1904.07734,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[11]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
The mean estimator and its standard deviation are calculated from running the procedure 10000 times
as a function of number of points. The mean estimator and its standard deviation are calculated from running the procedure 10000 times. 0 500 1000 1500 2000 Generation 10 6 10 4 10 2 100 102 log(||GMM0, GMMevolution||2) Distance between the original GMM and its approximation as function of a number of data samples 500 1000 10000 50000 200000 Figure 13: Pr...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.