pith. sign in

arxiv: 2305.17493 · v3 · pith:C6B2FLJUnew · submitted 2023-05-27 · 💻 cs.LG · cs.AI· cs.CL· cs.CR· cs.CV

The Curse of Recursion: Training on Generated Data Makes Models Forget

Pith reviewed 2026-05-20 13:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CRcs.CV
keywords model collapsetraining on generated datalarge language modelsgenerative modelsdistribution tailsrecursive trainingdata degradation
0
0 comments X

The pith

Use of model-generated data in training causes irreversible loss of the tails of the original content distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the effects of training generative models on data that includes outputs produced by earlier versions of such models. It identifies a process in which successive rounds of training on synthetic content cause the models to lose coverage of low-probability events that were present in the original data. This loss is demonstrated experimentally in variational autoencoders, Gaussian mixture models, and large language models. The authors argue that the effect must be addressed if the benefits of large-scale web-scraped training data are to continue as language models contribute more of the text found online.

Core claim

The central claim is that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. The authors term this effect Model Collapse and show that it occurs in Variational Autoencoders, Gaussian Mixture Models and LLMs, building theoretical intuition and demonstrating its presence across learned generative models.

What carries the argument

Model Collapse, the progressive disappearance of tail mass from the learned distribution when training data is replaced by samples drawn from a previous model.

If this is right

  • Models trained recursively on their own outputs will assign lower probability to events that were uncommon in the initial data.
  • The diversity of outputs from successive model generations will shrink as tail coverage is lost.
  • Data collected from genuine human interactions will retain higher value than synthetic content for sustaining model performance.
  • Continued use of unfiltered web data will eventually degrade the capabilities of models trained on it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Detection and removal of synthetic text from training corpora may become a standard preprocessing step.
  • The same collapse dynamic could appear in image or video generation pipelines that rely on web-scraped synthetic examples.
  • Hybrid datasets that deliberately mix real and generated samples in controlled ratios could be tested to measure the rate of tail erosion.

Load-bearing premise

Generated data enters training directly, without curation or mechanisms that would restore the missing low-probability regions of the original distribution.

What would settle it

Train one model on real data, then train successor models on data sampled from each predecessor and check whether the probability mass assigned to the original rare events steadily declines across generations.

read the original abstract

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that recursive training of generative models on their own outputs produces 'model collapse,' in which the tails of the original data distribution are irreversibly lost. This is illustrated with theoretical derivations for VAEs and GMMs together with iterative training experiments on small LLMs that exhibit rising perplexity and falling output diversity; the authors argue the effect will become material once LLM-generated content dominates web-scale training corpora.

Significance. If the central mechanism is confirmed, the work identifies a structural risk to continued scaling of generative models that rely on uncurated web data. The explicit variance-shrinkage and support-reduction results for GMMs and VAEs supply clear mathematical grounding, while the LLM demonstrations point to measurable degradation; the findings would directly affect data-acquisition strategies and the long-term value of human-generated text.

major comments (2)
  1. [LLM experiments] LLM experiments section: the reported rise in perplexity and drop in lexical diversity are consistent with degradation but do not quantify whether low-frequency tokens or rare n-grams from the original corpus receive disproportionately lower probability or are eliminated. Without this measurement the link between observed defects and the claimed tail-loss mechanism remains unverified.
  2. [Theoretical analysis] Theoretical sections on GMMs and VAEs: the derivations show variance shrinkage and latent-space support reduction under recursive maximum-likelihood updates, yet the manuscript does not demonstrate that these effects are strictly irreversible once standard regularization or data-filtering steps are introduced.
minor comments (2)
  1. [Abstract] Abstract: the claim that collapse 'has to be taken seriously' would be strengthened by a brief quantitative statement of the scale of the LLM models and number of recursion steps used.
  2. [Introduction] Notation: define the precise meaning of 'tail mass' (e.g., probability below a fixed quantile) at first use so that later empirical claims can be checked against it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [LLM experiments] LLM experiments section: the reported rise in perplexity and drop in lexical diversity are consistent with degradation but do not quantify whether low-frequency tokens or rare n-grams from the original corpus receive disproportionately lower probability or are eliminated. Without this measurement the link between observed defects and the claimed tail-loss mechanism remains unverified.

    Authors: We appreciate the referee's suggestion to more directly link the experimental observations to the tail-loss mechanism. In the revised manuscript, we will add a new subsection in the LLM experiments that measures the log-probabilities of low-frequency tokens and rare n-grams from the original corpus. We will show that these rare elements receive progressively lower probabilities in recursively trained models, providing quantitative support for the claimed mechanism. This analysis will use the same experimental setup as the existing perplexity and diversity metrics. revision: yes

  2. Referee: [Theoretical analysis] Theoretical sections on GMMs and VAEs: the derivations show variance shrinkage and latent-space support reduction under recursive maximum-likelihood updates, yet the manuscript does not demonstrate that these effects are strictly irreversible once standard regularization or data-filtering steps are introduced.

    Authors: The theoretical analysis isolates the effect under standard maximum-likelihood estimation to highlight the fundamental issue. To address the referee's point, we will include an extended discussion and additional derivations showing the effects under L2 regularization. While regularization slows the rate of collapse, the support reduction in the latent space and variance shrinkage persist over multiple generations, rendering the loss irreversible in the long term. For data filtering, we will argue that since the generated data inherently lacks the tails, filtering cannot recover them. These points will be incorporated into the theoretical sections and a new discussion paragraph. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations and experiments are independent of inputs

full rationale

The paper derives model collapse explicitly for GMMs via recursive MLE showing variance shrinkage and for VAEs via latent support reduction, then reports separate LLM experiments on perplexity and diversity. None of these steps reduce by construction to fitted parameters renamed as predictions, self-citations that bear the central load, or ansatzes smuggled from prior author work. The theoretical sections use standard generative modeling assumptions without redefining the target phenomenon in terms of itself, and the LLM results are empirical observations rather than forced outputs of the same equations. This is a standard non-circular finding for a paper whose core claims rest on explicit math and fresh experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the modeling assumption that generated samples are drawn from the learned distribution without external correction and that tail loss is not mitigated by standard training procedures.

axioms (1)
  • domain assumption Generative models are trained to approximate the full support of the data distribution
    Invoked when claiming that missing tails constitute a defect rather than an expected approximation error.

pith-pipeline@v0.9.0 · 5773 in / 1108 out tokens · 39607 ms · 2026-05-20T13:58:54.307487+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LawOfExistence nothing_cannot_exist echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as model collapse

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evaluating Very Long-Term Conversational Memory of LLM Agents

    cs.CL 2024-02 unverdicted novelty 8.0

    Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

  2. When Does Model Collapse Occur in Structured Interactive Learning?

    cs.LG 2026-05 unverdicted novelty 7.0

    Model collapse occurs in structured interactive learning if and only if the directed interaction graph satisfies a specific topological condition, with finite-sample guarantees for linear regression and asymptotic res...

  3. Base Models Look Human To AI Detectors

    cs.CL 2026-05 unverdicted novelty 7.0

    Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.

  4. Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

    cs.LG 2026-05 unverdicted novelty 7.0

    Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.

  5. Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

    cs.AI 2026-05 unverdicted novelty 7.0

    In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...

  6. RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

    cs.CR 2026-04 unverdicted novelty 7.0

    RLSpoofer trains a 4B model on 100 watermarked paraphrase pairs to spoof PF watermarks at 62% success rate, far exceeding baselines trained on up to 10,000 samples.

  7. Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

    cs.CL 2026-03 unverdicted novelty 7.0

    A two-stage synthetic data generation method creates the CommonSyn dataset, allowing LLMs fine-tuned on it to produce more diverse and higher-quality commonsense responses than vanilla or human-data-trained models.

  8. EmbGen: Teaching with Reassembled Corpora

    cs.CL 2026-05 unverdicted novelty 6.0

    EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on...

  9. Annotations Mitigate Post-Training Mode Collapse

    cs.CL 2026-05 unverdicted novelty 6.0

    Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.

  10. Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

    cs.CV 2026-04 unverdicted novelty 6.0

    CSRS improves MLLM self-evolution stability by using retracing mechanisms and softened continuous rewards instead of majority voting, reaching SOTA on geometric reasoning benchmarks like MathVision.

  11. Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task

    cs.AI 2025-06 unverdicted novelty 6.0

    LLM use for essay writing correlates with reduced brain network connectivity, lower self-reported ownership, and poorer recall of one's own content compared to unaided or search-based writing.

  12. HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

    cs.CL 2024-12 unverdicted novelty 6.0

    HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.

  13. Scaling Synthetic Data Creation with 1,000,000,000 Personas

    cs.CL 2024-06 unverdicted novelty 6.0

    A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.

  14. Reinforced Self-Training (ReST) for Language Modeling

    cs.CL 2023-08 unverdicted novelty 6.0

    ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

  15. Textbooks Are All You Need

    cs.CL 2023-06 unverdicted novelty 6.0

    A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

  16. AgentSim: A Platform for Verifiable Agent-Trace Simulation

    cs.IR 2026-04 unverdicted novelty 5.0

    AgentSim creates and releases the Agent-Trace Corpus of over 103,000 verifiable reasoning steps across three IR benchmarks with claimed 100% grounding on substantive answers.

  17. Position: No Retroactive Cure for Infringement during Training

    cs.CR 2026-04 unverdicted novelty 5.0

    Post-hoc mitigation cannot retroactively cure infringement that occurred during unauthorized data ingestion and training because liability attaches to data lineage and retained expressive value in model weights.

  18. Losing our Tail, Again: (Un)Natural Selection & Multilingual LLMs

    cs.CL 2025-07 unverdicted novelty 4.0

    Position paper warns that model collapse in self-consuming multilingual LLM training loops risks flattening linguistic diversity and cultural nuance.

  19. Content Platform GenAI Regulation via Compensation

    cs.CY 2026-03 unverdicted novelty 3.0

    A compensation-based incentive scheme for human creators on content platforms can increase high-value original content, reduce GenAI data pollution, and raise platform profits without needing AI detectors.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 19 Pith papers · 7 internal anchors

  1. [1]

    Poisoning Attacks against Support Vector Machines

    Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389,

  2. [2]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  3. [3]

    Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667,

    Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667,

  4. [4]

    Poisoning web-scale training datasets is practical

    5https://googleblog.blogspot.com/2011/02/finding-more-high-quality-sites-in.html 6https://www.technologyreview.com/2010/07/26/26327/the-search-engine-backlash-against-content-mills/ 13 Model Collapse Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramè...

  5. [5]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    doi: 10.1017/S0305004100016595. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  6. [6]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    URL https://proceedings.neurips.cc/paper_files/paper/2013/file/ e034fb6b66aacc1d48f445ddfb08da98-Paper.pdf. Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733,

  7. [7]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

  8. [8]

    Sponge examples: Energy-latency attacks on neural networks

    Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. Sponge examples: Energy-latency attacks on neural networks. In 2021 IEEE European Symposium on Security and Privacy (EuroS&P), pages 212–231. IEEE,

  9. [9]

    Energy and Policy Considerations for Deep Learning in NLP

    Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243,

  10. [10]

    Three scenarios for continual learning

    Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning.arXiv preprint arXiv:1904.07734,

  11. [11]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

  12. [12]

    The mean estimator and its standard deviation are calculated from running the procedure 10000 times

    as a function of number of points. The mean estimator and its standard deviation are calculated from running the procedure 10000 times. 0 500 1000 1500 2000 Generation 10 6 10 4 10 2 100 102 log(||GMM0, GMMevolution||2) Distance between the original GMM and its approximation as function of a number of data samples 500 1000 10000 50000 200000 Figure 13: Pr...