arXiv preprint arXiv:2410.12341 , year=

Characterizing Model Collapse in Large Language Models Using Semantic Networks, Next-Token Probability , author= · 2024 · cs.CL · arXiv 2410.12341

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

As AI-generated content increasingly populates the web, generative AI models are at growing risk of being trained on their own outputs, a process known as AI autophagy. This feedback loop has been shown to induce model collapse, typically characterized by a loss of diversity in generated content. However, existing work offers a limited understanding of this phenomenon and relies on mitigation strategies that assume access to human-authored data. In this paper, we conduct extensive simulations across multiple datasets and LLMs to address key gaps in the study of model collapse. First, we introduce model-intrinsic measures based on next-token probability distributions, showing that model collapse corresponds to an increasing concentration of probability mass on a small set of tokens. Second, we demonstrate that model collapse is also associated with a loss of common sense, as measured by a decline in commonsense inference accuracy. Third, we identify perplexity (a measure of model "surprise") as a key driver of collapse: fine-tuning on the least "surprising" documents leads to more severe degeneration. Building on this insight, we propose a perplexity-based filtering strategy that prioritizes high-surprise documents during fine-tuning. Unlike existing approaches, our method does not require distinguishing between human-authored and AI-generated content. Across datasets and LLM families, this strategy consistently mitigates model collapse, achieving performance comparable to, and in some cases better than, human-data baselines, while substantially reducing the concentration of next-token probabilities. Overall, our results provide a unified, model-centric understanding of model collapse and suggest practical, scalable strategies for training generative AI systems in increasingly synthetic environments.

representative citing papers

Entropy Minimization without Model Collapse: Mitigating Prediction Bias in Medical Imaging

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

Entropy minimization amplifies prediction bias from merged feature clusters under distribution shifts, and DSBR mitigates collapse by equalizing predicted class contributions to the unsupervised loss.

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Recursive generative retraining with heterogeneous rewards converges to a stable distribution satisfying a weighted Nash bargaining solution, preserving diversity under stated conditions.

The MMM Data Model -- A Normative Specification for Knowledge Interoperability in a Decentralisable Knowledge Commons

cs.AI · 2026-06-22

citing papers explorer

Showing 3 of 3 citing papers.

Entropy Minimization without Model Collapse: Mitigating Prediction Bias in Medical Imaging cs.LG · 2026-06-01 · unverdicted · none · ref 21 · internal anchor
Entropy minimization amplifies prediction bias from merged feature clusters under distribution shifts, and DSBR mitigates collapse by equalizing predicted class contributions to the unsupervised loss.
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences cs.LG · 2026-05-08 · unverdicted · none · ref 92 · internal anchor
Recursive generative retraining with heterogeneous rewards converges to a stable distribution satisfying a weighted Nash bargaining solution, preserving diversity under stated conditions.
The MMM Data Model -- A Normative Specification for Knowledge Interoperability in a Decentralisable Knowledge Commons cs.AI · 2026-06-22 · unreviewed · ref 24 · internal anchor

arXiv preprint arXiv:2410.12341 , year=

fields

years

verdicts

representative citing papers

citing papers explorer