Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification

Elvis Dohmatob; Francois Charton; Julia Kempe; Pu Yang; Yunzhen Feng

arxiv: 2406.07515 · v2 · pith:BFIHXHMWnew · submitted 2024-06-11 · 💻 cs.LG · cs.AI· stat.ML

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification

Yunzhen Feng , Elvis Dohmatob , Pu Yang , Francois Charton , Julia Kempe This is my paper

classification 💻 cs.LG cs.AIstat.ML

keywords datamodelcollapsegeneratedperformancesynthesizedbecauselinear

0 comments

read the original abstract

Large Language Models (LLM) are increasingly trained on data generated by other LLM, either because generated text and images become part of the pre-training corpus, or because synthetized data is used as a replacement for expensive human-annotation. This raises concerns about \emph{model collapse}, a drop in model performance when their training sets include generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of verification on synthesized data to prevent model collapse. We provide a theoretical characterization using Gaussian mixtures, linear classifiers, and linear verifiers to derive conditions with measurable proxies to assess whether the verifier can effectively select synthesized data that leads to optimal performance. We experiment with two practical tasks -- computing matrix eigenvalues with transformers and news summarization with LLMs -- which both exhibit model collapse when trained on generated data, and show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse and that our proposed proxy measure strongly correlates with performance.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
cs.LG 2026-05 unverdicted novelty 7.0

Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Recursive generative retraining with heterogeneous rewards converges to a stable distribution satisfying a weighted Nash bargaining solution, preserving diversity under stated conditions.