Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching structural predictions on C4 data.
hub
Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
By mid-2025 roughly 35% of new websites are AI-generated or AI-assisted, correlating with lower semantic diversity and higher positive sentiment but showing no significant drop in factual accuracy or stylistic diversity.
Recursive LLM text generation drives public corpora toward shallow equilibria via drift unless normative selection for quality sustains deeper structure with a bounded divergence.
FuXi-TC combines the FuXi global DL model with a diffusion generative framework to downscale and improve TC intensity and precipitation forecasts, matching ECMWF skill while being faster and generalizing zero-shot to North Atlantic hurricanes.
Filter Babel explores a future of AI-personalized private experiences that may erode common ground in communication while supporting individual identity and selfhood.
LLM integration in software engineering builds epistemological debt that erodes mental models and homogenizes code via recursive training, risking systemic fragility as illustrated by 2026 Amazon outages.
Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.
citing papers explorer
-
Knowledge Distillation Must Account for What It Loses
Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.
-
Reinforcement Learning from Human Feedback
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.
- Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
- Position: the Stochastic Parrot in the Coal Mine. Model Collapse is a Threat to Low-Resource Communities