Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
arXiv preprint arXiv:2502.15938 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
method 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
method 1polarities
use method 1representative citing papers
Optimal hyperparameters for LLM continued pre-training follow predictable scaling laws derived from proxy models, enabling a two-stage framework that predicts settings from compute budget and checkpoint state to reduce search overhead by 90%.
citing papers explorer
-
Layer Collapse in Diffusion Language Models
Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.