arXiv preprint arXiv:2502.15938 , year=

Why Linearly Decaying the Learning Rate to Zero Works Best , author= · arXiv 2502.15938

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Layer Collapse in Diffusion Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Optimal hyperparameters for LLM continued pre-training follow predictable scaling laws derived from proxy models, enabling a two-stage framework that predicts settings from compute budget and checkpoint state to reduce search overhead by 90%.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Layer Collapse in Diffusion Language Models cs.LG · 2026-05-07 · unverdicted · none · ref 2 · 2 links
Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.

arXiv preprint arXiv:2502.15938 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer