Overtrained Language Models Are Harder to Fine-Tune

Aditi Raghunathan; Graham Neubig; Jacob Mitchell Springer; Kaiyue Wen; Sachin Goyal; Sadhika Malladi; Tanishq Kumar; Xiang Yue

arxiv: 2503.19206 · v2 · pith:VDYYBT3Snew · submitted 2025-03-24 · 💻 cs.CL · cs.AI

Overtrained Language Models Are Harder to Fine-Tune

Jacob Mitchell Springer , Sachin Goyal , Kaiyue Wen , Tanishq Kumar , Xiang Yue , Sadhika Malladi , Graham Neubig , Aditi Raghunathan This is my paper

classification 💻 cs.CL cs.AI

keywords modelsperformancepre-trainedpre-trainingassumptioncatastrophicdownstreamfine-tune

0 comments

read the original abstract

Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors
cs.LG 2026-06 unverdicted novelty 6.0

MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
cs.LG 2026-05 unverdicted novelty 6.0

The Shannon Scaling Law treats LLM training as noisy-channel transmission and predicts U-shaped performance degradation when signal-to-noise ratio falls below a threshold, outperforming monotonic scaling laws on Pythi...
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
cs.LG 2026-05 conditional novelty 6.0

Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
cs.LG 2026-04 unverdicted novelty 6.0

Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.
Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay
cs.LG 2026-05 unverdicted novelty 5.0

Self-generated replay from language models nearly eliminates catastrophic forgetting during finetuning except when models are pretrained close to saturation.
Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation
cs.SE 2026-06 unverdicted novelty 4.0

Empirical study on five LLMs finds pretrained-to-aligned paths yield bigger gains over baseline than finetuned-to-aligned paths, though absolute accuracy remains lower for pretrained starts.