Recognition: unknown
(How) Learning Rates Regulate Catastrophic Overtraining
Pith reviewed 2026-05-10 13:36 UTC · model grok-4.3
The pith
Learning rate decay during pretraining increases model sharpness and thereby worsens catastrophic forgetting when the model later undergoes supervised fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When two models are trained to identical supervised fine-tuning loss, those optimized with large steps versus small steps reach qualitatively different parameter configurations. Learning-rate decay in the pretraining phase raises the sharpness of the resulting model, and this increased sharpness directly amplifies the amount of catastrophic forgetting that occurs during the subsequent fine-tuning stage, thereby producing overtraining.
What carries the argument
The implicit regularization induced by the learning-rate schedule, which controls the sharpness of the pretrained solution and thereby mediates how much pretraining knowledge is lost during later fine-tuning to the same loss value.
Load-bearing premise
Observed differences in forgetting between models that reach the same fine-tuning loss under different learning rates are caused by the learning rate's effect on sharpness rather than by other uncontrolled aspects of the optimization trajectory.
What would settle it
Train two models to the same sharpness value using unrelated learning-rate schedules and check whether they exhibit identical rates of capability loss once fine-tuned to the same instruction loss.
Figures
read the original abstract
Supervised fine-tuning (SFT) is a common first stage of LLM post-training, teaching the model to follow instructions and shaping its behavior as a helpful assistant. At the same time, SFT may harm the fundamental capabilities of an LLM, particularly after long pretraining: a phenomenon known as catastrophic overtraining (Springer et al., 2025). To understand overtraining, we first investigate catastrophic forgetting in finetuning through the lens of implicit regularization of the learning rate. For models trained to the same SFT loss, we identify how the learning rate mediates optimization: finetuning with large and small steps converges to qualitatively different models. Next, we link forgetting to overtraining: learning rate decay increases the sharpness of the pretrained model, which in turn exacerbates catastrophic forgetting during SFT, leading to overtraining. Our findings paint a picture of the overtraining mechanism in LLMs and broadly contribute to the understanding of the interplay between optimization dynamics during pretraining and finetuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates catastrophic overtraining in LLMs during supervised fine-tuning (SFT) by analyzing the implicit regularization effects of learning rates. It claims that, for models trained to identical SFT loss values, large versus small step sizes converge to qualitatively different solutions; separately, learning-rate decay during pretraining increases the sharpness of the pretrained model, which then amplifies catastrophic forgetting in the subsequent SFT stage and produces overtraining.
Significance. If the causal linkage holds, the work supplies a concrete mechanistic account of overtraining that connects pretraining optimization geometry to downstream forgetting. This extends existing sharpness-based explanations of generalization and supplies actionable guidance on learning-rate schedules across the pretrain–finetune boundary. The emphasis on comparing models at fixed SFT loss is a methodological strength that avoids trivial loss-based confounds.
major comments (2)
- [§4 (Experimental Results) and §5 (Mechanism Analysis)] The central causal claim—that pretrained sharpness (induced by LR decay) is the dominant mediator of exacerbated forgetting at fixed SFT loss—remains unisolated. The manuscript must demonstrate that alternative LR-induced differences (gradient-noise scale, momentum accumulation, or basin geometry unrelated to the chosen sharpness metric) have been controlled; without such controls the observed qualitative differences cannot be attributed specifically to sharpness.
- [§3 (Finetuning Dynamics) and associated figures] The abstract states that finetuning with large and small steps “converges to qualitatively different models” at the same SFT loss, yet no quantitative metric or statistical test is referenced that would establish the claimed qualitative distinction is reproducible and not an artifact of random seed or data-order effects.
minor comments (2)
- [Abstract] The abstract supplies no experimental details (model scale, dataset sizes, exact LR schedules, sharpness metric, or number of runs), which makes it impossible for a reader to evaluate the strength of the reported findings from the summary alone.
- [§2 (Background)] Notation for sharpness and the precise definition of “catastrophic overtraining” should be introduced with an equation or explicit reference to prior work (Springer et al., 2025) at first use.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of causal isolation and quantitative rigor that we will address in the revision. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [§4 (Experimental Results) and §5 (Mechanism Analysis)] The central causal claim—that pretrained sharpness (induced by LR decay) is the dominant mediator of exacerbated forgetting at fixed SFT loss—remains unisolated. The manuscript must demonstrate that alternative LR-induced differences (gradient-noise scale, momentum accumulation, or basin geometry unrelated to the chosen sharpness metric) have been controlled; without such controls the observed qualitative differences cannot be attributed specifically to sharpness.
Authors: We agree that a stronger isolation of sharpness from other LR-induced factors would improve the causal argument. Our current design already holds SFT loss fixed across step-size regimes, which removes loss-value confounds, and we measure sharpness with the same metric used in the pretraining stage. Nevertheless, we have not yet performed explicit ablations that hold momentum fixed while varying noise scale or that compare basins with matched sharpness but different curvature spectra. In the revised manuscript we will add two targeted controls: (i) an experiment that injects controlled gradient noise at fixed step size and momentum, and (ii) a comparison of models whose sharpness is matched by construction but whose optimization trajectories differ in momentum accumulation. These additions will allow us to quantify how much of the observed forgetting difference survives after the alternative factors are controlled. revision: yes
-
Referee: [§3 (Finetuning Dynamics) and associated figures] The abstract states that finetuning with large and small steps “converges to qualitatively different models” at the same SFT loss, yet no quantitative metric or statistical test is referenced that would establish the claimed qualitative distinction is reproducible and not an artifact of random seed or data-order effects.
Authors: We accept that the abstract’s phrasing would be more precise if supported by a quantitative measure and reproducibility checks. The figures in §3 already display consistent differences in post-SFT sharpness, forgetting curves, and downstream task performance across the two step-size regimes, but these are presented visually rather than with summary statistics. In the revision we will augment §3 with (a) a quantitative distance metric (parameter cosine similarity and layer-wise sharpness delta) between the large-step and small-step solutions at matched SFT loss, and (b) results aggregated over at least five independent random seeds with standard-error bars. These additions will be reflected in both the text and the abstract. revision: yes
Circularity Check
Empirical investigation with no self-referential derivation or fitted predictions
full rationale
The paper presents an empirical study of optimization dynamics in LLM pretraining and supervised fine-tuning. It reports observations that models trained to identical SFT loss values under different learning rates converge to qualitatively different solutions, and links learning-rate decay during pretraining to increased sharpness that then correlates with greater forgetting. These links are drawn directly from experimental comparisons rather than from any closed-form derivation, parameter fit presented as a prediction, or self-citation that supplies the central mechanism. No equations, ansatzes, uniqueness theorems, or renamings of known results appear in the abstract or described claims that would reduce the reported findings to their own inputs by construction. The work is therefore self-contained as an experimental investigation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existence of a loss landscape whose sharpness can be meaningfully compared across models trained to the same loss value.
Forward citations
Cited by 1 Pith paper
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Reference graph
Works this paper leans on
-
[1]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Maksym Andriushchenko, Francesco Croce, Maximilian M¨uller, Matthias Hein, and Nicolas Flammarion. A modern look at the relationship between sharpness and generalization. InInternational Conference on Machine Learning, pp. 840–902. PMLR, 2023a. Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. SGD with large step ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811,
Parikshit Bansal and Sujay Sanghavi. Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811,
-
[3]
Universal dynamics of warmup stable decay: understanding wsd beyond transformers
Annalisa Belloni, Lorenzo Noci, and Antonio Orvieto. Universal dynamics of warmup stable decay: understanding wsd beyond transformers. InHigh-dimensional Learning Dynamics 2025,
2025
-
[4]
Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process
Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. InConference on Learning Theory, COLT 2020, Proceedings of Machine Learning Research. PMLR,
2020
-
[5]
Training dynamics impact post-training quantization robustness.arXiv preprint arXiv:2510.06213,
https://transformer-circuits.pub/2023/monosemantic-features/index.html. 10 Preprint. Under review. Albert Catalan-Tatjer, Niccol`o Ajroldi, and Jonas Geiping. Training dynamics impact post- training quantization robustness.arXiv preprint arXiv:2510.06213,
-
[6]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Adaptive gradient methods at the edge of stability
Jeremy Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, Zachary Nado, George E Dahl, and Justin Gilmer. Adaptive gradient methods at the edge of stability. InNeurIPS 2023 Workshop Heavy Tails in Machine Learning,
2023
-
[8]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review arXiv
-
[9]
URL https://zenodo.org /records/12608602. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram’e, Morgane Rivi`ere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gael Liu, ...
-
[10]
An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
11 Preprint. Under review. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C
Suhas Kotha and Percy Liang. Replaying pre-training data improves fine-tuning.arXiv preprint arXiv:2603.04964,
-
[13]
Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218,
-
[14]
Revisiting catastrophic forgetting in large language model tuning
Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning. InFindings of the association for computational linguistics: EMNLP 2024, pp. 4297–4308,
2024
-
[15]
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
URL https://openreview.net/forum?id=Bk g6RiCqY7. Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614,
-
[16]
OLMo Team, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656,
work page internal anchor Pith review arXiv
-
[17]
https://arxiv.org/abs/2509.14233 (2025)
Project Apertus, Alejandro Hern´andez-Cano, Alexander H¨agele, Allen Hao Huang, Ange- lika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank ˇDurech, et al. Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233,
-
[18]
Amuro & char: Analyzing the relationship between pre- training and fine-tuning of large language models
Kaiser Sun and Mark Dredze. Amuro & char: Analyzing the relationship between pre- training and fine-tuning of large language models. InProceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025), pp. 131–151,
2025
-
[19]
Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, and Aditi Raghunathan
doi: 10.1109/TPAMI.2024.3367329. Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, and Aditi Raghunathan. Sharpness-aware pretraining mitigates catastrophic forgetting. InWorkshop on Scientific Methods for Understanding Deep Learning,
-
[20]
13 Preprint. Under review. Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P Gummadi, Willie Neiswanger, and Robin Jia. Hubble: a model suite to advance the study of llm memorization.arXiv preprint arXiv:2510.19811,
-
[21]
Data mining and knowledge discovery , 33(4):917–963
Association for Computational Linguistics. doi: 10.18653/v1/W17-4413. URLhttps://aclanthology.org/W17-4413/. Kaiyue Wen, Zhiyuan Li, Jason S Wang, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape view. InThe Thirteenth International Conference on Learning Representations,
-
[22]
Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770,
-
[23]
Under review
14 Preprint. Under review. A Additional Experimental Details A.1 Supervised Finetuning We mostly follow Springer et al. (2025) for the finetuning/evaluation design. Data.We use two SFT datasets: Anthropic HH (Bai et al.,
2025
-
[24]
Since Anthropic HH is originally a preference-tuning dataset, we use it for SFT by finetuning on “chosen” responses
and T ¨ulu 3 (Lambert et al., 2025). Since Anthropic HH is originally a preference-tuning dataset, we use it for SFT by finetuning on “chosen” responses. We use the maximum context length of our models to 512 tokens. Following standard practice, we wrap the SFT examples into a chat template with special tokens<|user|>and<|assistant|>added to the tokenizer...
2025
-
[25]
We obtain the noised model by adding Gaussian noise ε∼ N( 0, σ2), σ= 10−5 to the parameters
and calculate the average KL-divergence between the next-token predictions of the original model and the predictions of the noised model. We obtain the noised model by adding Gaussian noise ε∼ N( 0, σ2), σ= 10−5 to the parameters. For each batch, we average the results over 10 random samples ofε. B Additional Results B.1 Verifying the Metric for Feature D...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.