arxiv: 2604.13627 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.CL

Recognition: unknown

(How) Learning Rates Regulate Catastrophic Overtraining

Mark Rofin , Aditya Varre , Nicolas Flammarion

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:36 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords catastrophic overtrainingsupervised fine-tuninglearning rate decaymodel sharpnesscatastrophic forgettingimplicit regularizationpretraining dynamicsLLM post-training

0 comments

The pith

Learning rate decay during pretraining increases model sharpness and thereby worsens catastrophic forgetting when the model later undergoes supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates why supervised fine-tuning on instructions can erase capabilities that a language model acquired during pretraining, especially after extended pretraining runs. It shows that the learning rate schedule used in pretraining shapes the geometry of the solution the model reaches, with decaying rates producing sharper minima. Sharper models then lose more of their original behavior once fine-tuned to the same instruction-following loss, producing the overtraining effect. A reader would care because this link supplies a concrete handle on a widespread practical problem in scaling language models.

Core claim

When two models are trained to identical supervised fine-tuning loss, those optimized with large steps versus small steps reach qualitatively different parameter configurations. Learning-rate decay in the pretraining phase raises the sharpness of the resulting model, and this increased sharpness directly amplifies the amount of catastrophic forgetting that occurs during the subsequent fine-tuning stage, thereby producing overtraining.

What carries the argument

The implicit regularization induced by the learning-rate schedule, which controls the sharpness of the pretrained solution and thereby mediates how much pretraining knowledge is lost during later fine-tuning to the same loss value.

Load-bearing premise

Observed differences in forgetting between models that reach the same fine-tuning loss under different learning rates are caused by the learning rate's effect on sharpness rather than by other uncontrolled aspects of the optimization trajectory.

What would settle it

Train two models to the same sharpness value using unrelated learning-rate schedules and check whether they exhibit identical rates of capability loss once fine-tuned to the same instruction loss.

Figures

Figures reproduced from arXiv: 2604.13627 by Aditya Varre, Mark Rofin, Nicolas Flammarion.

**Figure 2.** Figure 2: SFT training loss vs MPA between the representations of the base and finetuned [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Loss landscape analysis for OLMo 1 1B. Panels [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Finetuning with Diagonal Networks. We train a diagonal network with [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Evolution of model sharpness throughout pretraining (x-axis reflects pretraining [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: The dynamic of accuracy drop (the difference between the OOD score between [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: The principal angles between representations of base and finetuned SmolLM3-3B [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: L2 on SAE representations vs MPA for Gemma 3 1B. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: L2 on SAE representations vs SFT train loss for Gemma 3 1B. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Left: dynamics of estimated sharpness and true sharpness closely follow each [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: SFT train loss vs OOD score during supervised finetuning (same as Figure [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: MPA vs SFT train loss during supervised finetuning (same as Figure [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: MPA vs SFT train loss during supervised finetuning (same as Figures [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Loss landscape analysis (same as Figure [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: MPA as a function of stepsize with variable learning rates (analogous to Figures [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: MPA as a function of stepsize with variable checkpoints (analogous to Figure [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: MPA as a function of stepsize with variable checkpoints (analogous to Figure [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Forgetting (the drop in OOD score between the base and finetuned models) [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 19.** Figure 19: The principal angles between representations of base and finetuned models as a [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗

**Figure 20.** Figure 20: Same as Figures 7 and 19, but for the models finetuned on Tulu 3. ¨ 22 [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗

read the original abstract

Supervised fine-tuning (SFT) is a common first stage of LLM post-training, teaching the model to follow instructions and shaping its behavior as a helpful assistant. At the same time, SFT may harm the fundamental capabilities of an LLM, particularly after long pretraining: a phenomenon known as catastrophic overtraining (Springer et al., 2025). To understand overtraining, we first investigate catastrophic forgetting in finetuning through the lens of implicit regularization of the learning rate. For models trained to the same SFT loss, we identify how the learning rate mediates optimization: finetuning with large and small steps converges to qualitatively different models. Next, we link forgetting to overtraining: learning rate decay increases the sharpness of the pretrained model, which in turn exacerbates catastrophic forgetting during SFT, leading to overtraining. Our findings paint a picture of the overtraining mechanism in LLMs and broadly contribute to the understanding of the interplay between optimization dynamics during pretraining and finetuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ties pretraining LR decay to sharper models and worse SFT forgetting, but leaves the sharpness mechanism under-isolated from other LR effects.

read the letter

The central claim is that learning rate decay in pretraining raises model sharpness, which then drives more catastrophic forgetting during supervised fine-tuning even at matched SFT loss. This is offered as a mechanistic account of overtraining. The work is useful for showing that finetuning with large versus small steps reaches qualitatively different solutions at the same loss value, and for linking that difference back to the pretraining schedule via sharpness. That framing is a reasonable extension of existing observations on forgetting and optimization geometry. The experiments appear to track sharpness metrics and forgetting behavior across LR choices, which is concrete enough to be worth checking. The main weakness is that the design does not clearly separate sharpness from other things LR decay changes, such as effective noise scale, momentum buildup, or basin properties unrelated to the chosen sharpness measure. Without explicit controls like noise-matched runs or direct sharpness interventions, it is hard to know whether sharpness is the load-bearing mediator or just correlated with the outcome. The abstract and stress-test note give no sign those alternatives were ruled out. This is the sort of paper that belongs in a reading group focused on LLM optimization or post-training pipelines. Readers who care about practical training choices or about how pretraining decisions propagate to fine-tuning will find the angle worth discussing. It is coherent on its own terms and engages the literature without obvious internal contradictions, so it should go to referees rather than a desk reject. Reviewers will likely press on the causality question, but the empirical observations alone justify the time.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates catastrophic overtraining in LLMs during supervised fine-tuning (SFT) by analyzing the implicit regularization effects of learning rates. It claims that, for models trained to identical SFT loss values, large versus small step sizes converge to qualitatively different solutions; separately, learning-rate decay during pretraining increases the sharpness of the pretrained model, which then amplifies catastrophic forgetting in the subsequent SFT stage and produces overtraining.

Significance. If the causal linkage holds, the work supplies a concrete mechanistic account of overtraining that connects pretraining optimization geometry to downstream forgetting. This extends existing sharpness-based explanations of generalization and supplies actionable guidance on learning-rate schedules across the pretrain–finetune boundary. The emphasis on comparing models at fixed SFT loss is a methodological strength that avoids trivial loss-based confounds.

major comments (2)

[§4 (Experimental Results) and §5 (Mechanism Analysis)] The central causal claim—that pretrained sharpness (induced by LR decay) is the dominant mediator of exacerbated forgetting at fixed SFT loss—remains unisolated. The manuscript must demonstrate that alternative LR-induced differences (gradient-noise scale, momentum accumulation, or basin geometry unrelated to the chosen sharpness metric) have been controlled; without such controls the observed qualitative differences cannot be attributed specifically to sharpness.
[§3 (Finetuning Dynamics) and associated figures] The abstract states that finetuning with large and small steps “converges to qualitatively different models” at the same SFT loss, yet no quantitative metric or statistical test is referenced that would establish the claimed qualitative distinction is reproducible and not an artifact of random seed or data-order effects.

minor comments (2)

[Abstract] The abstract supplies no experimental details (model scale, dataset sizes, exact LR schedules, sharpness metric, or number of runs), which makes it impossible for a reader to evaluate the strength of the reported findings from the summary alone.
[§2 (Background)] Notation for sharpness and the precise definition of “catastrophic overtraining” should be introduced with an equation or explicit reference to prior work (Springer et al., 2025) at first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of causal isolation and quantitative rigor that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [§4 (Experimental Results) and §5 (Mechanism Analysis)] The central causal claim—that pretrained sharpness (induced by LR decay) is the dominant mediator of exacerbated forgetting at fixed SFT loss—remains unisolated. The manuscript must demonstrate that alternative LR-induced differences (gradient-noise scale, momentum accumulation, or basin geometry unrelated to the chosen sharpness metric) have been controlled; without such controls the observed qualitative differences cannot be attributed specifically to sharpness.

Authors: We agree that a stronger isolation of sharpness from other LR-induced factors would improve the causal argument. Our current design already holds SFT loss fixed across step-size regimes, which removes loss-value confounds, and we measure sharpness with the same metric used in the pretraining stage. Nevertheless, we have not yet performed explicit ablations that hold momentum fixed while varying noise scale or that compare basins with matched sharpness but different curvature spectra. In the revised manuscript we will add two targeted controls: (i) an experiment that injects controlled gradient noise at fixed step size and momentum, and (ii) a comparison of models whose sharpness is matched by construction but whose optimization trajectories differ in momentum accumulation. These additions will allow us to quantify how much of the observed forgetting difference survives after the alternative factors are controlled. revision: yes
Referee: [§3 (Finetuning Dynamics) and associated figures] The abstract states that finetuning with large and small steps “converges to qualitatively different models” at the same SFT loss, yet no quantitative metric or statistical test is referenced that would establish the claimed qualitative distinction is reproducible and not an artifact of random seed or data-order effects.

Authors: We accept that the abstract’s phrasing would be more precise if supported by a quantitative measure and reproducibility checks. The figures in §3 already display consistent differences in post-SFT sharpness, forgetting curves, and downstream task performance across the two step-size regimes, but these are presented visually rather than with summary statistics. In the revision we will augment §3 with (a) a quantitative distance metric (parameter cosine similarity and layer-wise sharpness delta) between the large-step and small-step solutions at matched SFT loss, and (b) results aggregated over at least five independent random seeds with standard-error bars. These additions will be reflected in both the text and the abstract. revision: yes

Circularity Check

0 steps flagged

Empirical investigation with no self-referential derivation or fitted predictions

full rationale

The paper presents an empirical study of optimization dynamics in LLM pretraining and supervised fine-tuning. It reports observations that models trained to identical SFT loss values under different learning rates converge to qualitatively different solutions, and links learning-rate decay during pretraining to increased sharpness that then correlates with greater forgetting. These links are drawn directly from experimental comparisons rather than from any closed-form derivation, parameter fit presented as a prediction, or self-citation that supplies the central mechanism. No equations, ansatzes, uniqueness theorems, or renamings of known results appear in the abstract or described claims that would reduce the reported findings to their own inputs by construction. The work is therefore self-contained as an experimental investigation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract introduces no new free parameters, invented entities, or ad-hoc axioms; it relies on standard deep-learning concepts of loss landscapes and implicit regularization.

axioms (1)

domain assumption Existence of a loss landscape whose sharpness can be meaningfully compared across models trained to the same loss value.
Invoked when linking learning rate to sharpness and forgetting.

pith-pipeline@v0.9.0 · 5473 in / 1163 out tokens · 110750 ms · 2026-05-10T13:36:59.436510+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

Reference graph

Works this paper leans on

25 extracted references · 17 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Maksym Andriushchenko, Francesco Croce, Maximilian M¨uller, Matthias Hein, and Nicolas Flammarion. A modern look at the relationship between sharpness and generalization. InInternational Conference on Machine Learning, pp. 840–902. PMLR, 2023a. Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. SGD with large step ...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811,

Parikshit Bansal and Sujay Sanghavi. Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811,

work page arXiv
[3]

Universal dynamics of warmup stable decay: understanding wsd beyond transformers

Annalisa Belloni, Lorenzo Noci, and Antonio Orvieto. Universal dynamics of warmup stable decay: understanding wsd beyond transformers. InHigh-dimensional Learning Dynamics 2025,

2025
[4]

Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process

Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. InConference on Learning Theory, COLT 2020, Proceedings of Machine Learning Research. PMLR,

2020
[5]

Training dynamics impact post-training quantization robustness.arXiv preprint arXiv:2510.06213,

https://transformer-circuits.pub/2023/monosemantic-features/index.html. 10 Preprint. Under review. Albert Catalan-Tatjer, Niccol`o Ajroldi, and Jonas Geiping. Training dynamics impact post- training quantization robustness.arXiv preprint arXiv:2510.06213,

work page arXiv 2023
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Adaptive gradient methods at the edge of stability

Jeremy Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, Zachary Nado, George E Dahl, and Justin Gilmer. Adaptive gradient methods at the edge of stability. InNeurIPS 2023 Workshop Heavy Tails in Machine Learning,

2023
[8]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review arXiv
[9]

URL https://zenodo.org /records/12608602. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram’e, Morgane Rivi`ere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gael Liu, ...

work page arXiv
[10]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,

work page Pith review arXiv
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

11 Preprint. Under review. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C

Suhas Kotha and Percy Liang. Replaying pre-training data improves fine-tuning.arXiv preprint arXiv:2603.04964,

work page arXiv
[13]

The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218,

work page arXiv 2003
[14]

Revisiting catastrophic forgetting in large language model tuning

Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning. InFindings of the association for computational linguistics: EMNLP 2024, pp. 4297–4308,

2024
[15]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

URL https://openreview.net/forum?id=Bk g6RiCqY7. Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614,

work page Pith review arXiv
[16]

2 OLMo 2 Furious

OLMo Team, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656,

work page internal anchor Pith review arXiv
[17]

https://arxiv.org/abs/2509.14233 (2025)

Project Apertus, Alejandro Hern´andez-Cano, Alexander H¨agele, Allen Hao Huang, Ange- lika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank ˇDurech, et al. Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233,

work page arXiv
[18]

Amuro & char: Analyzing the relationship between pre- training and fine-tuning of large language models

Kaiser Sun and Mark Dredze. Amuro & char: Analyzing the relationship between pre- training and fine-tuning of large language models. InProceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025), pp. 131–151,

2025
[19]

Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, and Aditi Raghunathan

doi: 10.1109/TPAMI.2024.3367329. Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, and Aditi Raghunathan. Sharpness-aware pretraining mitigates catastrophic forgetting. InWorkshop on Scientific Methods for Understanding Deep Learning,

work page doi:10.1109/tpami.2024.3367329 2024
[20]

Under review

13 Preprint. Under review. Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P Gummadi, Willie Neiswanger, and Robin Jia. Hubble: a model suite to advance the study of llm memorization.arXiv preprint arXiv:2510.19811,

work page arXiv
[21]

Data mining and knowledge discovery , 33(4):917–963

Association for Computational Linguistics. doi: 10.18653/v1/W17-4413. URLhttps://aclanthology.org/W17-4413/. Kaiyue Wen, Zhiyuan Li, Jason S Wang, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape view. InThe Thirteenth International Conference on Learning Representations,

work page doi:10.18653/v1/w17-4413
[22]

A Walk with SGD

Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770,

work page Pith review arXiv
[23]

Under review

14 Preprint. Under review. A Additional Experimental Details A.1 Supervised Finetuning We mostly follow Springer et al. (2025) for the finetuning/evaluation design. Data.We use two SFT datasets: Anthropic HH (Bai et al.,

2025
[24]

Since Anthropic HH is originally a preference-tuning dataset, we use it for SFT by finetuning on “chosen” responses

and T ¨ulu 3 (Lambert et al., 2025). Since Anthropic HH is originally a preference-tuning dataset, we use it for SFT by finetuning on “chosen” responses. We use the maximum context length of our models to 512 tokens. Following standard practice, we wrap the SFT examples into a chat template with special tokens<|user|>and<|assistant|>added to the tokenizer...

2025
[25]

We obtain the noised model by adding Gaussian noise ε∼ N( 0, σ2), σ= 10−5 to the parameters

and calculate the average KL-divergence between the next-token predictions of the original model and the predictions of the noised model. We obtain the noised model by adding Gaussian noise ε∼ N( 0, σ2), σ= 10−5 to the parameters. For each batch, we average the results over 10 random samples ofε. B Additional Results B.1 Verifying the Metric for Feature D...

2023