pith. machine review for the scientific record. sign in

arxiv: 2604.13627 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.CL

Recognition: unknown

(How) Learning Rates Regulate Catastrophic Overtraining

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:36 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords catastrophic overtrainingsupervised fine-tuninglearning rate decaymodel sharpnesscatastrophic forgettingimplicit regularizationpretraining dynamicsLLM post-training
0
0 comments X

The pith

Learning rate decay during pretraining increases model sharpness and thereby worsens catastrophic forgetting when the model later undergoes supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates why supervised fine-tuning on instructions can erase capabilities that a language model acquired during pretraining, especially after extended pretraining runs. It shows that the learning rate schedule used in pretraining shapes the geometry of the solution the model reaches, with decaying rates producing sharper minima. Sharper models then lose more of their original behavior once fine-tuned to the same instruction-following loss, producing the overtraining effect. A reader would care because this link supplies a concrete handle on a widespread practical problem in scaling language models.

Core claim

When two models are trained to identical supervised fine-tuning loss, those optimized with large steps versus small steps reach qualitatively different parameter configurations. Learning-rate decay in the pretraining phase raises the sharpness of the resulting model, and this increased sharpness directly amplifies the amount of catastrophic forgetting that occurs during the subsequent fine-tuning stage, thereby producing overtraining.

What carries the argument

The implicit regularization induced by the learning-rate schedule, which controls the sharpness of the pretrained solution and thereby mediates how much pretraining knowledge is lost during later fine-tuning to the same loss value.

Load-bearing premise

Observed differences in forgetting between models that reach the same fine-tuning loss under different learning rates are caused by the learning rate's effect on sharpness rather than by other uncontrolled aspects of the optimization trajectory.

What would settle it

Train two models to the same sharpness value using unrelated learning-rate schedules and check whether they exhibit identical rates of capability loss once fine-tuned to the same instruction loss.

Figures

Figures reproduced from arXiv: 2604.13627 by Aditya Varre, Mark Rofin, Nicolas Flammarion.

Figure 1
Figure 1. Figure 1: SFT training loss vs OOD performance for different models and LRs. Checkpoints [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SFT training loss vs MPA between the representations of the base and finetuned [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Loss landscape analysis for OLMo 1 1B. Panels [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Finetuning with Diagonal Networks. We train a diagonal network with [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of model sharpness throughout pretraining (x-axis reflects pretraining [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The dynamic of accuracy drop (the difference between the OOD score between [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The principal angles between representations of base and finetuned SmolLM3-3B [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: L2 on SAE representations vs MPA for Gemma 3 1B. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: L2 on SAE representations vs SFT train loss for Gemma 3 1B. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Left: dynamics of estimated sharpness and true sharpness closely follow each [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: SFT train loss vs OOD score during supervised finetuning (same as Figure [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: MPA vs SFT train loss during supervised finetuning (same as Figure [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: MPA vs SFT train loss during supervised finetuning (same as Figures [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Loss landscape analysis (same as Figure [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: MPA as a function of stepsize with variable learning rates (analogous to Figures [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: MPA as a function of stepsize with variable checkpoints (analogous to Figure [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: MPA as a function of stepsize with variable checkpoints (analogous to Figure [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Forgetting (the drop in OOD score between the base and finetuned models) [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: The principal angles between representations of base and finetuned models as a [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Same as Figures 7 and 19, but for the models finetuned on Tulu 3. ¨ 22 [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
read the original abstract

Supervised fine-tuning (SFT) is a common first stage of LLM post-training, teaching the model to follow instructions and shaping its behavior as a helpful assistant. At the same time, SFT may harm the fundamental capabilities of an LLM, particularly after long pretraining: a phenomenon known as catastrophic overtraining (Springer et al., 2025). To understand overtraining, we first investigate catastrophic forgetting in finetuning through the lens of implicit regularization of the learning rate. For models trained to the same SFT loss, we identify how the learning rate mediates optimization: finetuning with large and small steps converges to qualitatively different models. Next, we link forgetting to overtraining: learning rate decay increases the sharpness of the pretrained model, which in turn exacerbates catastrophic forgetting during SFT, leading to overtraining. Our findings paint a picture of the overtraining mechanism in LLMs and broadly contribute to the understanding of the interplay between optimization dynamics during pretraining and finetuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates catastrophic overtraining in LLMs during supervised fine-tuning (SFT) by analyzing the implicit regularization effects of learning rates. It claims that, for models trained to identical SFT loss values, large versus small step sizes converge to qualitatively different solutions; separately, learning-rate decay during pretraining increases the sharpness of the pretrained model, which then amplifies catastrophic forgetting in the subsequent SFT stage and produces overtraining.

Significance. If the causal linkage holds, the work supplies a concrete mechanistic account of overtraining that connects pretraining optimization geometry to downstream forgetting. This extends existing sharpness-based explanations of generalization and supplies actionable guidance on learning-rate schedules across the pretrain–finetune boundary. The emphasis on comparing models at fixed SFT loss is a methodological strength that avoids trivial loss-based confounds.

major comments (2)
  1. [§4 (Experimental Results) and §5 (Mechanism Analysis)] The central causal claim—that pretrained sharpness (induced by LR decay) is the dominant mediator of exacerbated forgetting at fixed SFT loss—remains unisolated. The manuscript must demonstrate that alternative LR-induced differences (gradient-noise scale, momentum accumulation, or basin geometry unrelated to the chosen sharpness metric) have been controlled; without such controls the observed qualitative differences cannot be attributed specifically to sharpness.
  2. [§3 (Finetuning Dynamics) and associated figures] The abstract states that finetuning with large and small steps “converges to qualitatively different models” at the same SFT loss, yet no quantitative metric or statistical test is referenced that would establish the claimed qualitative distinction is reproducible and not an artifact of random seed or data-order effects.
minor comments (2)
  1. [Abstract] The abstract supplies no experimental details (model scale, dataset sizes, exact LR schedules, sharpness metric, or number of runs), which makes it impossible for a reader to evaluate the strength of the reported findings from the summary alone.
  2. [§2 (Background)] Notation for sharpness and the precise definition of “catastrophic overtraining” should be introduced with an equation or explicit reference to prior work (Springer et al., 2025) at first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of causal isolation and quantitative rigor that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [§4 (Experimental Results) and §5 (Mechanism Analysis)] The central causal claim—that pretrained sharpness (induced by LR decay) is the dominant mediator of exacerbated forgetting at fixed SFT loss—remains unisolated. The manuscript must demonstrate that alternative LR-induced differences (gradient-noise scale, momentum accumulation, or basin geometry unrelated to the chosen sharpness metric) have been controlled; without such controls the observed qualitative differences cannot be attributed specifically to sharpness.

    Authors: We agree that a stronger isolation of sharpness from other LR-induced factors would improve the causal argument. Our current design already holds SFT loss fixed across step-size regimes, which removes loss-value confounds, and we measure sharpness with the same metric used in the pretraining stage. Nevertheless, we have not yet performed explicit ablations that hold momentum fixed while varying noise scale or that compare basins with matched sharpness but different curvature spectra. In the revised manuscript we will add two targeted controls: (i) an experiment that injects controlled gradient noise at fixed step size and momentum, and (ii) a comparison of models whose sharpness is matched by construction but whose optimization trajectories differ in momentum accumulation. These additions will allow us to quantify how much of the observed forgetting difference survives after the alternative factors are controlled. revision: yes

  2. Referee: [§3 (Finetuning Dynamics) and associated figures] The abstract states that finetuning with large and small steps “converges to qualitatively different models” at the same SFT loss, yet no quantitative metric or statistical test is referenced that would establish the claimed qualitative distinction is reproducible and not an artifact of random seed or data-order effects.

    Authors: We accept that the abstract’s phrasing would be more precise if supported by a quantitative measure and reproducibility checks. The figures in §3 already display consistent differences in post-SFT sharpness, forgetting curves, and downstream task performance across the two step-size regimes, but these are presented visually rather than with summary statistics. In the revision we will augment §3 with (a) a quantitative distance metric (parameter cosine similarity and layer-wise sharpness delta) between the large-step and small-step solutions at matched SFT loss, and (b) results aggregated over at least five independent random seeds with standard-error bars. These additions will be reflected in both the text and the abstract. revision: yes

Circularity Check

0 steps flagged

Empirical investigation with no self-referential derivation or fitted predictions

full rationale

The paper presents an empirical study of optimization dynamics in LLM pretraining and supervised fine-tuning. It reports observations that models trained to identical SFT loss values under different learning rates converge to qualitatively different solutions, and links learning-rate decay during pretraining to increased sharpness that then correlates with greater forgetting. These links are drawn directly from experimental comparisons rather than from any closed-form derivation, parameter fit presented as a prediction, or self-citation that supplies the central mechanism. No equations, ansatzes, uniqueness theorems, or renamings of known results appear in the abstract or described claims that would reduce the reported findings to their own inputs by construction. The work is therefore self-contained as an experimental investigation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract introduces no new free parameters, invented entities, or ad-hoc axioms; it relies on standard deep-learning concepts of loss landscapes and implicit regularization.

axioms (1)
  • domain assumption Existence of a loss landscape whose sharpness can be meaningfully compared across models trained to the same loss value.
    Invoked when linking learning rate to sharpness and forgetting.

pith-pipeline@v0.9.0 · 5473 in / 1163 out tokens · 110750 ms · 2026-05-10T13:36:59.436510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

    cs.LG 2026-05 unverdicted novelty 6.0

    Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

Reference graph

Works this paper leans on

25 extracted references · 17 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Maksym Andriushchenko, Francesco Croce, Maximilian M¨uller, Matthias Hein, and Nicolas Flammarion. A modern look at the relationship between sharpness and generalization. InInternational Conference on Machine Learning, pp. 840–902. PMLR, 2023a. Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. SGD with large step ...

  2. [2]

    Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811,

    Parikshit Bansal and Sujay Sanghavi. Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811,

  3. [3]

    Universal dynamics of warmup stable decay: understanding wsd beyond transformers

    Annalisa Belloni, Lorenzo Noci, and Antonio Orvieto. Universal dynamics of warmup stable decay: understanding wsd beyond transformers. InHigh-dimensional Learning Dynamics 2025,

  4. [4]

    Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process

    Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. InConference on Learning Theory, COLT 2020, Proceedings of Machine Learning Research. PMLR,

  5. [5]

    Training dynamics impact post-training quantization robustness.arXiv preprint arXiv:2510.06213,

    https://transformer-circuits.pub/2023/monosemantic-features/index.html. 10 Preprint. Under review. Albert Catalan-Tatjer, Niccol`o Ajroldi, and Jonas Geiping. Training dynamics impact post- training quantization robustness.arXiv preprint arXiv:2510.06213,

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  7. [7]

    Adaptive gradient methods at the edge of stability

    Jeremy Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, Zachary Nado, George E Dahl, and Justin Gilmer. Adaptive gradient methods at the edge of stability. InNeurIPS 2023 Workshop Heavy Tails in Machine Learning,

  8. [8]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

  9. [9]

    URL https://zenodo.org /records/12608602. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram’e, Morgane Rivi`ere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gael Liu, ...

  10. [10]

    An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

    Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    11 Preprint. Under review. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  12. [12]

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C

    Suhas Kotha and Percy Liang. Replaying pre-training data improves fine-tuning.arXiv preprint arXiv:2603.04964,

  13. [13]

    The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

    Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218,

  14. [14]

    Revisiting catastrophic forgetting in large language model tuning

    Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning. InFindings of the association for computational linguistics: EMNLP 2024, pp. 4297–4308,

  15. [15]

    In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

    URL https://openreview.net/forum?id=Bk g6RiCqY7. Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614,

  16. [16]

    2 OLMo 2 Furious

    OLMo Team, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656,

  17. [17]

    https://arxiv.org/abs/2509.14233 (2025)

    Project Apertus, Alejandro Hern´andez-Cano, Alexander H¨agele, Allen Hao Huang, Ange- lika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank ˇDurech, et al. Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233,

  18. [18]

    Amuro & char: Analyzing the relationship between pre- training and fine-tuning of large language models

    Kaiser Sun and Mark Dredze. Amuro & char: Analyzing the relationship between pre- training and fine-tuning of large language models. InProceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025), pp. 131–151,

  19. [19]

    Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, and Aditi Raghunathan

    doi: 10.1109/TPAMI.2024.3367329. Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, and Aditi Raghunathan. Sharpness-aware pretraining mitigates catastrophic forgetting. InWorkshop on Scientific Methods for Understanding Deep Learning,

  20. [20]

    Under review

    13 Preprint. Under review. Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P Gummadi, Willie Neiswanger, and Robin Jia. Hubble: a model suite to advance the study of llm memorization.arXiv preprint arXiv:2510.19811,

  21. [21]

    Data mining and knowledge discovery , 33(4):917–963

    Association for Computational Linguistics. doi: 10.18653/v1/W17-4413. URLhttps://aclanthology.org/W17-4413/. Kaiyue Wen, Zhiyuan Li, Jason S Wang, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape view. InThe Thirteenth International Conference on Learning Representations,

  22. [22]

    A Walk with SGD

    Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770,

  23. [23]

    Under review

    14 Preprint. Under review. A Additional Experimental Details A.1 Supervised Finetuning We mostly follow Springer et al. (2025) for the finetuning/evaluation design. Data.We use two SFT datasets: Anthropic HH (Bai et al.,

  24. [24]

    Since Anthropic HH is originally a preference-tuning dataset, we use it for SFT by finetuning on “chosen” responses

    and T ¨ulu 3 (Lambert et al., 2025). Since Anthropic HH is originally a preference-tuning dataset, we use it for SFT by finetuning on “chosen” responses. We use the maximum context length of our models to 512 tokens. Following standard practice, we wrap the SFT examples into a chat template with special tokens<|user|>and<|assistant|>added to the tokenizer...

  25. [25]

    We obtain the noised model by adding Gaussian noise ε∼ N( 0, σ2), σ= 10−5 to the parameters

    and calculate the average KL-divergence between the next-token predictions of the original model and the predictions of the noised model. We obtain the noised model by adding Gaussian noise ε∼ N( 0, σ2), σ= 10−5 to the parameters. For each batch, we average the results over 10 random samples ofε. B Additional Results B.1 Verifying the Metric for Feature D...