pith. sign in

arxiv: 2605.20005 · v1 · pith:MJJ564QCnew · submitted 2026-05-19 · 💻 cs.LG

Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates

Pith reviewed 2026-05-20 07:10 UTC · model grok-4.3

classification 💻 cs.LG
keywords catastrophic forgettingfine-tuninglearning rate schedulelarge language modelsloss-adaptive optimizationknowledge preservationcontinual learning
0
0 comments X

The pith

Reducing learning rates on high-loss batches during fine-tuning reduces catastrophic forgetting by 93% on average while maintaining task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that catastrophic forgetting in fine-tuning large language models can be substantially mitigated by using a loss-adaptive learning rate that is lower for high-loss batches. The key insight is that the amount of forgetting per step is bounded by the learning rate multiplied by the square root of the training loss, making high-loss examples particularly risky for overwriting prior knowledge. By adapting the learning rate accordingly and increasing it as the model converges, FINCH achieves the reported reduction in forgetting across multiple benchmarks without altering the fine-tuning objective itself. A sympathetic reader would care because this provides a simple way to adapt models to new data while keeping their pre-trained capabilities intact, which is essential for practical deployment in evolving domains.

Core claim

We identify a simple mechanism for controlling forgetting: per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss. This suggests that high-loss batches are especially prone to inducing forgetting. Motivated by this observation, we introduce FINCH, a loss-adaptive learning-rate schedule that reduces the learning rate on high-loss batches and increases it as the model converges, while leaving the fine-tuning objective unchanged. Across benchmarks, FINCH reduces forgetting by 93% on average while matching the task performance of standard fine-tuning.

What carries the argument

FINCH, the loss-adaptive learning-rate schedule that lowers the learning rate for batches with high current training loss to limit the per-step forgetting bound.

If this is right

  • Across knowledge acquisition, science, and low-resource language adaptation benchmarks, forgetting is reduced by 93% on average while task performance matches standard fine-tuning.
  • On Qwen3-4B knowledge acquisition, TruthfulQA degradation is cut by 5x and HaluEval degradation is reversed.
  • Confidence calibration is better preserved compared to standard fine-tuning.
  • Learning-rate schedules can shape model behavior during fine-tuning beyond just target-task optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The bound could motivate real-time loss monitoring to dynamically adjust rates in any sequential training setting where earlier capabilities must be retained.
  • Similar adaptive schedules might stabilize training from scratch by down-weighting early high-loss phases to avoid unstable updates.
  • If the mechanism generalizes, it could reduce reliance on explicit regularization terms in continual learning pipelines.

Load-bearing premise

Per-step forgetting during fine-tuning is bounded by the product of the learning rate and the square root of the current training loss.

What would settle it

An experiment showing that observed forgetting after a fine-tuning step on a high-loss batch exceeds the predicted bound of learning rate times square root of loss, or that lowering the rate on such batches fails to reduce overall forgetting compared to a fixed schedule.

Figures

Figures reproduced from arXiv: 2605.20005 by Aldan Creo, Babak Salimi, Jiongli Zhu, Parjanya Prajakta Prashant.

Figure 1
Figure 1. Figure 1: Overview of FINCH. Results are shown for Qwen3-4B on knowledge acquisition; full experimental details are given in Section 4 and Appendix B. (a) We show normalized new-task accuracy, normalized old-task accuracy, and learning rate over training for standard SFT and FINCH. Norm. Acc. denotes min-max normalized accuracy: for each accuracy curve type, we set the minimum value attained by either SFT or FINCH o… view at source ↗
Figure 2
Figure 2. Figure 2: Task accuracy (or win-tie rate) vs. average benchmark [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Truthfulness, hallucination, and calibration vs. task accuracy on knowledge acquisition [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Truthfulness, hallucination, and calibration vs. task accuracy on Science (Qwen3-4B). [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Truthfulness, hallucination, and calibration vs. win-tie rate on Galician (Qwen3-4B). [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
read the original abstract

Fine-tuning large language models on new data improves task performance but degrades capabilities learned during pretraining, a phenomenon known as catastrophic forgetting. Existing methods mitigate this by modifying the fine-tuning objective to suppress high-loss tokens or sequences, but these tokens are essential for learning new tasks, especially those with poor pretraining coverage. In such settings, hard tokens should still contribute to learning, so forgetting must be controlled without suppressing them. We identify a simple mechanism for doing so: per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss. This suggests that high-loss batches are especially prone to inducing forgetting. Motivated by this observation, we introduce FINCH, a loss-adaptive learning-rate schedule that reduces the learning rate on high-loss batches and increases it as the model converges, while leaving the fine-tuning objective unchanged. Across knowledge acquisition, science, and low-resource language adaptation benchmarks, FINCH reduces forgetting by 93% on average while matching the task performance of standard fine-tuning. On Qwen3-4B knowledge acquisition, FINCH cuts TruthfulQA degradation by 5x and reverses HaluEval degradation, while better preserving confidence calibration. Overall, our results show that learning-rate schedules are an effective tool to shape model behavior during fine-tuning, beyond just target-task optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes FINCH, a loss-adaptive learning-rate schedule for fine-tuning LLMs that reduces the learning rate on high-loss batches. It claims that per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss, motivating the schedule while leaving the fine-tuning objective unchanged. Across knowledge acquisition, science, and low-resource language adaptation benchmarks, FINCH is reported to reduce forgetting by 93% on average while matching standard fine-tuning task performance, with specific gains on Qwen3-4B including a 5x cut in TruthfulQA degradation and reversal of HaluEval degradation.

Significance. If the per-step forgetting bound is rigorously derived and the empirical controls are sound, the work shows that learning-rate schedules alone can shape fine-tuning behavior to preserve pretraining capabilities without suppressing hard tokens or modifying the loss. This is a lightweight alternative to existing forgetting-mitigation techniques and could be broadly useful for stable LLM adaptation.

major comments (1)
  1. [Abstract] Abstract: the claim that per-step forgetting is bounded by LR × √(current training loss) is presented as the key mechanism motivating FINCH, yet no derivation, inequality, or set of assumptions is supplied. Without this, it is impossible to assess whether the bound remains valid under the distribution shift that occurs during fine-tuning on out-of-distribution data, which directly undermines the justification for the adaptive schedule and the reported 93% forgetting reduction.
minor comments (2)
  1. The abstract refers to 'knowledge acquisition, science, and low-resource language adaptation benchmarks' but does not list the concrete datasets, number of runs, or statistical tests used to support the average 93% reduction.
  2. It would be helpful to include a short proof sketch or inequality chain for the claimed forgetting bound in the main text or appendix so that readers can verify the √loss dependence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying an important point regarding the theoretical motivation. We address the major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that per-step forgetting is bounded by LR × √(current training loss) is presented as the key mechanism motivating FINCH, yet no derivation, inequality, or set of assumptions is supplied. Without this, it is impossible to assess whether the bound remains valid under the distribution shift that occurs during fine-tuning on out-of-distribution data, which directly undermines the justification for the adaptive schedule and the reported 93% forgetting reduction.

    Authors: We acknowledge that the abstract states the bound without supplying the derivation or explicit assumptions, which limits the ability to evaluate its validity under distribution shift. The manuscript motivates the bound via the observation that the per-step parameter update magnitude scales with the learning rate and that the gradient norm is controlled by the current loss value (via standard inequalities relating loss to gradient under smoothness assumptions). However, a complete step-by-step derivation with listed assumptions was not included. We will add a dedicated paragraph (or short subsection) in the revised manuscript that states the bound formally, lists the assumptions (e.g., L-smoothness of the loss and bounded gradient norms), and discusses its role as a heuristic motivation rather than a strict guarantee throughout training. We will also note that the schedule remains beneficial even if the bound loosens under shift, because it still down-weights updates on high-loss batches. This revision will strengthen the justification while preserving all empirical claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bound and schedule are independently motivated

full rationale

The paper derives the per-step forgetting bound from the parameter update rule combined with a definition of forgetting (increase in pretraining loss) and presents it as a first-principles mechanism. The FINCH schedule is then constructed directly from that bound without fitting parameters to the target forgetting metric or renaming an observed pattern. No self-citation chains, self-definitional steps, or fitted-input-called-prediction reductions appear in the derivation. Empirical results on benchmarks provide independent falsifiable content outside the motivating inequality. The central claim therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the stated bound between forgetting, learning rate, and loss; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss
    This observation is presented as the foundation for the loss-adaptive schedule.

pith-pipeline@v0.9.0 · 5779 in / 1087 out tokens · 44762 ms · 2026-05-20T07:10:51.425676+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 16 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  2. [2]

    Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811, 2025

    Parikshit Bansal and Sujay Sanghavi. Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811, 2025

  3. [3]

    LoRA learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

  4. [4]

    Continual memorization of factoids in language models.arXiv preprint arXiv:2411.07175, 2024

    Howard Chen, Jiayi Geng, Adithya Bhaskar, Dan Friedman, and Danqi Chen. Continual memorization of factoids in language models.arXiv preprint arXiv:2411.07175, 2024

  5. [5]

    Monolingual or multilingual instruction tuning: Which makes a better alpaca

    Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. Monolingual or multilingual instruction tuning: Which makes a better alpaca. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1347–1356, 2024

  6. [6]

    Language modeling with gated convolutional networks

    Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InInternational conference on machine learning, pages 933–941. PMLR, 2017

  7. [7]

    Episodic memory in lifelong language learning.Advances in Neural Information Processing Systems, 32, 2019

    Cyprien de Masson D’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. Episodic memory in lifelong language learning.Advances in Neural Information Processing Systems, 32, 2019

  8. [8]

    How catas- trophic can catastrophic forgetting be in linear regression? InConference on Learning Theory, pages 4028–4079

    Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catas- trophic can catastrophic forgetting be in linear regression? InConference on Learning Theory, pages 4028–4079. PMLR, 2022

  9. [9]

    Sciknoweval: Evaluating multi-level scientific knowledge of large language models

    Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024

  10. [10]

    Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

    Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

  11. [11]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  12. [12]

    Does fine-tuning llms on new knowledge encourage hallucinations? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7765–7784, 2024

    Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7765–7784, 2024. 10

  13. [13]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677, 2017

  14. [14]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  15. [15]

    Preserving pre-trained features helps calibrate fine-tuned language models.arXiv preprint arXiv:2305.19249, 2023

    Guande He, Jianfei Chen, and Jun Zhu. Preserving pre-trained features helps calibrate fine-tuned language models.arXiv preprint arXiv:2305.19249, 2023

  16. [16]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  17. [17]

    arXiv preprint arXiv:2004.13135 , year=

    Calypso Herrera, Florian Krach, and Josef Teichmann. Local lipschitz bounds of deep neural networks.arXiv preprint arXiv:2004.13135, 2020

  18. [18]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  19. [19]

    Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal

    Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1416–1428, 2024

  20. [20]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

  21. [21]

    Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024

    Gangwei Jiang, Zhaoyi Li, Defu Lian, and Ying Wei. Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024

  22. [22]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  23. [23]

    Scaling laws for forgetting when fine-tuning large language models

    Damjan Kalajdzievski. Scaling laws for forgetting when fine-tuning large language models. arXiv preprint arXiv:2401.05605, 2024

  24. [24]

    Why warmup the learning rate? underlying mech- anisms and improvements.Advances in Neural Information Processing Systems, 37:111760– 111801, 2024

    Dayal Singh Kalra and Maissam Barkeshli. Why warmup the learning rate? underlying mech- anisms and improvements.Advances in Neural Information Processing Systems, 37:111760– 111801, 2024

  25. [25]

    High dimensional bayesian optimisation and bandits via additive models

    Kirthevasan Kandasamy, Jeff Schneider, and Barnab´as P´oczos. High dimensional bayesian optimisation and bandits via additive models. InInternational conference on machine learning, pages 295–304. PMLR, 2015

  26. [26]

    Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024

    Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew G Wilson. Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024

  27. [27]

    Intelligent learning rate distribution to reduce catastrophic forgetting in transformers

    Philip Kenneweg, Alexander Schulz, Sarah Schr¨oder, and Barbara Hammer. Intelligent learning rate distribution to reduce catastrophic forgetting in transformers. InInternational Conference on Intelligent Data Engineering and Automated Learning, pages 252–261. Springer, 2022

  28. [28]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 11

  29. [29]

    Analyzing & reducing the need for learning rate warmup in gpt training.Advances in Neural Information Processing Systems, 37:2914–2942, 2024

    Atli Kosson, Bettina Messmer, and Martin Jaggi. Analyzing & reducing the need for learning rate warmup in gpt training.Advances in Neural Information Processing Systems, 37:2914–2942, 2024

  30. [30]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  31. [31]

    Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

    Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

  32. [32]

    Fine-tuning without forgetting in-context learning: A theoretical analysis of linear attention models.arXiv preprint arXiv:2602.23197, 2026

    Chungpa Lee, Jy-yong Sohn, and Kangwook Lee. Fine-tuning without forgetting in-context learning: A theoretical analysis of linear attention models.arXiv preprint arXiv:2602.23197, 2026

  33. [33]

    Lewkowycz, Y

    Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

  34. [34]

    Towards understanding catastrophic forgetting in two-layer convolutional neural networks

    Boqi Li, Youjun Wang, and Weiwei Liu. Towards understanding catastrophic forgetting in two-layer convolutional neural networks. InForty-second International Conference on Machine Learning, 2025

  35. [35]

    Revisiting catastrophic forgetting in large language model tuning

    Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning. InFindings of the association for computational linguistics: EMNLP 2024, pages 4297–4308, 2024

  36. [36]

    Halueval: A large-scale hallucination evaluation benchmark for large language models

    Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 6449–6464, 2023

  37. [37]

    Sft doesn’t always hurt general capabilities: Revisiting domain-specific fine-tuning in llms.arXiv preprint arXiv:2509.20758, 2025

    Jiacheng Lin, Zhongruo Wang, Kun Qian, Tian Wang, Arvind Srinivasan, Hansi Zeng, Ruochen Jiao, Xie Zhou, Jiri Gesi, Dakuo Wang, et al. Sft doesn’t always hurt general capabilities: Revisiting domain-specific fine-tuning in llms.arXiv preprint arXiv:2509.20758, 2025

  38. [38]

    Trgp: Trust region gradient projection for continual learning.arXiv preprint arXiv:2202.02931, 2022

    Sen Lin, Li Yang, Deliang Fan, and Junshan Zhang. Trgp: Trust region gradient projection for continual learning.arXiv preprint arXiv:2202.02931, 2022

  39. [39]

    Truthfulqa: Measuring how models mimic hu- man falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic hu- man falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

  40. [40]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016

  41. [41]

    On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

  42. [42]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

  43. [43]

    TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

  44. [44]

    Catastrophic interference in connectionist networks: The sequential learning problem

    Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

  45. [45]

    arXiv preprint arXiv:2404.00213 , year=

    Nick Mecklenburg, Yiyou Lin, Xiaoxiao Li, Daniel Holstein, Leonardo Nunes, Sara Mal- var, Bruno Silva, Ranveer Chandra, Vijay Aski, Pavan Kumar Reddy Yannam, et al. Inject- ing new knowledge into large language models via supervised fine-tuning.arXiv preprint arXiv:2404.00213, 2024. 12

  46. [46]

    Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

    Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, and Wee Sun Lee. Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

  47. [47]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023

  48. [48]

    Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

  49. [49]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

  50. [50]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  51. [51]

    Upweighting easy samples in fine-tuning mitigates forgetting.arXiv preprint arXiv:2502.02797, 2025

    Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, and Sujay Sanghavi. Upweighting easy samples in fine-tuning mitigates forgetting.arXiv preprint arXiv:2502.02797, 2025

  52. [52]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  53. [53]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  54. [54]

    Fine-tuned language models are continual learners

    Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, 2022

  55. [55]

    Cambridge university press, 2014

    Shai Shalev-Shwartz and Shai Ben-David.Understanding machine learning: From theory to algorithms. Cambridge university press, 2014

  56. [56]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  57. [57]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  58. [58]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas H¨ubotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  59. [59]

    Mitigating forgetting in continual learning with selective gradient projection

    Anika Singh, David Martinez, Aayush Dhaulakhandi, Varun Chopade, Likhith Malipati, Vasu Sharma, Kevin Zhu, Sunishchal Dev, and Ryan Lagasse. Mitigating forgetting in continual learning with selective gradient projection. InThe 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Assoc...

  60. [60]

    Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

  61. [61]

    Super-convergence: Very fast training of neural networks using large learning rates

    Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. InArtificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE, 2019

  62. [62]

    How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization.arXiv preprint arXiv:2501.13669, 2025

    Shezheng Song, Hao Xu, Jun Ma, Shasha Li, Long Peng, Qian Wan, Xiaodong Liu, and Jie Yu. How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization.arXiv preprint arXiv:2501.13669, 2025. 13

  63. [63]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  64. [64]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  65. [65]

    Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in neural information processing systems, 31, 2018

    Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in neural information processing systems, 31, 2018

  66. [66]

    Factuality of large language models: A survey

    Yuxia Wang, Minghan Wang, Muhammad Arslan Manzoor, Fei Liu, Georgi Nenkov Georgiev, Rocktim Jyoti Das, and Preslav Nakov. Factuality of large language models: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19519–19529, 2024

  67. [67]

    Measuring short-form factuality in large language models

    Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024

  68. [68]

    Robust fine-tuning of zero-shot models

    Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959–7971, 2022

  69. [69]

    Mitigating forgetting in llm fine-tuning via low-perplexity token learning.arXiv preprint arXiv:2501.14315, 2025

    Chao-Chung Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Shao-Hua Sun, and Hung-yi Lee. Mitigating forgetting in llm fine-tuning via low-perplexity token learning.arXiv preprint arXiv:2501.14315, 2025

  70. [70]

    On the generalization of sft: A reinforcement learning perspective with reward rectification.arXiv preprint arXiv:2508.05629, 2025

    Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of sft: A reinforcement learning perspective with reward rectification.arXiv preprint arXiv:2508.05629, 2025

  71. [71]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  72. [72]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

  73. [73]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  74. [74]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  75. [75]

    Proximal Supervised Fine-Tuning

    Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, and Pengfei Liu. Proximal supervised fine-tuning.arXiv preprint arXiv:2508.17784, 2025

  76. [76]

    How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676, 2025

    Nicolas Zucchet, J¨org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, and Soham De. How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676, 2025. 14 A Proofs A.1 Auxiliary bounds implied by Assumption 1 We first record standard consequences of Assumption 1. Since the input domain is bounded...

  77. [77]

    Is the response written in Galician (not Spanish, Portuguese, English, or other languages)?

  78. [78]

    How natural and fluent is the Galician? Does it sound like a native speaker, or does it have Spanish/Portuguese interference?

  79. [79]

    How consistent is the Galician throughout the response—does it code-switch mid-response? After your brief explanation, you must output only one of the following choices as your final verdict with a label:

  80. [80]

    Assistant A is significantly better:[[A>>B]]

Showing first 80 references.