Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates

Aldan Creo; Babak Salimi; Jiongli Zhu; Parjanya Prajakta Prashant

arxiv: 2605.20005 · v1 · pith:MJJ564QCnew · submitted 2026-05-19 · 💻 cs.LG

Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates

Parjanya Prajakta Prashant , Jiongli Zhu , Aldan Creo , Babak Salimi This is my paper

Pith reviewed 2026-05-20 07:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords catastrophic forgettingfine-tuninglearning rate schedulelarge language modelsloss-adaptive optimizationknowledge preservationcontinual learning

0 comments

The pith

Reducing learning rates on high-loss batches during fine-tuning reduces catastrophic forgetting by 93% on average while maintaining task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that catastrophic forgetting in fine-tuning large language models can be substantially mitigated by using a loss-adaptive learning rate that is lower for high-loss batches. The key insight is that the amount of forgetting per step is bounded by the learning rate multiplied by the square root of the training loss, making high-loss examples particularly risky for overwriting prior knowledge. By adapting the learning rate accordingly and increasing it as the model converges, FINCH achieves the reported reduction in forgetting across multiple benchmarks without altering the fine-tuning objective itself. A sympathetic reader would care because this provides a simple way to adapt models to new data while keeping their pre-trained capabilities intact, which is essential for practical deployment in evolving domains.

Core claim

We identify a simple mechanism for controlling forgetting: per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss. This suggests that high-loss batches are especially prone to inducing forgetting. Motivated by this observation, we introduce FINCH, a loss-adaptive learning-rate schedule that reduces the learning rate on high-loss batches and increases it as the model converges, while leaving the fine-tuning objective unchanged. Across benchmarks, FINCH reduces forgetting by 93% on average while matching the task performance of standard fine-tuning.

What carries the argument

FINCH, the loss-adaptive learning-rate schedule that lowers the learning rate for batches with high current training loss to limit the per-step forgetting bound.

If this is right

Across knowledge acquisition, science, and low-resource language adaptation benchmarks, forgetting is reduced by 93% on average while task performance matches standard fine-tuning.
On Qwen3-4B knowledge acquisition, TruthfulQA degradation is cut by 5x and HaluEval degradation is reversed.
Confidence calibration is better preserved compared to standard fine-tuning.
Learning-rate schedules can shape model behavior during fine-tuning beyond just target-task optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bound could motivate real-time loss monitoring to dynamically adjust rates in any sequential training setting where earlier capabilities must be retained.
Similar adaptive schedules might stabilize training from scratch by down-weighting early high-loss phases to avoid unstable updates.
If the mechanism generalizes, it could reduce reliance on explicit regularization terms in continual learning pipelines.

Load-bearing premise

Per-step forgetting during fine-tuning is bounded by the product of the learning rate and the square root of the current training loss.

What would settle it

An experiment showing that observed forgetting after a fine-tuning step on a high-loss batch exceeds the predicted bound of learning rate times square root of loss, or that lowering the rate on such batches fails to reduce overall forgetting compared to a fixed schedule.

Figures

Figures reproduced from arXiv: 2605.20005 by Aldan Creo, Babak Salimi, Jiongli Zhu, Parjanya Prajakta Prashant.

**Figure 1.** Figure 1: Overview of FINCH. Results are shown for Qwen3-4B on knowledge acquisition; full experimental details are given in Section 4 and Appendix B. (a) We show normalized new-task accuracy, normalized old-task accuracy, and learning rate over training for standard SFT and FINCH. Norm. Acc. denotes min-max normalized accuracy: for each accuracy curve type, we set the minimum value attained by either SFT or FINCH o… view at source ↗

**Figure 2.** Figure 2: Task accuracy (or win-tie rate) vs. average benchmark [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Truthfulness, hallucination, and calibration vs. task accuracy on knowledge acquisition [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Truthfulness, hallucination, and calibration vs. task accuracy on Science (Qwen3-4B). [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Truthfulness, hallucination, and calibration vs. win-tie rate on Galician (Qwen3-4B). [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

read the original abstract

Fine-tuning large language models on new data improves task performance but degrades capabilities learned during pretraining, a phenomenon known as catastrophic forgetting. Existing methods mitigate this by modifying the fine-tuning objective to suppress high-loss tokens or sequences, but these tokens are essential for learning new tasks, especially those with poor pretraining coverage. In such settings, hard tokens should still contribute to learning, so forgetting must be controlled without suppressing them. We identify a simple mechanism for doing so: per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss. This suggests that high-loss batches are especially prone to inducing forgetting. Motivated by this observation, we introduce FINCH, a loss-adaptive learning-rate schedule that reduces the learning rate on high-loss batches and increases it as the model converges, while leaving the fine-tuning objective unchanged. Across knowledge acquisition, science, and low-resource language adaptation benchmarks, FINCH reduces forgetting by 93% on average while matching the task performance of standard fine-tuning. On Qwen3-4B knowledge acquisition, FINCH cuts TruthfulQA degradation by 5x and reverses HaluEval degradation, while better preserving confidence calibration. Overall, our results show that learning-rate schedules are an effective tool to shape model behavior during fine-tuning, beyond just target-task optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FINCH ties an adaptive LR schedule to a per-step forgetting bound of LR times sqrt(loss) and reports 93% average forgetting reduction while matching task performance.

read the letter

The main takeaway is that this paper gives a straightforward loss-adaptive learning-rate schedule called FINCH that lowers the rate on high-loss batches and raises it as training progresses. It claims this controls forgetting without touching the fine-tuning objective, and the reported numbers show a 93% average drop in forgetting across knowledge, science, and low-resource language benchmarks while task performance stays comparable to standard fine-tuning. On Qwen3-4B it cuts TruthfulQA degradation by 5x and reverses some HaluEval increases, plus better calibration preservation. That combination is practically useful if the gains are real and not tied to a narrow set of runs.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes FINCH, a loss-adaptive learning-rate schedule for fine-tuning LLMs that reduces the learning rate on high-loss batches. It claims that per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss, motivating the schedule while leaving the fine-tuning objective unchanged. Across knowledge acquisition, science, and low-resource language adaptation benchmarks, FINCH is reported to reduce forgetting by 93% on average while matching standard fine-tuning task performance, with specific gains on Qwen3-4B including a 5x cut in TruthfulQA degradation and reversal of HaluEval degradation.

Significance. If the per-step forgetting bound is rigorously derived and the empirical controls are sound, the work shows that learning-rate schedules alone can shape fine-tuning behavior to preserve pretraining capabilities without suppressing hard tokens or modifying the loss. This is a lightweight alternative to existing forgetting-mitigation techniques and could be broadly useful for stable LLM adaptation.

major comments (1)

[Abstract] Abstract: the claim that per-step forgetting is bounded by LR × √(current training loss) is presented as the key mechanism motivating FINCH, yet no derivation, inequality, or set of assumptions is supplied. Without this, it is impossible to assess whether the bound remains valid under the distribution shift that occurs during fine-tuning on out-of-distribution data, which directly undermines the justification for the adaptive schedule and the reported 93% forgetting reduction.

minor comments (2)

The abstract refers to 'knowledge acquisition, science, and low-resource language adaptation benchmarks' but does not list the concrete datasets, number of runs, or statistical tests used to support the average 93% reduction.
It would be helpful to include a short proof sketch or inequality chain for the claimed forgetting bound in the main text or appendix so that readers can verify the √loss dependence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying an important point regarding the theoretical motivation. We address the major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that per-step forgetting is bounded by LR × √(current training loss) is presented as the key mechanism motivating FINCH, yet no derivation, inequality, or set of assumptions is supplied. Without this, it is impossible to assess whether the bound remains valid under the distribution shift that occurs during fine-tuning on out-of-distribution data, which directly undermines the justification for the adaptive schedule and the reported 93% forgetting reduction.

Authors: We acknowledge that the abstract states the bound without supplying the derivation or explicit assumptions, which limits the ability to evaluate its validity under distribution shift. The manuscript motivates the bound via the observation that the per-step parameter update magnitude scales with the learning rate and that the gradient norm is controlled by the current loss value (via standard inequalities relating loss to gradient under smoothness assumptions). However, a complete step-by-step derivation with listed assumptions was not included. We will add a dedicated paragraph (or short subsection) in the revised manuscript that states the bound formally, lists the assumptions (e.g., L-smoothness of the loss and bounded gradient norms), and discusses its role as a heuristic motivation rather than a strict guarantee throughout training. We will also note that the schedule remains beneficial even if the bound loosens under shift, because it still down-weights updates on high-loss batches. This revision will strengthen the justification while preserving all empirical claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bound and schedule are independently motivated

full rationale

The paper derives the per-step forgetting bound from the parameter update rule combined with a definition of forgetting (increase in pretraining loss) and presents it as a first-principles mechanism. The FINCH schedule is then constructed directly from that bound without fitting parameters to the target forgetting metric or renaming an observed pattern. No self-citation chains, self-definitional steps, or fitted-input-called-prediction reductions appear in the derivation. Empirical results on benchmarks provide independent falsifiable content outside the motivating inequality. The central claim therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the stated bound between forgetting, learning rate, and loss; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss
This observation is presented as the foundation for the loss-adaptive schedule.

pith-pipeline@v0.9.0 · 5779 in / 1087 out tokens · 44762 ms · 2026-05-20T07:10:51.425676+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss... η_i = κ / √L_Bi(θ_i)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Corollary 1... cumulative forgetting satisfies L_old(p_T) − L_old(p_0) = O(T κ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 16 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024
[2]

Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811, 2025

Parikshit Bansal and Sujay Sanghavi. Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811, 2025

work page arXiv 2025
[3]

LoRA learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

work page arXiv 2024
[4]

Continual memorization of factoids in language models.arXiv preprint arXiv:2411.07175, 2024

Howard Chen, Jiayi Geng, Adithya Bhaskar, Dan Friedman, and Danqi Chen. Continual memorization of factoids in language models.arXiv preprint arXiv:2411.07175, 2024

work page arXiv 2024
[5]

Monolingual or multilingual instruction tuning: Which makes a better alpaca

Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. Monolingual or multilingual instruction tuning: Which makes a better alpaca. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1347–1356, 2024

work page 2024
[6]

Language modeling with gated convolutional networks

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InInternational conference on machine learning, pages 933–941. PMLR, 2017

work page 2017
[7]

Episodic memory in lifelong language learning.Advances in Neural Information Processing Systems, 32, 2019

Cyprien de Masson D’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. Episodic memory in lifelong language learning.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[8]

How catas- trophic can catastrophic forgetting be in linear regression? InConference on Learning Theory, pages 4028–4079

Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catas- trophic can catastrophic forgetting be in linear regression? InConference on Learning Theory, pages 4028–4079. PMLR, 2022

work page 2022
[9]

Sciknoweval: Evaluating multi-level scientific knowledge of large language models

Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024

work page arXiv 2024
[10]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

work page 1999
[11]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024
[12]

Does fine-tuning llms on new knowledge encourage hallucinations? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7765–7784, 2024

Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7765–7784, 2024. 10

work page 2024
[13]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Preserving pre-trained features helps calibrate fine-tuned language models.arXiv preprint arXiv:2305.19249, 2023

Guande He, Jianfei Chen, and Jun Zhu. Preserving pre-trained features helps calibrate fine-tuned language models.arXiv preprint arXiv:2305.19249, 2023

work page arXiv 2023
[16]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[17]

arXiv preprint arXiv:2004.13135 , year=

Calypso Herrera, Florian Krach, and Josef Teichmann. Local lipschitz bounds of deep neural networks.arXiv preprint arXiv:2004.13135, 2020

work page arXiv 2004
[18]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[19]

Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal

Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1416–1428, 2024

work page 2024
[20]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025
[21]

Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024

Gangwei Jiang, Zhaoyi Li, Defu Lian, and Ying Wei. Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024

work page arXiv 2024
[22]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Scaling laws for forgetting when fine-tuning large language models

Damjan Kalajdzievski. Scaling laws for forgetting when fine-tuning large language models. arXiv preprint arXiv:2401.05605, 2024

work page arXiv 2024
[24]

Why warmup the learning rate? underlying mech- anisms and improvements.Advances in Neural Information Processing Systems, 37:111760– 111801, 2024

Dayal Singh Kalra and Maissam Barkeshli. Why warmup the learning rate? underlying mech- anisms and improvements.Advances in Neural Information Processing Systems, 37:111760– 111801, 2024

work page 2024
[25]

High dimensional bayesian optimisation and bandits via additive models

Kirthevasan Kandasamy, Jeff Schneider, and Barnab´as P´oczos. High dimensional bayesian optimisation and bandits via additive models. InInternational conference on machine learning, pages 295–304. PMLR, 2015

work page 2015
[26]

Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024

Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew G Wilson. Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024

work page 2024
[27]

Intelligent learning rate distribution to reduce catastrophic forgetting in transformers

Philip Kenneweg, Alexander Schulz, Sarah Schr¨oder, and Barbara Hammer. Intelligent learning rate distribution to reduce catastrophic forgetting in transformers. InInternational Conference on Intelligent Data Engineering and Automated Learning, pages 252–261. Springer, 2022

work page 2022
[28]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 11

work page 2017
[29]

Analyzing & reducing the need for learning rate warmup in gpt training.Advances in Neural Information Processing Systems, 37:2914–2942, 2024

Atli Kosson, Bettina Messmer, and Martin Jaggi. Analyzing & reducing the need for learning rate warmup in gpt training.Advances in Neural Information Processing Systems, 37:2914–2942, 2024

work page 2024
[30]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[31]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

work page 2002
[32]

Fine-tuning without forgetting in-context learning: A theoretical analysis of linear attention models.arXiv preprint arXiv:2602.23197, 2026

Chungpa Lee, Jy-yong Sohn, and Kangwook Lee. Fine-tuning without forgetting in-context learning: A theoretical analysis of linear attention models.arXiv preprint arXiv:2602.23197, 2026

work page arXiv 2026
[33]

Lewkowycz, Y

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

work page arXiv 2003
[34]

Towards understanding catastrophic forgetting in two-layer convolutional neural networks

Boqi Li, Youjun Wang, and Weiwei Liu. Towards understanding catastrophic forgetting in two-layer convolutional neural networks. InForty-second International Conference on Machine Learning, 2025

work page 2025
[35]

Revisiting catastrophic forgetting in large language model tuning

Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning. InFindings of the association for computational linguistics: EMNLP 2024, pages 4297–4308, 2024

work page 2024
[36]

Halueval: A large-scale hallucination evaluation benchmark for large language models

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 6449–6464, 2023

work page 2023
[37]

Sft doesn’t always hurt general capabilities: Revisiting domain-specific fine-tuning in llms.arXiv preprint arXiv:2509.20758, 2025

Jiacheng Lin, Zhongruo Wang, Kun Qian, Tian Wang, Arvind Srinivasan, Hansi Zeng, Ruochen Jiao, Xie Zhou, Jiri Gesi, Dakuo Wang, et al. Sft doesn’t always hurt general capabilities: Revisiting domain-specific fine-tuning in llms.arXiv preprint arXiv:2509.20758, 2025

work page arXiv 2025
[38]

Trgp: Trust region gradient projection for continual learning.arXiv preprint arXiv:2202.02931, 2022

Sen Lin, Li Yang, Deliang Fan, and Junshan Zhang. Trgp: Trust region gradient projection for continual learning.arXiv preprint arXiv:2202.02931, 2022

work page arXiv 2022
[39]

Truthfulqa: Measuring how models mimic hu- man falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic hu- man falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

work page 2022
[40]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[41]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

work page 2025
[42]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025
[43]

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

work page 1989
[45]

arXiv preprint arXiv:2404.00213 , year=

Nick Mecklenburg, Yiyou Lin, Xiaoxiao Li, Daniel Holstein, Leonardo Nunes, Sara Mal- var, Bruno Silva, Ranveer Chandra, Vijay Aski, Pavan Kumar Reddy Yannam, et al. Inject- ing new knowledge into large language models via supervised fine-tuning.arXiv preprint arXiv:2404.00213, 2024. 12

work page arXiv 2024
[46]

Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, and Wee Sun Lee. Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

work page arXiv 2026
[47]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

work page 2019
[49]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021
[51]

Upweighting easy samples in fine-tuning mitigates forgetting.arXiv preprint arXiv:2502.02797, 2025

Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, and Sujay Sanghavi. Upweighting easy samples in fine-tuning mitigates forgetting.arXiv preprint arXiv:2502.02797, 2025

work page arXiv 2025
[52]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015
[53]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

Fine-tuned language models are continual learners

Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, 2022

work page 2022
[55]

Cambridge university press, 2014

Shai Shalev-Shwartz and Shai Ben-David.Understanding machine learning: From theory to algorithms. Cambridge university press, 2014

work page 2014
[56]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[58]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas H¨ubotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

Mitigating forgetting in continual learning with selective gradient projection

Anika Singh, David Martinez, Aayush Dhaulakhandi, Varun Chopade, Likhith Malipati, Vasu Sharma, Kevin Zhu, Sunishchal Dev, and Ryan Lagasse. Mitigating forgetting in continual learning with selective gradient projection. InThe 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Assoc...

work page 2025
[60]

Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

work page 2025
[61]

Super-convergence: Very fast training of neural networks using large learning rates

Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. InArtificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE, 2019

work page 2019
[62]

How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization.arXiv preprint arXiv:2501.13669, 2025

Shezheng Song, Hao Xu, Jun Ma, Shasha Li, Long Peng, Qian Wan, Xiaodong Liu, and Jie Yu. How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization.arXiv preprint arXiv:2501.13669, 2025. 13

work page arXiv 2025
[63]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[64]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[65]

Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in neural information processing systems, 31, 2018

Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in neural information processing systems, 31, 2018

work page 2018
[66]

Factuality of large language models: A survey

Yuxia Wang, Minghan Wang, Muhammad Arslan Manzoor, Fei Liu, Georgi Nenkov Georgiev, Rocktim Jyoti Das, and Preslav Nakov. Factuality of large language models: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19519–19529, 2024

work page 2024
[67]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959–7971, 2022

work page 2022
[69]

Mitigating forgetting in llm fine-tuning via low-perplexity token learning.arXiv preprint arXiv:2501.14315, 2025

Chao-Chung Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Shao-Hua Sun, and Hung-yi Lee. Mitigating forgetting in llm fine-tuning via low-perplexity token learning.arXiv preprint arXiv:2501.14315, 2025

work page arXiv 2025
[70]

On the generalization of sft: A reinforcement learning perspective with reward rectification.arXiv preprint arXiv:2508.05629, 2025

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of sft: A reinforcement learning perspective with reward rectification.arXiv preprint arXiv:2508.05629, 2025

work page arXiv 2025
[71]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

work page 2019
[73]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[74]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Proximal Supervised Fine-Tuning

Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, and Pengfei Liu. Proximal supervised fine-tuning.arXiv preprint arXiv:2508.17784, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676, 2025

Nicolas Zucchet, J¨org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, and Soham De. How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676, 2025. 14 A Proofs A.1 Auxiliary bounds implied by Assumption 1 We first record standard consequences of Assumption 1. Since the input domain is bounded...

work page arXiv 2025
[77]

Is the response written in Galician (not Spanish, Portuguese, English, or other languages)?

work page
[78]

How natural and fluent is the Galician? Does it sound like a native speaker, or does it have Spanish/Portuguese interference?

work page
[79]

How consistent is the Galician throughout the response—does it code-switch mid-response? After your brief explanation, you must output only one of the following choices as your final verdict with a label:

work page
[80]

Assistant A is significantly better:[[A>>B]]

work page

Showing first 80 references.

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024

[2] [2]

Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811, 2025

Parikshit Bansal and Sujay Sanghavi. Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811, 2025

work page arXiv 2025

[3] [3]

LoRA learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

work page arXiv 2024

[4] [4]

Continual memorization of factoids in language models.arXiv preprint arXiv:2411.07175, 2024

Howard Chen, Jiayi Geng, Adithya Bhaskar, Dan Friedman, and Danqi Chen. Continual memorization of factoids in language models.arXiv preprint arXiv:2411.07175, 2024

work page arXiv 2024

[5] [5]

Monolingual or multilingual instruction tuning: Which makes a better alpaca

Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. Monolingual or multilingual instruction tuning: Which makes a better alpaca. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1347–1356, 2024

work page 2024

[6] [6]

Language modeling with gated convolutional networks

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InInternational conference on machine learning, pages 933–941. PMLR, 2017

work page 2017

[7] [7]

Episodic memory in lifelong language learning.Advances in Neural Information Processing Systems, 32, 2019

Cyprien de Masson D’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. Episodic memory in lifelong language learning.Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[8] [8]

How catas- trophic can catastrophic forgetting be in linear regression? InConference on Learning Theory, pages 4028–4079

Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catas- trophic can catastrophic forgetting be in linear regression? InConference on Learning Theory, pages 4028–4079. PMLR, 2022

work page 2022

[9] [9]

Sciknoweval: Evaluating multi-level scientific knowledge of large language models

Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024

work page arXiv 2024

[10] [10]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

work page 1999

[11] [11]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024

[12] [12]

Does fine-tuning llms on new knowledge encourage hallucinations? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7765–7784, 2024

Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7765–7784, 2024. 10

work page 2024

[13] [13]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Preserving pre-trained features helps calibrate fine-tuned language models.arXiv preprint arXiv:2305.19249, 2023

Guande He, Jianfei Chen, and Jun Zhu. Preserving pre-trained features helps calibrate fine-tuned language models.arXiv preprint arXiv:2305.19249, 2023

work page arXiv 2023

[16] [16]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[17] [17]

arXiv preprint arXiv:2004.13135 , year=

Calypso Herrera, Florian Krach, and Josef Teichmann. Local lipschitz bounds of deep neural networks.arXiv preprint arXiv:2004.13135, 2020

work page arXiv 2004

[18] [18]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022

[19] [19]

Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal

Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1416–1428, 2024

work page 2024

[20] [20]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025

[21] [21]

Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024

Gangwei Jiang, Zhaoyi Li, Defu Lian, and Ying Wei. Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024

work page arXiv 2024

[22] [22]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Scaling laws for forgetting when fine-tuning large language models

Damjan Kalajdzievski. Scaling laws for forgetting when fine-tuning large language models. arXiv preprint arXiv:2401.05605, 2024

work page arXiv 2024

[24] [24]

Why warmup the learning rate? underlying mech- anisms and improvements.Advances in Neural Information Processing Systems, 37:111760– 111801, 2024

Dayal Singh Kalra and Maissam Barkeshli. Why warmup the learning rate? underlying mech- anisms and improvements.Advances in Neural Information Processing Systems, 37:111760– 111801, 2024

work page 2024

[25] [25]

High dimensional bayesian optimisation and bandits via additive models

Kirthevasan Kandasamy, Jeff Schneider, and Barnab´as P´oczos. High dimensional bayesian optimisation and bandits via additive models. InInternational conference on machine learning, pages 295–304. PMLR, 2015

work page 2015

[26] [26]

Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024

Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew G Wilson. Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024

work page 2024

[27] [27]

Intelligent learning rate distribution to reduce catastrophic forgetting in transformers

Philip Kenneweg, Alexander Schulz, Sarah Schr¨oder, and Barbara Hammer. Intelligent learning rate distribution to reduce catastrophic forgetting in transformers. InInternational Conference on Intelligent Data Engineering and Automated Learning, pages 252–261. Springer, 2022

work page 2022

[28] [28]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 11

work page 2017

[29] [29]

Analyzing & reducing the need for learning rate warmup in gpt training.Advances in Neural Information Processing Systems, 37:2914–2942, 2024

Atli Kosson, Bettina Messmer, and Martin Jaggi. Analyzing & reducing the need for learning rate warmup in gpt training.Advances in Neural Information Processing Systems, 37:2914–2942, 2024

work page 2024

[30] [30]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009

[31] [31]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

work page 2002

[32] [32]

Fine-tuning without forgetting in-context learning: A theoretical analysis of linear attention models.arXiv preprint arXiv:2602.23197, 2026

Chungpa Lee, Jy-yong Sohn, and Kangwook Lee. Fine-tuning without forgetting in-context learning: A theoretical analysis of linear attention models.arXiv preprint arXiv:2602.23197, 2026

work page arXiv 2026

[33] [33]

Lewkowycz, Y

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

work page arXiv 2003

[34] [34]

Towards understanding catastrophic forgetting in two-layer convolutional neural networks

Boqi Li, Youjun Wang, and Weiwei Liu. Towards understanding catastrophic forgetting in two-layer convolutional neural networks. InForty-second International Conference on Machine Learning, 2025

work page 2025

[35] [35]

Revisiting catastrophic forgetting in large language model tuning

Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning. InFindings of the association for computational linguistics: EMNLP 2024, pages 4297–4308, 2024

work page 2024

[36] [36]

Halueval: A large-scale hallucination evaluation benchmark for large language models

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 6449–6464, 2023

work page 2023

[37] [37]

Sft doesn’t always hurt general capabilities: Revisiting domain-specific fine-tuning in llms.arXiv preprint arXiv:2509.20758, 2025

Jiacheng Lin, Zhongruo Wang, Kun Qian, Tian Wang, Arvind Srinivasan, Hansi Zeng, Ruochen Jiao, Xie Zhou, Jiri Gesi, Dakuo Wang, et al. Sft doesn’t always hurt general capabilities: Revisiting domain-specific fine-tuning in llms.arXiv preprint arXiv:2509.20758, 2025

work page arXiv 2025

[38] [38]

Trgp: Trust region gradient projection for continual learning.arXiv preprint arXiv:2202.02931, 2022

Sen Lin, Li Yang, Deliang Fan, and Junshan Zhang. Trgp: Trust region gradient projection for continual learning.arXiv preprint arXiv:2202.02931, 2022

work page arXiv 2022

[39] [39]

Truthfulqa: Measuring how models mimic hu- man falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic hu- man falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

work page 2022

[40] [40]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[41] [41]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

work page 2025

[42] [42]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025

[43] [43]

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

work page 1989

[45] [45]

arXiv preprint arXiv:2404.00213 , year=

Nick Mecklenburg, Yiyou Lin, Xiaoxiao Li, Daniel Holstein, Leonardo Nunes, Sara Mal- var, Bruno Silva, Ranveer Chandra, Vijay Aski, Pavan Kumar Reddy Yannam, et al. Inject- ing new knowledge into large language models via supervised fine-tuning.arXiv preprint arXiv:2404.00213, 2024. 12

work page arXiv 2024

[46] [46]

Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, and Wee Sun Lee. Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

work page arXiv 2026

[47] [47]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

work page 2019

[49] [49]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021

[51] [51]

Upweighting easy samples in fine-tuning mitigates forgetting.arXiv preprint arXiv:2502.02797, 2025

Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, and Sujay Sanghavi. Upweighting easy samples in fine-tuning mitigates forgetting.arXiv preprint arXiv:2502.02797, 2025

work page arXiv 2025

[52] [52]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015

[53] [53]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[54] [54]

Fine-tuned language models are continual learners

Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, 2022

work page 2022

[55] [55]

Cambridge university press, 2014

Shai Shalev-Shwartz and Shai Ben-David.Understanding machine learning: From theory to algorithms. Cambridge university press, 2014

work page 2014

[56] [56]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[58] [58]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas H¨ubotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[59] [59]

Mitigating forgetting in continual learning with selective gradient projection

Anika Singh, David Martinez, Aayush Dhaulakhandi, Varun Chopade, Likhith Malipati, Vasu Sharma, Kevin Zhu, Sunishchal Dev, and Ryan Lagasse. Mitigating forgetting in continual learning with selective gradient projection. InThe 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Assoc...

work page 2025

[60] [60]

Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

work page 2025

[61] [61]

Super-convergence: Very fast training of neural networks using large learning rates

Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. InArtificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE, 2019

work page 2019

[62] [62]

How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization.arXiv preprint arXiv:2501.13669, 2025

Shezheng Song, Hao Xu, Jun Ma, Shasha Li, Long Peng, Qian Wan, Xiaodong Liu, and Jie Yu. How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization.arXiv preprint arXiv:2501.13669, 2025. 13

work page arXiv 2025

[63] [63]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023

[64] [64]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[65] [65]

Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in neural information processing systems, 31, 2018

Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in neural information processing systems, 31, 2018

work page 2018

[66] [66]

Factuality of large language models: A survey

Yuxia Wang, Minghan Wang, Muhammad Arslan Manzoor, Fei Liu, Georgi Nenkov Georgiev, Rocktim Jyoti Das, and Preslav Nakov. Factuality of large language models: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19519–19529, 2024

work page 2024

[67] [67]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959–7971, 2022

work page 2022

[69] [69]

Mitigating forgetting in llm fine-tuning via low-perplexity token learning.arXiv preprint arXiv:2501.14315, 2025

Chao-Chung Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Shao-Hua Sun, and Hung-yi Lee. Mitigating forgetting in llm fine-tuning via low-perplexity token learning.arXiv preprint arXiv:2501.14315, 2025

work page arXiv 2025

[70] [70]

On the generalization of sft: A reinforcement learning perspective with reward rectification.arXiv preprint arXiv:2508.05629, 2025

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of sft: A reinforcement learning perspective with reward rectification.arXiv preprint arXiv:2508.05629, 2025

work page arXiv 2025

[71] [71]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [72]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

work page 2019

[73] [73]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023

[74] [74]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[75] [75]

Proximal Supervised Fine-Tuning

Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, and Pengfei Liu. Proximal supervised fine-tuning.arXiv preprint arXiv:2508.17784, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [76]

How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676, 2025

Nicolas Zucchet, J¨org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, and Soham De. How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676, 2025. 14 A Proofs A.1 Auxiliary bounds implied by Assumption 1 We first record standard consequences of Assumption 1. Since the input domain is bounded...

work page arXiv 2025

[77] [77]

Is the response written in Galician (not Spanish, Portuguese, English, or other languages)?

work page

[78] [78]

How natural and fluent is the Galician? Does it sound like a native speaker, or does it have Spanish/Portuguese interference?

work page

[79] [79]

How consistent is the Galician throughout the response—does it code-switch mid-response? After your brief explanation, you must output only one of the following choices as your final verdict with a label:

work page

[80] [80]

Assistant A is significantly better:[[A>>B]]

work page