Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

Aarash Abro; Muhammad Tahir

arxiv: 2605.20296 · v1 · pith:KUGW6W2Lnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

Aarash Abro , Muhammad Tahir This is my paper

Pith reviewed 2026-05-21 07:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords catastrophic forgettingfine-tuningspectral repairpost-hoc methodlanguage modelssafety restorationsingular value threshold

0 comments

The pith

Fine-tuning weight changes contain a removable noise component that spectral filtering can strip away to restore lost model capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fine-tuning a language model often damages its performance on unrelated tasks, known as catastrophic forgetting. The paper argues that this damage is not permanent or inevitable but arises from a noisy component in the weight update that can be identified and removed after the fact. Using only the original and fine-tuned model weights, DG-Hard applies a hard threshold to the singular values of their difference to keep the useful structured changes and discard the rest. This results in a repaired model that keeps the benefits of fine-tuning while recovering performance on other benchmarks and even restoring degraded safety properties.

Core claim

Treating the fine-tuning update as a low-rank signal plus IID noise allows the Donoho-Gavish hard threshold to produce a repaired checkpoint that balances target task retention with recovery of held-out capabilities across multiple models and tasks.

What carries the argument

The Donoho-Gavish hard singular-value threshold applied to the matrices of the weight delta Δ = W_ft - W_base, which separates the task-aligned low-rank part from the noise residual without any data or tuning.

Load-bearing premise

The fine-tuning update can be cleanly separated into a low-rank task signal and an IID-like noise residual by a singular value threshold.

What would settle it

A counterexample would be a fine-tuning update whose singular value spectrum lacks a detectable gap, or where applying the threshold does not yield better balanced performance than the original fine-tuned model on the held-out benchmarks.

Figures

Figures reproduced from arXiv: 2605.20296 by Aarash Abro, Muhammad Tahir.

**Figure 1.** Figure 1: Recovery × preservation per cohort. Each panel plots the % healed score on the damaged partition (x-axis) against the % preserved score on the improved partition (y-axis), as defined in §4.3. The ideal corner is (100, 100), and the dotted contour marks HM(% healed, % preserved) = 80. DG-Hard (blue diamond) is closest to the ideal corner across all five cohorts. FAPM [16] strongly recovers damaged measureme… view at source ↗

**Figure 2.** Figure 2: Population-level Combined score per method, sliced by cohort. Panel titles list [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Trade-off frontiers (both axes 0 to 100, higher is better). Left: Clean-up vs Retention. DGHard sits in the upper-right region where both axes are simultaneously high; V-SoftMask [23] is the retention-extreme (top-left); FAPM [16] is the cleanup-extreme (bottom-right). Right: Knowledgecohort Combined (x) vs Cognition-cohort Combined (y). DG-Hard sits high on both (84.8 and 82.1), the most balanced strong… view at source ↗

**Figure 4.** Figure 4: Spectral unforgetting in two views, on Llama-3.2-3B [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Fine-tuning a language model for a target task routinely degrades capabilities the training data never explicitly threatened. We study this phenomenon, known as catastrophic forgetting, and propose a post-hoc repair solution that uses only the pretrained checkpoint $W_{\mathrm{base}}$ and its fine-tuned descendant $W_{\mathrm{ft}}$. The goal is not merely to revert the model toward the base checkpoint, but to recover capabilities damaged by fine-tuning while preserving both the target-task gains and any beneficial held-out improvements. We introduce DG-Hard, a checkpoint-only spectral repair method for the fine-tuning update $\Delta = W_{\mathrm{ft}} - W_{\mathrm{base}}$. DG-Hard treats $\Delta$ as a low-rank task-aligned signal embedded in an IID-like noise residual that gradient descent has no incentive to remove, and applies the Donoho-Gavish hard singular-value threshold to each weight-delta matrix, keeping the structured high-energy part of the update and removing the spectral bulk. This reduces repair to a closed-form SVD filtering step requiring no data-dependent tuning. A central difficulty is evaluation: average accuracy hides per-benchmark failures, while naive recovery scores reward models that simply revert toward the base. We therefore introduce a partition-conditional metric that separately tracks healing, preservation, non-damage, and target-task retention. Across $14$ (model, task) settings and nine cross-domain held-out benchmarks, DG-Hard achieves the strongest balanced repair among post-hoc baselines. DG-Hard also restores safety alignment degraded by benign fine-tuning on three independent safety axes, despite using no alignment data. These results suggest that part of fine-tuning-induced capability loss is not an unavoidable consequence of specialization, but a removable spectral residue in the weight update itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a closed-form SVD hard-threshold on the fine-tuning delta recovers lost capabilities and even safety, but the experiments are under-reported and the IID noise assumption looks shaky.

read the letter

Here's the main point: the authors claim that applying Donoho-Gavish hard thresholding to the singular values of the fine-tuning weight delta recovers damaged capabilities in a post-hoc way, without needing to retrain or access data, and it even helps with safety alignment on some axes. What is actually new is taking the spectral filtering idea and directing it specifically at the update matrix Δ = W_ft - W_base, plus introducing that partition-conditional metric to evaluate recovery more carefully than simple averages allow. The paper does well at explaining the motivation from spectral theory and showing that the method is parameter-free and closed-form, which keeps the circularity low. It also gets credit for testing across multiple models and tasks and including held-out benchmarks, and for checking safety restoration even though no safety data was used in the repair. The soft spots are in the experimental reporting and the foundational assumption. The abstract mentions superiority across 14 settings but gives no concrete numbers, no variance, and no clear list of baselines or how they were implemented. That makes it tough to judge how much of an advance this is. On the assumption side, treating the residual as IID-like noise that gradient descent leaves behind may not be accurate. Fine-tuning can introduce structured correlations from the loss landscape and data, so the bulk spectrum might not follow the Marchenko-Pastur law closely enough for the threshold to isolate the task signal reliably. If that's the case, the method could be succeeding for other reasons, like just removing small updates. This kind of work is useful for practitioners who fine-tune models and want a quick fix for side effects. Readers focused on efficient post-training interventions or spectral analysis of neural nets would get the most out of it. I would recommend sending it for peer review. The core idea is simple enough to verify or refute with standard experiments, and the evaluation framework has some merit even if the current results need more support.

Referee Report

2 major / 2 minor

Summary. The paper claims that catastrophic forgetting during fine-tuning can be partially reversed post-hoc by treating the weight update Δ = W_ft − W_base as a low-rank task signal plus an IID-like noise residual, then applying the Donoho-Gavish hard singular-value threshold to each layer's delta matrix. This closed-form SVD filter is said to recover damaged capabilities on held-out tasks while preserving target-task gains. Across 14 (model, task) pairs and nine cross-domain benchmarks, DG-Hard outperforms post-hoc baselines on a new partition-conditional metric that separately scores healing, preservation, non-damage, and retention; the same procedure also restores safety alignment on three axes without using any alignment data.

Significance. If the central empirical claim holds under rigorous controls, the work supplies a simple, training-free, data-free repair that could reduce the cost of maintaining specialized models. The partition-conditional evaluation metric is a useful addition for distinguishing true recovery from trivial reversion. The safety-restoration result, obtained without safety data, would be particularly noteworthy if replicated.

major comments (2)

[Method description (around the definition of DG-Hard)] The justification for Donoho-Gavish hard thresholding rests on the claim that the residual after low-rank extraction behaves as unstructured IID Gaussian noise (see the paragraph beginning 'DG-Hard treats Δ as a low-rank task-aligned signal...'). No diagnostic is reported that compares the empirical singular-value distribution of the residual to the Marchenko-Pastur bulk predicted by the theorem; if row/column correlations or heavy tails remain, the threshold may truncate signal rather than noise, undermining both the mechanistic story and the recovery claims.
[Empirical evaluation and results] The abstract and results sections report that DG-Hard 'achieves the strongest balanced repair' across 14 settings, yet supply no per-run standard deviations, no statistical significance tests against the strongest baseline, and no explicit data-exclusion or hyper-parameter selection protocol. Without these, it is impossible to judge whether the reported superiority is robust or sensitive to evaluation choices.

minor comments (2)

[Evaluation metric] The partition-conditional metric is introduced without a formal definition or pseudocode; a small table or equation showing how the four scores are aggregated into the 'balanced repair' ranking would improve reproducibility.
[Figures] Several figures compare DG-Hard to baselines but omit error bars or confidence intervals; adding them would make the visual claims easier to assess.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the justification of our method and the reporting of empirical results. We address each major point below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Method description (around the definition of DG-Hard)] The justification for Donoho-Gavish hard thresholding rests on the claim that the residual after low-rank extraction behaves as unstructured IID Gaussian noise (see the paragraph beginning 'DG-Hard treats Δ as a low-rank task-aligned signal...'). No diagnostic is reported that compares the empirical singular-value distribution of the residual to the Marchenko-Pastur bulk predicted by the theorem; if row/column correlations or heavy tails remain, the threshold may truncate signal rather than noise, undermining both the mechanistic story and the recovery claims.

Authors: We agree that an explicit diagnostic comparing the residual singular-value spectrum to the Marchenko-Pastur bulk would strengthen the mechanistic justification. The Donoho-Gavish threshold is chosen because it provides a closed-form, data-free way to separate the presumed low-rank task signal from the residual under the modeling assumption of approximately unstructured noise; however, we did not verify this assumption empirically in the submitted version. In the revised manuscript we will add a supplementary figure and accompanying text that plots the empirical singular values of the post-DG-Hard residuals for representative layers across several models and compares them to the theoretical Marchenko-Pastur edge. revision: yes
Referee: [Empirical evaluation and results] The abstract and results sections report that DG-Hard 'achieves the strongest balanced repair' across 14 settings, yet supply no per-run standard deviations, no statistical significance tests against the strongest baseline, and no explicit data-exclusion or hyper-parameter selection protocol. Without these, it is impossible to judge whether the reported superiority is robust or sensitive to evaluation choices.

Authors: We acknowledge that the current presentation lacks measures of variability and formal statistical comparisons, which limits assessment of robustness. The experiments followed standard benchmark protocols with fixed evaluation settings and DG-Hard itself requires no hyper-parameter tuning beyond the SVD operation. In the revision we will (i) report per-run standard deviations for all metrics where repeated evaluations are feasible, (ii) include statistical significance tests (e.g., paired Wilcoxon tests) against the strongest baseline, and (iii) add an explicit subsection describing data exclusion criteria, benchmark partitioning, and the absence of hyper-parameter search for DG-Hard. revision: yes

Circularity Check

0 steps flagged

No significant circularity: closed-form SVD filter applies external Donoho-Gavish threshold without fitted parameters or self-referential derivation

full rationale

The paper defines DG-Hard explicitly as the application of the known Donoho-Gavish hard singular-value threshold to each weight-delta matrix Δ = W_ft − W_base, under the modeling assumption that the update contains low-rank task signal plus IID-like residual noise. This threshold is imported from external random-matrix theory (Marchenko-Pastur law) rather than derived from the paper's own repair results or evaluation metrics. The method is described as requiring no data-dependent tuning or fitted parameters, and the central empirical claims rest on cross-benchmark experiments rather than on any quantity that is statistically forced by construction. The introduced partition-conditional metric is an evaluation device, not part of the repair derivation itself. No self-citations, ansatzes smuggled via prior work, or renamings of known results appear as load-bearing steps in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the modeling assumption that the weight update contains separable low-rank signal plus IID-like noise; no free parameters or new entities are introduced in the abstract description.

axioms (1)

domain assumption The Donoho-Gavish hard threshold optimally separates low-rank signal from noise in the weight-delta matrices under the stated IID-like residual model.
Invoked when the paper states that DG-Hard applies the threshold to keep the structured high-energy part of the update.

pith-pipeline@v0.9.0 · 5848 in / 1338 out tokens · 41462 ms · 2026-05-21T07:45:35.978350+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DG-Hard treats ∆ as a low-rank task-aligned signal embedded in an IID-like noise residual ... applies the Donoho-Gavish hard singular-value threshold to each weight-delta matrix

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 16 internal anchors

[1]

Michael McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, 1989

work page 1989
[2]

Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions.Psychological Review, 97(2):285–308, 1990

Roger Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions.Psychological Review, 97(2):285–308, 1990. doi: 10.1037/0033-295X. 97.2.285

work page doi:10.1037/0033-295x 1990
[3]

Robert M. French. Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences, 3(4):128–135, 1999. doi: 10.1016/S1364-6613(99)01294-2

work page doi:10.1016/s1364-6613(99)01294-2 1999
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[5]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classi- fication. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018. URLhttps://arxiv.org/abs/1801.06146

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. URLhttps://arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019
[7]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Yun Luo, Zefan Yang, Fan Meng, Yufei Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.arXiv preprint arXiv:2308.08747, 2023. URLhttps://arxiv.org/abs/2308.08747

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/ abs/2310.03693

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Fine-tuning can distort pretrained features and underperform out-of-distribution

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. InInternational Conference on Learning Representations, 2022

work page 2022
[11]

Three Factors Influencing Minima in SGD

Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD.arXiv preprint arXiv:1711.04623, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

N. S. Keskar, Dheevatsa Mudigere, Jorge Nocedal, Misha Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017

work page 2017
[13]

Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Y . Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. URL https://arxiv.org/ abs/2109....

work page arXiv 2022
[14]

Yıldız, C., Ravichandran, N

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yong Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InInternational Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2311.03099

work page arXiv 2024
[15]

Resolving interference when merging models.arXiv preprint arXiv:2306.01708, 1, 2023a

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. TIES-merging: Resolving interference when merging models. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2306.01708

work page arXiv 2023
[16]

Mitigating catastrophic forgetting in large language models with forgetting-aware pruning (FAPM)

Wei Huang, Aimin Cheng, and Yu Wang. Mitigating catastrophic forgetting in large language models with forgetting-aware pruning (FAPM). InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. URL https://arxiv.org/abs/ 2509.08255

work page arXiv 2025
[17]

V . A. Marchenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

work page 1967
[18]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://arxiv.org/ abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021. URL https://arxiv.org/abs/2012. 13255

work page 2021
[20]

Random matrix analysis of deep neural network weight matrices.Physical Review E, 106:054124, 2022

Matthias Thamm, Max Staats, and Bernd Rosenow. Random matrix analysis of deep neural network weight matrices.Physical Review E, 106:054124, 2022. URL https://arxiv.org/ abs/2203.14661

work page arXiv 2022
[21]

Boundary between noise and information applied to filtering neural network weight matrices.Physical Review E, 108:L022302, 2023

Max Staats, Matthias Thamm, and Bernd Rosenow. Boundary between noise and information applied to filtering neural network weight matrices.Physical Review E, 108:L022302, 2023. URLhttps://arxiv.org/abs/2206.03927

work page arXiv 2023
[22]

Matan Gavish and David L. Donoho. The optimal hard threshold for singular values is 4/ √ 3. IEEE Transactions on Information Theory, 60(8):5040–5053, 2014. URL https://arxiv. org/abs/1305.5870

work page internal anchor Pith review Pith/arXiv arXiv 2014
[23]

Continual pre-training of language models

Zixuan Ke, Yijia Shao, Haolong Lin, Tatsuya Konishi, Gyuwan Kim, and Bing Liu. Continual pre-training of language models. InInternational Conference on Learning Representations,

work page
[24]

URLhttps://arxiv.org/abs/2302.03241

work page arXiv
[25]

Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi´nska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 3...

work page arXiv 2017
[26]

Continual Learning Through Synaptic Intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InInternational Conference on Machine Learning, 2017. URL https://arxiv. org/abs/1703.04200

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Memory Aware Synapses: Learning what (not) to forget

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European Conference on Computer Vision, 2018. URLhttps://arxiv.org/abs/1711.09601

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Balancing speciality and versatility: A coarse to fine framework for mitigating catastrophic forgetting in large language models

Haonan Zhang, Yinjun Wu, Dongxu Li, Shuo Yang, Rui Zhao, Yu Jiang, and Fei Tan. Balancing speciality and versatility: A coarse to fine framework for mitigating catastrophic forgetting in large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URLhttps://arxiv.org/abs/2404.10306

work page arXiv 2024
[29]

Gradient episodic memory for continual learning, 2022

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems, 2017. URL https://arxiv.org/ abs/1706.08840. 11

work page arXiv 2017
[30]

On Tiny Episodic Memories in Continual Learning

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K. Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning.arXiv preprint arXiv:1902.10486, 2019. URL https://arxiv.org/abs/1902. 10486

work page internal anchor Pith review Pith/arXiv arXiv 1902
[31]

LoRA vs full fine-tuning: An illusion of equivalence

Richard Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs full fine-tuning: An illusion of equivalence. InAdvances in Neural Information Processing Systems,

work page
[32]

URLhttps://arxiv.org/abs/2410.21228

work page arXiv
[33]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Con- ference on Learning Representations, 2023. URLhttps://arxiv.org/abs/2212.04089

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Matena and Colin A

Michael S. Matena and Colin A. Raffel. Merging models with fisher-weighted averaging. In Advances in Neural Information Processing Systems, 2022. URL https://arxiv.org/abs/ 2111.09832

work page arXiv 2022
[35]

Matan Gavish and David L. Donoho. Optimal shrinkage of singular values.IEEE Transac- tions on Information Theory, 63(4):2137–2152, 2017. URL https://arxiv.org/abs/1405. 7511

work page 2017
[36]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning, 2024. URL https://arxiv.org/ abs/2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language mod- els. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024. URLhttps://arxiv.org/abs/2308.01263

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

A StrongREJECT for Empty Jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for empty jailbreaks. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2024. URLhttps://arxiv.org/abs/2402.10260

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Task-specific skill localization in fine-tuned language models

Abhishek Panigrahi, Nikunj Saunshi, Haifeng Zhao, and Sanjeev Arora. Task-specific skill localization in fine-tuned language models. InInternational Conference on Machine Learning,

work page
[40]

URLhttps://arxiv.org/abs/2302.06600

work page arXiv
[41]

The truth is in there: Improving reasoning in language models with layer-selective rank reduction.arXiv preprint arXiv:2312.13558,

Pratyusha Sharma, Jordan T. Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. InInternational Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/2312.13558. 12 Appendix Contents A Empirical evidence for the signal-plus-noise structure of∆. . . . . . . . . . . . ....

work page arXiv 2024
[42]

provide a complementary parametric-sparsity result: ∼0.01% of parameters carry >95% of fine-tune task performance when grafted back onto the base model. Sparsity in coordinate space and concentration in singular-value space are distinct mathematical properties; we cite this work as a parallel structural prior, not as direct evidence for low-rank∆. The noi...

work page
[43]

extreme redundancy

(DARE) show that the fine-tune delta tolerates random pruning of 90–99% of its entries with rescaling, attributing this to “extreme redundancy” of small-magnitude updates, consistent with most of∆being redundant rather than informative. Direct check on our own checkpoints.Fig. 4 verifies the structure layer-locally on Llama-3.2-3B’s mlp.up_proj at layer 1...

work page arXiv 2000

[1] [1]

Michael McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, 1989

work page 1989

[2] [2]

Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions.Psychological Review, 97(2):285–308, 1990

Roger Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions.Psychological Review, 97(2):285–308, 1990. doi: 10.1037/0033-295X. 97.2.285

work page doi:10.1037/0033-295x 1990

[3] [3]

Robert M. French. Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences, 3(4):128–135, 1999. doi: 10.1016/S1364-6613(99)01294-2

work page doi:10.1016/s1364-6613(99)01294-2 1999

[4] [4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[5] [5]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classi- fication. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018. URLhttps://arxiv.org/abs/1801.06146

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. URLhttps://arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019

[7] [7]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Yun Luo, Zefan Yang, Fan Meng, Yufei Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.arXiv preprint arXiv:2308.08747, 2023. URLhttps://arxiv.org/abs/2308.08747

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/ abs/2310.03693

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Fine-tuning can distort pretrained features and underperform out-of-distribution

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. InInternational Conference on Learning Representations, 2022

work page 2022

[11] [11]

Three Factors Influencing Minima in SGD

Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD.arXiv preprint arXiv:1711.04623, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

N. S. Keskar, Dheevatsa Mudigere, Jorge Nocedal, Misha Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017

work page 2017

[13] [13]

Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Y . Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. URL https://arxiv.org/ abs/2109....

work page arXiv 2022

[14] [14]

Yıldız, C., Ravichandran, N

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yong Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InInternational Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2311.03099

work page arXiv 2024

[15] [15]

Resolving interference when merging models.arXiv preprint arXiv:2306.01708, 1, 2023a

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. TIES-merging: Resolving interference when merging models. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2306.01708

work page arXiv 2023

[16] [16]

Mitigating catastrophic forgetting in large language models with forgetting-aware pruning (FAPM)

Wei Huang, Aimin Cheng, and Yu Wang. Mitigating catastrophic forgetting in large language models with forgetting-aware pruning (FAPM). InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. URL https://arxiv.org/abs/ 2509.08255

work page arXiv 2025

[17] [17]

V . A. Marchenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

work page 1967

[18] [18]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://arxiv.org/ abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021. URL https://arxiv.org/abs/2012. 13255

work page 2021

[20] [20]

Random matrix analysis of deep neural network weight matrices.Physical Review E, 106:054124, 2022

Matthias Thamm, Max Staats, and Bernd Rosenow. Random matrix analysis of deep neural network weight matrices.Physical Review E, 106:054124, 2022. URL https://arxiv.org/ abs/2203.14661

work page arXiv 2022

[21] [21]

Boundary between noise and information applied to filtering neural network weight matrices.Physical Review E, 108:L022302, 2023

Max Staats, Matthias Thamm, and Bernd Rosenow. Boundary between noise and information applied to filtering neural network weight matrices.Physical Review E, 108:L022302, 2023. URLhttps://arxiv.org/abs/2206.03927

work page arXiv 2023

[22] [22]

Matan Gavish and David L. Donoho. The optimal hard threshold for singular values is 4/ √ 3. IEEE Transactions on Information Theory, 60(8):5040–5053, 2014. URL https://arxiv. org/abs/1305.5870

work page internal anchor Pith review Pith/arXiv arXiv 2014

[23] [23]

Continual pre-training of language models

Zixuan Ke, Yijia Shao, Haolong Lin, Tatsuya Konishi, Gyuwan Kim, and Bing Liu. Continual pre-training of language models. InInternational Conference on Learning Representations,

work page

[24] [24]

URLhttps://arxiv.org/abs/2302.03241

work page arXiv

[25] [25]

Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi´nska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 3...

work page arXiv 2017

[26] [26]

Continual Learning Through Synaptic Intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InInternational Conference on Machine Learning, 2017. URL https://arxiv. org/abs/1703.04200

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Memory Aware Synapses: Learning what (not) to forget

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European Conference on Computer Vision, 2018. URLhttps://arxiv.org/abs/1711.09601

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Balancing speciality and versatility: A coarse to fine framework for mitigating catastrophic forgetting in large language models

Haonan Zhang, Yinjun Wu, Dongxu Li, Shuo Yang, Rui Zhao, Yu Jiang, and Fei Tan. Balancing speciality and versatility: A coarse to fine framework for mitigating catastrophic forgetting in large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URLhttps://arxiv.org/abs/2404.10306

work page arXiv 2024

[29] [29]

Gradient episodic memory for continual learning, 2022

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems, 2017. URL https://arxiv.org/ abs/1706.08840. 11

work page arXiv 2017

[30] [30]

On Tiny Episodic Memories in Continual Learning

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K. Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning.arXiv preprint arXiv:1902.10486, 2019. URL https://arxiv.org/abs/1902. 10486

work page internal anchor Pith review Pith/arXiv arXiv 1902

[31] [31]

LoRA vs full fine-tuning: An illusion of equivalence

Richard Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs full fine-tuning: An illusion of equivalence. InAdvances in Neural Information Processing Systems,

work page

[32] [32]

URLhttps://arxiv.org/abs/2410.21228

work page arXiv

[33] [33]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Con- ference on Learning Representations, 2023. URLhttps://arxiv.org/abs/2212.04089

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Matena and Colin A

Michael S. Matena and Colin A. Raffel. Merging models with fisher-weighted averaging. In Advances in Neural Information Processing Systems, 2022. URL https://arxiv.org/abs/ 2111.09832

work page arXiv 2022

[35] [35]

Matan Gavish and David L. Donoho. Optimal shrinkage of singular values.IEEE Transac- tions on Information Theory, 63(4):2137–2152, 2017. URL https://arxiv.org/abs/1405. 7511

work page 2017

[36] [36]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning, 2024. URL https://arxiv.org/ abs/2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language mod- els. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024. URLhttps://arxiv.org/abs/2308.01263

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

A StrongREJECT for Empty Jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for empty jailbreaks. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2024. URLhttps://arxiv.org/abs/2402.10260

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Task-specific skill localization in fine-tuned language models

Abhishek Panigrahi, Nikunj Saunshi, Haifeng Zhao, and Sanjeev Arora. Task-specific skill localization in fine-tuned language models. InInternational Conference on Machine Learning,

work page

[40] [40]

URLhttps://arxiv.org/abs/2302.06600

work page arXiv

[41] [41]

The truth is in there: Improving reasoning in language models with layer-selective rank reduction.arXiv preprint arXiv:2312.13558,

Pratyusha Sharma, Jordan T. Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. InInternational Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/2312.13558. 12 Appendix Contents A Empirical evidence for the signal-plus-noise structure of∆. . . . . . . . . . . . ....

work page arXiv 2024

[42] [42]

provide a complementary parametric-sparsity result: ∼0.01% of parameters carry >95% of fine-tune task performance when grafted back onto the base model. Sparsity in coordinate space and concentration in singular-value space are distinct mathematical properties; we cite this work as a parallel structural prior, not as direct evidence for low-rank∆. The noi...

work page

[43] [43]

extreme redundancy

(DARE) show that the fine-tune delta tolerates random pruning of 90–99% of its entries with rescaling, attributing this to “extreme redundancy” of small-magnitude updates, consistent with most of∆being redundant rather than informative. Direct check on our own checkpoints.Fig. 4 verifies the structure layer-locally on Llama-3.2-3B’s mlp.up_proj at layer 1...

work page arXiv 2000