pith. sign in

arxiv: 2605.20296 · v1 · pith:KUGW6W2Lnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

Pith reviewed 2026-05-21 07:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords catastrophic forgettingfine-tuningspectral repairpost-hoc methodlanguage modelssafety restorationsingular value threshold
0
0 comments X

The pith

Fine-tuning weight changes contain a removable noise component that spectral filtering can strip away to restore lost model capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fine-tuning a language model often damages its performance on unrelated tasks, known as catastrophic forgetting. The paper argues that this damage is not permanent or inevitable but arises from a noisy component in the weight update that can be identified and removed after the fact. Using only the original and fine-tuned model weights, DG-Hard applies a hard threshold to the singular values of their difference to keep the useful structured changes and discard the rest. This results in a repaired model that keeps the benefits of fine-tuning while recovering performance on other benchmarks and even restoring degraded safety properties.

Core claim

Treating the fine-tuning update as a low-rank signal plus IID noise allows the Donoho-Gavish hard threshold to produce a repaired checkpoint that balances target task retention with recovery of held-out capabilities across multiple models and tasks.

What carries the argument

The Donoho-Gavish hard singular-value threshold applied to the matrices of the weight delta Δ = W_ft - W_base, which separates the task-aligned low-rank part from the noise residual without any data or tuning.

Load-bearing premise

The fine-tuning update can be cleanly separated into a low-rank task signal and an IID-like noise residual by a singular value threshold.

What would settle it

A counterexample would be a fine-tuning update whose singular value spectrum lacks a detectable gap, or where applying the threshold does not yield better balanced performance than the original fine-tuned model on the held-out benchmarks.

Figures

Figures reproduced from arXiv: 2605.20296 by Aarash Abro, Muhammad Tahir.

Figure 1
Figure 1. Figure 1: Recovery × preservation per cohort. Each panel plots the % healed score on the damaged partition (x-axis) against the % preserved score on the improved partition (y-axis), as defined in §4.3. The ideal corner is (100, 100), and the dotted contour marks HM(% healed, % preserved) = 80. DG-Hard (blue diamond) is closest to the ideal corner across all five cohorts. FAPM [16] strongly recovers damaged measureme… view at source ↗
Figure 2
Figure 2. Figure 2: Population-level Combined score per method, sliced by cohort. Panel titles list [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Trade-off frontiers (both axes 0 to 100, higher is better). Left: Clean-up vs Retention. DG￾Hard sits in the upper-right region where both axes are simultaneously high; V-SoftMask [23] is the retention-extreme (top-left); FAPM [16] is the cleanup-extreme (bottom-right). Right: Knowledge￾cohort Combined (x) vs Cognition-cohort Combined (y). DG-Hard sits high on both (84.8 and 82.1), the most balanced strong… view at source ↗
Figure 4
Figure 4. Figure 4: Spectral unforgetting in two views, on Llama-3.2-3B [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Fine-tuning a language model for a target task routinely degrades capabilities the training data never explicitly threatened. We study this phenomenon, known as catastrophic forgetting, and propose a post-hoc repair solution that uses only the pretrained checkpoint $W_{\mathrm{base}}$ and its fine-tuned descendant $W_{\mathrm{ft}}$. The goal is not merely to revert the model toward the base checkpoint, but to recover capabilities damaged by fine-tuning while preserving both the target-task gains and any beneficial held-out improvements. We introduce DG-Hard, a checkpoint-only spectral repair method for the fine-tuning update $\Delta = W_{\mathrm{ft}} - W_{\mathrm{base}}$. DG-Hard treats $\Delta$ as a low-rank task-aligned signal embedded in an IID-like noise residual that gradient descent has no incentive to remove, and applies the Donoho-Gavish hard singular-value threshold to each weight-delta matrix, keeping the structured high-energy part of the update and removing the spectral bulk. This reduces repair to a closed-form SVD filtering step requiring no data-dependent tuning. A central difficulty is evaluation: average accuracy hides per-benchmark failures, while naive recovery scores reward models that simply revert toward the base. We therefore introduce a partition-conditional metric that separately tracks healing, preservation, non-damage, and target-task retention. Across $14$ (model, task) settings and nine cross-domain held-out benchmarks, DG-Hard achieves the strongest balanced repair among post-hoc baselines. DG-Hard also restores safety alignment degraded by benign fine-tuning on three independent safety axes, despite using no alignment data. These results suggest that part of fine-tuning-induced capability loss is not an unavoidable consequence of specialization, but a removable spectral residue in the weight update itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that catastrophic forgetting during fine-tuning can be partially reversed post-hoc by treating the weight update Δ = W_ft − W_base as a low-rank task signal plus an IID-like noise residual, then applying the Donoho-Gavish hard singular-value threshold to each layer's delta matrix. This closed-form SVD filter is said to recover damaged capabilities on held-out tasks while preserving target-task gains. Across 14 (model, task) pairs and nine cross-domain benchmarks, DG-Hard outperforms post-hoc baselines on a new partition-conditional metric that separately scores healing, preservation, non-damage, and retention; the same procedure also restores safety alignment on three axes without using any alignment data.

Significance. If the central empirical claim holds under rigorous controls, the work supplies a simple, training-free, data-free repair that could reduce the cost of maintaining specialized models. The partition-conditional evaluation metric is a useful addition for distinguishing true recovery from trivial reversion. The safety-restoration result, obtained without safety data, would be particularly noteworthy if replicated.

major comments (2)
  1. [Method description (around the definition of DG-Hard)] The justification for Donoho-Gavish hard thresholding rests on the claim that the residual after low-rank extraction behaves as unstructured IID Gaussian noise (see the paragraph beginning 'DG-Hard treats Δ as a low-rank task-aligned signal...'). No diagnostic is reported that compares the empirical singular-value distribution of the residual to the Marchenko-Pastur bulk predicted by the theorem; if row/column correlations or heavy tails remain, the threshold may truncate signal rather than noise, undermining both the mechanistic story and the recovery claims.
  2. [Empirical evaluation and results] The abstract and results sections report that DG-Hard 'achieves the strongest balanced repair' across 14 settings, yet supply no per-run standard deviations, no statistical significance tests against the strongest baseline, and no explicit data-exclusion or hyper-parameter selection protocol. Without these, it is impossible to judge whether the reported superiority is robust or sensitive to evaluation choices.
minor comments (2)
  1. [Evaluation metric] The partition-conditional metric is introduced without a formal definition or pseudocode; a small table or equation showing how the four scores are aggregated into the 'balanced repair' ranking would improve reproducibility.
  2. [Figures] Several figures compare DG-Hard to baselines but omit error bars or confidence intervals; adding them would make the visual claims easier to assess.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the justification of our method and the reporting of empirical results. We address each major point below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Method description (around the definition of DG-Hard)] The justification for Donoho-Gavish hard thresholding rests on the claim that the residual after low-rank extraction behaves as unstructured IID Gaussian noise (see the paragraph beginning 'DG-Hard treats Δ as a low-rank task-aligned signal...'). No diagnostic is reported that compares the empirical singular-value distribution of the residual to the Marchenko-Pastur bulk predicted by the theorem; if row/column correlations or heavy tails remain, the threshold may truncate signal rather than noise, undermining both the mechanistic story and the recovery claims.

    Authors: We agree that an explicit diagnostic comparing the residual singular-value spectrum to the Marchenko-Pastur bulk would strengthen the mechanistic justification. The Donoho-Gavish threshold is chosen because it provides a closed-form, data-free way to separate the presumed low-rank task signal from the residual under the modeling assumption of approximately unstructured noise; however, we did not verify this assumption empirically in the submitted version. In the revised manuscript we will add a supplementary figure and accompanying text that plots the empirical singular values of the post-DG-Hard residuals for representative layers across several models and compares them to the theoretical Marchenko-Pastur edge. revision: yes

  2. Referee: [Empirical evaluation and results] The abstract and results sections report that DG-Hard 'achieves the strongest balanced repair' across 14 settings, yet supply no per-run standard deviations, no statistical significance tests against the strongest baseline, and no explicit data-exclusion or hyper-parameter selection protocol. Without these, it is impossible to judge whether the reported superiority is robust or sensitive to evaluation choices.

    Authors: We acknowledge that the current presentation lacks measures of variability and formal statistical comparisons, which limits assessment of robustness. The experiments followed standard benchmark protocols with fixed evaluation settings and DG-Hard itself requires no hyper-parameter tuning beyond the SVD operation. In the revision we will (i) report per-run standard deviations for all metrics where repeated evaluations are feasible, (ii) include statistical significance tests (e.g., paired Wilcoxon tests) against the strongest baseline, and (iii) add an explicit subsection describing data exclusion criteria, benchmark partitioning, and the absence of hyper-parameter search for DG-Hard. revision: yes

Circularity Check

0 steps flagged

No significant circularity: closed-form SVD filter applies external Donoho-Gavish threshold without fitted parameters or self-referential derivation

full rationale

The paper defines DG-Hard explicitly as the application of the known Donoho-Gavish hard singular-value threshold to each weight-delta matrix Δ = W_ft − W_base, under the modeling assumption that the update contains low-rank task signal plus IID-like residual noise. This threshold is imported from external random-matrix theory (Marchenko-Pastur law) rather than derived from the paper's own repair results or evaluation metrics. The method is described as requiring no data-dependent tuning or fitted parameters, and the central empirical claims rest on cross-benchmark experiments rather than on any quantity that is statistically forced by construction. The introduced partition-conditional metric is an evaluation device, not part of the repair derivation itself. No self-citations, ansatzes smuggled via prior work, or renamings of known results appear as load-bearing steps in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the modeling assumption that the weight update contains separable low-rank signal plus IID-like noise; no free parameters or new entities are introduced in the abstract description.

axioms (1)
  • domain assumption The Donoho-Gavish hard threshold optimally separates low-rank signal from noise in the weight-delta matrices under the stated IID-like residual model.
    Invoked when the paper states that DG-Hard applies the threshold to keep the structured high-energy part of the update.

pith-pipeline@v0.9.0 · 5848 in / 1338 out tokens · 41462 ms · 2026-05-21T07:45:35.978350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 16 internal anchors

  1. [1]

    Michael McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, 1989

  2. [2]

    Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions.Psychological Review, 97(2):285–308, 1990

    Roger Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions.Psychological Review, 97(2):285–308, 1990. doi: 10.1037/0033-295X. 97.2.285

  3. [3]

    Robert M. French. Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences, 3(4):128–135, 1999. doi: 10.1016/S1364-6613(99)01294-2

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  5. [5]

    Universal Language Model Fine-tuning for Text Classification

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classi- fication. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018. URLhttps://arxiv.org/abs/1801.06146

  6. [6]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. URLhttps://arxiv.org/abs/1810.04805

  7. [7]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  8. [8]

    An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

    Yun Luo, Zefan Yang, Fan Meng, Yufei Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.arXiv preprint arXiv:2308.08747, 2023. URLhttps://arxiv.org/abs/2308.08747

  9. [9]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/ abs/2310.03693

  10. [10]

    Fine-tuning can distort pretrained features and underperform out-of-distribution

    Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. InInternational Conference on Learning Representations, 2022

  11. [11]

    Three Factors Influencing Minima in SGD

    Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD.arXiv preprint arXiv:1711.04623, 2017

  12. [12]

    N. S. Keskar, Dheevatsa Mudigere, Jorge Nocedal, Misha Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017

  13. [13]

    Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt

    Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Y . Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. URL https://arxiv.org/ abs/2109....

  14. [14]

    Yıldız, C., Ravichandran, N

    Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yong Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InInternational Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2311.03099

  15. [15]

    Resolving interference when merging models.arXiv preprint arXiv:2306.01708, 1, 2023a

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. TIES-merging: Resolving interference when merging models. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2306.01708

  16. [16]

    Mitigating catastrophic forgetting in large language models with forgetting-aware pruning (FAPM)

    Wei Huang, Aimin Cheng, and Yu Wang. Mitigating catastrophic forgetting in large language models with forgetting-aware pruning (FAPM). InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. URL https://arxiv.org/abs/ 2509.08255

  17. [17]

    V . A. Marchenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

  18. [18]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://arxiv.org/ abs/2106.09685

  19. [19]

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning

    Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021. URL https://arxiv.org/abs/2012. 13255

  20. [20]

    Random matrix analysis of deep neural network weight matrices.Physical Review E, 106:054124, 2022

    Matthias Thamm, Max Staats, and Bernd Rosenow. Random matrix analysis of deep neural network weight matrices.Physical Review E, 106:054124, 2022. URL https://arxiv.org/ abs/2203.14661

  21. [21]

    Boundary between noise and information applied to filtering neural network weight matrices.Physical Review E, 108:L022302, 2023

    Max Staats, Matthias Thamm, and Bernd Rosenow. Boundary between noise and information applied to filtering neural network weight matrices.Physical Review E, 108:L022302, 2023. URLhttps://arxiv.org/abs/2206.03927

  22. [22]

    Matan Gavish and David L. Donoho. The optimal hard threshold for singular values is 4/ √ 3. IEEE Transactions on Information Theory, 60(8):5040–5053, 2014. URL https://arxiv. org/abs/1305.5870

  23. [23]

    Continual pre-training of language models

    Zixuan Ke, Yijia Shao, Haolong Lin, Tatsuya Konishi, Gyuwan Kim, and Bing Liu. Continual pre-training of language models. InInternational Conference on Learning Representations,

  24. [24]

    URLhttps://arxiv.org/abs/2302.03241

  25. [25]

    Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi´nska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 3...

  26. [26]

    Continual Learning Through Synaptic Intelligence

    Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InInternational Conference on Machine Learning, 2017. URL https://arxiv. org/abs/1703.04200

  27. [27]

    Memory Aware Synapses: Learning what (not) to forget

    Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European Conference on Computer Vision, 2018. URLhttps://arxiv.org/abs/1711.09601

  28. [28]

    Balancing speciality and versatility: A coarse to fine framework for mitigating catastrophic forgetting in large language models

    Haonan Zhang, Yinjun Wu, Dongxu Li, Shuo Yang, Rui Zhao, Yu Jiang, and Fei Tan. Balancing speciality and versatility: A coarse to fine framework for mitigating catastrophic forgetting in large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URLhttps://arxiv.org/abs/2404.10306

  29. [29]

    Gradient episodic memory for continual learning, 2022

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems, 2017. URL https://arxiv.org/ abs/1706.08840. 11

  30. [30]

    On Tiny Episodic Memories in Continual Learning

    Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K. Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning.arXiv preprint arXiv:1902.10486, 2019. URL https://arxiv.org/abs/1902. 10486

  31. [31]

    LoRA vs full fine-tuning: An illusion of equivalence

    Richard Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs full fine-tuning: An illusion of equivalence. InAdvances in Neural Information Processing Systems,

  32. [32]

    URLhttps://arxiv.org/abs/2410.21228

  33. [33]

    Editing Models with Task Arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Con- ference on Learning Representations, 2023. URLhttps://arxiv.org/abs/2212.04089

  34. [34]

    Matena and Colin A

    Michael S. Matena and Colin A. Raffel. Merging models with fisher-weighted averaging. In Advances in Neural Information Processing Systems, 2022. URL https://arxiv.org/abs/ 2111.09832

  35. [35]

    Matan Gavish and David L. Donoho. Optimal shrinkage of singular values.IEEE Transac- tions on Information Theory, 63(4):2137–2152, 2017. URL https://arxiv.org/abs/1405. 7511

  36. [36]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning, 2024. URL https://arxiv.org/ abs/2402.04249

  37. [37]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language mod- els. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024. URLhttps://arxiv.org/abs/2308.01263

  38. [38]

    A StrongREJECT for Empty Jailbreaks

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for empty jailbreaks. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2024. URLhttps://arxiv.org/abs/2402.10260

  39. [39]

    Task-specific skill localization in fine-tuned language models

    Abhishek Panigrahi, Nikunj Saunshi, Haifeng Zhao, and Sanjeev Arora. Task-specific skill localization in fine-tuned language models. InInternational Conference on Machine Learning,

  40. [40]

    URLhttps://arxiv.org/abs/2302.06600

  41. [41]

    The truth is in there: Improving reasoning in language models with layer-selective rank reduction.arXiv preprint arXiv:2312.13558,

    Pratyusha Sharma, Jordan T. Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. InInternational Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/2312.13558. 12 Appendix Contents A Empirical evidence for the signal-plus-noise structure of∆. . . . . . . . . . . . ....

  42. [42]

    provide a complementary parametric-sparsity result: ∼0.01% of parameters carry >95% of fine-tune task performance when grafted back onto the base model. Sparsity in coordinate space and concentration in singular-value space are distinct mathematical properties; we cite this work as a parallel structural prior, not as direct evidence for low-rank∆. The noi...

  43. [43]

    extreme redundancy

    (DARE) show that the fine-tune delta tolerates random pruning of 90–99% of its entries with rescaling, attributing this to “extreme redundancy” of small-magnitude updates, consistent with most of∆being redundant rather than informative. Direct check on our own checkpoints.Fig. 4 verifies the structure layer-locally on Llama-3.2-3B’s mlp.up_proj at layer 1...