pith. sign in

arxiv: 2505.16831 · v3 · pith:E7MMSI7Gnew · submitted 2025-05-22 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Pith reviewed 2026-05-22 13:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.LG
keywords machine unlearninglarge language modelsreversibilityrepresentation analysisforgetting regimesPCACKAFisher information
0
0 comments X

The pith

Task-level metrics mislead about unlearning success in LLMs because minimal fine-tuning restores original behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard ways of checking whether unlearning worked in large language models can be misleading. Accuracy and perplexity scores may drop after an unlearning step, creating the appearance that specific data has been removed. In practice, however, a small amount of further training often brings the original performance back quickly. This pattern indicates that the information is suppressed rather than deleted from the model. The authors therefore propose examining changes inside the model's internal representations to get a clearer picture of what has actually happened.

Core claim

Unlearning methods in LLMs often produce reversible forgetting: the model shows reduced performance on the target data according to task metrics, yet its original behavior returns after minimal fine-tuning. This reversibility implies that the target information remains encoded rather than erased. A representation-level framework using PCA similarity and shift, centered kernel alignment, Fisher information, and mean PCA distance measures how much the internal encodings have changed, revealing four distinct forgetting regimes that differ in reversibility and in how much they harm the model's overall capabilities.

What carries the argument

Representation-level analysis framework that tracks representational drift through PCA similarity and shift, centered kernel alignment (CKA), Fisher information, and mean PCA distance to distinguish suppression from deletion.

If this is right

  • Task-level metrics alone cannot confirm that unlearning has achieved genuine deletion.
  • Unlearning outcomes fall into four regimes that combine different degrees of reversibility and damage to general model performance.
  • Relearning speed depends on the source of the data used for recovery attempts.
  • Irreversible and non-catastrophic forgetting is rare and difficult to produce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Unlearning algorithms may need explicit mechanisms to resist small updates that restore suppressed knowledge.
  • Representation metrics could be paired with recovery experiments to create stronger verification protocols.
  • Similar reversibility patterns may appear in other model-editing settings that aim to remove specific capabilities.

Load-bearing premise

That the representation-level metrics reliably indicate whether information has been permanently deleted rather than merely suppressed in a form that small updates can overcome.

What would settle it

An unlearning method after which even extensive fine-tuning on recovery data fails to restore original performance while the mean PCA distance and related metrics remain large and stable.

Figures

Figures reproduced from arXiv: 2505.16831 by Haibo Hu, Huadi Zheng, Minxin Du, Peizhao Hu, Qingqing Ye, Xiang Yue, Xiaoyu Xu, Yang Liu.

Figure 1
Figure 1. Figure 1: (a) task-level accuracy and CKA subspaces of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Single unlearning analysis on Yi-6B with GA un [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise PCA Similarity and Shift for GA on Yi-6B (simple task). Vary LR [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CKA for GA on Yi-6B, simple task. Vary LR [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise PCA Similarity and Shift for GA on Yi-6B (simple task). vary [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: CKA and FIM for GA on Yi-6B, simple task. Vary LR [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: PCA Similarity Across Layers. Each row shows results under different unlearning methods: [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: PCA Similarity Across Layers. Each row shows results under different unlearning methods: [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: PCA Similarity Across Layers. Each row shows results under different unlearning methods: [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: PCA Similarity Analysis for GA under Varied Relearning and Evaluation Inputs on Yi-6B [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: PCA Shift Across Layers. Each row shows results under different unlearning methods: [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: PCA Shift Across Layers. Each row shows results under different unlearning methods: [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: PCA Shift Across Layers. Each row shows results under different unlearning methods: [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: PCA Shift Analysis under Varied Relearning and Evaluation Inputs on Yi-6B (Simple [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: CKA Across Layers. Each row shows results under different unlearning methods: GA+GD [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: CKA Across Layers. Each row shows results under different unlearning methods: GA+GD [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: CKA Across Layers. Each row shows results under different unlearning methods: GA [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: CKA Analysis under Varied Relearning and Evaluation Inputs on Yi-6B (Simple Task). [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: FIM for GA Across Layers. All plots are for the simple task on Yi-6B, using three learning [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: FIM for GA+GD Across Layers. All plots are for the simple task on Yi-6B, using three [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: FIM for GA+KL Across Layers. All plots are for the simple task on Yi-6B, using three [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: FIM for NPO Across Layers. All plots are for the simple task on Yi-6B, using three [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: FIM for NPO+KL Across Layers. All plots are for the simple task on Yi-6B, using three [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: FIM for Rlable Across Layers. All plots are for the simple task on Yi-6B, using three [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: FIM for GA Across Layers. Simple task on Yi-6B with fixed learning rate [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: FIM for GA+GD Across Layers. Simple task on Yi-6B with fixed learning rate [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: FIM for GA+KL Across Layers. Simple task on Yi-6B with fixed learning rate [PITH_FULL_IMAGE:figures/full_fig_p037_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: FIM for NPO Across Layers. Simple task on Yi-6B with fixed learning rate [PITH_FULL_IMAGE:figures/full_fig_p038_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: FIM for NPO+KL Across Layers. Simple task on Yi-6B with fixed learning rate [PITH_FULL_IMAGE:figures/full_fig_p039_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: FIM for Rlable Across Layers. Simple task on Yi-6B with fixed learning rate [PITH_FULL_IMAGE:figures/full_fig_p040_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: FIM for GA Across Layers. All plots are for the complex task on Qwen2.5-7B, using [PITH_FULL_IMAGE:figures/full_fig_p041_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: FIM for NPO Across Layers. All plots are for the complex task on Qwen2.5-7B, using [PITH_FULL_IMAGE:figures/full_fig_p042_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: FIM for Rlable Across Layers. All plots are for the complex task on Qwen2.5-7B, using [PITH_FULL_IMAGE:figures/full_fig_p043_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: FIM in layer 31 under Varied Relearning and Evaluation Inputs on Yi-6B (Simple Task). [PITH_FULL_IMAGE:figures/full_fig_p044_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: FIM in layer 25 under Varied Relearning and Evaluation Inputs on Yi-6B (Simple Task). [PITH_FULL_IMAGE:figures/full_fig_p044_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: FIM in layer 16 under Varied Relearning and Evaluation Inputs on Yi-6B (Simple Task). [PITH_FULL_IMAGE:figures/full_fig_p045_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: FIM in layer 4 under Varied Relearning and Evaluation Inputs on Yi-6B (Simple Task). [PITH_FULL_IMAGE:figures/full_fig_p045_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: FIM in layer 1 under Varied Relearning and Evaluation Inputs on Yi-6B (Simple Task). [PITH_FULL_IMAGE:figures/full_fig_p046_38.png] view at source ↗
read the original abstract

Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with task-level metrics like accuracy and perplexity. We show that these metrics can be misleading, as models can appear to forget while their original behavior is easily restored through minimal fine-tuning. This \emph{reversibility} suggests that information is merely suppressed, not genuinely erased. To address this critical evaluation gap, we introduce a \emph{representation-level analysis framework}. Our toolkit comprises PCA similarity and shift, centered kernel alignment (CKA), and Fisher information, complemented by a summary metric, the mean PCA distance, to measure representational drift. Applying this framework across multiple unlearning methods, data domains, and LLMs, we identify four distinct forgetting regimes based on their \emph{reversibility} and \emph{catastrophicity}. We compare recovery strategies and show that relearning efficiency relies on the data source. We also find that irreversible, non-catastrophic forgetting is exceptionally challenging. By probing unlearning limits, we identify a case of seemingly irreversible, targeted forgetting, offering insights for more robust erasure algorithms. Overall, our findings expose a gap in current evaluation and establish a representation-level foundation for trustworthy unlearning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that task-level metrics such as accuracy and perplexity are misleading for evaluating machine unlearning in LLMs, as models can recover original behavior via minimal fine-tuning, indicating suppression rather than deletion. It introduces a representation-level framework using PCA similarity/shift, CKA, Fisher information, and mean PCA distance to measure drift, identifies four forgetting regimes based on reversibility and catastrophicity across methods/domains/models, compares recovery strategies, and reports a case of seemingly irreversible targeted forgetting while noting the difficulty of irreversible non-catastrophic unlearning.

Significance. If the observations hold, the work is significant for exposing gaps in current unlearning evaluation and establishing a representation-level foundation that could improve robustness of erasure methods. The empirical breadth across methods, domains, and LLMs lends support to the regime taxonomy and reversibility findings. However, the framework's interpretive claims would be strengthened by addressing the lack of a positive control baseline.

major comments (2)
  1. [§3 and §4] §3 (Representation-Level Analysis Framework) and §4 (Experiments): The central claim that PCA/CKA/Fisher/mean-PCA-distance metrics distinguish genuine deletion from suppression because they track reversibility is load-bearing for the four-regime taxonomy, yet the paper provides no positive control by comparing unlearned models against an otherwise identical model trained from random initialization on data excluding the target set. Without this anchor, observed drifts could reflect unlearning optimizer artifacts rather than evidence of information removal, leaving the 'irreversible' regime interpretation underdetermined.
  2. [Abstract and §5] Abstract and §5 (Discussion): The identification of a 'seemingly irreversible, targeted forgetting' case and the broader claims about recovery efficiency relying on data source would benefit from explicit controls for confounding factors in the recovery experiments and reporting of error bars or statistical significance, as these directly support the reversibility observations and regime distinctions.
minor comments (2)
  1. [Abstract] The abstract would be clearer with a brief enumeration of the specific unlearning methods (e.g., gradient ascent, negative preference optimization) and model scales tested.
  2. [§3.2] Notation for the summary metric 'mean PCA distance' could be formalized with an equation in §3.2 to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful review. We appreciate the acknowledgment of the paper's significance in exposing limitations of task-level metrics for machine unlearning evaluation and the value of the representation-level framework. We address the major comments point by point below, providing clarifications and indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Representation-Level Analysis Framework) and §4 (Experiments): The central claim that PCA/CKA/Fisher/mean-PCA-distance metrics distinguish genuine deletion from suppression because they track reversibility is load-bearing for the four-regime taxonomy, yet the paper provides no positive control by comparing unlearned models against an otherwise identical model trained from random initialization on data excluding the target set. Without this anchor, observed drifts could reflect unlearning optimizer artifacts rather than evidence of information removal, leaving the 'irreversible' regime interpretation underdetermined.

    Authors: We thank the referee for highlighting this point. Our framework measures representational drift and reversibility relative to the original model, which directly tests whether unlearning suppresses or deletes information by checking if fine-tuning restores original behavior. The consistency of distinct regimes across multiple unlearning methods, domains, and LLMs suggests the metrics capture genuine differences rather than generic optimizer artifacts. A model trained from random initialization on excluded data would provide an additional anchor for 'complete absence of target information,' but it would not isolate the effects of the unlearning optimization trajectory from a pre-trained starting point. We will revise §3 to explicitly discuss this baseline choice and its rationale, add a limitations paragraph in §4 and §5 acknowledging the value of such a positive control, and include a small-scale experiment if compute permits. This addresses the interpretive concern while preserving the core multi-method evidence for the taxonomy. revision: partial

  2. Referee: [Abstract and §5] Abstract and §5 (Discussion): The identification of a 'seemingly irreversible, targeted forgetting' case and the broader claims about recovery efficiency relying on data source would benefit from explicit controls for confounding factors in the recovery experiments and reporting of error bars or statistical significance, as these directly support the reversibility observations and regime distinctions.

    Authors: We agree that additional statistical rigor would strengthen the recovery results. The experiments already control for key factors including fine-tuning steps, learning rate, and data source (original vs. alternative corpora) to isolate efficiency differences. For the irreversible case, we used fixed evaluation protocols. In the revision we will: (1) report error bars from multiple random seeds in all relevant figures and tables, (2) add explicit discussion in §5 of potential confounders such as hyperparameter sensitivity and data overlap, and (3) include statistical significance tests where appropriate. These changes will better support the reversibility claims and regime distinctions without altering the main findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity: representation metrics and regime taxonomy remain independent of target claims

full rationale

The paper applies standard, off-the-shelf representation similarity tools (PCA similarity/shift, CKA, Fisher information, mean PCA distance) to quantify drift between pre- and post-unlearning activations. These metrics are computed directly from model internals and are not fitted or redefined using the recovery or accuracy outcomes they are later correlated with. The four forgetting regimes are labeled after the fact from two independently measured axes—reversibility (ease of restoring original behavior via minimal fine-tuning) and catastrophicity (task performance drop)—without any equation or definition that makes one quantity a function of the other by construction. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is present in the provided derivation chain. The central claim that task-level metrics can mislead therefore rests on observable experimental outcomes rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations using standard representation metrics and the assumption that task-level recovery indicates non-deletion; no free parameters, invented entities, or ad-hoc axioms beyond domain-standard evaluation practices are introduced.

axioms (1)
  • domain assumption Task-level metrics such as accuracy and perplexity are insufficient to confirm genuine data deletion in unlearning.
    Invoked to motivate the shift to representation-level analysis.

pith-pipeline@v0.9.0 · 5780 in / 1250 out tokens · 53350 ms · 2026-05-22T13:22:35.236027+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. "The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

    cs.CL 2026-01 unverdicted novelty 6.0

    COMPACT adaptively fuses multi-teacher CoT supervisions using graph-based consensus, mutual-information adaptability, and loss-based difficulty metrics to improve small language model reasoning performance while mitig...

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Open problems in machine unlearning for ai safety

    Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O’Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernández-Orallo, Mor Geva, and Yarin Gal. Open problems in machine unlearning for AI safety. arXiv:2501.04952, 2025

  2. [2]

    Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

    Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. InS&P, pp. 141–159, 2021

  3. [3]

    Towards robust and cost-efficient knowledge unlearning for large language models

    Sungmin Cha, Sungjun Cho, Dasol Hwang, and Moontae Lee. Towards robust and cost-efficient knowledge unlearning for large language models. InICLR, 2025

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv:2110.14168, 2021

  5. [5]

    The rotation of eigenvectors by a perturbation

    Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970

  6. [6]

    Erasing conceptual knowl- edge from language models

    Rohit Gandikota, Sheridan Feucht, Samuel Marks, and David Bau. Erasing conceptual knowl- edge from language models. arXiv:2410.02760, 2024

  7. [7]

    On large language model continual unlearning

    Chongyang Gao, Lixu Wang, Kaize Ding, Chenkai Weng, Xiao Wang, and Qi Zhu. On large language model continual unlearning. InICLR, 2025

  8. [8]

    Guan, Gregory Valiant, and James Zou

    Antonio Ginart, Melody Y . Guan, Gregory Valiant, and James Zou. Making AI forget you: Data deletion in machine learning. InNeurIPS, pp. 3513–3526, 2019

  9. [9]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021

  10. [10]

    Language model compression with weighted low-rank factorization

    Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization. InICLR, 2022

  11. [11]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022

  12. [12]

    Knowledge unlearning for mitigating privacy risks in language models

    Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. In ACL, pp. 14389–14408, 2023

  13. [13]

    WAGLE: strategic weight attribution for effective and modular unlearning in large language models

    Jinghan Jia, Jiancheng Liu, Yihua Zhang, Parikshit Ram, Nathalie Baracaldo, and Sijia Liu. WAGLE: strategic weight attribution for effective and modular unlearning in large language models. InNeurIPS, 2024

  14. [14]

    Copyright violations and large language models

    Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard. Copyright violations and large language models. InEMNLP, pp. 7403–7412, 2023. 10 Under Review

  15. [15]

    Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A

    James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. arXiv:1612.00796, 2016

  16. [16]

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey E. Hinton. Similarity of neural network representations revisited. InICML, pp. 3519–3529, 2019

  17. [17]

    Protecting privacy through ap- proximating optimal parameters for sequence unlearning in language models

    Dohyun Lee, Daniel Rim, Minseok Choi, and Jaegul Choo. Protecting privacy through ap- proximating optimal parameters for sequence unlearning in language models. InACL, pp. 15820–15839, 2024

  18. [18]

    Numi- namath

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numi- namath. [https://huggingface.co/AI-MO/NuminaMath-1.5](https: //github.com/project-numina/aimo-progress-prize/blob/main/ report...

  19. [19]

    Single image unlearning: Efficient machine unlearning in multimodal large language models

    Jiaqi Li, Qianshan Wei, Chuanyi Zhang, Guilin Qi, Miaozeng Du, Yongrui Chen, Sheng Bi, and Fan Liu. Single image unlearning: Efficient machine unlearning in multimodal large language models. InNeurIPS, 2024

  20. [20]

    Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-V oss, Cort B. Breuer, Andy ...

  21. [21]

    Shen, Meghdad Kurmanji, Xinchi Qiu, Dongqi Cai, Chao Wu, and Nicholas D

    Zexi Li, Xiangzhu Wang, William F. Shen, Meghdad Kurmanji, Xinchi Qiu, Dongqi Cai, Chao Wu, and Nicholas D. Lane. Editing as unlearning: Are knowledge editing methods strong baselines for large language model unlearning? arXiv:2505.19855, 2025

  22. [22]

    Funu: Boosting machine unlearning efficiency by filtering unnecessary unlearning

    Zitong Li, Qingqing Ye, and Haibo Hu. Funu: Boosting machine unlearning efficiency by filtering unnecessary unlearning. arXiv:2501.16614, 2025

  23. [23]

    Large language model unlearning via embedding-corrupted prompts

    Chris Yuhao Liu, Yaxuan Wang, Jeffrey Flanigan, and Yang Liu. Large language model unlearning via embedding-corrupted prompts. InNeurIPS, 2024

  24. [24]

    Michelle Lo, Fazl Barez, and Shay B. Cohen. Large language models relearn removed concepts. InFindings of ACL, pp. 8306–8323, 2024

  25. [25]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

  26. [26]

    URL https://openreview.net/forum?id= J5IRyTKZ9s

    Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms. arXiv:2402.16835, 2024

  27. [27]

    Lipton, and J

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. TOFU: A task of fictitious unlearning for llms. InCOLM, 2024

  28. [28]

    Scalable Extraction of Training Data from (Production) Language Models

    Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee. Scalable extraction of training data from (production) language models. arXiv:2311.17035, 2023

  29. [29]

    In-context unlearning: Language models as few-shot unlearners

    Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few-shot unlearners. InICML, 2024. 11 Under Review

  30. [30]

    SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability. InNeurIPS, pp. 6076–6085, 2017

  31. [31]

    Detecting pretraining data from large language models

    Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. InICLR, 2024

  32. [32]

    Smith, and Chiyuan Zhang

    Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, and Chiyuan Zhang. MUSE: machine unlearning six-way evaluation for language models. InICLR, 2025

  33. [33]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InNAACL, pp. 4149–4158, 2019

  34. [34]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  35. [35]

    Unveiling the implicit toxicity in large language models

    Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, and Minlie Huang. Unveiling the implicit toxicity in large language models. InEMNLP, pp. 1322–1338, 2023

  36. [36]

    Adaptive localization of knowledge negation for continual llm unlearning

    Abudukelimu Wuerkaixi, Qizhou Wang, Sen Cui, Wutong Xu, Bo Han, Gang Niu, Masashi Sugiyama, and Changshui Zhang. Adaptive localization of knowledge negation for continual llm unlearning. InICML, 2025

  37. [37]

    Obliviate: Robust and practical machine unlearning for large language models

    Xiaoyu Xu, Minxin Du, Qingqing Ye, and Haibo Hu. Obliviate: Robust and practical machine unlearning for large language models. InEMNLP, 2025

  38. [38]

    Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

    Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, and Minxin Du. Unlearning isn’t deletion: Investigating reversibility of machine unlearning in llms. arXiv:2505.16831, 2025

  39. [39]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tian- hao Li, Tingyu...

  40. [40]

    Machine unlearning of pre-trained large language models

    Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, and Xiang Yue. Machine unlearning of pre-trained large language models. InACL, pp. 8403–8419, 2024

  41. [41]

    Yi: Open Foundation Models by 01.AI

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong D...

  42. [42]

    A closer look at machine unlearning for large language models

    Xiaojian Yuan, Tianyu Pang, Chao Du, Kejiang Chen, Weiming Zhang, and Min Lin. A closer look at machine unlearning for large language models. InICLR, 2025. 12 Under Review

  43. [43]

    Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. arXiv:2404.05868, 2024

  44. [44]

    Catastrophic failure of LLM unlearning via quantization

    Zhiwei Zhang, Fali Wang, Xiaomin Li, Zongyu Wu, Xianfeng Tang, Hui Liu, Qi He, Wenpeng Yin, and Suhang Wang. Catastrophic failure of LLM unlearning via quantization. InICLR, 2025

  45. [45]

    Spurious forgetting in continual learning of language models

    Junhao Zheng, Xidi Cai, Shengjie Qiu, and Qianli Ma. Spurious forgetting in continual learning of language models. InICLR, 2025

  46. [46]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043, 2023. 13 Under Review A APPENDIX A.1 LIMITATIONS Our experiments target two LLMs and a handful of tasks and unlearning methods; although our diagnostic framework is model-agnostic and designed to scale, em...