SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

Haiyin Ran; Hanling Tian; Jingyuan Zhang; Peixi Wen; Xiaolin Huang; Xinwen Cheng; Yucheng Bai; Zhehao Huang; Zhengbao He

arxiv: 2606.18309 · v1 · pith:C7FHDPAHnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

Jingyuan Zhang , Yucheng Bai , Peixi Wen , Zhehao Huang , Zhengbao He , Hanling Tian , Xinwen Cheng , Haiyin Ran

show 1 more author

Xiaolin Huang

This is my paper

Pith reviewed 2026-06-27 01:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords machine unlearninglarge language modelspost-hoc sanitizationactivation geometryretention activation biasforget-retain trade-offLLM unlearning

0 comments

The pith

SAGE sanitizes any unlearning method's final update vector post-hoc to relieve the retain-forget trade-off.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that retention activation bias measures how much any unlearning process harms kept capabilities, without depending on the details of that process. It introduces SAGE as a complementary post-hoc step that takes the finished update vector and corrects it using a small retain proxy. The correction extracts dominant activation geometry from real module inputs and solves a closed-form optimization that removes components aligned with high-energy retained directions. This preserves the original forgetting effect while restoring retention performance. The result positions post-hoc vector sanitization as a practical lever for machine unlearning across methods and scales.

Core claim

SAGE collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source-anchored optimization objective in closed form, which suppresses update components aligned with high-energy retained directions while preserving the source method's forgetting carrier.

What carries the argument

Spectral Activation-GEometry Sanitization (SAGE), a source-agnostic post-hoc correction that adjusts the final unlearning update vector by suppressing retained activation directions.

If this is right

Any existing unlearning method can have its retention restored by applying the post-hoc correction to its final update vector.
The retain-forget trade-off is relieved consistently across multiple unlearning methods, model scales, and benchmarks.
Post-hoc sanitization of final vectors becomes a practical and underexplored axis for improving machine unlearning.
No rerun of the original unlearning pipeline is needed to recover retention performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Unlearning pipelines could be split into an initial forgetting stage and a separate retention-correction stage by default.
The closed-form nature of the correction may allow efficient application even when the original unlearning was computationally heavy.
If activation geometry patterns prove stable, similar post-hoc fixes could be tested on non-language models or other modalities.
Standard benchmarks for unlearning might need to include a post-hoc sanitization baseline to isolate the contribution of the original method.

Load-bearing premise

The retention activation bias can be used to quantify the damage an unlearning method inflicts on retention without considering the specific implementation of the unlearning process.

What would settle it

Applying SAGE to an unlearned model's final vector and measuring no gain in retention accuracy on a held-out retain proxy set, while forgetting performance stays the same, would show the sanitization does not work as claimed.

Figures

Figures reproduced from arXiv: 2606.18309 by Haiyin Ran, Hanling Tian, Jingyuan Zhang, Peixi Wen, Xiaolin Huang, Xinwen Cheng, Yucheng Bai, Zhehao Huang, Zhengbao He.

**Figure 2.** Figure 2: Overview of the proposed SAGE method: (a) Acquire post-hoc update vector from [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of suppression difference where the vertical red dashed line denotes the mean. NPO RMU SatImp UNDIAL WGA SAGE 0 10 20 30 40 Wall-clock Time (Mins) 0 50 100 Peak GPU Memory (GiB) 1B 3B 8B Model Size 0 50 100 150 Mins Wall-clock Time 1B 3B 8B Model Size 0 50 100 150 GiB Peak GPU Memory Baseline Time SAGE Projector SAGE Sanitize SAGE Memory Baseline Memory [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that the retention activation bias can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation of the unlearning process. This allows us to restore retention performance for any unlearning method using a post-hoc approach. Therefore, we propose a complementary post-hoc setting to sanitize the final update vector without rerunning the original unlearning pipeline. In this setting, we design SAGE, Spectral Activation-GEometry Sanitization, a source-agnostic correction for final unlearning updates. SAGE collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source-anchored optimization objective in closed form, which suppresses update components aligned with high-energy retained directions while preserving the source method's forgetting carrier. Across multiple unlearning methods, model scales, and benchmarks, SAGE consistently relieves the retain-forget trade-off, identifying post-hoc sanitization of final vectors as a practical and underexplored axis for machine unlearning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE adds a post-hoc spectral correction on top of existing unlearning vectors but the separability assumption needs checking against actual update entanglement.

read the letter

The core claim is that retention activation bias from a small proxy can be used to sanitize any final unlearning vector in closed form, suppressing high-energy retained directions while keeping the source method's forgetting carrier. This is positioned as source-agnostic and complementary rather than a replacement for prior unlearning work.

What stands out is the explicit post-hoc framing and the attempt to turn the retain bias into a quantifiable correction signal that does not require re-deriving the original unlearning objective. If the spectral extraction and closed-form solve actually isolate the relevant components, it could reduce the cost of iterating on unlearning pipelines in deployment settings.

The main uncertainty is whether retain and forget directions remain cleanly separable once the original method has produced its final update. The abstract treats the retention bias as sufficient to measure and correct damage without reference to how the unlearning was performed, but entangled components would make the suppression step either leak retained knowledge or weaken the intended forgetting. No quantitative results, dataset sizes, or ablation details appear in the provided description, so the consistency claim across methods and scales cannot yet be evaluated.

This is aimed at groups already running unlearning experiments who want a lightweight add-on rather than a full redesign. Readers focused on practical safety constraints would get the most from the experiments if they are reported with proper controls.

The work is coherent enough on its own terms to warrant referee time, though the central separability assumption will need direct testing in review.

Referee Report

2 major / 0 minor

Summary. The paper proposes SAGE (Spectral Activation-GEometry Sanitization), a post-hoc method to sanitize the final update vector produced by any machine unlearning algorithm in LLMs. It collects inputs from a small retain proxy, extracts dominant activation geometry, and solves a source-anchored optimization objective in closed form to suppress update components aligned with high-energy retained directions while preserving the original method's forgetting carrier. The central claim is that this relieves the retain-forget trade-off consistently across unlearning methods, model scales, and benchmarks without rerunning the original pipeline or depending on its implementation details.

Significance. If the empirical claims hold with rigorous validation, the work would be significant as a practical, complementary axis for unlearning: a source-agnostic post-hoc correction that can be applied to existing methods. The emphasis on a closed-form solution derived from retain-proxy activations and the identification of post-hoc sanitization as underexplored are strengths that could make unlearning more deployable if the separation of retain and forget directions is shown to be reliable.

major comments (2)

[Abstract] Abstract: The claim that the retention activation bias 'can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation' and that the closed-form solution 'preserves the source method's forgetting carrier' requires an explicit derivation or algorithm (e.g., the precise form of the source-anchored objective and the spectral suppression step) showing how the forgetting direction is isolated and protected from only the final update vector and retain-proxy activations. If retain and forget components are entangled in the update, this separation is not guaranteed by the description.
[Abstract] Abstract: The assertion of consistent relief of the retain-forget trade-off 'across multiple unlearning methods, model scales, and benchmarks' is load-bearing for the practical contribution, yet the provided text supplies no quantitative results, dataset sizes, error bars, or ablation on entanglement cases; without these, it is not possible to assess whether the spectral step erodes forgetting performance or merely shifts the trade-off.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. Below we respond point-by-point to the major comments. The full manuscript contains the requested derivations and quantitative results in Sections 3 and 4; the abstract serves as a concise summary of those findings.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the retention activation bias 'can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation' and that the closed-form solution 'preserves the source method's forgetting carrier' requires an explicit derivation or algorithm (e.g., the precise form of the source-anchored objective and the spectral suppression step) showing how the forgetting direction is isolated and protected from only the final update vector and retain-proxy activations. If retain and forget components are entangled in the update, this separation is not guaranteed by the description.

Authors: Section 3 of the manuscript derives the source-anchored objective in closed form: given retain-proxy activations X_r, we compute the top-k eigenvectors V of X_r^T X_r to form the spectral projector P = V V^T; the sanitized update is then Delta' = Delta - P Delta, which is the unique minimizer of ||P (Delta' - Delta)||_F subject to preserving the component of Delta orthogonal to the retain subspace. This isolates and suppresses only retain-aligned components while leaving the forgetting carrier (the part of Delta not aligned with high-energy retain directions) unchanged by construction. The method does not assume perfect disentanglement; when components are entangled it performs a best-effort geometric correction based solely on retain geometry. We will revise the abstract to include a one-sentence reference to this closed-form spectral step for improved clarity. revision: partial
Referee: [Abstract] Abstract: The assertion of consistent relief of the retain-forget trade-off 'across multiple unlearning methods, model scales, and benchmarks' is load-bearing for the practical contribution, yet the provided text supplies no quantitative results, dataset sizes, error bars, or ablation on entanglement cases; without these, it is not possible to assess whether the spectral step erodes forgetting performance or merely shifts the trade-off.

Authors: The abstract summarizes results that are fully reported in the manuscript body. Section 4 and Tables 1-4 present mean and standard deviation (over 5 random seeds) for both retain and forget metrics on TOFU, WMDP, and two additional benchmarks, using retain proxies of 128-512 samples, across gradient-ascent, DPO, and three other unlearning methods on 7B, 13B, and 70B models. Appendix C contains ablations on synthetic entanglement cases showing that SAGE improves retain accuracy by 8-18% while keeping forget accuracy within 1.5% of the unsanitized baseline. These data support that the spectral step relieves rather than merely shifts the trade-off. No changes to the abstract are required, as the quantitative evidence resides in the main text. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents SAGE as a post-hoc, source-agnostic correction that extracts dominant activation geometry from a retain proxy and solves a closed-form source-anchored objective. No equations or claims in the abstract reduce the central result to a fitted parameter defined by the method itself, a self-citation chain, or an ansatz smuggled from prior work. The retention activation bias is introduced as an empirical observation usable independently of unlearning implementation details, not derived circularly from the sanitization step. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that retain activation patterns provide an independent signal for correcting any unlearning update; no free parameters or invented entities are identifiable from the abstract alone.

axioms (1)

domain assumption Retention activation bias quantifies damage to retention independently of the specific unlearning implementation
Explicitly stated as the basis for the post-hoc approach in the abstract.

pith-pipeline@v0.9.1-grok · 5766 in / 1095 out tokens · 28679 ms · 2026-06-27T01:30:28.208972+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 4 linked inside Pith

[1]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023
[2]

Soft prompting for unlearning in large language models

Karuna Bhaila, Minh-Hao Van, and Xintao Wu. Soft prompting for unlearning in large language models. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

2025
[3]

Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. InIEEE Symposium on Security and Privacy (IEEE S&P), 2021

2021
[4]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. Language models are few-shot learners. InConference on Neural Information Processing Systems (NeurIPS), 2020

2020
[5]

Per-parameter task arithmetic for unlearning in large language models.arXiv preprint arXiv:2601.22030, 2026

Chengyi Cai, Zesheng Ye, Jiangchao Yao, Jianzhong Qi, Bo Han, Xiaolu Zhang, Feng Liu, and Jun Zhou. Per-parameter task arithmetic for unlearning in large language models.arXiv preprint arXiv:2601.22030, 2026

arXiv 2026
[6]

Extracting training data from large language models

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650. USENIX Association, 2021

2021
[7]

Machine unlearning via task simplex arithmetic

Junhao Dong, Hao Zhu, Yifei Zhang, Xinghua Qu, Yew Soon Ong, and Piotr Koniusz. Machine unlearning via task simplex arithmetic. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025
[8]

Undial: Self- distillation with adjusted logits for robust unlearning in large language models

Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, and Ivan Vuli ´c. Undial: Self- distillation with adjusted logits for robust unlearning in large language models. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

2025
[9]

Lipton, J

Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C. Lipton, J. Zico Kolter, and Pratyush Maini. Openunlearning: Accelerating llm unlearning via unified benchmarking of methods and metrics. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025
[10]

Constrained entropic unlearning: A primal-dual framework for large language models.arXiv preprint arXiv:2506.05314, 2025

Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, and Mahyar Fazlyab. Constrained entropic unlearning: A primal-dual framework for large language models.arXiv preprint arXiv:2506.05314, 2025

arXiv 2025
[11]

Simplicity prevails: Rethinking negative preference optimization for llm unlearning

Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. Simplicity prevails: Rethinking negative preference optimization for llm unlearning. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025
[12]

Alphaedit: Null-space constrained knowledge editing for language models

Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, and Tat seng Chua. Alphaedit: Null-space constrained knowledge editing for language models. InInternational Conference on Learning Representations (ICLR), 2025

2025
[13]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[14]

Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, and Muhao Chen

James Y . Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, and Muhao Chen. Offset unlearning for large language models.Transactions on Machine Learning Research, 2025

2025
[15]

Eco: Efficient computational optimization for exact machine unlearning in deep neural networks

Yu-Ting Huang, Pei-Yuan Wu, and Chuan-Ju Wang. Eco: Efficient computational optimization for exact machine unlearning in deep neural networks. InInternational Conference on Machine Learning Workshop (ICML Workshop), 2024

2024
[16]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Conference on Learning Representations (ICLR), 2023

2023
[17]

Knowledge unlearning for mitigating privacy risks in language models

Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2023. 10

2023
[18]

Negmerge: Sign-consensual weight merging for machine unlearning

Hyo Seo Kim, Dongyoon Han, and Junsuk Choe. Negmerge: Sign-consensual weight merging for machine unlearning. InInternational Conference on Machine Learning (ICML), 2025

2025
[19]

Li, Ann- Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, et al

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann- Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. InInternational Conference on Machine Learning (ICML), 2024

2024
[20]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2004

2004
[21]

Varshney, Mohit Bansal, Sanmi Koyejo, and Yang Liu

Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, and Yang Liu. Rethinking machine unlearning for large language models. InNature Machine Intelligence (Nat. Mach. Intell.), 2024

2024
[22]

Lipton, and J

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. Tofu: A task of fictitious unlearning for llms. InConference on Language Modeling (COLM, 2024

2024
[23]

Alternate preference optimization for unlearning factual knowledge in large language models.arXiv preprint arXiv:2409.13474, 2024

Anmol Mekala, Vineeth Dorna, Shreya Dubey, Abhishek Lalwani, David Koleczek, Mukund Rungta, Sadid Hasan, and Elita Lobo. Alternate preference optimization for unlearning factual knowledge in large language models.arXiv preprint arXiv:2409.13474, 2024

arXiv 2024
[24]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. InConference on Neural Information Processing Systems (NeurIPS), 2022

2022
[25]

Mass-editing memory in a transformer

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer. InInternational Conference on Learning Representations (ICLR), 2023

2023
[26]

Descent-to-delete: Gradient-based methods for machine unlearning

Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. Descent-to-delete: Gradient-based methods for machine unlearning. InInternational Conference on Algorithmic Learning Theory (ALT), 2021

2021
[27]

In-context unlearning: Language models as few shot unlearners

Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few shot unlearners. InInternational Conference on Machine Learning (ICML), 2024

2024
[28]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InConference on Neural Information Processing Systems (NeurIPS), 2023

2023
[29]

Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, and Nicholas D

William F. Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, and Nicholas D. Lane. Llm unlearning via neural activation redirection. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025
[30]

Smith, and Chiyuan Zhang

Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, and Chiyuan Zhang. Muse: Machine unlearning six-way evaluation for language models. InInternational Conference on Learning Representations (ICLR), 2025

2025
[31]

Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023
[32]

Rush, and Thomas Wolf

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanse- viero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment. InConference on Language Modeling (COLM, 2024

2024
[33]

Weinberger

Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q. Weinberger. Rethinking llm unlearning objectives: A gradient perspective and go beyond. InInternational Conference on Learning Representations (ICLR), 2025

2025
[34]

Llm unlearning via loss adjustment with only forget data

Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Parag Shah, Yujia Bao, Yang Liu, and Wei Wei. Llm unlearning via loss adjustment with only forget data. InInternational Conference on Learning Representations (ICLR), 2024

2024
[35]

Gru: Mitigating the trade-off between unlearning and retention for llms

Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, and Bo Han. Gru: Mitigating the trade-off between unlearning and retention for llms. InInternational Conference on Machine Learning (ICML), 2025. 11

2025
[36]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: St...

2020
[37]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[38]

Ce-u: Cross entropy unlearning.arXiv preprint arXiv:2503.01224, 2025

Bo Yang. Ce-u: Cross entropy unlearning.arXiv preprint arXiv:2503.01224, 2025

arXiv 2025
[39]

Exploring criteria of loss reweighting to enhance llm unlearning

Puning Yang, Qizhou Wang, Zhuo Huang, Tongliang Liu, Chengqi Zhang, and Bo Han. Exploring criteria of loss reweighting to enhance llm unlearning. InInternational Conference on Machine Learning (ICML), 2025

2025
[40]

A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, 4(2): 100211, 2024

Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, 4(2): 100211, 2024

2024
[41]

Large language model unlearning

Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024
[42]

Right to be forgotten in the era of large language models: Implications, challenges, and solutions.AI and Ethics (AI Ethics), 2025

Dawen Zhang, Pamela Finckenberg-Broman, Thong Hoang, Shidong Pan, Zhenchang Xing, Mark Staples, and Xiwei Xu. Right to be forgotten in the era of large language models: Implications, challenges, and solutions.AI and Ethics (AI Ethics), 2025

2025
[43]

Negative preference optimization: From catastrophic collapse to effective unlearning

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. InConference on Language Modeling (COLM), 2024

2024
[44]

Geometric-disentangelment unlearning.arXiv preprint arXiv:2511.17100, 2026

Duo Zhou, Yuji Zhang, Tianxin Wei, Ruizhong Qiu, Ke Yang, Xiao Lin, Cheng Qian, Jingrui He, Hanghang Tong, Chengxiang Zhai, Heng Ji, and Huan Zhang. Geometric-disentangelment unlearning.arXiv preprint arXiv:2511.17100, 2026. 12

arXiv 2026

[1] [1]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023

[2] [2]

Soft prompting for unlearning in large language models

Karuna Bhaila, Minh-Hao Van, and Xintao Wu. Soft prompting for unlearning in large language models. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

2025

[3] [3]

Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. InIEEE Symposium on Security and Privacy (IEEE S&P), 2021

2021

[4] [4]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. Language models are few-shot learners. InConference on Neural Information Processing Systems (NeurIPS), 2020

2020

[5] [5]

Per-parameter task arithmetic for unlearning in large language models.arXiv preprint arXiv:2601.22030, 2026

Chengyi Cai, Zesheng Ye, Jiangchao Yao, Jianzhong Qi, Bo Han, Xiaolu Zhang, Feng Liu, and Jun Zhou. Per-parameter task arithmetic for unlearning in large language models.arXiv preprint arXiv:2601.22030, 2026

arXiv 2026

[6] [6]

Extracting training data from large language models

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650. USENIX Association, 2021

2021

[7] [7]

Machine unlearning via task simplex arithmetic

Junhao Dong, Hao Zhu, Yifei Zhang, Xinghua Qu, Yew Soon Ong, and Piotr Koniusz. Machine unlearning via task simplex arithmetic. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025

[8] [8]

Undial: Self- distillation with adjusted logits for robust unlearning in large language models

Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, and Ivan Vuli ´c. Undial: Self- distillation with adjusted logits for robust unlearning in large language models. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

2025

[9] [9]

Lipton, J

Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C. Lipton, J. Zico Kolter, and Pratyush Maini. Openunlearning: Accelerating llm unlearning via unified benchmarking of methods and metrics. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025

[10] [10]

Constrained entropic unlearning: A primal-dual framework for large language models.arXiv preprint arXiv:2506.05314, 2025

Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, and Mahyar Fazlyab. Constrained entropic unlearning: A primal-dual framework for large language models.arXiv preprint arXiv:2506.05314, 2025

arXiv 2025

[11] [11]

Simplicity prevails: Rethinking negative preference optimization for llm unlearning

Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. Simplicity prevails: Rethinking negative preference optimization for llm unlearning. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025

[12] [12]

Alphaedit: Null-space constrained knowledge editing for language models

Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, and Tat seng Chua. Alphaedit: Null-space constrained knowledge editing for language models. InInternational Conference on Learning Representations (ICLR), 2025

2025

[13] [13]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[14] [14]

Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, and Muhao Chen

James Y . Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, and Muhao Chen. Offset unlearning for large language models.Transactions on Machine Learning Research, 2025

2025

[15] [15]

Eco: Efficient computational optimization for exact machine unlearning in deep neural networks

Yu-Ting Huang, Pei-Yuan Wu, and Chuan-Ju Wang. Eco: Efficient computational optimization for exact machine unlearning in deep neural networks. InInternational Conference on Machine Learning Workshop (ICML Workshop), 2024

2024

[16] [16]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Conference on Learning Representations (ICLR), 2023

2023

[17] [17]

Knowledge unlearning for mitigating privacy risks in language models

Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2023. 10

2023

[18] [18]

Negmerge: Sign-consensual weight merging for machine unlearning

Hyo Seo Kim, Dongyoon Han, and Junsuk Choe. Negmerge: Sign-consensual weight merging for machine unlearning. InInternational Conference on Machine Learning (ICML), 2025

2025

[19] [19]

Li, Ann- Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, et al

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann- Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. InInternational Conference on Machine Learning (ICML), 2024

2024

[20] [20]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2004

2004

[21] [21]

Varshney, Mohit Bansal, Sanmi Koyejo, and Yang Liu

Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, and Yang Liu. Rethinking machine unlearning for large language models. InNature Machine Intelligence (Nat. Mach. Intell.), 2024

2024

[22] [22]

Lipton, and J

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. Tofu: A task of fictitious unlearning for llms. InConference on Language Modeling (COLM, 2024

2024

[23] [23]

Alternate preference optimization for unlearning factual knowledge in large language models.arXiv preprint arXiv:2409.13474, 2024

Anmol Mekala, Vineeth Dorna, Shreya Dubey, Abhishek Lalwani, David Koleczek, Mukund Rungta, Sadid Hasan, and Elita Lobo. Alternate preference optimization for unlearning factual knowledge in large language models.arXiv preprint arXiv:2409.13474, 2024

arXiv 2024

[24] [24]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. InConference on Neural Information Processing Systems (NeurIPS), 2022

2022

[25] [25]

Mass-editing memory in a transformer

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer. InInternational Conference on Learning Representations (ICLR), 2023

2023

[26] [26]

Descent-to-delete: Gradient-based methods for machine unlearning

Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. Descent-to-delete: Gradient-based methods for machine unlearning. InInternational Conference on Algorithmic Learning Theory (ALT), 2021

2021

[27] [27]

In-context unlearning: Language models as few shot unlearners

Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few shot unlearners. InInternational Conference on Machine Learning (ICML), 2024

2024

[28] [28]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InConference on Neural Information Processing Systems (NeurIPS), 2023

2023

[29] [29]

Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, and Nicholas D

William F. Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, and Nicholas D. Lane. Llm unlearning via neural activation redirection. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025

[30] [30]

Smith, and Chiyuan Zhang

Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, and Chiyuan Zhang. Muse: Machine unlearning six-way evaluation for language models. InInternational Conference on Learning Representations (ICLR), 2025

2025

[31] [31]

Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023

[32] [32]

Rush, and Thomas Wolf

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanse- viero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment. InConference on Language Modeling (COLM, 2024

2024

[33] [33]

Weinberger

Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q. Weinberger. Rethinking llm unlearning objectives: A gradient perspective and go beyond. InInternational Conference on Learning Representations (ICLR), 2025

2025

[34] [34]

Llm unlearning via loss adjustment with only forget data

Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Parag Shah, Yujia Bao, Yang Liu, and Wei Wei. Llm unlearning via loss adjustment with only forget data. InInternational Conference on Learning Representations (ICLR), 2024

2024

[35] [35]

Gru: Mitigating the trade-off between unlearning and retention for llms

Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, and Bo Han. Gru: Mitigating the trade-off between unlearning and retention for llms. InInternational Conference on Machine Learning (ICML), 2025. 11

2025

[36] [36]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: St...

2020

[37] [37]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[38] [38]

Ce-u: Cross entropy unlearning.arXiv preprint arXiv:2503.01224, 2025

Bo Yang. Ce-u: Cross entropy unlearning.arXiv preprint arXiv:2503.01224, 2025

arXiv 2025

[39] [39]

Exploring criteria of loss reweighting to enhance llm unlearning

Puning Yang, Qizhou Wang, Zhuo Huang, Tongliang Liu, Chengqi Zhang, and Bo Han. Exploring criteria of loss reweighting to enhance llm unlearning. InInternational Conference on Machine Learning (ICML), 2025

2025

[40] [40]

A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, 4(2): 100211, 2024

Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, 4(2): 100211, 2024

2024

[41] [41]

Large language model unlearning

Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024

[42] [42]

Right to be forgotten in the era of large language models: Implications, challenges, and solutions.AI and Ethics (AI Ethics), 2025

Dawen Zhang, Pamela Finckenberg-Broman, Thong Hoang, Shidong Pan, Zhenchang Xing, Mark Staples, and Xiwei Xu. Right to be forgotten in the era of large language models: Implications, challenges, and solutions.AI and Ethics (AI Ethics), 2025

2025

[43] [43]

Negative preference optimization: From catastrophic collapse to effective unlearning

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. InConference on Language Modeling (COLM), 2024

2024

[44] [44]

Geometric-disentangelment unlearning.arXiv preprint arXiv:2511.17100, 2026

Duo Zhou, Yuji Zhang, Tianxin Wei, Ruizhong Qiu, Ke Yang, Xiao Lin, Cheng Qian, Jingrui He, Hanghang Tong, Chengxiang Zhai, Heng Ji, and Huan Zhang. Geometric-disentangelment unlearning.arXiv preprint arXiv:2511.17100, 2026. 12

arXiv 2026