pith. sign in

arxiv: 2501.19202 · v6 · submitted 2025-01-31 · 💻 cs.CL

Improving LLM Unlearning Robustness via Random Perturbations

Pith reviewed 2026-05-23 04:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM unlearningmodel robustnessbackdoor attacksforget tokensrandom noise augmentationmachine unlearningadversarial robustness
0
0 comments X

The pith

LLM unlearning methods create backdoor vulnerabilities by aligning forget-tokens with target representations during forgetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current unlearning techniques in large language models reduce robustness, so that even one non-adversarial forget-token inside a retain-query triggers misbehavior. It develops a theoretical account that treats the forgetting step as an inadvertent backdoor attack in which forget-tokens become triggers paired with target labels. The retaining step is then reinterpreted as a defense opportunity, leading to the proposal of Random Noise Augmentation, a lightweight perturbation method that carries theoretical robustness guarantees. Experiments show the augmentation restores robustness while leaving forget and retain performance intact. The resulting attack-defense framing supplies a mechanistic explanation for why unlearning tends to hide rather than erase target knowledge.

Core claim

The forgetting process in LLM unlearning inadvertently learns to align forget-tokens with target-representations, converting those tokens into backdoor triggers that, when present in retain-queries, disrupt the unlearned model's behavior; Random Noise Augmentation re-casts the retaining process as backdoor defense and supplies a model- and method-agnostic mitigation with theoretical support.

What carries the argument

The backdoor attack-defense reframing of unlearning, in which forget-tokens function as triggers paired with target labels and random noise augmentation serves as the corresponding defense during retention.

If this is right

  • Unlearned models become susceptible to non-adversarial forget-tokens placed inside otherwise normal retain-queries.
  • Forget-tokens function as backdoor triggers that activate target-representations and produce incorrect outputs.
  • Random Noise Augmentation during retention measurably raises robustness without degrading forget or retain accuracy.
  • Unlearning tends to hide target knowledge behind trigger alignments rather than remove it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trigger-alignment dynamic may appear in unlearning settings outside language models, such as vision or tabular data.
  • Future unlearning algorithms could be designed to avoid token-label pairing altogether rather than defend against it afterward.
  • Evaluating unlearning success may require new tests that deliberately insert forget-tokens into retain data.

Load-bearing premise

The observed fragility arises specifically because the forgetting process learns an alignment between forget-tokens and target representations rather than from unrelated training dynamics.

What would settle it

An experiment that performs unlearning on a model while preventing any statistical alignment between forget-tokens and target outputs, then measures whether single forget-tokens in retain-queries still cause misbehavior.

Figures

Figures reproduced from arXiv: 2501.19202 by Anh Bui, Dang Huu-Tien, Hoang Thanh-Tung, Le-Minh Nguyen, Minh-Phuong Nguyen, Naoya Inoue.

Figure 1
Figure 1. Figure 1: Left-most: Accuracy of RM and RM w/ RNA models on MMLU and perturbed MMLU (MMLU QA contains forget-tokens; see Appendix A.2 for details). Left-mid: Accuracy of PO and PO w/ RNA models on MMLU and perturbed MMLU. Right-mid: Accuracy of all unlearned models on WMDP and perturbed WMDP. Right-most: Accuracy of all unlearned models on MMLU subsets (College Biology and Computer Security). Original RM models are … view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy of RM models on perturbed MMLU across values of coefficient [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy of RNA models measured on perturbed MMLU Q&A and WMDP (avg. of [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: A sample zero-shot QA prompt. A random incorrect answer ( [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Left: noise sensitivity of layer g = 8 for the base model, PO models, and RM models. Right: Layer-wise noise sensitivity across all layers for the base model, PO models, and RM models. difficulty and data difficulty. From the model perspective, noise sensitivity can help characterize the unlearning difficulty of specific components—such as an intermediate layer (as described in Eqn. 35), a group of layers,… view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of loss changes for forget-samples in the WMDP-Biology and WMDP-Cyber [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distributions of changes in maximum activations (MaxAct) of generated tokens in unlearned [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy of un￾learned models on perturbed MMLU with respect to bi-gram similarity perturbations [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-layer injection: accuracy of RNA models on MMLU, perturbed MMLU and WMDP (avg. of Biology and Cyber) across different perturbed layers [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Region-specific layer injection: accuracy of RNA models on MMLU, perturbed MMLU and WMDP (avg. of Biology and Cyber) w.r.t early layers (5, 6, 7), middle layers (14, 15, 16), and later layers (28, 29, 30). bilities or increase susceptibility to other attacks (Tramer & Boneh, 2019; Weng et al., 2020; Kamath et al., 2021) on the general capabilities. In this section, we present an analysis of whether RNA ma… view at source ↗
Figure 13
Figure 13. Figure 13: Full-layer injection: accuracy of RNA models on MMLU, perturbed MMLU and WMDP (avg. of Biology and Cyber). increase of loss to construct adversarial prompts. We utilize optimal checkpoints from the main setting, as detailed in Subsection A.4. Empirical effects of unlearning against attacks. We report the accuracy and ROUGE-L score under attack of the original unlearned model and RNA models in [PITH_FULL_… view at source ↗
read the original abstract

Here, we show that current LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learns to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of unlearned models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript argues that standard LLM unlearning methods inherently degrade robustness, so that a single non-adversarial forget-token in a retain query triggers misbehavior. It reframes the unlearning process as inadvertently installing a backdoor in which forget-tokens become triggers aligned with target representations, proposes Random Noise Augmentation (RNA) as a model- and method-agnostic defense that reinterprets retention as backdoor defense, supplies theoretical guarantees for RNA, and reports that extensive experiments show RNA improves robustness while preserving forget and retain performance.

Significance. If the backdoor causal account is substantiated and the RNA guarantees hold without hidden fitted parameters, the work would supply both a mechanistic lens on unlearning fragility and a lightweight, plug-in robustness fix. The explicit claim of theoretical guarantees is a potential strength worth verifying.

major comments (2)
  1. [Abstract] Abstract: the central claim that observed fragility on retain queries containing a single forget-token is caused specifically by inadvertent backdoor alignment (rather than capacity reduction, distribution shift from the unlearning objective, or retain-set sampling artifacts) is presented without any described control experiment or ablation that isolates the alignment effect; this interpretive step is load-bearing for both the framework and the motivation for RNA.
  2. [Abstract] Abstract: no quantitative results, error bars, baseline numbers, or derivation details for the claimed theoretical guarantees are supplied, so the soundness of the RNA guarantees and the experimental support for the backdoor account cannot be assessed from the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that observed fragility on retain queries containing a single forget-token is caused specifically by inadvertent backdoor alignment (rather than capacity reduction, distribution shift from the unlearning objective, or retain-set sampling artifacts) is presented without any described control experiment or ablation that isolates the alignment effect; this interpretive step is load-bearing for both the framework and the motivation for RNA.

    Authors: We agree that isolating the backdoor alignment effect from alternatives such as capacity reduction or distribution shift is important for substantiating the framework. Section 3 formally derives how the unlearning objective produces this alignment. We will add new control experiments and ablations in the revised manuscript to empirically distinguish the alignment mechanism from the other factors mentioned. revision: yes

  2. Referee: [Abstract] Abstract: no quantitative results, error bars, baseline numbers, or derivation details for the claimed theoretical guarantees are supplied, so the soundness of the RNA guarantees and the experimental support for the backdoor account cannot be assessed from the provided text.

    Authors: The abstract is a high-level summary. Quantitative results with error bars (standard deviations across runs), baseline comparisons, and performance metrics appear in Sections 5–6. The derivation of the RNA theoretical guarantees is given in Section 4 with full proofs in the appendix. We will add a concise statement of key quantitative findings to the abstract in revision, subject to length constraints. revision: partial

Circularity Check

0 steps flagged

No circularity: framework and method presented as independent proposals without reduction to fitted inputs or self-citations

full rationale

The paper proposes a reframing of unlearning as a backdoor attack-defense problem and introduces RNA as a mitigation with claimed theoretical guarantees. No equations, derivations, or load-bearing claims in the abstract or described structure reduce the robustness improvement or the backdoor interpretation to a parameter already tuned on the target metric, a self-citation chain, or an ansatz smuggled from prior work. The central interpretive step is offered as a novel lens rather than a mathematical necessity derived from the paper's own fitted quantities, and experiments are positioned as external validation. This meets the default expectation of a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work relies on standard machine-learning assumptions about token representations and training dynamics plus the novel interpretive mapping of forget-tokens to backdoor triggers. No explicit numerical free parameters are stated in the abstract.

axioms (1)
  • domain assumption Standard assumptions in machine learning about how token-level signals influence model representations during fine-tuning.
    The backdoor framing presupposes that alignment of specific tokens with altered internal states occurs in the same manner as in conventional backdoor attacks.
invented entities (1)
  • forget-token as backdoor trigger no independent evidence
    purpose: To explain why unlearned models misbehave on retain queries containing those tokens.
    This mapping is introduced by the authors as the core of their theoretical framework; no independent falsifiable prediction outside the paper is supplied in the abstract.

pith-pipeline@v0.9.0 · 5784 in / 1311 out tokens · 28928 ms · 2026-05-23T04:39:50.315227+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 9 internal anchors

  1. [1]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

  2. [2]

    Open problems in machine unlearning for ai safety

    Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O’Gara, Robert Kirk, Ben Bucknall, Tim Fist, et al. Open problems in machine unlearning for ai safety. arXiv preprint arXiv:2501.04952,

  3. [3]

    Minseok Choi, Kyunghyun Min, and Jaegul Choo

    doi: 10.1109/SP.2015.35. Minseok Choi, Kyunghyun Min, and Jaegul Choo. Cross-lingual unlearning of selective knowledge in multilingual language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 10732–10747,

  4. [4]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human ...

  5. [5]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300/. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  6. [6]

    Extracting memorized pieces of (copyrighted) books from open-weight language models

    A Feder Cooper, Aaron Gokaslan, Amy B Cyphert, Christopher De Sa, Mark A Lemley, Daniel E Ho, and Percy Liang. Extracting memorized pieces of (copyrighted) books from open-weight language models.arXiv preprint arXiv:2505.12546,

  7. [7]

    Do unlearning methods remove information from language model weights?arXiv preprint arXiv:2410.08827,

    Aghyad Deeb and Fabien Roger. Do unlearning methods remove information from language model weights?arXiv preprint arXiv:2410.08827,

  8. [8]

    Does unlearning truly unlearn? a black box evaluation of llm unlearning methods.arXiv preprint arXiv:2411.12103,

    Jai Doshi and Asa Cooper Stickland. Does unlearning truly unlearn? a black box evaluation of llm unlearning methods.arXiv preprint arXiv:2411.12103,

  9. [9]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  10. [10]

    Who’s harry potter? approximate unlearning in llms.arXiv preprint arXiv:2310.02238,

    10 Preprint Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms.arXiv preprint arXiv:2310.02238,

  11. [11]

    Simplicity prevails: Rethinking negative preference optimization for LLM unlearning

    Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. Simplicity prevails: Rethinking negative preference optimization for LLM unlearning. InNeurips Safe Generative AI Workshop 2024,

  12. [12]

    Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang

    URL https: //openreview.net/forum?id=zZjLv6F0Ks. Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang. Llm agents can autonomously hack websites.arXiv preprint arXiv:2402.06664,

  13. [13]

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

    URLhttps://zenodo.org/records/12608602. Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detec- tion. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.),Proceedings of the 60th Annual Meeting of the Associati...

  14. [14]

    doi: 10.18653/ v1/2022.acl-long.234

    Association for Computational Linguistics. doi: 10.18653/ v1/2022.acl-long.234. URLhttps://aclanthology.org/2022.acl-long.234/. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR),

  15. [15]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,

  16. [16]

    Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard

    URL https://openreview.net/forum?id= k9iBo3RmCFd. Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard. Copyright violations and large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7403–7412, Singapore, December

  17. [17]

    doi: 10.18653/v1/2023.emnlp-main

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main

  18. [18]

    Pang Wei Koh and Percy Liang

    URLhttps://aclanthology.org/2023.emnlp-main.458/. Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885–1894. PMLR,

  19. [19]

    Exact unlearning of finetuning data via model merging at scale.arXiv preprint arXiv:2504.04626,

    Kevin Kuo, Amrith Setlur, Kartik Srinivas, Aditi Raghunathan, and Virginia Smith. Exact unlearning of finetuning data via model merging at scale.arXiv preprint arXiv:2504.04626,

  20. [20]

    TextBugger: Generating Adversarial Text Against Real-world Applications

    Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. Textbugger: Generating adversarial text against real-world applications.arXiv preprint arXiv:1812.05271,

  21. [21]

    doi: 10.18653/v1/2022.acl-long.229

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long. 229/. Michelle Lo, Fazl Barez, and Shay Cohen. Large language models relearn removed concepts. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 8306–8323,...

  22. [22]

    doi: 10.18653/v1/2024.findings-acl.492

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.492. URL https://aclanthology.org/2024.findings-acl.492/. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Confer- ence on Learning Representations,

  23. [23]

    URL https://openreview.net/forum?id= J5IRyTKZ9s

    ISSN 2835-8856. URL https://openreview.net/forum?id= J5IRyTKZ9s. Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms.arXiv preprint arXiv:2402.16835,

  24. [24]

    Pointer Sentinel Mixture Models

    URL https://openreview.net/forum?id=B41hNBoWLo. 12 Preprint Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

  25. [25]

    Thanh Tam Nguyen, Thanh Trung Huynh, Zhao Ren, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen

    URL https://openreview.net/ forum?id=vjel3nWP2a. Thanh Tam Nguyen, Thanh Trung Huynh, Zhao Ren, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning.arXiv preprint arXiv:2209.02299,

  26. [26]

    Nils Reimers and Iryna Gurevych

    URL https: //openreview.net/forum?id=HPuSIXJaa9. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992,

  27. [27]

    Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools.arXiv preprint arXiv:2306.13952,

    Jonas B Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools.arXiv preprint arXiv:2306.13952,

  28. [28]

    Latent adversarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549,

    Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, et al. Latent adversarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549,

  29. [29]

    13 Preprint Ilia Shumailov, Jamie Hayes, Eleni Triantafillou, Guillermo Ortiz-Jimenez, Nicolas Papernot, Matthew Jagielski, Itay Yona, Heidi Howard, and Eugene Bagdasaryan

    URLhttps://openreview.net/forum?id=TArmA033BU. 13 Preprint Ilia Shumailov, Jamie Hayes, Eleni Triantafillou, Guillermo Ortiz-Jimenez, Nicolas Papernot, Matthew Jagielski, Itay Yona, Heidi Howard, and Eugene Bagdasaryan. Ununlearning: Unlearning is not sufficient for content regulation in advanced generative ai.arXiv preprint arXiv:2407.00106,

  30. [30]

    CommonsenseQA: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume ...

  31. [31]

    doi: 10.18653/v1/N19-1421

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology. org/N19-1421/. Rishub Tamirisa, Bhrugu Bharathi, Andy Zhou, Bo Li, and Mantas Mazeika. Toward robust unlearning for LLMs. InICLR 2024 Workshop on Secure and Trustworthy Large Language Models,

  32. [32]

    Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, and Virginia Smith

    URL https://openreview.net/forum? id=4FIjRodbW6. Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, and Virginia Smith. Guardrail baselines for unlearning in llms.arXiv preprint arXiv:2403.03329,

  33. [33]

    Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, and Minlie Huang

    URLhttps://openreview.net/forum?id=ar8aRMrmod. Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, and Minlie Huang. Unveil- ing the implicit toxicity in large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 1322–1338, Singapor...

  34. [34]

    doi: 10.18653/v1/2023.emnlp-main.84

    Association for Computational Linguis- tics. doi: 10.18653/v1/2023.emnlp-main.84. URL https://aclanthology.org/2023. emnlp-main.84/. 14 Preprint Cheng-Hsin Weng, Yan-Ting Lee, and Shan-Hung (Brandon) Wu. On the trade-off between adversar- ial and backdoor robustness. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.),Advances in Neur...

  35. [35]

    Xiaoyu Wu, Yifei Pang, Terrance Liu, and Zhiwei Steven Wu

    URL https://proceedings.neurips.cc/paper_files/ paper/2020/file/8b4066554730ddfaa0266346bdc1b202-Paper.pdf. Xiaoyu Wu, Yifei Pang, Terrance Liu, and Zhiwei Steven Wu. Breaking the gold standard: Extracting forgotten data under exact unlearning in large language models.arXiv preprint arXiv:2505.24379,

  36. [36]

    doi: 10.1145/3603620

    ISSN 0360-0300. doi: 10.1145/3603620. URL https://doi.org/10.1145/3603620. Hongbang Yuan, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Towards robust knowledge unlearning: An adversarial framework for assessing and improving unlearning robust- ness in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, v...

  37. [37]

    doi: 10.18653/v1/P19-1472

    Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URLhttps://aclanthology.org/P19-1472/. Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. InFirst Conference on Language Modeling,

  38. [38]

    Fine-Tuning Language Models from Human Preferences

    Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.424. URL https://aclanthology. org/2025.acl-long.424/. Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

  39. [39]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

  40. [40]

    Following Li et al

    is a language modeling dataset consisting of over 100 milion tokens extracted from Wikipedia. Following Li et al. (2024), we specifically use the WIKITEXT- 2-RAW-V1 test split as the retain-set for fine-tuning. The dataset is publicly available at https: //huggingface.co/datasets/Salesforce/wikitext. MMLU(Hendrycks et al.,

  41. [41]

    I Don’t Know

    are two sub-categories in MMLU, corresponding to topics closely related to the WMDP Biology and WMDP Cyber forget-sets. They are used to evaluate the unlearned model’s ability to retain relevant knowledge in areas related to the forget-sets. “I Don’t Know” dataset.We employ a set of 100 refusal responses as the preference answers for DPO+KL and DPO+MSE. F...

  42. [42]

    SARS-CoV-19,

    for evaluation. Each query is formulated as a default zero-shot QA prompt (Figure 6). Following the setting of prior work (Thaker et al., 2025), we randomly replace anincorrectanswer in the retain QA dataset with the forget keyword “SARS-CoV-19,” while leaving the correct answer unchanged. Since the forget keyword is unrelated to the retain-queries, this ...

  43. [43]

    Following previous works (Li et al., 2024), we update three layers of parameters {l, l−1, l−2} of the model for memory efficiency

    for T= 500 update steps, learning rate is5e−5 , batch size of 4, max sequence length is500 with WMDP-Biology 17 Preprint and 768 for WMDP-Cyber. Following previous works (Li et al., 2024), we update three layers of parameters {l, l−1, l−2} of the model for memory efficiency. For the original RM methods, we set the retain weight αbiology = 1200 and αcyber ...

  44. [44]

    Specifically, we setβ= 0.1 for all PO methods, and γ= 0 for both SimNPO+KL and SimNPO+MSE

    For the original PO methods, we adopt the default hyperparameters used in previous works (Yuan et al., 2025b; Fan et al., 2024). Specifically, we setβ= 0.1 for all PO methods, and γ= 0 for both SimNPO+KL and SimNPO+MSE. For the retain weights, we perform a grid search over combinations of (αbiology, αcyber), where αbiology, αcyber ∈ {5,10,20,30,40,50,100}...

  45. [45]

    The probability that the RNA model rejects the effect induced by noiseϵis: P ∆J rna ∆J u ≤0 ≈P (gper)⊤δ1 −g ⊤δ2 g⊤ϵ ≤ −1 (31) 19 Preprint The ratio of two random normally distributed variables (gper)⊤δ1−g⊤δ2 g⊤ϵ follows a Cauchy distribution with location parameter x0 = 0 and scale parameter γ= q ν η 1 + ||gper|| ||g|| . The cumulative distribution functi...

  46. [46]

    Which forget-tokens when appearing in the retain-query can cause the unlearned model to misbehave?

    We then compute the difference in maximum activation (MaxAct) across latent dimensions at layer l= 7 of each generated token, conditioned on perturbed and original MMLU retain-queries. As shown in Figure 9, the differences in MaxActs exhibit a Gaussian-like distribution. These results provide supporting empirical evidence for Assumption 4.1. 22 Preprint D...

  47. [47]

    The most pronounced gains are observed when RNA is applied with MSE retain-losses

    We observe that RNA consistently improves the robustness of unlearned models across all n-gram perturbations. The most pronounced gains are observed when RNA is applied with MSE retain-losses. Specifically, for DPO+MSE, performance improvements are +24.6 (2-gram), +29.5 (4-gram), +24.9 (8-gram), and +19.6 (16-gram); for SimNPO+MSE, gains are +25.4, +23.3,...

  48. [48]

    In this section, we present an analysis of whether RNA makes the model become more susceptible to other adversarial attacks

    on the general capabilities. In this section, we present an analysis of whether RNA makes the model become more susceptible to other adversarial attacks. Setup.We employ four widely used adversarial attack methods to evaluate the side effects of RNA, including Greedy Coordinate Gradient (GCG; Zou et al. (2023)), TextBugger (Li et al., 2018), DeepWordBug (...

  49. [49]

    (multiple-choice QA) and ToxiGen (Hartvigsen et al., 2022), commonsense reasoning on WinoGrande (Sakaguchi et al.,

  50. [50]

    and CommonsenseQA (Talmor et al., 2019), natural language inference/completion on HellaSwag (Zellers et al., 2019), science reasoning on ARC (Clark et al.,

  51. [51]

    Performance of unlearning methods on these tasks is shown in Table

    (easy and challenge), and factuality on BoolQ (Clark et al., 2019). Performance of unlearning methods on these tasks is shown in Table

  52. [52]

    and Mistral-7B Jiang et al. (2023) models on two 26 Preprint Methods GCG TextBugger DeepWordBug TextFooler AuA ROUGE-L AuA ROUGE-L AuA ROUGE-L AuA ROUGE-L Base Original40.3—33.6—39.6—52.9— Representation Misdirection RMU Original33.6 63.0 30.5 81.2 38.2 76.8 50.5 85.2 w/ RNA40.3+6.760.9−2.130.5+0.079.1−1.138.9+0.776.4−0.450.1−0.484.2−1.0 Adap. RMU Origina...

  53. [53]

    Specifically, TOFU is designed to remove the influence of specific data points, making it less applicable in generative settings

    and TOFU (Maini et al., 2024), but these are less suitable for our experimental setup. Specifically, TOFU is designed to remove the influence of specific data points, making it less applicable in generative settings. While MUSE could be suitable, previous work (Shi et al.,