Improving LLM Unlearning Robustness via Random Perturbations
Pith reviewed 2026-05-23 04:39 UTC · model grok-4.3
The pith
LLM unlearning methods create backdoor vulnerabilities by aligning forget-tokens with target representations during forgetting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The forgetting process in LLM unlearning inadvertently learns to align forget-tokens with target-representations, converting those tokens into backdoor triggers that, when present in retain-queries, disrupt the unlearned model's behavior; Random Noise Augmentation re-casts the retaining process as backdoor defense and supplies a model- and method-agnostic mitigation with theoretical support.
What carries the argument
The backdoor attack-defense reframing of unlearning, in which forget-tokens function as triggers paired with target labels and random noise augmentation serves as the corresponding defense during retention.
If this is right
- Unlearned models become susceptible to non-adversarial forget-tokens placed inside otherwise normal retain-queries.
- Forget-tokens function as backdoor triggers that activate target-representations and produce incorrect outputs.
- Random Noise Augmentation during retention measurably raises robustness without degrading forget or retain accuracy.
- Unlearning tends to hide target knowledge behind trigger alignments rather than remove it.
Where Pith is reading between the lines
- The same trigger-alignment dynamic may appear in unlearning settings outside language models, such as vision or tabular data.
- Future unlearning algorithms could be designed to avoid token-label pairing altogether rather than defend against it afterward.
- Evaluating unlearning success may require new tests that deliberately insert forget-tokens into retain data.
Load-bearing premise
The observed fragility arises specifically because the forgetting process learns an alignment between forget-tokens and target representations rather than from unrelated training dynamics.
What would settle it
An experiment that performs unlearning on a model while preventing any statistical alignment between forget-tokens and target outputs, then measures whether single forget-tokens in retain-queries still cause misbehavior.
Figures
read the original abstract
Here, we show that current LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learns to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of unlearned models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that standard LLM unlearning methods inherently degrade robustness, so that a single non-adversarial forget-token in a retain query triggers misbehavior. It reframes the unlearning process as inadvertently installing a backdoor in which forget-tokens become triggers aligned with target representations, proposes Random Noise Augmentation (RNA) as a model- and method-agnostic defense that reinterprets retention as backdoor defense, supplies theoretical guarantees for RNA, and reports that extensive experiments show RNA improves robustness while preserving forget and retain performance.
Significance. If the backdoor causal account is substantiated and the RNA guarantees hold without hidden fitted parameters, the work would supply both a mechanistic lens on unlearning fragility and a lightweight, plug-in robustness fix. The explicit claim of theoretical guarantees is a potential strength worth verifying.
major comments (2)
- [Abstract] Abstract: the central claim that observed fragility on retain queries containing a single forget-token is caused specifically by inadvertent backdoor alignment (rather than capacity reduction, distribution shift from the unlearning objective, or retain-set sampling artifacts) is presented without any described control experiment or ablation that isolates the alignment effect; this interpretive step is load-bearing for both the framework and the motivation for RNA.
- [Abstract] Abstract: no quantitative results, error bars, baseline numbers, or derivation details for the claimed theoretical guarantees are supplied, so the soundness of the RNA guarantees and the experimental support for the backdoor account cannot be assessed from the provided text.
Simulated Author's Rebuttal
Thank you for the detailed review. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that observed fragility on retain queries containing a single forget-token is caused specifically by inadvertent backdoor alignment (rather than capacity reduction, distribution shift from the unlearning objective, or retain-set sampling artifacts) is presented without any described control experiment or ablation that isolates the alignment effect; this interpretive step is load-bearing for both the framework and the motivation for RNA.
Authors: We agree that isolating the backdoor alignment effect from alternatives such as capacity reduction or distribution shift is important for substantiating the framework. Section 3 formally derives how the unlearning objective produces this alignment. We will add new control experiments and ablations in the revised manuscript to empirically distinguish the alignment mechanism from the other factors mentioned. revision: yes
-
Referee: [Abstract] Abstract: no quantitative results, error bars, baseline numbers, or derivation details for the claimed theoretical guarantees are supplied, so the soundness of the RNA guarantees and the experimental support for the backdoor account cannot be assessed from the provided text.
Authors: The abstract is a high-level summary. Quantitative results with error bars (standard deviations across runs), baseline comparisons, and performance metrics appear in Sections 5–6. The derivation of the RNA theoretical guarantees is given in Section 4 with full proofs in the appendix. We will add a concise statement of key quantitative findings to the abstract in revision, subject to length constraints. revision: partial
Circularity Check
No circularity: framework and method presented as independent proposals without reduction to fitted inputs or self-citations
full rationale
The paper proposes a reframing of unlearning as a backdoor attack-defense problem and introduces RNA as a mitigation with claimed theoretical guarantees. No equations, derivations, or load-bearing claims in the abstract or described structure reduce the robustness improvement or the backdoor interpretation to a parameter already tuned on the target metric, a self-citation chain, or an ansatz smuggled from prior work. The central interpretive step is offered as a novel lens rather than a mathematical necessity derived from the paper's own fitted quantities, and experiments are positioned as external validation. This meets the default expectation of a self-contained derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions in machine learning about how token-level signals influence model representations during fine-tuning.
invented entities (1)
-
forget-token as backdoor trigger
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Open problems in machine unlearning for ai safety
Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O’Gara, Robert Kirk, Ben Bucknall, Tim Fist, et al. Open problems in machine unlearning for ai safety. arXiv preprint arXiv:2501.04952,
-
[3]
Minseok Choi, Kyunghyun Min, and Jaegul Choo
doi: 10.1109/SP.2015.35. Minseok Choi, Kyunghyun Min, and Jaegul Choo. Cross-lingual unlearning of selective knowledge in multilingual language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 10732–10747,
-
[4]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human ...
work page 2019
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300/. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1300
-
[6]
Extracting memorized pieces of (copyrighted) books from open-weight language models
A Feder Cooper, Aaron Gokaslan, Amy B Cyphert, Christopher De Sa, Mark A Lemley, Daniel E Ho, and Percy Liang. Extracting memorized pieces of (copyrighted) books from open-weight language models.arXiv preprint arXiv:2505.12546,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Aghyad Deeb and Fabien Roger. Do unlearning methods remove information from language model weights?arXiv preprint arXiv:2410.08827,
-
[8]
Jai Doshi and Asa Cooper Stickland. Does unlearning truly unlearn? a black box evaluation of llm unlearning methods.arXiv preprint arXiv:2411.12103,
-
[9]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Who’s harry potter? approximate unlearning in llms.arXiv preprint arXiv:2310.02238,
10 Preprint Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms.arXiv preprint arXiv:2310.02238,
-
[11]
Simplicity prevails: Rethinking negative preference optimization for LLM unlearning
Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. Simplicity prevails: Rethinking negative preference optimization for LLM unlearning. InNeurips Safe Generative AI Workshop 2024,
work page 2024
-
[12]
Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang
URL https: //openreview.net/forum?id=zZjLv6F0Ks. Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang. Llm agents can autonomously hack websites.arXiv preprint arXiv:2402.06664,
-
[13]
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar
URLhttps://zenodo.org/records/12608602. Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detec- tion. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.),Proceedings of the 60th Annual Meeting of the Associati...
-
[14]
doi: 10.18653/ v1/2022.acl-long.234
Association for Computational Linguistics. doi: 10.18653/ v1/2022.acl-long.234. URLhttps://aclanthology.org/2022.acl-long.234/. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR),
work page 2022
-
[15]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard
URL https://openreview.net/forum?id= k9iBo3RmCFd. Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard. Copyright violations and large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7403–7412, Singapore, December
work page 2023
-
[17]
doi: 10.18653/v1/2023.emnlp-main
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main
-
[18]
URLhttps://aclanthology.org/2023.emnlp-main.458/. Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885–1894. PMLR,
work page 2023
-
[19]
Exact unlearning of finetuning data via model merging at scale.arXiv preprint arXiv:2504.04626,
Kevin Kuo, Amrith Setlur, Kartik Srinivas, Aditi Raghunathan, and Virginia Smith. Exact unlearning of finetuning data via model merging at scale.arXiv preprint arXiv:2504.04626,
-
[20]
TextBugger: Generating Adversarial Text Against Real-world Applications
Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. Textbugger: Generating adversarial text against real-world applications.arXiv preprint arXiv:1812.05271,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
doi: 10.18653/v1/2022.acl-long.229
Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long. 229/. Michelle Lo, Fazl Barez, and Shay Cohen. Large language models relearn removed concepts. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 8306–8323,...
-
[22]
doi: 10.18653/v1/2024.findings-acl.492
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.492. URL https://aclanthology.org/2024.findings-acl.492/. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Confer- ence on Learning Representations,
-
[23]
URL https://openreview.net/forum?id= J5IRyTKZ9s
ISSN 2835-8856. URL https://openreview.net/forum?id= J5IRyTKZ9s. Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms.arXiv preprint arXiv:2402.16835,
-
[24]
Pointer Sentinel Mixture Models
URL https://openreview.net/forum?id=B41hNBoWLo. 12 Preprint Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
URL https://openreview.net/ forum?id=vjel3nWP2a. Thanh Tam Nguyen, Thanh Trung Huynh, Zhao Ren, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning.arXiv preprint arXiv:2209.02299,
-
[26]
Nils Reimers and Iryna Gurevych
URL https: //openreview.net/forum?id=HPuSIXJaa9. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992,
work page 2019
-
[27]
Jonas B Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools.arXiv preprint arXiv:2306.13952,
-
[28]
Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, et al. Latent adversarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549,
-
[29]
URLhttps://openreview.net/forum?id=TArmA033BU. 13 Preprint Ilia Shumailov, Jamie Hayes, Eleni Triantafillou, Guillermo Ortiz-Jimenez, Nicolas Papernot, Matthew Jagielski, Itay Yona, Heidi Howard, and Eugene Bagdasaryan. Ununlearning: Unlearning is not sufficient for content regulation in advanced generative ai.arXiv preprint arXiv:2407.00106,
-
[30]
CommonsenseQA: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume ...
work page 2019
-
[31]
Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology. org/N19-1421/. Rishub Tamirisa, Bhrugu Bharathi, Andy Zhou, Bo Li, and Mantas Mazeika. Toward robust unlearning for LLMs. InICLR 2024 Workshop on Secure and Trustworthy Large Language Models,
-
[32]
Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, and Virginia Smith
URL https://openreview.net/forum? id=4FIjRodbW6. Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, and Virginia Smith. Guardrail baselines for unlearning in llms.arXiv preprint arXiv:2403.03329,
-
[33]
Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, and Minlie Huang
URLhttps://openreview.net/forum?id=ar8aRMrmod. Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, and Minlie Huang. Unveil- ing the implicit toxicity in large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 1322–1338, Singapor...
work page 2023
-
[34]
doi: 10.18653/v1/2023.emnlp-main.84
Association for Computational Linguis- tics. doi: 10.18653/v1/2023.emnlp-main.84. URL https://aclanthology.org/2023. emnlp-main.84/. 14 Preprint Cheng-Hsin Weng, Yan-Ting Lee, and Shan-Hung (Brandon) Wu. On the trade-off between adversar- ial and backdoor robustness. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.),Advances in Neur...
-
[35]
Xiaoyu Wu, Yifei Pang, Terrance Liu, and Zhiwei Steven Wu
URL https://proceedings.neurips.cc/paper_files/ paper/2020/file/8b4066554730ddfaa0266346bdc1b202-Paper.pdf. Xiaoyu Wu, Yifei Pang, Terrance Liu, and Zhiwei Steven Wu. Breaking the gold standard: Extracting forgotten data under exact unlearning in large language models.arXiv preprint arXiv:2505.24379,
-
[36]
ISSN 0360-0300. doi: 10.1145/3603620. URL https://doi.org/10.1145/3603620. Hongbang Yuan, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Towards robust knowledge unlearning: An adversarial framework for assessing and improving unlearning robust- ness in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, v...
-
[37]
Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URLhttps://aclanthology.org/P19-1472/. Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. InFirst Conference on Language Modeling,
-
[38]
Fine-Tuning Language Models from Human Preferences
Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.424. URL https://aclanthology. org/2025.acl-long.424/. Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.424 2025
-
[39]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
is a language modeling dataset consisting of over 100 milion tokens extracted from Wikipedia. Following Li et al. (2024), we specifically use the WIKITEXT- 2-RAW-V1 test split as the retain-set for fine-tuning. The dataset is publicly available at https: //huggingface.co/datasets/Salesforce/wikitext. MMLU(Hendrycks et al.,
work page 2024
-
[41]
are two sub-categories in MMLU, corresponding to topics closely related to the WMDP Biology and WMDP Cyber forget-sets. They are used to evaluate the unlearned model’s ability to retain relevant knowledge in areas related to the forget-sets. “I Don’t Know” dataset.We employ a set of 100 refusal responses as the preference answers for DPO+KL and DPO+MSE. F...
work page 2024
-
[42]
for evaluation. Each query is formulated as a default zero-shot QA prompt (Figure 6). Following the setting of prior work (Thaker et al., 2025), we randomly replace anincorrectanswer in the retain QA dataset with the forget keyword “SARS-CoV-19,” while leaving the correct answer unchanged. Since the forget keyword is unrelated to the retain-queries, this ...
work page 2025
-
[43]
for T= 500 update steps, learning rate is5e−5 , batch size of 4, max sequence length is500 with WMDP-Biology 17 Preprint and 768 for WMDP-Cyber. Following previous works (Li et al., 2024), we update three layers of parameters {l, l−1, l−2} of the model for memory efficiency. For the original RM methods, we set the retain weight αbiology = 1200 and αcyber ...
work page 2024
-
[44]
Specifically, we setβ= 0.1 for all PO methods, and γ= 0 for both SimNPO+KL and SimNPO+MSE
For the original PO methods, we adopt the default hyperparameters used in previous works (Yuan et al., 2025b; Fan et al., 2024). Specifically, we setβ= 0.1 for all PO methods, and γ= 0 for both SimNPO+KL and SimNPO+MSE. For the retain weights, we perform a grid search over combinations of (αbiology, αcyber), where αbiology, αcyber ∈ {5,10,20,30,40,50,100}...
work page 2024
-
[45]
The probability that the RNA model rejects the effect induced by noiseϵis: P ∆J rna ∆J u ≤0 ≈P (gper)⊤δ1 −g ⊤δ2 g⊤ϵ ≤ −1 (31) 19 Preprint The ratio of two random normally distributed variables (gper)⊤δ1−g⊤δ2 g⊤ϵ follows a Cauchy distribution with location parameter x0 = 0 and scale parameter γ= q ν η 1 + ||gper|| ||g|| . The cumulative distribution functi...
work page 2018
-
[46]
Which forget-tokens when appearing in the retain-query can cause the unlearned model to misbehave?
We then compute the difference in maximum activation (MaxAct) across latent dimensions at layer l= 7 of each generated token, conditioned on perturbed and original MMLU retain-queries. As shown in Figure 9, the differences in MaxActs exhibit a Gaussian-like distribution. These results provide supporting empirical evidence for Assumption 4.1. 22 Preprint D...
work page 2019
-
[47]
The most pronounced gains are observed when RNA is applied with MSE retain-losses
We observe that RNA consistently improves the robustness of unlearned models across all n-gram perturbations. The most pronounced gains are observed when RNA is applied with MSE retain-losses. Specifically, for DPO+MSE, performance improvements are +24.6 (2-gram), +29.5 (4-gram), +24.9 (8-gram), and +19.6 (16-gram); for SimNPO+MSE, gains are +25.4, +23.3,...
work page 2019
-
[48]
on the general capabilities. In this section, we present an analysis of whether RNA makes the model become more susceptible to other adversarial attacks. Setup.We employ four widely used adversarial attack methods to evaluate the side effects of RNA, including Greedy Coordinate Gradient (GCG; Zou et al. (2023)), TextBugger (Li et al., 2018), DeepWordBug (...
work page 2023
-
[49]
(multiple-choice QA) and ToxiGen (Hartvigsen et al., 2022), commonsense reasoning on WinoGrande (Sakaguchi et al.,
work page 2022
-
[50]
and CommonsenseQA (Talmor et al., 2019), natural language inference/completion on HellaSwag (Zellers et al., 2019), science reasoning on ARC (Clark et al.,
work page 2019
-
[51]
Performance of unlearning methods on these tasks is shown in Table
(easy and challenge), and factuality on BoolQ (Clark et al., 2019). Performance of unlearning methods on these tasks is shown in Table
work page 2019
-
[52]
and Mistral-7B Jiang et al. (2023) models on two 26 Preprint Methods GCG TextBugger DeepWordBug TextFooler AuA ROUGE-L AuA ROUGE-L AuA ROUGE-L AuA ROUGE-L Base Original40.3—33.6—39.6—52.9— Representation Misdirection RMU Original33.6 63.0 30.5 81.2 38.2 76.8 50.5 85.2 w/ RNA40.3+6.760.9−2.130.5+0.079.1−1.138.9+0.776.4−0.450.1−0.484.2−1.0 Adap. RMU Origina...
work page 2023
-
[53]
and TOFU (Maini et al., 2024), but these are less suitable for our experimental setup. Specifically, TOFU is designed to remove the influence of specific data points, making it less applicable in generative settings. While MUSE could be suitable, previous work (Shi et al.,
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.