pith. machine review for the scientific record. sign in

arxiv: 2401.06121 · v1 · submitted 2024-01-11 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

TOFU: A Task of Fictitious Unlearning for LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords unlearninglarge language modelsprivacybenchmarksynthetic dataforgetting
0
0 comments X

The pith

Unlearning methods for large language models fail to make them behave as if specific training data was never seen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TOFU as a benchmark for testing whether unlearning techniques can remove the effects of particular data from trained language models. It uses 200 synthetic author profiles, each built from 20 question-answer pairs, with a designated forget subset. A collection of metrics evaluates how closely an unlearned model matches one that was never exposed to the forget data at all. Results on current baseline methods show they do not reach this standard. This setup matters because models trained on web data can reproduce private information, and reliable forgetting would address privacy risks after training is complete.

Core claim

TOFU provides a dataset of 200 diverse synthetic author profiles each consisting of 20 question-answer pairs, along with a forget set and a suite of metrics that together measure whether unlearning produces models equivalent to those never trained on the target data; existing baselines do not achieve this equivalence.

What carries the argument

The TOFU benchmark, built from controlled synthetic author profiles and question-answer pairs, that isolates the forgetting task and supplies metrics for true unlearning.

If this is right

  • Current unlearning algorithms leave detectable traces of the forget data in model behavior.
  • Effective unlearning requires methods that achieve equivalence to training without the target data.
  • The benchmark supplies a standardized test that future algorithms can be measured against.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same metrics could be applied to real private data once synthetic results improve.
  • Persistent failure across baselines points to deeper limits in how models store and access information.
  • Success on this task would enable post-training removal of specific facts without full retraining.

Load-bearing premise

Results observed on synthetic author profiles will reflect the difficulty of removing real sensitive information from actual large-scale training data.

What would settle it

An unlearning method that produces model outputs on forget-set questions indistinguishable from a model never trained on those profiles, while preserving performance on unrelated data.

read the original abstract

Large language models trained on massive corpora of data from the web can memorize and reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. Although several methods exist for such unlearning, it is unclear to what extent they result in models equivalent to those where the data to be forgotten was never learned in the first place. To address this challenge, we present TOFU, a Task of Fictitious Unlearning, as a benchmark aimed at helping deepen our understanding of unlearning. We offer a dataset of 200 diverse synthetic author profiles, each consisting of 20 question-answer pairs, and a subset of these profiles called the forget set that serves as the target for unlearning. We compile a suite of metrics that work together to provide a holistic picture of unlearning efficacy. Finally, we provide a set of baseline results from existing unlearning algorithms. Importantly, none of the baselines we consider show effective unlearning motivating continued efforts to develop approaches for unlearning that effectively tune models so that they truly behave as if they were never trained on the forget data at all.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TOFU, a benchmark for evaluating machine unlearning in LLMs. It consists of 200 synthetic author profiles (each with 20 QA pairs), designates a forget subset, and defines a suite of metrics that compare post-unlearning model behavior against a retrained model never exposed to the forget set. The central empirical finding is that none of the evaluated baseline unlearning methods achieve effective unlearning on this benchmark.

Significance. If the metrics and synthetic construction are accepted as a valid proxy, the negative result on baselines is useful for motivating stronger unlearning algorithms that aim for behavior equivalent to a model never trained on the target data. The controlled, reproducible nature of the synthetic profiles is a strength for benchmarking.

major comments (2)
  1. [Benchmark construction and results sections] The claim that 'none of the baselines we consider show effective unlearning' (abstract) rests on the synthetic profiles serving as a faithful proxy for real sensitive data. Because the profiles are constructed as isolated, self-contained facts rather than densely entangled knowledge, the metrics may register failure even for methods that would succeed on real data; this makes the negative result non-diagnostic for the motivating claim about real-world unlearning. The paper should either add experiments with more entangled synthetic data or explicitly bound the scope of the generalization claim.
  2. [Experimental setup] The definition of effective unlearning via comparison to a retrained model is central, yet the manuscript does not detail how the retrained model is trained (data mixture, hyperparameters, number of epochs) or whether it is matched exactly to the original training distribution excluding the forget set. Without these controls, differences in the metrics could arise from training variance rather than unlearning failure.
minor comments (2)
  1. [Metrics section] Clarify the exact formulas and weighting for the suite of metrics in the main text rather than deferring entirely to the appendix.
  2. [Dataset description] The 200-profile scale is modest; a brief ablation on profile count or diversity would strengthen the robustness claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Benchmark construction and results sections] The claim that 'none of the baselines we consider show effective unlearning' (abstract) rests on the synthetic profiles serving as a faithful proxy for real sensitive data. Because the profiles are constructed as isolated, self-contained facts rather than densely entangled knowledge, the metrics may register failure even for methods that would succeed on real data; this makes the negative result non-diagnostic for the motivating claim about real-world unlearning. The paper should either add experiments with more entangled synthetic data or explicitly bound the scope of the generalization claim.

    Authors: The TOFU benchmark deliberately uses isolated, self-contained synthetic profiles to enable precise, reproducible comparison against a retrained model without confounding from knowledge entanglement. This controlled design isolates the unlearning signal and supports clean metric computation. We agree that the results are therefore most directly diagnostic for discrete factual information rather than densely interconnected real-world data. In revision we will explicitly bound the generalization claims to this controlled setting and add a dedicated paragraph discussing potential differences with more entangled data. We will not add new entangled-data experiments, as the current construction prioritizes control and reproducibility. revision: partial

  2. Referee: [Experimental setup] The definition of effective unlearning via comparison to a retrained model is central, yet the manuscript does not detail how the retrained model is trained (data mixture, hyperparameters, number of epochs) or whether it is matched exactly to the original training distribution excluding the forget set. Without these controls, differences in the metrics could arise from training variance rather than unlearning failure.

    Authors: We thank the referee for identifying this omission. The retrained model was trained on the identical data mixture and distribution as the original model except for the explicit removal of the forget-set profiles, using the same hyperparameters and number of epochs. In the revised manuscript we will add a new subsection in Experimental Setup that fully specifies the retrained-model training procedure, including exact hyperparameters, epoch count, data-mixture details, and verification steps confirming the distribution match. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark creation with no derivations or self-referential reductions

full rationale

The paper constructs TOFU as a new synthetic dataset of 200 author profiles with associated QA pairs, defines a suite of evaluation metrics for unlearning efficacy, and reports baseline results from existing algorithms on this data. No equations, derivations, or fitted parameters are present that could reduce to the paper's own inputs by construction. Claims about baseline ineffectiveness are direct empirical observations on the provided forget set versus a retrained reference model, with no load-bearing self-citations or ansatzes that collapse the result. The work is self-contained as benchmark introduction and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters, mathematical axioms beyond standard machine-learning evaluation practices, or invented entities; it constructs an empirical testbed on top of existing unlearning concepts.

axioms (1)
  • domain assumption Synthetic author profiles can serve as a valid proxy for evaluating the difficulty of unlearning real sensitive information
    The benchmark's claim to practical relevance rests on this unstated premise about generalization from fictitious to real data.

pith-pipeline@v0.9.0 · 5528 in / 1277 out tokens · 46135 ms · 2026-05-16T11:03:31.010641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

    cs.CR 2026-05 conditional novelty 8.0

    Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

  2. Inference-Time Machine Unlearning via Gated Activation Redirection

    cs.LG 2026-05 conditional novelty 8.0

    GUARD-IT performs machine unlearning in LLMs via inference-time gated activation redirection, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.

  3. Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

    cs.CV 2026-04 conditional novelty 8.0

    VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

  4. Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

    cs.CL 2026-05 unverdicted novelty 7.0

    New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

  5. PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    PPU-Bench is a real-world benchmark exposing forget-retain trade-offs in MLLM unlearning and motivating Boundary-Aware Optimization to enforce intra-subject factual boundaries.

  6. ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.

  7. Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.

  8. Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score

    cs.CV 2026-05 unverdicted novelty 7.0

    Standard metrics for multimodal machine unlearning conflict in rankings, addressed by a new oracle-correlated composite score that yields stable results.

  9. Is your algorithm unlearning or untraining?

    cs.LG 2026-04 conditional novelty 7.0

    Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).

  10. Early Data Exposure Improves Robustness to Subsequent Fine-Tuning

    cs.LG 2026-05 conditional novelty 6.0

    Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.

  11. Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning

    cs.AI 2026-05 unverdicted novelty 6.0

    A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.

  12. CAP: Controllable Alignment Prompting for Unlearning in LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    CAP enables reversible unlearning of targeted knowledge in LLMs through optimized prompts generated via reinforcement learning, without any parameter updates.

  13. CAP: Controllable Alignment Prompting for Unlearning in LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    CAP optimizes prompts via reinforcement learning to selectively unlearn target knowledge in LLMs while preserving general capabilities, without any parameter updates and with reversible revocation.

  14. From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    MAGE builds a memory graph from a user anchor to generate its own supervision signals for corpus-free unlearning, matching the effectiveness of methods that use external reference data on TOFU and RWKU benchmarks.

  15. Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.

  16. Efficient machine unlearning with minimax optimality

    stat.ML 2026-04 unverdicted novelty 6.0

    ULS provides minimax-optimal estimation of remaining-data parameters in machine unlearning with limited access and decomposes error into oracle plus unlearning cost terms.

  17. MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models

    cs.LG 2026-02 unverdicted novelty 6.0

    MPU is a framework that achieves privacy-preserving unlearning for LLMs by distributing perturbed model copies for local client-side unlearning followed by server-side aggregation with harmonic denoising.

  18. Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score

    cs.CV 2026-05 unverdicted novelty 5.0

    Standard unlearning metrics disagree in multimodal settings, but a correlation-weighted Unified Quality Score delivers consistent method rankings across benchmarks.

  19. Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents

    cs.MA 2026-05 unverdicted novelty 4.0

    The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 17 Pith papers · 5 internal anchors

  1. [1]

    Machine unlearning

    Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp.\ 141--159. IEEE, 2021

  2. [2]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.\ 2633--2650, 2021

  3. [3]

    Membership inference attacks from first principles

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp.\ 1897--1914. IEEE, 2022

  4. [4]

    Unlearn what you want to forget: Efficient unlearning for llms, 2023

    Jiaao Chen and Diyi Yang. Unlearn what you want to forget: Efficient unlearning for llms, 2023

  5. [5]

    On the properties of neural machine translation: Encoder -- decoder approaches

    Kyunghyun Cho, Bart van Merri \"e nboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder -- decoder approaches. In Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi (eds.), Proceedings of SSST -8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation , pp.\ 103--111, Doha,...

  6. [6]

    Editing factual knowledge in language models

    Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164, 2021

  7. [7]

    Who's harry potter? approximate unlearning in llms

    Ronen Eldan and Mark Russinovich. Who's harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238, 2023

  8. [8]

    Towards adversarial evaluations for inexact machine unlearning

    Shashwat Goel, Ameya Prabhu, Amartya Sanyal, Ser-Nam Lim, Philip Torr, and Ponnurangam Kumaraguru. Towards adversarial evaluations for inexact machine unlearning. arXiv preprint arXiv:2201.06640, 2022

  9. [9]

    Eternal sunshine of the spotless net: Selective forgetting in deep networks

    Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 9304--9312, 2020

  10. [10]

    Certified data removal from machine learning models

    Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019

  11. [11]

    Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation, 2023

    Xinshuo Hu, Dongfang Li, Zihao Zheng, Zhenyu Liu, Baotian Hu, and Min Zhang. Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation, 2023

  12. [12]

    Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. Are large pre-trained language models leaking your personal information? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.\ 2038--2047, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguis...

  13. [13]

    Auditing differentially private machine learning: How private is private sgd? Advances in Neural Information Processing Systems, 33: 0 22205--22216, 2020

    Matthew Jagielski, Jonathan Ullman, and Alina Oprea. Auditing differentially private machine learning: How private is private sgd? Advances in Neural Information Processing Systems, 33: 0 22205--22216, 2020

  14. [14]

    Knowledge unlearning for mitigating privacy risks in language models

    Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022

  15. [15]

    Evaluating differentially private machine learning in practice

    Bargav Jayaraman and David Evans. Evaluating differentially private machine learning in practice. In 28th USENIX Security Symposium (USENIX Security 19), pp.\ 1895--1912, 2019

  16. [16]

    Propile: Probing privacy leakage in large language models

    Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh. Propile: Probing privacy leakage in large language models. arXiv preprint arXiv:2307.01881, 2023

  17. [17]

    The brainy student: Scalable unlearning by selectively disobeying the teacher, 2023 a

    Meghdad Kurmanji, Peter Triantafillou, and Eleni Triantafillou. The brainy student: Scalable unlearning by selectively disobeying the teacher, 2023 a . URL https://openreview.net/forum?id=f9eHl5mKx5i

  18. [18]

    Towards unbounded machine unlearning

    Meghdad Kurmanji, Peter Triantafillou, and Eleni Triantafillou. Towards unbounded machine unlearning. arXiv preprint arXiv:2302.09880, 2023 b

  19. [19]

    Textbooks Are All You Need II: phi-1.5 technical report

    Yuanzhi Li, S \'e bastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023

  20. [20]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.\ 74--81, 2004

  21. [21]

    Continual learning and private unlearning

    Bo Liu, Qiang Liu, and Peter Stone. Continual learning and private unlearning. In Conference on Lifelong Learning Agents, pp.\ 243--254. PMLR, 2022

  22. [22]

    Quark: Controllable text generation with reinforced unlearning

    Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35: 0 27591--27609, 2022

  23. [23]

    Dataset inference: Ownership resolution in machine learning

    Pratyush Maini, Mohammad Yaghini, and Nicolas Papernot. Dataset inference: Ownership resolution in machine learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=hvdKKV2yt7T

  24. [24]

    Catastrophic interference in connectionist networks: The sequential learning problem

    Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp.\ 109--165. Elsevier, 1989

  25. [25]

    Locating and editing factual associations in gpt

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35: 0 17359--17372, 2022

  26. [26]

    Adversary instantiation: Lower bounds for differentially private machine learning

    Milad Nasr, Shuang Songi, Abhradeep Thakurta, Nicolas Papernot, and Nicholas Carlin. Adversary instantiation: Lower bounds for differentially private machine learning. In 2021 IEEE Symposium on security and privacy (SP), pp.\ 866--882. IEEE, 2021

  27. [27]

    Ccpa regulations: Final regulation text

    CA OAG. Ccpa regulations: Final regulation text. Office of the Attorney General, California Department of Justice, 2021

  28. [28]

    Can sensitive information be deleted from llms? objectives for defending against extraction attacks

    Vaidehi Patil, Peter Hase, and Mohit Bansal. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410, 2023

  29. [29]

    In-context unlearning: Language models as few shot unlearners

    Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579, 2023

  30. [30]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023

  31. [31]

    Remember what you want to forget: Algorithms for machine unlearning

    Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34: 0 18075--18086, 2021

  32. [32]

    Detecting pretraining data from large language models

    Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023

  33. [33]

    Membership inference attacks against machine learning models

    Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp.\ 3--18. IEEE, 2017

  34. [34]

    Privacy auditing with one (1) training run

    Thomas Steinke, Milad Nasr, and Matthew Jagielski. Privacy auditing with one (1) training run. arXiv preprint arXiv:2305.08846, 2023

  35. [35]

    On the necessity of auditable algorithmic definitions for machine unlearning

    Anvith Thudi, Hengrui Jia, Ilia Shumailov, and Nicolas Papernot. On the necessity of auditable algorithmic definitions for machine unlearning. In 31st USENIX Security Symposium (USENIX Security 22), pp.\ 4007--4022, 2022

  36. [36]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  37. [37]

    Regulation (eu) 2016/679 of the european parliament and of the council

    European Union. Regulation (eu) 2016/679 of the european parliament and of the council. Official Journal of the European Union, 2016

  38. [38]

    The eu general data protection regulation (gdpr)

    Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10: 0 3152676, 2017

  39. [39]

    Kga: A general machine unlearning framework based on knowledge gap alignment

    Lingzhi Wang, Tong Chen, Wei Yuan, Xingshan Zeng, Kam-Fai Wong, and Hongzhi Yin. Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535, 2023

  40. [40]

    Jailbroken: How Does LLM Safety Training Fail?

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023

  41. [41]

    Large language model unlearning, 2023

    Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning, 2023

  42. [42]

    Right to be forgotten in the era of large language models: Implications, challenges, and solutions

    Dawen Zhang, Pamela Finckenberg-Broman, Thong Hoang, Shidong Pan, Zhenchang Xing, Mark Staples, and Xiwei Xu. Right to be forgotten in the era of large language models: Implications, challenges, and solutions. arXiv preprint arXiv:2307.03941, 2023

  43. [43]

    A comprehensive study of knowledge editing for large language models, 2024

    Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, and Huajun Chen. A comprehensive study of knowledge editing for large language models, 2024

  44. [44]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023