arxiv: 2401.06121 · v1 · submitted 2024-01-11 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini , Zhili Feng , Avi Schwarzschild , Zachary C. Lipton , J. Zico Kolter

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords unlearninglarge language modelsprivacybenchmarksynthetic dataforgetting

0 comments

The pith

Unlearning methods for large language models fail to make them behave as if specific training data was never seen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TOFU as a benchmark for testing whether unlearning techniques can remove the effects of particular data from trained language models. It uses 200 synthetic author profiles, each built from 20 question-answer pairs, with a designated forget subset. A collection of metrics evaluates how closely an unlearned model matches one that was never exposed to the forget data at all. Results on current baseline methods show they do not reach this standard. This setup matters because models trained on web data can reproduce private information, and reliable forgetting would address privacy risks after training is complete.

Core claim

TOFU provides a dataset of 200 diverse synthetic author profiles each consisting of 20 question-answer pairs, along with a forget set and a suite of metrics that together measure whether unlearning produces models equivalent to those never trained on the target data; existing baselines do not achieve this equivalence.

What carries the argument

The TOFU benchmark, built from controlled synthetic author profiles and question-answer pairs, that isolates the forgetting task and supplies metrics for true unlearning.

If this is right

Current unlearning algorithms leave detectable traces of the forget data in model behavior.
Effective unlearning requires methods that achieve equivalence to training without the target data.
The benchmark supplies a standardized test that future algorithms can be measured against.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same metrics could be applied to real private data once synthetic results improve.
Persistent failure across baselines points to deeper limits in how models store and access information.
Success on this task would enable post-training removal of specific facts without full retraining.

Load-bearing premise

Results observed on synthetic author profiles will reflect the difficulty of removing real sensitive information from actual large-scale training data.

What would settle it

An unlearning method that produces model outputs on forget-set questions indistinguishable from a model never trained on those profiles, while preserving performance on unrelated data.

read the original abstract

Large language models trained on massive corpora of data from the web can memorize and reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. Although several methods exist for such unlearning, it is unclear to what extent they result in models equivalent to those where the data to be forgotten was never learned in the first place. To address this challenge, we present TOFU, a Task of Fictitious Unlearning, as a benchmark aimed at helping deepen our understanding of unlearning. We offer a dataset of 200 diverse synthetic author profiles, each consisting of 20 question-answer pairs, and a subset of these profiles called the forget set that serves as the target for unlearning. We compile a suite of metrics that work together to provide a holistic picture of unlearning efficacy. Finally, we provide a set of baseline results from existing unlearning algorithms. Importantly, none of the baselines we consider show effective unlearning motivating continued efforts to develop approaches for unlearning that effectively tune models so that they truly behave as if they were never trained on the forget data at all.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TOFU supplies a clean synthetic benchmark with 200 author profiles and metrics that show current unlearning baselines fall short of retrained-model performance, but the isolated facts likely make forgetting look harder than it would on real entangled data.

read the letter

TOFU is a benchmark for fictitious unlearning in large language models. It uses 200 synthetic author profiles, each with 20 question-answer pairs, to create a forget set and test whether unlearning methods can make the model act like it never saw that data. The paper does well by supplying this controlled dataset and a collection of metrics that together check different aspects of forgetting. They run several baseline unlearning algorithms and find that none of them reach the performance of a model retrained from scratch without the forget set. This gives the field a concrete target and shows that current approaches have room to improve. What stands out as new is the specific construction of the TOFU dataset and the holistic evaluation suite. Prior work on unlearning existed, but having a standardized, reproducible test with known ground truth via the retrained model is a practical step forward. The soft spot is the reliance on these synthetic profiles. They are designed as isolated facts, which probably understates how knowledge gets entangled in actual training on web data. As a result, the finding that baselines fail to unlearn effectively might not generalize well to removing real sensitive information. The metrics could be flagging failures that wouldn't matter as much in messier, real-world scenarios. This paper is aimed at researchers in machine learning privacy and safety who need better ways to evaluate unlearning. Anyone building or testing unlearning techniques will find the dataset and baselines useful for comparison. It deserves a serious referee because the benchmark contribution is clear and the empirical results are straightforward to check. I would recommend sending it for peer review. The work is honest about its scope and provides something the community can build on, even with the limitations of the synthetic setup.

Referee Report

2 major / 2 minor

Summary. The paper introduces TOFU, a benchmark for evaluating machine unlearning in LLMs. It consists of 200 synthetic author profiles (each with 20 QA pairs), designates a forget subset, and defines a suite of metrics that compare post-unlearning model behavior against a retrained model never exposed to the forget set. The central empirical finding is that none of the evaluated baseline unlearning methods achieve effective unlearning on this benchmark.

Significance. If the metrics and synthetic construction are accepted as a valid proxy, the negative result on baselines is useful for motivating stronger unlearning algorithms that aim for behavior equivalent to a model never trained on the target data. The controlled, reproducible nature of the synthetic profiles is a strength for benchmarking.

major comments (2)

[Benchmark construction and results sections] The claim that 'none of the baselines we consider show effective unlearning' (abstract) rests on the synthetic profiles serving as a faithful proxy for real sensitive data. Because the profiles are constructed as isolated, self-contained facts rather than densely entangled knowledge, the metrics may register failure even for methods that would succeed on real data; this makes the negative result non-diagnostic for the motivating claim about real-world unlearning. The paper should either add experiments with more entangled synthetic data or explicitly bound the scope of the generalization claim.
[Experimental setup] The definition of effective unlearning via comparison to a retrained model is central, yet the manuscript does not detail how the retrained model is trained (data mixture, hyperparameters, number of epochs) or whether it is matched exactly to the original training distribution excluding the forget set. Without these controls, differences in the metrics could arise from training variance rather than unlearning failure.

minor comments (2)

[Metrics section] Clarify the exact formulas and weighting for the suite of metrics in the main text rather than deferring entirely to the appendix.
[Dataset description] The 200-profile scale is modest; a brief ablation on profile count or diversity would strengthen the robustness claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Benchmark construction and results sections] The claim that 'none of the baselines we consider show effective unlearning' (abstract) rests on the synthetic profiles serving as a faithful proxy for real sensitive data. Because the profiles are constructed as isolated, self-contained facts rather than densely entangled knowledge, the metrics may register failure even for methods that would succeed on real data; this makes the negative result non-diagnostic for the motivating claim about real-world unlearning. The paper should either add experiments with more entangled synthetic data or explicitly bound the scope of the generalization claim.

Authors: The TOFU benchmark deliberately uses isolated, self-contained synthetic profiles to enable precise, reproducible comparison against a retrained model without confounding from knowledge entanglement. This controlled design isolates the unlearning signal and supports clean metric computation. We agree that the results are therefore most directly diagnostic for discrete factual information rather than densely interconnected real-world data. In revision we will explicitly bound the generalization claims to this controlled setting and add a dedicated paragraph discussing potential differences with more entangled data. We will not add new entangled-data experiments, as the current construction prioritizes control and reproducibility. revision: partial
Referee: [Experimental setup] The definition of effective unlearning via comparison to a retrained model is central, yet the manuscript does not detail how the retrained model is trained (data mixture, hyperparameters, number of epochs) or whether it is matched exactly to the original training distribution excluding the forget set. Without these controls, differences in the metrics could arise from training variance rather than unlearning failure.

Authors: We thank the referee for identifying this omission. The retrained model was trained on the identical data mixture and distribution as the original model except for the explicit removal of the forget-set profiles, using the same hyperparameters and number of epochs. In the revised manuscript we will add a new subsection in Experimental Setup that fully specifies the retrained-model training procedure, including exact hyperparameters, epoch count, data-mixture details, and verification steps confirming the distribution match. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark creation with no derivations or self-referential reductions

full rationale

The paper constructs TOFU as a new synthetic dataset of 200 author profiles with associated QA pairs, defines a suite of evaluation metrics for unlearning efficacy, and reports baseline results from existing algorithms on this data. No equations, derivations, or fitted parameters are present that could reduce to the paper's own inputs by construction. Claims about baseline ineffectiveness are direct empirical observations on the provided forget set versus a retrained reference model, with no load-bearing self-citations or ansatzes that collapse the result. The work is self-contained as benchmark introduction and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters, mathematical axioms beyond standard machine-learning evaluation practices, or invented entities; it constructs an empirical testbed on top of existing unlearning concepts.

axioms (1)

domain assumption Synthetic author profiles can serve as a valid proxy for evaluating the difficulty of unlearning real sensitive information
The benchmark's claim to practical relevance rests on this unstated premise about generalization from fictitious to real data.

pith-pipeline@v0.9.0 · 5528 in / 1277 out tokens · 46135 ms · 2026-05-16T11:03:31.010641+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
cs.CR 2026-05 conditional novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Inference-Time Machine Unlearning via Gated Activation Redirection
cs.LG 2026-05 conditional novelty 8.0

GUARD-IT performs machine unlearning in LLMs via inference-time gated activation redirection, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
cs.CV 2026-04 conditional novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
cs.CL 2026-05 unverdicted novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models
cs.CV 2026-05 unverdicted novelty 7.0

PPU-Bench is a real-world benchmark exposing forget-retain trade-offs in MLLM unlearning and motivating Boundary-Aware Optimization to enforce intra-subject factual boundaries.
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
cs.CV 2026-05 unverdicted novelty 7.0

CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
cs.CV 2026-05 unverdicted novelty 7.0

Standard metrics for multimodal machine unlearning conflict in rankings, addressed by a new oracle-correlated composite score that yields stable results.
Is your algorithm unlearning or untraining?
cs.LG 2026-04 conditional novelty 7.0

Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
cs.LG 2026-05 conditional novelty 6.0

Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning
cs.AI 2026-05 unverdicted novelty 6.0

A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.
CAP: Controllable Alignment Prompting for Unlearning in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAP enables reversible unlearning of targeted knowledge in LLMs through optimized prompts generated via reinforcement learning, without any parameter updates.
CAP: Controllable Alignment Prompting for Unlearning in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAP optimizes prompts via reinforcement learning to selectively unlearn target knowledge in LLMs while preserving general capabilities, without any parameter updates and with reversible revocation.
From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

MAGE builds a memory graph from a user anchor to generate its own supervision signals for corpus-free unlearning, matching the effectiveness of methods that use external reference data on TOFU and RWKU benchmarks.
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
Efficient machine unlearning with minimax optimality
stat.ML 2026-04 unverdicted novelty 6.0

ULS provides minimax-optimal estimation of remaining-data parameters in machine unlearning with limited access and decomposes error into oracle plus unlearning cost terms.
MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models
cs.LG 2026-02 unverdicted novelty 6.0

MPU is a framework that achieves privacy-preserving unlearning for LLMs by distributing perturbed model copies for local client-side unlearning followed by server-side aggregation with harmonic denoising.
Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
cs.CV 2026-05 unverdicted novelty 5.0

Standard unlearning metrics disagree in multimodal settings, but a correlation-weighted Unified Quality Score delivers consistent method rankings across benchmarks.
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents
cs.MA 2026-05 unverdicted novelty 4.0

The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 17 Pith papers · 5 internal anchors

[1]

Machine unlearning

Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp.\ 141--159. IEEE, 2021

work page 2021
[2]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.\ 2633--2650, 2021

work page 2021
[3]

Membership inference attacks from first principles

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp.\ 1897--1914. IEEE, 2022

work page 2022
[4]

Unlearn what you want to forget: Efficient unlearning for llms, 2023

Jiaao Chen and Diyi Yang. Unlearn what you want to forget: Efficient unlearning for llms, 2023

work page 2023
[5]

On the properties of neural machine translation: Encoder -- decoder approaches

Kyunghyun Cho, Bart van Merri \"e nboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder -- decoder approaches. In Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi (eds.), Proceedings of SSST -8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation , pp.\ 103--111, Doha,...

work page doi:10.3115/v1/w14-4012 2014
[6]

Editing factual knowledge in language models

Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164, 2021

work page arXiv 2021
[7]

Who's harry potter? approximate unlearning in llms

Ronen Eldan and Mark Russinovich. Who's harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238, 2023

work page arXiv 2023
[8]

Towards adversarial evaluations for inexact machine unlearning

Shashwat Goel, Ameya Prabhu, Amartya Sanyal, Ser-Nam Lim, Philip Torr, and Ponnurangam Kumaraguru. Towards adversarial evaluations for inexact machine unlearning. arXiv preprint arXiv:2201.06640, 2022

work page arXiv 2022
[9]

Eternal sunshine of the spotless net: Selective forgetting in deep networks

Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 9304--9312, 2020

work page 2020
[10]

Certified data removal from machine learning models

Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019

work page arXiv 1911
[11]

Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation, 2023

Xinshuo Hu, Dongfang Li, Zihao Zheng, Zhenyu Liu, Baotian Hu, and Min Zhang. Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation, 2023

work page 2023
[12]

Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. Are large pre-trained language models leaking your personal information? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.\ 2038--2047, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguis...

work page doi:10.18653/v1/2022.findings-emnlp.148 2022
[13]

Auditing differentially private machine learning: How private is private sgd? Advances in Neural Information Processing Systems, 33: 0 22205--22216, 2020

Matthew Jagielski, Jonathan Ullman, and Alina Oprea. Auditing differentially private machine learning: How private is private sgd? Advances in Neural Information Processing Systems, 33: 0 22205--22216, 2020

work page 2020
[14]

Knowledge unlearning for mitigating privacy risks in language models

Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022

work page arXiv 2022
[15]

Evaluating differentially private machine learning in practice

Bargav Jayaraman and David Evans. Evaluating differentially private machine learning in practice. In 28th USENIX Security Symposium (USENIX Security 19), pp.\ 1895--1912, 2019

work page 1912
[16]

Propile: Probing privacy leakage in large language models

Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh. Propile: Probing privacy leakage in large language models. arXiv preprint arXiv:2307.01881, 2023

work page arXiv 2023
[17]

The brainy student: Scalable unlearning by selectively disobeying the teacher, 2023 a

Meghdad Kurmanji, Peter Triantafillou, and Eleni Triantafillou. The brainy student: Scalable unlearning by selectively disobeying the teacher, 2023 a . URL https://openreview.net/forum?id=f9eHl5mKx5i

work page 2023
[18]

Towards unbounded machine unlearning

Meghdad Kurmanji, Peter Triantafillou, and Eleni Triantafillou. Towards unbounded machine unlearning. arXiv preprint arXiv:2302.09880, 2023 b

work page arXiv 2023
[19]

Textbooks Are All You Need II: phi-1.5 technical report

Yuanzhi Li, S \'e bastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.\ 74--81, 2004

work page 2004
[21]

Continual learning and private unlearning

Bo Liu, Qiang Liu, and Peter Stone. Continual learning and private unlearning. In Conference on Lifelong Learning Agents, pp.\ 243--254. PMLR, 2022

work page 2022
[22]

Quark: Controllable text generation with reinforced unlearning

Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35: 0 27591--27609, 2022

work page 2022
[23]

Dataset inference: Ownership resolution in machine learning

Pratyush Maini, Mohammad Yaghini, and Nicolas Papernot. Dataset inference: Ownership resolution in machine learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=hvdKKV2yt7T

work page 2021
[24]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp.\ 109--165. Elsevier, 1989

work page 1989
[25]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35: 0 17359--17372, 2022

work page 2022
[26]

Adversary instantiation: Lower bounds for differentially private machine learning

Milad Nasr, Shuang Songi, Abhradeep Thakurta, Nicolas Papernot, and Nicholas Carlin. Adversary instantiation: Lower bounds for differentially private machine learning. In 2021 IEEE Symposium on security and privacy (SP), pp.\ 866--882. IEEE, 2021

work page 2021
[27]

Ccpa regulations: Final regulation text

CA OAG. Ccpa regulations: Final regulation text. Office of the Attorney General, California Department of Justice, 2021

work page 2021
[28]

Can sensitive information be deleted from llms? objectives for defending against extraction attacks

Vaidehi Patil, Peter Hase, and Mohit Bansal. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410, 2023

work page arXiv 2023
[29]

In-context unlearning: Language models as few shot unlearners

Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579, 2023

work page arXiv 2023
[30]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Remember what you want to forget: Algorithms for machine unlearning

Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34: 0 18075--18086, 2021

work page 2021
[32]

Detecting pretraining data from large language models

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023

work page arXiv 2023
[33]

Membership inference attacks against machine learning models

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp.\ 3--18. IEEE, 2017

work page 2017
[34]

Privacy auditing with one (1) training run

Thomas Steinke, Milad Nasr, and Matthew Jagielski. Privacy auditing with one (1) training run. arXiv preprint arXiv:2305.08846, 2023

work page arXiv 2023
[35]

On the necessity of auditable algorithmic definitions for machine unlearning

Anvith Thudi, Hengrui Jia, Ilia Shumailov, and Nicolas Papernot. On the necessity of auditable algorithmic definitions for machine unlearning. In 31st USENIX Security Symposium (USENIX Security 22), pp.\ 4007--4022, 2022

work page 2022
[36]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Regulation (eu) 2016/679 of the european parliament and of the council

European Union. Regulation (eu) 2016/679 of the european parliament and of the council. Official Journal of the European Union, 2016

work page 2016
[38]

The eu general data protection regulation (gdpr)

Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10: 0 3152676, 2017

work page 2017
[39]

Kga: A general machine unlearning framework based on knowledge gap alignment

Lingzhi Wang, Tong Chen, Wei Yuan, Xingshan Zeng, Kam-Fai Wong, and Hongzhi Yin. Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535, 2023

work page arXiv 2023
[40]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Large language model unlearning, 2023

Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning, 2023

work page 2023
[42]

Right to be forgotten in the era of large language models: Implications, challenges, and solutions

Dawen Zhang, Pamela Finckenberg-Broman, Thong Hoang, Shidong Pan, Zhenchang Xing, Mark Staples, and Xiwei Xu. Right to be forgotten in the era of large language models: Implications, challenges, and solutions. arXiv preprint arXiv:2307.03941, 2023

work page arXiv 2023
[43]

A comprehensive study of knowledge editing for large language models, 2024

Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, and Huajun Chen. A comprehensive study of knowledge editing for large language models, 2024

work page 2024
[44]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023