Recognition: no theorem link
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
Pith reviewed 2026-05-16 22:23 UTC · model grok-4.3
The pith
Negative Preference Optimization unlearns large portions of LLM training data without catastrophic collapse.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Negative Preference Optimization achieves effective unlearning by minimizing a loss that treats undesirable data as negative preferences, resulting in exponentially slower progression toward catastrophic collapse than gradient ascent methods. On the TOFU dataset, NPO-based methods successfully forget 50 percent or more of the training data while preserving model utilities, unlike existing methods that struggle beyond 10 percent.
What carries the argument
Negative Preference Optimization (NPO), an alignment-inspired loss that down-weights the probability of undesirable sequences in a preference-optimization framework.
If this is right
- NPO enables forgetting of much larger data fractions than gradient-ascent baselines while keeping outputs sensible.
- Unlearning tasks become practical at scales where previous methods produce unusable models.
- The exponential slowdown in collapse rate allows longer optimization runs without utility loss.
- Alignment-style objectives can be repurposed for removal rather than addition of knowledge.
Where Pith is reading between the lines
- NPO may support post-training removal of private or copyrighted material in deployed models without full retraining.
- The same negative-preference framing could extend to unlearning in other modalities or architectures.
- Hybrid pipelines that alternate NPO with utility-preserving steps might further widen the forgetting-utility trade-off.
- Scalability tests on models beyond TOFU sizes would clarify whether the collapse-resistance property holds at frontier scale.
Load-bearing premise
Results observed on synthetic data and the TOFU benchmark will generalize to unlearning sensitive data in large production LLMs without introducing new failure modes.
What would settle it
Train a production-scale LLM, apply NPO to remove 50 percent of its pre-training data, then measure whether utility on held-out tasks remains stable or drops sharply into incoherent output.
read the original abstract
Large Language Models (LLMs) often memorize sensitive, private, or copyrighted data during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from the pre-trained model while preserving the model's utilities on other tasks. Several practical methods have recently been proposed for LLM unlearning, mostly based on gradient ascent (GA) on the loss of undesirable data. However, on certain unlearning tasks, these methods either fail to effectively unlearn the target data or suffer from catastrophic collapse -- a drastic degradation of the model's utilities. In this paper, we propose Negative Preference Optimization (NPO), a simple alignment-inspired method that could efficiently and effectively unlearn a target dataset. We theoretically show that the progression toward catastrophic collapse by minimizing the NPO loss is exponentially slower than GA. Through experiments on synthetic data and the benchmark TOFU dataset, we demonstrate that NPO-based methods achieve a better balance between unlearning the undesirable data and maintaining the model's utilities. We also observe that NPO-based methods generate more sensible outputs than GA-based methods, whose outputs are often gibberish. Remarkably, on TOFU, NPO-based methods are the first to achieve reasonable unlearning results in forgetting 50% (or more) of the training data, whereas existing methods already struggle with forgetting 10% of training data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Negative Preference Optimization (NPO), an alignment-inspired objective for LLM unlearning that replaces gradient ascent on forget-set loss. It claims a theoretical result that NPO drives the model toward catastrophic collapse exponentially more slowly than GA, and reports that NPO-based methods are the first to achieve reasonable unlearning on the TOFU benchmark when 50% or more of the training data must be forgotten while preserving utility on retain sets.
Significance. If the slower-collapse analysis and the TOFU results hold under scrutiny, the work supplies a practical, low-overhead alternative to GA that materially expands the feasible regime for machine unlearning in LLMs. The explicit comparison of collapse dynamics and the demonstration that coherent outputs can be retained at high forget ratios are the most load-bearing contributions.
major comments (3)
- [§3] §3 (theoretical analysis): the claim that NPO yields an exponentially slower progression to collapse is central to the paper’s motivation, yet the derivation steps that produce the exponential factor are only sketched; the full expansion from the NPO loss to the stated rate bound is missing, preventing direct verification.
- [§4.2, Tables 2–4] TOFU experiments (Tables 2–4 and §4.2): results for the 50% forget-set regime are presented without error bars, without multiple random seeds, and with only minimal ablation on the single free parameter β; because the headline claim rests on these numbers, the absence of statistical characterization weakens the empirical support.
- [§4.3, §5] §4.3 and §5: the manuscript acknowledges that TOFU uses synthetic, low-entanglement QA pairs, but provides no additional experiment or analysis that tests whether the observed stability persists when forget-set facts share representations with retain-set facts, which is the regime that would determine practical utility.
minor comments (3)
- [Eq. (3)] Notation for the NPO loss (Eq. 3) re-uses symbols already defined for the GA baseline; a short clarifying sentence would avoid reader confusion.
- [Figure 2] Figure 2 (loss curves) would benefit from explicit annotation of the point at which utility collapse begins for each method.
- [Abstract, §2] The abstract states that NPO is “the first” to succeed at 50% forgetting; the related-work section should explicitly cite the strongest prior GA variants that were tested on the same TOFU splits.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where revisions are feasible, we will incorporate them in the next version of the manuscript to strengthen the theoretical clarity and empirical robustness.
read point-by-point responses
-
Referee: [§3] §3 (theoretical analysis): the claim that NPO yields an exponentially slower progression to collapse is central to the paper’s motivation, yet the derivation steps that produce the exponential factor are only sketched; the full expansion from the NPO loss to the stated rate bound is missing, preventing direct verification.
Authors: We agree that the derivation in §3 was presented too concisely and should be expanded for direct verification. In the revised manuscript we will include the complete step-by-step expansion. Starting from the NPO loss L_NPO = E[log(1 + exp(β (ℓ_f - ℓ_r)))] where ℓ_f is the forget-set loss, the gradient with respect to model parameters yields an update whose magnitude is bounded by a term that decays exponentially in the loss gap (specifically, the factor exp(-β Δℓ) appears after applying the chain rule and bounding the sigmoid derivative). This produces the stated exponential slowdown relative to gradient ascent, whose update magnitude remains linear in the loss. The full algebraic expansion and all intermediate inequalities will be added to §3. revision: yes
-
Referee: [§4.2, Tables 2–4] TOFU experiments (Tables 2–4 and §4.2): results for the 50% forget-set regime are presented without error bars, without multiple random seeds, and with only minimal ablation on the single free parameter β; because the headline claim rests on these numbers, the absence of statistical characterization weakens the empirical support.
Authors: We accept that the current presentation lacks statistical characterization. In the revision we will rerun the 50% forget-set experiments across at least three random seeds, report means and standard deviations as error bars in the updated Tables 2–4, and expand the β ablation to a wider grid (0.1 to 10) with corresponding utility and unlearning metrics. These additions will be included in §4.2. revision: yes
-
Referee: [§4.3, §5] §4.3 and §5: the manuscript acknowledges that TOFU uses synthetic, low-entanglement QA pairs, but provides no additional experiment or analysis that tests whether the observed stability persists when forget-set facts share representations with retain-set facts, which is the regime that would determine practical utility.
Authors: We recognize that TOFU’s synthetic construction limits entanglement and that this is a genuine limitation for assessing practical utility. While a comprehensive new benchmark with high entanglement is beyond the scope of the current work, we will add a controlled preliminary experiment on a small synthetic dataset with tunable fact overlap and include the results in the revision. We will also expand the discussion in §5 to explicitly flag entanglement as an important open direction and note the current results as a lower bound on stability. revision: partial
Circularity Check
NPO derivation is self-contained with no reduction to fitted inputs or self-citations
full rationale
The paper introduces Negative Preference Optimization as a distinct alignment-inspired objective and derives a theoretical comparison showing exponentially slower collapse relative to gradient ascent. No equations reduce the NPO loss or its predictions to a reparameterization of the input data or to a fitted parameter renamed as output. The TOFU and synthetic experiments are presented as independent empirical validation rather than forced by construction. Any self-citations are peripheral and not load-bearing for the central claim of slower collapse or improved unlearning balance.
Axiom & Free-Parameter Ledger
free parameters (1)
- beta
axioms (1)
- domain assumption The NPO loss produces exponentially slower progression to collapse than gradient ascent on the undesirable data loss.
Forward citations
Cited by 19 Pith papers
-
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
-
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.
-
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models
PPU-Bench is a real-world benchmark exposing forget-retain trade-offs in MLLM unlearning and motivating Boundary-Aware Optimization to enforce intra-subject factual boundaries.
-
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of basel...
-
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models
ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
-
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
INT4 quantization recovers forgotten data in unlearned LLMs up to 22x, exposing a trilemma with no existing method solving forgetting, utility, and robustness together; a new sharpness-aware method achieves cross-prec...
-
Revisiting Privacy Leakage in Machine Unlearning: Membership Inference Beyond the Forgotten Set
Unlearning increases privacy leakage for the retain set, and a new tri-class membership inference attack distinguishes forget, retain, and unseen data using pre- and post-unlearning model outputs.
-
Self-Improving Tabular Language Models via Iterative Group Alignment
TabGRAA enables self-improving tabular language models through iterative group-relative advantage alignment using modular automated quality signals like distinguishability classifiers.
-
Is your algorithm unlearning or untraining?
Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
-
Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning
Existing LLM unlearning methods fail honesty standards by hallucinating on forgotten knowledge; ReVa improves rejection rates nearly twofold while enhancing retained honesty.
-
Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning
A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.
-
Threshold-Guided Optimization for Visual Generative Models
A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
-
PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning
PrivUn shows privacy unlearning in LLMs produces gradient-driven ripple effects and only shallow forgetting across layers, with new strategies proposed for deeper removal.
-
Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies
A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.
-
MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models
MPU is a framework that achieves privacy-preserving unlearning for LLMs by distributing perturbed model copies for local client-side unlearning followed by server-side aggregation with harmonic denoising.
-
Sparse Concept Anchoring for Interpretable and Controllable Neural Representations
Sparse Concept Anchoring biases neural latent spaces toward targeted concepts using under 0.1% labels per concept, enabling reversible steering via projection and permanent removal via weight ablation with minimal sid...
-
Hard Negative Sample-Augmented DPO Post-Training for Small Language Models
A six-dimensional MathVerifier supplies hard negatives and per-sample weights that improve DPO performance on math reasoning for a 1.5B Qwen2.5 model over standard SFT and unweighted DPO.
Reference graph
Works this paper leans on
-
[1]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE,
work page 2021
-
[3]
Towards making systems forget with machine unlearning
Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pp. 463–480. IEEE,
work page 2015
-
[4]
Quantifying Memorization Across Neural Language Models
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantify- ing memorization across neural language models. arXiv preprint arXiv:2202.07646,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Unlearn what you want to forget: Efficient unlearning for llms
Jiaao Chen and Diyi Yang. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150,
-
[6]
Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, and Ning Gu. Negating negatives: Alignment without human positive samples via distributional dispreference optimization. arXiv preprint arXiv:2403.03419,
-
[7]
Who’s harry potter? approximate unlearning in llms
Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238,
-
[8]
KTO: Model Alignment as Prospect Theoretic Optimization
13 Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Hannun, and Laurens van der Maaten
Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030,
-
[10]
Are large pre-trained language models leaking your personal information?, 2022
Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628,
-
[11]
Approximate data deletion from machine learning models
Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp. 2008–2016. PMLR,
work page 2008
-
[12]
Knowledge unlearning for mitigating privacy risks in language models
Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504,
-
[13]
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Rethinking machine unlearning for large language models
Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Xiaojun Xu, Yuguang Yao, Hang Li, Kush R Varshney, et al. Rethinking machine unlearning for large language models. arXiv preprint arXiv:2402.08787, 2024a. Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Towards safer large language models through mac...
-
[15]
TOFU: A Task of Fictitious Unlearning for LLMs
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
A survey of machine unlearning
Thanh Tam Nguyen, Thanh Trung Huynh, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning. arXiv preprint arXiv:2209.02299,
-
[17]
Can sensitive information be deleted from llms? objectives for defending against extraction attacks
Vaidehi Patil, Peter Hase, and Mohit Bansal. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410,
-
[18]
In-context unlearning: Language models as few shot un- learners
Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few shot un- learners. arXiv preprint arXiv:2310.07579,
-
[19]
Jonas B Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biolog- ical design tools. arXiv preprint arXiv:2306.13952,
-
[20]
Detecting pretraining data from large language models
Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789,
-
[21]
Knowledge unlearning for llms: Tasks, methods, and challenges
Nianwen Si, Hao Zhang, Heyu Chang, Wenlin Zhang, Dan Qu, and Weiqiang Zhang. Knowledge unlearning for llms: Tasks, methods, and challenges. arXiv preprint arXiv:2311.15766,
-
[22]
Unrolling sgd: Understanding factors influencing machine unlearning
Anvith Thudi, Gabriel Deza, Varun Chandrasekaran, and Nicolas Papernot. Unrolling sgd: Understanding factors influencing machine unlearning. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pp. 303–319. IEEE,
work page 2022
-
[23]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Kga: A general machine unlearning framework based on knowledge gap alignment
Lingzhi Wang, Tong Chen, Wei Yuan, Xingshan Zeng, Kam-Fai Wong, and Hongzhi Yin. Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535,
-
[25]
Machine unlearning of pre-trained large language models
Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, and Xiang Yue. Machine unlearning of pre-trained large language models. arXiv preprint arXiv:2402.15159,
-
[26]
Large language model unlearning.arXiv preprint arXiv:2310.10683,
Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning.arXiv preprint arXiv:2310.10683,
-
[27]
Therefore, under this condition, we have predi(b(t) NPO,i) = 1 1 + exp (1 − 2yi)cinit,i + (1 − 2yi)b(t) NPO,i = exp (2yi − 1)b(t) NPO,i exp (2yi − 1)b(t) NPO,i + exp (1 − 2yi)cinit,i ∈ " exp (2yi − 1)b(t) NPO,i 1 + exp (1 − 2yi)cinit,i , exp (2yi − 1)b(t) NPO,i exp (1 − 2yi)cinit,i # . (15) Putting Eq. (14) and (15) together and recalling that |cinit,i| ≤...
work page 2000
-
[28]
Initial model and retrained model. We create the initial model πref and the retrained model πretr via optimizing over θ using the cross-entropy loss on the entire dataset D = DFG ∪ DRT and the retain dataset DRT, respectively. Concretely, initializing at θ = 0128, we run gradient descent for 20000 steps with the learning rate equals 0.05 to obtain the ini...
work page 2000
-
[29]
D Experiments on the TOFU dataset In this section, we provide details of the experiments on the TOFU dataset (Maini et al., 2024). We first present a detailed explanation of the metrics, the baseline methods and the hyperparameters in the experiments. Then, we provide the full results on different levels of the tasks. 23 Method learning rate β α = 1 α = 0...
work page 2024
-
[30]
designed for measuring the unlearning methods for LLMs. TOFU contains 200 fictitious author profiles, each consisting of 20 question-answering pairs generated by GPT-4 based on a set of predefined attributes. These profiles are fictitious and do not exist in the pre-training data, providing a controlled environment for studying unlearning LLMs. TOFU intro...
work page 2024
-
[31]
Finally, we compute the averaged Truth Ratio on the three datasets above
across these datasets, a metric that evaluates the accuracy of the model’s response compared to the reference answers. Finally, we compute the averaged Truth Ratio on the three datasets above. The Truth Ratio defined in Maini et al. (2024) measures how likely the unlearned model will give a correct answer versus an incorrect one. More specifically, given ...
work page 2024
-
[32]
on each question-answer pair from the forget set. Instead of simply averaging them, they test whether the distribution of the Truth Ratio computed from the unlearned and the retrained models are indistinguishable. More specifically, they perform the Kolmogorov-Smirnov (KS) test and compute the p-value of the test. A large p-value indicates that the two mo...
work page 2022
-
[33]
IDK-based Methods (’I don’t know’)
Loss cGA cFG cRT cFGKL cRTKL GA 1 0 0 0 0 GA+RT 1 0 1 0 0 GA+KL 1 0 0 0 1 IDK+RT 0 1 1 0 0 Table 2: The weights for different components in GA-based loss functions and IDK+RT loss. IDK-based Methods (’I don’t know’). Maini et al. (2024) proposed IDK+RT, which is a supervised loss function comprising of the retain loss and IDK loss term. The IDK loss term ...
work page 2024
-
[34]
and its variants by adding either the retain loss or the KL divergence on the retain set. In the DPO loss, we take ’I don’t know’ or its variants as positive responses and the answers in the forget set as negative responses. We use β = 0 .1 in all DPO-based experiments, which is commonly recognized as the optimal inverse temperature in most cases. KTO-bas...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.