pith. machine review for the scientific record. sign in

arxiv: 2404.05868 · v2 · submitted 2024-04-08 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Recognition: no theorem link

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLstat.ML
keywords LLM unlearningNegative Preference Optimizationcatastrophic collapsegradient ascentTOFU benchmarkmachine unlearningalignment methodsmodel forgetting
0
0 comments X

The pith

Negative Preference Optimization unlearns large portions of LLM training data without catastrophic collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Negative Preference Optimization (NPO) as an alignment-inspired method to remove the influence of undesirable data from large language models. It proves that minimizing the NPO loss slows progression toward model collapse exponentially compared to gradient ascent on the undesirable loss. Experiments on synthetic data and the TOFU benchmark show NPO maintains higher model utility while achieving stronger forgetting. On TOFU, NPO succeeds at forgetting 50 percent or more of the data, where prior methods already fail at 10 percent. The resulting outputs remain coherent rather than collapsing into gibberish.

Core claim

Negative Preference Optimization achieves effective unlearning by minimizing a loss that treats undesirable data as negative preferences, resulting in exponentially slower progression toward catastrophic collapse than gradient ascent methods. On the TOFU dataset, NPO-based methods successfully forget 50 percent or more of the training data while preserving model utilities, unlike existing methods that struggle beyond 10 percent.

What carries the argument

Negative Preference Optimization (NPO), an alignment-inspired loss that down-weights the probability of undesirable sequences in a preference-optimization framework.

If this is right

  • NPO enables forgetting of much larger data fractions than gradient-ascent baselines while keeping outputs sensible.
  • Unlearning tasks become practical at scales where previous methods produce unusable models.
  • The exponential slowdown in collapse rate allows longer optimization runs without utility loss.
  • Alignment-style objectives can be repurposed for removal rather than addition of knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • NPO may support post-training removal of private or copyrighted material in deployed models without full retraining.
  • The same negative-preference framing could extend to unlearning in other modalities or architectures.
  • Hybrid pipelines that alternate NPO with utility-preserving steps might further widen the forgetting-utility trade-off.
  • Scalability tests on models beyond TOFU sizes would clarify whether the collapse-resistance property holds at frontier scale.

Load-bearing premise

Results observed on synthetic data and the TOFU benchmark will generalize to unlearning sensitive data in large production LLMs without introducing new failure modes.

What would settle it

Train a production-scale LLM, apply NPO to remove 50 percent of its pre-training data, then measure whether utility on held-out tasks remains stable or drops sharply into incoherent output.

read the original abstract

Large Language Models (LLMs) often memorize sensitive, private, or copyrighted data during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from the pre-trained model while preserving the model's utilities on other tasks. Several practical methods have recently been proposed for LLM unlearning, mostly based on gradient ascent (GA) on the loss of undesirable data. However, on certain unlearning tasks, these methods either fail to effectively unlearn the target data or suffer from catastrophic collapse -- a drastic degradation of the model's utilities. In this paper, we propose Negative Preference Optimization (NPO), a simple alignment-inspired method that could efficiently and effectively unlearn a target dataset. We theoretically show that the progression toward catastrophic collapse by minimizing the NPO loss is exponentially slower than GA. Through experiments on synthetic data and the benchmark TOFU dataset, we demonstrate that NPO-based methods achieve a better balance between unlearning the undesirable data and maintaining the model's utilities. We also observe that NPO-based methods generate more sensible outputs than GA-based methods, whose outputs are often gibberish. Remarkably, on TOFU, NPO-based methods are the first to achieve reasonable unlearning results in forgetting 50% (or more) of the training data, whereas existing methods already struggle with forgetting 10% of training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes Negative Preference Optimization (NPO), an alignment-inspired objective for LLM unlearning that replaces gradient ascent on forget-set loss. It claims a theoretical result that NPO drives the model toward catastrophic collapse exponentially more slowly than GA, and reports that NPO-based methods are the first to achieve reasonable unlearning on the TOFU benchmark when 50% or more of the training data must be forgotten while preserving utility on retain sets.

Significance. If the slower-collapse analysis and the TOFU results hold under scrutiny, the work supplies a practical, low-overhead alternative to GA that materially expands the feasible regime for machine unlearning in LLMs. The explicit comparison of collapse dynamics and the demonstration that coherent outputs can be retained at high forget ratios are the most load-bearing contributions.

major comments (3)
  1. [§3] §3 (theoretical analysis): the claim that NPO yields an exponentially slower progression to collapse is central to the paper’s motivation, yet the derivation steps that produce the exponential factor are only sketched; the full expansion from the NPO loss to the stated rate bound is missing, preventing direct verification.
  2. [§4.2, Tables 2–4] TOFU experiments (Tables 2–4 and §4.2): results for the 50% forget-set regime are presented without error bars, without multiple random seeds, and with only minimal ablation on the single free parameter β; because the headline claim rests on these numbers, the absence of statistical characterization weakens the empirical support.
  3. [§4.3, §5] §4.3 and §5: the manuscript acknowledges that TOFU uses synthetic, low-entanglement QA pairs, but provides no additional experiment or analysis that tests whether the observed stability persists when forget-set facts share representations with retain-set facts, which is the regime that would determine practical utility.
minor comments (3)
  1. [Eq. (3)] Notation for the NPO loss (Eq. 3) re-uses symbols already defined for the GA baseline; a short clarifying sentence would avoid reader confusion.
  2. [Figure 2] Figure 2 (loss curves) would benefit from explicit annotation of the point at which utility collapse begins for each method.
  3. [Abstract, §2] The abstract states that NPO is “the first” to succeed at 50% forgetting; the related-work section should explicitly cite the strongest prior GA variants that were tested on the same TOFU splits.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where revisions are feasible, we will incorporate them in the next version of the manuscript to strengthen the theoretical clarity and empirical robustness.

read point-by-point responses
  1. Referee: [§3] §3 (theoretical analysis): the claim that NPO yields an exponentially slower progression to collapse is central to the paper’s motivation, yet the derivation steps that produce the exponential factor are only sketched; the full expansion from the NPO loss to the stated rate bound is missing, preventing direct verification.

    Authors: We agree that the derivation in §3 was presented too concisely and should be expanded for direct verification. In the revised manuscript we will include the complete step-by-step expansion. Starting from the NPO loss L_NPO = E[log(1 + exp(β (ℓ_f - ℓ_r)))] where ℓ_f is the forget-set loss, the gradient with respect to model parameters yields an update whose magnitude is bounded by a term that decays exponentially in the loss gap (specifically, the factor exp(-β Δℓ) appears after applying the chain rule and bounding the sigmoid derivative). This produces the stated exponential slowdown relative to gradient ascent, whose update magnitude remains linear in the loss. The full algebraic expansion and all intermediate inequalities will be added to §3. revision: yes

  2. Referee: [§4.2, Tables 2–4] TOFU experiments (Tables 2–4 and §4.2): results for the 50% forget-set regime are presented without error bars, without multiple random seeds, and with only minimal ablation on the single free parameter β; because the headline claim rests on these numbers, the absence of statistical characterization weakens the empirical support.

    Authors: We accept that the current presentation lacks statistical characterization. In the revision we will rerun the 50% forget-set experiments across at least three random seeds, report means and standard deviations as error bars in the updated Tables 2–4, and expand the β ablation to a wider grid (0.1 to 10) with corresponding utility and unlearning metrics. These additions will be included in §4.2. revision: yes

  3. Referee: [§4.3, §5] §4.3 and §5: the manuscript acknowledges that TOFU uses synthetic, low-entanglement QA pairs, but provides no additional experiment or analysis that tests whether the observed stability persists when forget-set facts share representations with retain-set facts, which is the regime that would determine practical utility.

    Authors: We recognize that TOFU’s synthetic construction limits entanglement and that this is a genuine limitation for assessing practical utility. While a comprehensive new benchmark with high entanglement is beyond the scope of the current work, we will add a controlled preliminary experiment on a small synthetic dataset with tunable fact overlap and include the results in the revision. We will also expand the discussion in §5 to explicitly flag entanglement as an important open direction and note the current results as a lower bound on stability. revision: partial

Circularity Check

0 steps flagged

NPO derivation is self-contained with no reduction to fitted inputs or self-citations

full rationale

The paper introduces Negative Preference Optimization as a distinct alignment-inspired objective and derives a theoretical comparison showing exponentially slower collapse relative to gradient ascent. No equations reduce the NPO loss or its predictions to a reparameterization of the input data or to a fitted parameter renamed as output. The TOFU and synthetic experiments are presented as independent empirical validation rather than forced by construction. Any self-citations are peripheral and not load-bearing for the central claim of slower collapse or improved unlearning balance.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the definition of the NPO loss function and the theoretical analysis of its dynamics; no new physical entities are postulated.

free parameters (1)
  • beta
    Hyperparameter controlling the strength of the negative preference term in the NPO objective.
axioms (1)
  • domain assumption The NPO loss produces exponentially slower progression to collapse than gradient ascent on the undesirable data loss.
    Invoked in the theoretical section comparing dynamics of the two objectives.

pith-pipeline@v0.9.0 · 5544 in / 1197 out tokens · 36246 ms · 2026-05-16T22:23:06.857645+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

    cs.CR 2026-05 conditional novelty 8.0

    Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

  2. DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

    cs.LG 2026-05 conditional novelty 8.0

    INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.

  3. Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

    cs.CL 2026-05 unverdicted novelty 7.0

    New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

  4. Inducing Artificial Uncertainty in Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

  5. PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    PPU-Bench is a real-world benchmark exposing forget-retain trade-offs in MLLM unlearning and motivating Boundary-Aware Optimization to enforce intra-subject factual boundaries.

  6. MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of basel...

  7. ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.

  8. DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

    cs.LG 2026-05 unverdicted novelty 7.0

    INT4 quantization recovers forgotten data in unlearned LLMs up to 22x, exposing a trilemma with no existing method solving forgetting, utility, and robustness together; a new sharpness-aware method achieves cross-prec...

  9. Revisiting Privacy Leakage in Machine Unlearning: Membership Inference Beyond the Forgotten Set

    cs.CR 2026-05 unverdicted novelty 7.0

    Unlearning increases privacy leakage for the retain set, and a new tri-class membership inference attack distinguishes forget, retain, and unseen data using pre- and post-unlearning model outputs.

  10. Self-Improving Tabular Language Models via Iterative Group Alignment

    cs.LG 2026-04 unverdicted novelty 7.0

    TabGRAA enables self-improving tabular language models through iterative group-relative advantage alignment using modular automated quality signals like distinguishability classifiers.

  11. Is your algorithm unlearning or untraining?

    cs.LG 2026-04 conditional novelty 7.0

    Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).

  12. Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning

    cs.LG 2026-05 conditional novelty 6.0

    Existing LLM unlearning methods fail honesty standards by hallucinating on forgotten knowledge; ReVa improves rejection rates nearly twofold while enhancing retained honesty.

  13. Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning

    cs.AI 2026-05 unverdicted novelty 6.0

    A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.

  14. Threshold-Guided Optimization for Visual Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.

  15. PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

    cs.LG 2026-04 unverdicted novelty 6.0

    PrivUn shows privacy unlearning in LLMs produces gradient-driven ripple effects and only shallow forgetting across layers, with new strategies proposed for deeper removal.

  16. Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

    cs.AI 2026-04 unverdicted novelty 6.0

    A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.

  17. MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models

    cs.LG 2026-02 unverdicted novelty 6.0

    MPU is a framework that achieves privacy-preserving unlearning for LLMs by distributing perturbed model copies for local client-side unlearning followed by server-side aggregation with harmonic denoising.

  18. Sparse Concept Anchoring for Interpretable and Controllable Neural Representations

    cs.LG 2025-12 unverdicted novelty 6.0

    Sparse Concept Anchoring biases neural latent spaces toward targeted concepts using under 0.1% labels per concept, enabling reversible steering via projection and permanent removal via weight ablation with minimal sid...

  19. Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

    cs.LG 2025-12 unverdicted novelty 5.0

    A six-dimensional MathVerifier supplies hard negatives and per-sample weights that improve DPO performance on math reasoning for a 1.5B Qwen2.5 model over standard SFT and unweighted DPO.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 18 Pith papers · 6 internal anchors

  1. [1]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,

  2. [2]

    Machine unlearning

    Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE,

  3. [3]

    Towards making systems forget with machine unlearning

    Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pp. 463–480. IEEE,

  4. [4]

    Quantifying Memorization Across Neural Language Models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantify- ing memorization across neural language models. arXiv preprint arXiv:2202.07646,

  5. [5]

    Unlearn what you want to forget: Efficient unlearning for llms

    Jiaao Chen and Diyi Yang. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150,

  6. [6]

    Negating negatives: Alignment without human positive samples via distributional dispreference optimization

    Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, and Ning Gu. Negating negatives: Alignment without human positive samples via distributional dispreference optimization. arXiv preprint arXiv:2403.03419,

  7. [7]

    Who’s harry potter? approximate unlearning in llms

    Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238,

  8. [8]

    KTO: Model Alignment as Prospect Theoretic Optimization

    13 Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306,

  9. [9]

    Hannun, and Laurens van der Maaten

    Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030,

  10. [10]

    Are large pre-trained language models leaking your personal information?, 2022

    Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628,

  11. [11]

    Approximate data deletion from machine learning models

    Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp. 2008–2016. PMLR,

  12. [12]

    Knowledge unlearning for mitigating privacy risks in language models

    Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504,

  13. [13]

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218,

  14. [14]

    Rethinking machine unlearning for large language models

    Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Xiaojun Xu, Yuguang Yao, Hang Li, Kush R Varshney, et al. Rethinking machine unlearning for large language models. arXiv preprint arXiv:2402.08787, 2024a. Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Towards safer large language models through mac...

  15. [15]

    TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121,

  16. [16]

    A survey of machine unlearning

    Thanh Tam Nguyen, Thanh Trung Huynh, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning. arXiv preprint arXiv:2209.02299,

  17. [17]

    Can sensitive information be deleted from llms? objectives for defending against extraction attacks

    Vaidehi Patil, Peter Hase, and Mohit Bansal. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410,

  18. [18]

    In-context unlearning: Language models as few shot un- learners

    Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few shot un- learners. arXiv preprint arXiv:2310.07579,

  19. [19]

    Artificial intelligence and biological misuse: Differentiating risks of language models and biolog- ical design tools

    Jonas B Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biolog- ical design tools. arXiv preprint arXiv:2306.13952,

  20. [20]

    Detecting pretraining data from large language models

    Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789,

  21. [21]

    Knowledge unlearning for llms: Tasks, methods, and challenges

    Nianwen Si, Hao Zhang, Heyu Chang, Wenlin Zhang, Dan Qu, and Weiqiang Zhang. Knowledge unlearning for llms: Tasks, methods, and challenges. arXiv preprint arXiv:2311.15766,

  22. [22]

    Unrolling sgd: Understanding factors influencing machine unlearning

    Anvith Thudi, Gabriel Deza, Varun Chandrasekaran, and Nicolas Papernot. Unrolling sgd: Understanding factors influencing machine unlearning. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pp. 303–319. IEEE,

  23. [23]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

  24. [24]

    Kga: A general machine unlearning framework based on knowledge gap alignment

    Lingzhi Wang, Tong Chen, Wei Yuan, Xingshan Zeng, Kam-Fai Wong, and Hongzhi Yin. Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535,

  25. [25]

    Machine unlearning of pre-trained large language models

    Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, and Xiang Yue. Machine unlearning of pre-trained large language models. arXiv preprint arXiv:2402.15159,

  26. [26]

    Large language model unlearning.arXiv preprint arXiv:2310.10683,

    Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning.arXiv preprint arXiv:2310.10683,

  27. [27]

    (15) Putting Eq

    Therefore, under this condition, we have predi(b(t) NPO,i) = 1 1 + exp (1 − 2yi)cinit,i + (1 − 2yi)b(t) NPO,i = exp (2yi − 1)b(t) NPO,i exp (2yi − 1)b(t) NPO,i + exp (1 − 2yi)cinit,i ∈ " exp (2yi − 1)b(t) NPO,i 1 + exp (1 − 2yi)cinit,i , exp (2yi − 1)b(t) NPO,i exp (1 − 2yi)cinit,i # . (15) Putting Eq. (14) and (15) together and recalling that |cinit,i| ≤...

  28. [28]

    Initial model and retrained model. We create the initial model πref and the retrained model πretr via optimizing over θ using the cross-entropy loss on the entire dataset D = DFG ∪ DRT and the retain dataset DRT, respectively. Concretely, initializing at θ = 0128, we run gradient descent for 20000 steps with the learning rate equals 0.05 to obtain the ini...

  29. [29]

    We first present a detailed explanation of the metrics, the baseline methods and the hyperparameters in the experiments

    D Experiments on the TOFU dataset In this section, we provide details of the experiments on the TOFU dataset (Maini et al., 2024). We first present a detailed explanation of the metrics, the baseline methods and the hyperparameters in the experiments. Then, we provide the full results on different levels of the tasks. 23 Method learning rate β α = 1 α = 0...

  30. [30]

    TOFU contains 200 fictitious author profiles, each consisting of 20 question-answering pairs generated by GPT-4 based on a set of predefined attributes

    designed for measuring the unlearning methods for LLMs. TOFU contains 200 fictitious author profiles, each consisting of 20 question-answering pairs generated by GPT-4 based on a set of predefined attributes. These profiles are fictitious and do not exist in the pre-training data, providing a controlled environment for studying unlearning LLMs. TOFU intro...

  31. [31]

    Finally, we compute the averaged Truth Ratio on the three datasets above

    across these datasets, a metric that evaluates the accuracy of the model’s response compared to the reference answers. Finally, we compute the averaged Truth Ratio on the three datasets above. The Truth Ratio defined in Maini et al. (2024) measures how likely the unlearned model will give a correct answer versus an incorrect one. More specifically, given ...

  32. [32]

    Instead of simply averaging them, they test whether the distribution of the Truth Ratio computed from the unlearned and the retrained models are indistinguishable

    on each question-answer pair from the forget set. Instead of simply averaging them, they test whether the distribution of the Truth Ratio computed from the unlearned and the retrained models are indistinguishable. More specifically, they perform the Kolmogorov-Smirnov (KS) test and compute the p-value of the test. A large p-value indicates that the two mo...

  33. [33]

    IDK-based Methods (’I don’t know’)

    Loss cGA cFG cRT cFGKL cRTKL GA 1 0 0 0 0 GA+RT 1 0 1 0 0 GA+KL 1 0 0 0 1 IDK+RT 0 1 1 0 0 Table 2: The weights for different components in GA-based loss functions and IDK+RT loss. IDK-based Methods (’I don’t know’). Maini et al. (2024) proposed IDK+RT, which is a supervised loss function comprising of the retain loss and IDK loss term. The IDK loss term ...

  34. [34]

    In the DPO loss, we take ’I don’t know’ or its variants as positive responses and the answers in the forget set as negative responses

    and its variants by adding either the retain loss or the KL divergence on the retain set. In the DPO loss, we take ’I don’t know’ or its variants as positive responses and the answers in the forget set as negative responses. We use β = 0 .1 in all DPO-based experiments, which is commonly recognized as the optimal inverse temperature in most cases. KTO-bas...