pith. sign in

arxiv: 2606.06320 · v1 · pith:AJEG4UMEnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI· cs.CL

Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance

Pith reviewed 2026-06-28 02:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords machine unlearninglarge language modelstoken importancejoint optimizationforget retain tradeoffTOFURWKU
0
0 comments X

The pith

Joint optimization of model parameters and token weights recovers oracle forget-specific tokens under a separation condition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a token's relevance for forgetting is determined by whether minimizing the forget loss on it conflicts with the retain objective. They formalize this as a joint optimization over the model and token weights, showing that a natural separation condition allows recovery of the true forget-specific tokens. They introduce ATWU, an alternating optimization using a linear scorer on hidden states to learn these weights without external supervision. This method achieves better forget-retain trade-offs than previous approaches on TOFU and RWKU benchmarks. Readers should care because precise token-level unlearning can help remove unwanted knowledge from LLMs more effectively while preserving performance.

Core claim

Under a natural separation condition, the joint optimization over model parameters and token weights recovers the oracle forget-specific token support. Motivated by this, ATWU learns token forget-specificity via a simple linear scorer over hidden states during unlearning, achieving state-of-the-art forget-retain trade-offs on TOFU and RWKU while aligning learned scores better with ground truth spans.

What carries the argument

The joint optimization problem over model parameters and token weights, solved by alternating updates where token weights are produced by a linear scorer on the model's hidden states.

If this is right

  • ATWU outperforms sample-level methods, probability-based heuristics, and auxiliary-model approaches on forget-retain trade-offs.
  • The learned token scores align substantially better with ground truth forget-specific spans.
  • Token-level forget-specificity can be learned unsupervised directly from model representations with minimal computational overhead.
  • Retain conflict provides an effective criterion for identifying what to forget in language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could extend to identifying important tokens in other model adaptation tasks like fine-tuning or editing.
  • If the separation condition is approximately satisfied in practice, it may enable more targeted and efficient unlearning in deployed systems.
  • Similar conflict-based weighting might improve performance in related areas such as differential privacy or continual learning.
  • Future work could test whether the linear scorer generalizes across different model architectures without retraining.

Load-bearing premise

The natural separation condition between forget and retain objectives must hold for the joint optimization to recover the oracle forget-specific token support.

What would settle it

Running ATWU on a synthetic dataset where forget and retain objectives are deliberately entangled and observing that the recovered token support does not match the known oracle would falsify the recovery claim.

Figures

Figures reproduced from arXiv: 2606.06320 by Giorgos Nikolaou, Gizem Y\"uce, Nicolas Flammarion.

Figure 1
Figure 1. Figure 1: (a) Token-level scores on a TOFU forget sample. SEUL, SU-LLM, and SATIMP assign substantial weight to both forget-specific and structural tokens, whereas ATWU concentrates on the bold ground-truth forget-specific span. (b) ATWU can be combined with diverse forget losses. For each objective, DPO, NPO, SIMNPO, and SATGA, solid bars denote baseline Unlearning Quality (UQ), and added segments denote the gains … view at source ↗
Figure 2
Figure 2. Figure 2: Per-sample AUROC distributions for token-level forget-specificity on TOFU forget10, scored against the ground-truth token labels of Zhou et al. [2026]. Labels report mean ± std across forget samples. Method-to-scoring-criterion mapping in [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Token-level forget-specificity from SEUL, SU-LLM, SU-NGRAM, FUNDIAL, ETW, WGA, SATIMP, and ATWU on two TOFU forget samples. Shading reflects each method’s raw token score; bold spans mark the ground-truth forget-specific tokens; ATWU concentrates most clearly on the answer-bearing spans. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of ATWU’s token scores during training. The scorer starts from uniform scores and gradually concentrates on the ground-truth forget-specific span. A Qualitative Examples We complement the quantitative results with qualitative examples of the token-level scores learned by ATWU. The goal is to inspect whether the scorer identifies the tokens that actually carry the forget target, rather than assign… view at source ↗
Figure 5
Figure 5. Figure 5: Progression of the per-sample AUROC distribution of ATWU’s learned token scores dur￾ing training on TOFU [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Learned ATWU scores on unseen real-entity examples. The examples use TOFU-style prompts with fictional entities replaced by real people and factual information. ing data, ATWU still assigns elevated scores to the answer-bearing tokens—the name and sport tokens (Michael, Phelps, swimming) in the Olympic example, and the pet-identifying tokens (Water Dogs, Bo, Sunny) in the Obama example—while leaving most f… view at source ↗
Figure 7
Figure 7. Figure 7: Agreement between the GPT-5.4 mini judge and human anno￾tators on a 440-row sample of TOFU forget01. The judge matches the hu￾man label on ∼96% of calls; errors split as 8 false positives and 7 false negatives. To monitor broader post-unlearning behavior, we addition￾ally report the utility metrics of Section C.3: MMLU, repetitiveness, and WR. We interpret these as utility￾preservation probes rather than o… view at source ↗
Figure 8
Figure 8. Figure 8: Supervised baseline. A linear scorer trained with binary cross-entropy against the GT labels of Zhou et al. [2026] cleanly recovers the forget span after a fraction of an epoch. Bold tokens mark the ground-truth forget-relevant span. Unsupervised scorer with the unlearning objective. Ground-truth token labels of the kind released by Zhou et al. [2026] require manual or LLM-based annotation of the forget da… view at source ↗
Figure 9
Figure 9. Figure 9: Unsupervised scorer. Trained against the original target model (top), the ATWU objective fails to localize the GT forget span; trained against a retain model (bottom), the same objective recovers the answer-bearing tokens. Bold tokens mark the ground-truth forget-relevant span. E.2 ATWU with different forget losses Method Replacement Scope DPO r(s−) 7→ r g (s−) dispreferred sequence only NPO r(s) 7→ r g (s… view at source ↗
Figure 10
Figure 10. Figure 10: ROC curves for token-level forget￾relevance detection on TOFU forget10. ATWU obtains the highest AUROC among the compared scoring methods; the dashed diagonal denotes ran￾dom scoring. Table 14a compares three alternatives to the headline online procedure. TRAINED-FROZEN (TF) reuses the converged scorer from a previ￾ous ATWU run, freezes it, and then retrains the language model from scratch under this fixe… view at source ↗
read the original abstract

Machine unlearning aims to remove targeted knowledge from a trained model while preserving its general capabilities. For autoregressive language models, not all tokens in a forget sample are equally relevant to forgetting. Existing approaches either ignore this heterogeneity or rely on auxiliary models, heuristics, or external annotations to estimate each token's relevance for forgetting. We instead characterize it through the interaction with the retain objective: a token is forget-specific to the extent that minimizing the forget loss on that token does not conflict with retain optimality. We formalize this perspective as a joint optimization problem over the model parameters and the token weights and show that, under a natural separation condition, the resulting objective recovers the oracle forget-specific token support. Motivated by this formulation, we introduce Alternating Token-Weighted Unlearning (ATWU), a lightweight framework that jointly learns token forget-specificity and model parameters during unlearning using a simple linear scorer over the hidden states, without external token level supervision. Across TOFU and RWKU, ATWU achieves state of the art forget-retain trade-offs, outperforming sample-level methods, probability-based token weighting heuristics, and auxiliary-model-based approaches. Moreover, the learned scores align substantially better with ground truth forget-specific spans, indicating that ATWU identifies semantically meaningful token level forgetting signals. Overall, our results suggest that retain conflict provides an effective criterion for identifying what language models should forget, enabling unsupervised learning of token level forget-specificity directly from model representations with minimal computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that token forget-specificity in LLM unlearning can be characterized by the degree to which minimizing forget loss on a token conflicts with retain optimality. It formalizes this as a joint optimization over model parameters and token weights, proves that under a natural separation condition the objective recovers the oracle forget-specific token support, and introduces ATWU, which learns a linear scorer over hidden states to estimate token weights without external supervision. Experiments on TOFU and RWKU show ATWU achieves SOTA forget-retain trade-offs over sample-level, heuristic, and auxiliary-model baselines, with learned scores aligning better with ground-truth forget spans.

Significance. If the separation condition holds and the empirical gains are robust, the work supplies a lightweight, unsupervised criterion for token-level unlearning that avoids auxiliary models or annotations. The alignment of learned scores with ground-truth spans and the outperformance on standard benchmarks indicate a practical advance in precise, retain-preserving unlearning for autoregressive models.

major comments (3)
  1. [§3] §3 (formalization and separation condition): The central theoretical claim—that joint optimization recovers the oracle forget-specific support—rests on an unstated or only informally described 'natural separation condition.' No explicit mathematical statement, proof sketch, or empirical check (e.g., gradient-overlap statistics on TOFU/RWKU) is provided; if the condition fails when retain and forget gradients overlap on many tokens, the derivation does not hold and ATWU reduces to an ad-hoc linear scorer whose superiority must be justified purely empirically.
  2. [§4] §4 (ATWU algorithm and linear scorer): The alternating optimization jointly learns the linear scorer and model parameters, yet no ablation isolates the contribution of the learned token weights versus the alternating schedule or the choice of hidden-state features; without these controls it is unclear whether the reported trade-off improvements are attributable to the proposed token-level mechanism.
  3. [Table 1 / §5] Table 1 / §5 (forget-retain trade-offs): The SOTA claims are presented without reported variance across random seeds or statistical significance tests against the strongest baselines (probability-based heuristics and auxiliary-model methods); a single-run comparison is insufficient to establish reliable superiority on TOFU and RWKU.
minor comments (2)
  1. Notation for the token-weight vector and the linear scorer parameters is introduced without a consolidated table of symbols, making it difficult to track definitions across the formalization and algorithm sections.
  2. Figure captions for the alignment plots with ground-truth spans should explicitly state the metric used (e.g., precision@K or IoU) and the number of samples averaged.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, agreeing where revisions are needed to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (formalization and separation condition): The central theoretical claim—that joint optimization recovers the oracle forget-specific support—rests on an unstated or only informally described 'natural separation condition.' No explicit mathematical statement, proof sketch, or empirical check (e.g., gradient-overlap statistics on TOFU/RWKU) is provided; if the condition fails when retain and forget gradients overlap on many tokens, the derivation does not hold and ATWU reduces to an ad-hoc linear scorer whose superiority must be justified purely empirically.

    Authors: We agree that the separation condition requires a more explicit treatment. In the revised manuscript we will add a formal mathematical statement of the condition (including the precise gradient non-overlap requirement), include a concise proof sketch in §3 or the appendix, and report an empirical verification via gradient-overlap statistics computed on the TOFU and RWKU datasets. These additions will clarify the scope of the theoretical guarantee and allow readers to assess when the recovery result applies. revision: yes

  2. Referee: [§4] §4 (ATWU algorithm and linear scorer): The alternating optimization jointly learns the linear scorer and model parameters, yet no ablation isolates the contribution of the learned token weights versus the alternating schedule or the choice of hidden-state features; without these controls it is unclear whether the reported trade-off improvements are attributable to the proposed token-level mechanism.

    Authors: We acknowledge the value of targeted ablations. The revision will include additional experiments that (i) compare ATWU against a uniform-weight baseline (removing the learned scorer), (ii) contrast the alternating schedule with a joint-optimization variant, and (iii) ablate different hidden-state feature choices (last-layer vs. averaged layers). These controls will isolate the contribution of the token-weighting mechanism and the optimization procedure. revision: yes

  3. Referee: [Table 1 / §5] Table 1 / §5 (forget-retain trade-offs): The SOTA claims are presented without reported variance across random seeds or statistical significance tests against the strongest baselines (probability-based heuristics and auxiliary-model methods); a single-run comparison is insufficient to establish reliable superiority on TOFU and RWKU.

    Authors: We agree that single-run results limit the strength of the empirical claims. In the revised manuscript we will rerun all methods with at least three random seeds, report mean and standard deviation for the forget-retain metrics, and include paired statistical significance tests (e.g., t-tests) against the strongest baselines. This will provide a more rigorous assessment of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is conditional and self-contained

full rationale

The paper's core formalization is a joint optimization over parameters and token weights, with recovery of oracle support shown only under an explicitly invoked 'natural separation condition' between forget and retain objectives. This is a standard conditional theorem rather than a tautology or fit-by-construction. The ATWU linear scorer is learned jointly during unlearning without external supervision, but claims of alignment with ground-truth spans and SOTA trade-offs are presented as empirical outcomes on TOFU/RWKU, not as predictions forced by the inputs. No self-citations appear load-bearing, no ansatz is smuggled, and no known result is merely renamed. The derivation introduces an independent retain-conflict criterion and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on one key domain assumption (separation condition) and introduces no new free parameters beyond the learned linear scorer weights; no invented entities are postulated.

axioms (1)
  • domain assumption natural separation condition between forget and retain objectives
    Invoked to guarantee that the joint objective recovers the oracle forget-specific token support

pith-pipeline@v0.9.1-grok · 5803 in / 1262 out tokens · 26323 ms · 2026-06-28T02:14:08.942669+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 12 canonical work pages

  1. [1]

    2024 , url=

    Pratyush Maini and Zhili Feng and Avi Schwarzschild and Zachary Chase Lipton and J Zico Kolter , booktitle=. 2024 , url=

  2. [2]

    OpenUnlearning: Accelerating

    Vineeth Dorna and Anmol Reddy Mekala and Wenlong Zhao and Andrew McCallum and J Zico Kolter and Zachary Chase Lipton and Pratyush Maini , booktitle=. OpenUnlearning: Accelerating. 2026 , url=

  3. [3]

    Smith and Chiyuan Zhang , booktitle=

    Weijia Shi and Jaechan Lee and Yangsibo Huang and Sadhika Malladi and Jieyu Zhao and Ari Holtzman and Daogao Liu and Luke Zettlemoyer and Noah A. Smith and Chiyuan Zhang , booktitle=. 2025 , url=

  4. [4]

    Rethinking

    Qizhou Wang and Jin Peng Zhou and Zhanke Zhou and Saebyeol Shin and Bo Han and Kilian Q Weinberger , booktitle=. Rethinking. 2025 , url=

  5. [5]

    2025 , eprint=

    Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs , author=. 2025 , eprint=

  6. [6]

    The Ninth International Conference on Learning Representations , year=

    Measuring Massive Multitask Language Understanding , author=. The Ninth International Conference on Learning Representations , year=

  7. [7]

    doi:10.5281/zenodo.12608602 , url =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  8. [8]

    Exploring Criteria of Loss Reweighting to Enhance

    Puning Yang and Qizhou Wang and Zhuo Huang and Tongliang Liu and Chengqi Zhang and Bo Han , booktitle=. Exploring Criteria of Loss Reweighting to Enhance. 2025 , url=

  9. [9]

    Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

    Optuna: A next-generation hyperparameter optimization framework , author=. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

  10. [10]

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The

  11. [12]

    2024 , eprint=

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

  12. [13]

    2024 , url=

    Zhuoran Jin and Pengfei Cao and Chenhao Wang and Zhitao He and Hongbang Yuan and Jiachun Li and Yubo Chen and Kang Liu and Jun Zhao , booktitle=. 2024 , url=

  13. [14]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Selective forgetting: Advancing machine unlearning techniques and evaluation in language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  14. [16]

    Undial: Self-distillation with adjusted logits for robust unlearning in large language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  15. [17]

    Nathaniel Li and Alexander Pan and Anjali Gopal and Summer Yue and Daniel Berrios and Alice Gatti and Justin D. Li and Ann-Kathrin Dombrowski and Shashwat Goel and Gabriel Mukobi and Nathan Helm-Burger and Rassin Lababidi and Lennart Justen and Andrew Bo Liu and Michael Chen and Isabelle Barrass and Oliver Zhang and Xiaoyuan Zhu and Rishub Tamirisa and Bh...

  16. [18]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Not all tokens are meant to be forgotten , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  17. [19]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Knowledge unlearning for mitigating privacy risks in language models , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  18. [20]

    Advances in neural information processing systems , volume=

    Quark: Controllable text generation with reinforced unlearning , author=. Advances in neural information processing systems , volume=

  19. [21]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  20. [22]

    First Conference on Language Modeling , year=

    Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. First Conference on Language Modeling , year=

  21. [23]

    Simplicity Prevails: Rethinking Negative Preference Optimization for

    Chongyu Fan and Jiancheng Liu and Licong Lin and Jinghan Jia and Ruiqi Zhang and Song Mei and Sijia Liu , booktitle=. Simplicity Prevails: Rethinking Negative Preference Optimization for. 2026 , url=

  22. [24]

    The Thirteenth International Conference on Learning Representations , year=

    Scalable Extraction of Training Data from Aligned, Production Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  23. [27]

    The Eleventh International Conference on Learning Representations , year=

    Quantifying Memorization Across Neural Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  24. [28]

    2023 , month =

    Eldan, Ronen and Russinovich, Mark , title =. 2023 , month =

  25. [29]

    Right to be forgotten in the Era of large language models: implications, challenges, and solutions

    Right to be forgotten in the. AI and Ethics , author =. 2025 , pages =. doi:10.1007/s43681-024-00573-9 , abstract =

  26. [31]

    and Jia, Hengrui and Travers, Adelin and Zhang, Baiwu and Lie, David and Papernot, Nicolas , booktitle=

    Bourtoule, Lucas and Chandrasekaran, Varun and Choquette-Choo, Christopher A. and Jia, Hengrui and Travers, Adelin and Zhang, Baiwu and Lie, David and Papernot, Nicolas , booktitle=. Machine Unlearning , year=

  27. [33]

    2026 , eprint=

    Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens , author=. 2026 , eprint=

  28. [37]

    2017 , url=

    Understanding intermediate layers using linear classifier probes , author=. 2017 , url=

  29. [39]

    Phi-3 technical report: A highly capable language model locally on your phone, 2024

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, and Others. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv.org/abs/2404.14219

  30. [40]

    Optuna: A next-generation hyperparameter optimization framework

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623--2631, 2019

  31. [41]

    Understanding intermediate layers using linear classifier probes, 2017

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2017. URL https://openreview.net/forum?id=ryF7rTqgl

  32. [42]

    Identifying and mitigating the security risks of generative ai

    Clark Barrett, Brad Boyd, Elie Bursztein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi, Kathleen Fisher, Tatsunori Hashimoto, Dan Hendrycks, Somesh Jha, Daniel Kang, Florian Kerschbaum, Eric Mitchell, John Mitchell, Zulfikar Ramzan, Khawaja Shams, Dawn Song, Ankur Taly, and Diyi Yang. Iden...

  33. [43]

    Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. What do neural machine translation models learn about morphology? In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861--872, Vancouver, Canada, July 2017. Associati...

  34. [44]

    Adversary instantiation: Lower bounds for differentially private machine learning,

    Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141--159, 2021. doi:10.1109/SP40001.2021.00019

  35. [45]

    findings-acl.454/

    Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In Proceedings of the 2015 IEEE Symposium on Security and Privacy, SP '15, page 463–480, USA, 2015. IEEE Computer Society. ISBN 9781467369497. doi:10.1109/SP.2015.35. URL https://doi.org/10.1109/SP.2015.35

  36. [46]

    Quantifying memorization across neural language models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=TatRHT_1cK

  37. [47]

    Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT ' s attention. In Tal Linzen, Grzegorz Chrupa a, Yonatan Belinkov, and Dieuwke Hupkes, editors, Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276--286, Florence, Italy, August 201...

  38. [48]

    Undial: Self-distillation with adjusted logits for robust unlearning in large language models

    Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, and Ivan Vuli \'c . Undial: Self-distillation with adjusted logits for robust unlearning in large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)...

  39. [49]

    Openunlearning: Accelerating LLM unlearning via unified benchmarking of methods and metrics

    Vineeth Dorna, Anmol Reddy Mekala, Wenlong Zhao, Andrew McCallum, J Zico Kolter, Zachary Chase Lipton, and Pratyush Maini. Openunlearning: Accelerating LLM unlearning via unified benchmarking of methods and metrics. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.ne...

  40. [50]

    Who's Harry Potter ? A pproximate unlearning in LLMs

    Ronen Eldan and Mark Russinovich. Who's Harry Potter ? A pproximate unlearning in LLMs . arXiv, October 2023. URL https://www.microsoft.com/en-us/research/publication/whos-harry-potter-approximate-unlearning-in-llms/

  41. [51]

    Simplicity prevails: Rethinking negative preference optimization for LLM unlearning

    Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. Simplicity prevails: Rethinking negative preference optimization for LLM unlearning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=JbvSQm5h1l

  42. [52]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  43. [53]

    The L lama 3 herd of models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The L lama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  44. [54]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In The Ninth International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

  45. [55]

    John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4129--...

  46. [56]

    Knowledge unlearning for mitigating privacy risks in language models

    Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14389--14408, 2023

  47. [57]

    RWKU : Benchmarking real-world knowledge unlearning for large language models

    Zhuoran Jin, Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, and Jun Zhao. RWKU : Benchmarking real-world knowledge unlearning for large language models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=wOmtZ5FgMH

  48. [58]

    Copyright violations and large language models

    Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders S gaard. Copyright violations and large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7403--7412, Singapore, December 2023. Association for Computational Linguistics. doi:10.18653/v1/2...

  49. [59]

    Forget what matters, keep the rest: Selective unlearning of informative tokens, 2026

    Seunghee Koh, Sunghyun Baek, Youngdong Kim, and Junmo Kim. Forget what matters, keep the rest: Selective unlearning of informative tokens, 2026. URL https://arxiv.org/abs/2604.17785

  50. [60]

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew Bo Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-Voss, Cort B Breuer, Andy Zo...

  51. [61]

    Quark: Controllable text generation with reinforced unlearning

    Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35: 0 27591--27609, 2022

  52. [62]

    TOFU : A task of fictitious unlearning for LLM s

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, and J Zico Kolter. TOFU : A task of fictitious unlearning for LLM s. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=B41hNBoWLo

  53. [63]

    Feder Cooper, Daphne Ippolito, Christopher A

    Milad Nasr, Javier Rando, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Florian Tram \`e r, and Katherine Lee. Scalable extraction of training data from aligned, production language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://open...

  54. [64]

    A Survey of Machine Unlearning

    Thanh Tam Nguyen, Thanh Trung Huynh, Zhao Ren, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning. ACM Trans. Intell. Syst. Technol., 16 0 (5), September 2025. ISSN 2157-6904. doi:10.1145/3749987. URL https://doi.org/10.1145/3749987

  55. [65]

    GPT-5.4 mini , 2026

    OpenAI . GPT-5.4 mini , 2026. URL https://developers.openai.com/api/docs/models/gpt-5.4-mini. OpenAI API documentation

  56. [66]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9

  57. [67]

    Smith, and Chiyuan Zhang

    Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, and Chiyuan Zhang. MUSE : Machine unlearning six-way evaluation for language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=TArmA033BU

  58. [68]

    Unlearning that lasts: Utility-preserving, robust, and almost irreversible forgetting in llms, 2025

    Naman Deep Singh, Maximilian Müller, Francesco Croce, and Matthias Hein. Unlearning that lasts: Utility-preserving, robust, and almost irreversible forgetting in llms, 2025. URL https://arxiv.org/abs/2509.02820

  59. [69]

    Tenney, D

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593--4601, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-14...

  60. [70]

    Not every token needs forgetting: Selective unlearning balancing forgetting and utility in large language models

    Yixin Wan, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, and Rahul Gupta. Not every token needs forgetting: Selective unlearning balancing forgetting and utility in large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, page...

  61. [71]

    Selective forgetting: Advancing machine unlearning techniques and evaluation in language models

    Lingzhi Wang, Xingshan Zeng, Jinsong Guo, Kam-Fai Wong, and Georg Gottlob. Selective forgetting: Advancing machine unlearning techniques and evaluation in language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 843--851, 2025 a

  62. [72]

    Rethinking LLM unlearning objectives: A gradient perspective and go beyond

    Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q Weinberger. Rethinking LLM unlearning objectives: A gradient perspective and go beyond. In The Thirteenth International Conference on Learning Representations, 2025 b . URL https://openreview.net/forum?id=huo8MqVH6t

  63. [73]

    Exploring criteria of loss reweighting to enhance LLM unlearning

    Puning Yang, Qizhou Wang, Zhuo Huang, Tongliang Liu, Chengqi Zhang, and Bo Han. Exploring criteria of loss reweighting to enhance LLM unlearning. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=mGOugCZlAq

  64. [74]

    Negative preference optimization: From catastrophic collapse to effective unlearning

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=MXLBXjQkmb

  65. [75]

    Not all tokens are meant to be forgotten

    Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Douglas Zytko, Prashant Khanduri, and Dongxiao Zhu. Not all tokens are meant to be forgotten. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38173--38182, 2026