Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

Bo Han; Junchi Yu; Philip Torr; Puning Yang; Qizhou Wang; Xiuying Chen

arxiv: 2605.16776 · v1 · pith:QNLXS4CHnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

Puning Yang , Junchi Yu , Qizhou Wang , Philip Torr , Bo Han , Xiuying Chen This is my paper

Pith reviewed 2026-05-19 21:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM unlearningknowledge erasurerefusal mechanismsenergy-based modelslatent representationssafe AIDistinguishable Deletion

0 comments

The pith

Distinguishable Deletion unifies knowledge erasure and refusal by restricting response distributions in latent space for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Distinguishable Deletion to solve issues in existing unlearning approaches for large language models. Knowledge deletion methods often bias by targeting specific tokens instead of fully removing knowledge, while refusal methods leave the knowledge intact and risk re-emergence. By restricting the response distribution in the latent representation and distinguishing unlearned from retained knowledge, the new paradigm enables both erasure and safe refusal. An energy index is used to measure this, supporting alignment during training and refusal at inference, with experiments showing better results.

Core claim

Distinguishable Deletion (D²) restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. This is implemented using an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses support that energy is accurate and efficient, allowing Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference.

What carries the argument

The energy index, which quantifies the presence of knowledge and the separation between unlearned and retained content in the latent representations.

Load-bearing premise

The energy index accurately quantifies the presence of knowledge and the separation between unlearned and retained content in latent representations.

What would settle it

An experiment where the energy index does not create a clear boundary, resulting in either incomplete erasure of sensitive knowledge or unintended suppression of retained knowledge.

Figures

Figures reproduced from arXiv: 2605.16776 by Bo Han, Junchi Yu, Philip Torr, Puning Yang, Qizhou Wang, Xiuying Chen.

**Figure 1.** Figure 1: Motivation and overview of our work. Left: Existing unlearning methods fall short in overall performance and general reliability: KD-based unlearning often produces unstable outputs, while DR-based unlearning is highly vulnerable to adversarial attacks. Right: These limitations in practicality and reliability motivate us to explore a new unlearning paradigm, Distinguishable Deletion (D 2 ), equipped with a… view at source ↗

**Figure 2.** Figure 2: Energy dynamics reveal the instability of KD-based unlearning and motivate EUA. (a) GradDiff reduces targeted logits while unconstrainedly increasing other-label logits. (b) This corresponds to decreasing and increasing negative energy, respectively; the overall energy first decreases and then increases, indicating a transition from under- to over-unlearning. (c)(d) Improved GradDiff-based methods achieve … view at source ↗

**Figure 3.** Figure 3: The robustness evaluations of EUA on TOFU with LLaMA-3.2-3B. Detailed values are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Energy distribution for existing KD-based methods. Results are obtained on TOFU-5% with LLaMA-3.2-3B. Training trajectory for energy. Furthermore, we present additional training trajectories and analyze the training dynamics of KD-based methods through Maximum Softmax Probability (MSP), which provides a more interpretable perspective. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Trajectories of energy and Maximum Softmax Probability (MSP). Results are obtained on TOFU-5% using LLaMA-3.2-3B. Efficiency of the energy index. The proposed energy index provides an efficient mechanism for estimating knowledge presence. While several recent works pursue similar goals, they rely on sampling-based generation methods, which require multiple samples (Li et al., 2026) or repeated output sampl… view at source ↗

**Figure 6.** Figure 6: Trajectory results during training on TOFU benchmark. 50 100 150 200 250 300 350 Training Steps 0 10 20 30 40 50 60 Performance VerbMem KnowMem UtilPres (a) GradDiff 50 100 150 200 250 300 350 Training Steps 0 10 20 30 40 50 60 Performance VerbMem KnowMem UtilPres (b) WGA 50 100 150 200 250 300 350 Training Steps 0 10 20 30 40 50 60 Performance VerbMem KnowMem UtilPres (c) SatImp 50 100 150 200 250 300 350… view at source ↗

**Figure 7.** Figure 7: Trajectory results during training on MUSE benchmark. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Energy changes under different attack strategies. We observe that the relearning attack is substantially more aggressive than prompt-based attacks. To further evaluate robustness, we conduct additional experiments on the MUSE benchmark and compare EUA with prior methods. Results shown in [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

read the original abstract

Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Distinguishable Deletion (D²) to unify knowledge deletion (KD) and distinguishable refusal (DR) for LLM unlearning. It introduces an energy index that quantifies knowledge presence and separation in latent representations, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary restrictions during training and an energy-based refusal mechanism at inference. Mathematical and empirical analyses are claimed to show that energy is accurate and efficient, with extensive experiments demonstrating that EUA outperforms prior methods.

Significance. If the energy index reliably separates unlearned from retained knowledge, this could offer a more robust alternative to existing KD and DR approaches by addressing biased deletion and knowledge re-emergence. The code release supports reproducibility, which is a strength for validating the empirical claims.

major comments (2)

[Abstract and §3] Abstract and §3: The central claim that the energy index enables complete erasure without collateral damage to retained knowledge rests on the untested assumption that unlearned content occupies a distinct energy region with no overlap. No derivation or analysis demonstrates preservation of this separation when a single fact is statistically entangled with retained facts via shared entities or reasoning chains in latent space.
[Empirical Evaluation] Empirical section: Tables and experiments test only isolated deletion tasks. Without ablations on entangled knowledge scenarios, the reported superiority of EUA over baselines does not yet establish generalizability of the D² paradigm.

minor comments (1)

[§3] The definition of the energy index would benefit from an explicit equation number referenced in the text when first introduced, to improve traceability of the mathematical analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our paper. We address each of the major comments below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3: The central claim that the energy index enables complete erasure without collateral damage to retained knowledge rests on the untested assumption that unlearned content occupies a distinct energy region with no overlap. No derivation or analysis demonstrates preservation of this separation when a single fact is statistically entangled with retained facts via shared entities or reasoning chains in latent space.

Authors: We thank the referee for highlighting this important point. Our analysis in Section 3 derives the energy index based on the separation in latent representations and shows through mathematical formulation that it can quantify and enforce boundaries between unlearned and retained knowledge. However, we acknowledge that the specific case of statistical entanglement through shared entities or reasoning chains is not explicitly analyzed for preservation of separation. In the revised manuscript, we will add a discussion and partial derivation addressing how the energy-based alignment can maintain separation even in entangled scenarios by leveraging the global energy distribution rather than local token dependencies. revision: yes
Referee: [Empirical Evaluation] Empirical section: Tables and experiments test only isolated deletion tasks. Without ablations on entangled knowledge scenarios, the reported superiority of EUA over baselines does not yet establish generalizability of the D² paradigm.

Authors: We appreciate this feedback on the empirical evaluation. The current experiments follow the standard benchmarks in the unlearning literature, which primarily use isolated deletion tasks to measure effectiveness. We agree that testing on entangled knowledge scenarios is crucial for demonstrating the generalizability of our D² paradigm. Accordingly, we will include additional ablation studies in the revised version, where we construct entangled knowledge sets (e.g., deleting a specific fact while retaining related facts sharing entities) and evaluate the performance of EUA compared to baselines in these settings. revision: yes

Circularity Check

1 steps flagged

Self-introduced energy index creates moderate circularity in unlearning claims

specific steps

self definitional [Abstract]
"To implement D², we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference."

The energy index is introduced by definition to quantify exactly the presence and separation needed for D²; the subsequent claim that analyses show the index is 'accurate' for enforcing unlearning therefore reduces to verifying properties built into the definition itself rather than an independent test of whether such a scalar proxy exists in entangled representations.

full rationale

The paper introduces an energy index specifically to implement the D² paradigm by quantifying knowledge presence and separation in latent space, then uses mathematical and empirical analyses to validate that this index enables accurate boundary enforcement and refusal. This creates a moderate circularity burden because the claimed accuracy and efficiency of the index are evaluated against the separation properties it was defined to produce, rather than against fully independent external benchmarks for entangled knowledge. However, the central derivation still contains independent empirical experiments on deletion tasks and comparisons to prior methods, so the overall result is not fully forced by construction. No self-citation chains or uniqueness theorems are load-bearing here.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the newly introduced energy index for quantifying and separating knowledge in latent space; this is a domain assumption without independent external validation shown in the abstract.

axioms (1)

domain assumption Restricting response distribution in latent representation erases undesirable knowledge while preserving distinction from retained knowledge.
Core premise of D² as stated in the abstract.

invented entities (1)

Energy index no independent evidence
purpose: Quantifies presence of knowledge and separation between unlearned and retained content.
Introduced to implement D² and EUA; no independent evidence outside the paper's analyses.

pith-pipeline@v0.9.0 · 5792 in / 1321 out tokens · 35579 ms · 2026-05-19T21:45:07.220707+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content... Eθ(x,y<i) = -T·log ∑v∈V e^{zθ(v|x,y<i)/T}
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Distinguishable Deletion (D²) paradigm that restricts the response distribution in the latent representation rather than specific tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · 21 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Arxiv Preprint , year=

What Is Preference Optimization Doing, How and Why? , author=. Arxiv Preprint , year=

work page
[4]

International Conference on Learning Representations , year=

Explainable LLM Unlearning through Reasoning , author=. International Conference on Learning Representations , year=

work page
[5]

ICML , year=

In-Context Unlearning: Language Models as Few-Shot Unlearners , author=. ICML , year=

work page
[6]

Nature Machine Intelligence , year=

Rethinking machine unlearning for large language models , author=. Nature Machine Intelligence , year=

work page
[7]

Proceedings of the ACM collective intelligence conference , year=

Gender bias and stereotypes in large language models , author=. Proceedings of the ACM collective intelligence conference , year=

work page
[8]

EMNLP , year=

Copyright violations and large language models , author=. EMNLP , year=

work page
[9]

Public Choice , year=

More human than human: measuring ChatGPT political bias , author=. Public Choice , year=

work page
[10]

Scalable Extraction of Training Data from (Production) Language Models

Scalable extraction of training data from (production) language models , author=. arXiv preprint arXiv:2311.17035 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

ICML , year=

GRU: Mitigating the Trade-off between Unlearning and Retention for LLMs , author=. ICML , year=

work page
[12]

ICML , year=

Position: Trustllm: Trustworthiness in large language models , author=. ICML , year=

work page
[13]

CoLM , year=

Tofu: A task of fictitious unlearning for llms , author=. CoLM , year=

work page
[14]

CoLM , year=

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. CoLM , year=

work page
[15]

ACL , year=

Knowledge unlearning for mitigating privacy risks in language models , author=. ACL , year=

work page
[16]

ICLR , year=

Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond , author=. ICLR , year=

work page
[17]

ICML , year=

Exploring Criteria of Loss Reweighting to Enhance LLM Unlearning , author=. ICML , year=

work page
[18]

NeurIPS , year=

Lunar: Llm unlearning via neural activation redirection , author=. NeurIPS , year=

work page
[19]

arXiv preprint arXiv:2504.05058 , year=

Not all data are unlearned equally , author=. arXiv preprint arXiv:2504.05058 , year=

work page arXiv
[20]

ICLR , year=

A Probabilistic Perspective on Unlearning and Alignment for Large Language Models , author=. ICLR , year=

work page
[21]

arXiv preprint arXiv:2509.24675 , year=

Understanding the Dilemma of Unlearning for Large Language Models , author=. arXiv preprint arXiv:2509.24675 , year=

work page arXiv
[22]

ICLR , year=

LLM Unlearning with LLM Beliefs , author=. ICLR , year=

work page
[23]

arXiv preprint arXiv:2511.04934 , year=

Leak@ k : Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding , author=. arXiv preprint arXiv:2511.04934 , year=

work page arXiv
[24]

EMNLP , year=

A Fully Probabilistic Perspective on Large Language Model Unlearning: Evaluation and Optimization , author=. EMNLP , year=

work page
[25]

S&P , year=

Learn what you want to unlearn: Unlearning inversion attacks against machine unlearning , author=. S&P , year=

work page
[26]

2025 , howpublished =

Introducing GPT-5.2 , author =. 2025 , howpublished =

work page 2025
[27]

do anything now

" do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , year=

work page 2024
[28]

ICML , year=

Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond , author=. ICML , year=

work page
[29]

URL https://openreview.net/forum?id= J5IRyTKZ9s

Eight methods to evaluate robust unlearning in llms , author=. arXiv preprint arXiv:2402.16835 , year=

work page arXiv
[30]

Predicting Structured Data , year=

A tutorial on energy-based learning , author=. Predicting Structured Data , year=

work page
[31]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

ICML 2025 Workshop MUGen , year=

On the Fragility of Latent Knowledge: Layer-wise Influence under Unlearning in Large Language Model , author=. ICML 2025 Workshop MUGen , year=

work page 2025
[33]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Simplicity Prevails: Rethinking Negative Preference Optimization for

Fan, Chongyu and Liu, Jiancheng and Lin, Licong and Jia, Jinghan and Zhang, Ruiqi and Mei, Song and Liu, Sijia , booktitle=. Simplicity Prevails: Rethinking Negative Preference Optimization for

work page
[35]

ICALP , year=

Differential privacy , author=. ICALP , year=

work page
[36]

CVPR , year=

Eternal sunshine of the spotless net: Selective forgetting in deep networks , author=. CVPR , year=

work page
[37]

S&P , year=

Towards making systems forget with machine unlearning , author=. S&P , year=

work page
[38]

ACL , year=

Unsupervised word sense disambiguation rivaling supervised methods , author=. ACL , year=

work page
[39]

Large language model safety: A holistic survey,

Large language model safety: A holistic survey , author=. arXiv preprint arXiv:2412.17686 , year=

work page arXiv
[40]

NeurIPS , year=

A Sober Look at the Robustness of CLIPs to Spurious Features , author=. NeurIPS , year=

work page
[41]

A comprehensive survey of machine unlearning techniques for large language models.arXiv preprint arXiv:2503.01854, 2025

A comprehensive survey of machine unlearning techniques for large language models , author=. arXiv preprint arXiv:2503.01854 , year=

work page arXiv
[42]

Sutherland , booktitle=

Yi Ren and Danica J. Sutherland , booktitle=. Learning Dynamics of

work page
[43]

NeurIPS , year=

Pareto multi-task learning , author=. NeurIPS , year=

work page
[44]

NeurIPS , year=

Gradient episodic memory for continual learning , author=. NeurIPS , year=

work page
[45]

Unrolling

Thudi, Anvith and Deza, Gabriel and Chandrasekaran, Varun and Papernot, Nicolas , booktitle=. Unrolling

work page
[46]

The Platonic Representation Hypothesis

The platonic representation hypothesis , author=. arXiv preprint arXiv:2405.07987 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Textbooks Are All You Need II: phi-1.5 technical report

Li, Yuanzhi and Bubeck, S. Textbooks are all you need. arXiv preprint arXiv:2309.05463 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

ORPO: Monolithic Preference Optimization without Reference Model

Orpo: Monolithic preference optimization without reference model , author=. arXiv preprint arXiv:2403.07691 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

NeurIPS , year=

Direct preference optimization: Your language model is secretly a reward model , author=. NeurIPS , year=

work page
[51]

Towards Effective Evaluations and Comparisons for

Qizhou Wang and Bo Han and Puning Yang and Jianing Zhu and Tongliang Liu and Masashi Sugiyama , booktitle=. Towards Effective Evaluations and Comparisons for

work page
[52]

arXiv preprint arXiv:2402.00888 , year=

Security and privacy challenges of large language models: A survey , author=. arXiv preprint arXiv:2402.00888 , year=

work page arXiv
[53]

Can sensitive information be deleted from

Patil, Vaidehi and Hase, Peter and Bansal, Mohit , booktitle=. Can sensitive information be deleted from

work page
[54]

and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and others , booktitle=

Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and others , booktitle=. The

work page
[55]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Generating Wikipedia by Summarizing Long Sequences

Generating wikipedia by summarizing long sequences , author=. arXiv preprint arXiv:1801.10198 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

CVPR , year=

Robust fine-tuning of zero-shot models , author=. CVPR , year=

work page
[58]

NeurIPS , year=

Language models are few-shot learners , author=. NeurIPS , year=

work page
[59]

BloombergGPT: A Large Language Model for Finance

Bloomberggpt: A large language model for finance , author=. arXiv preprint arXiv:2303.17564 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

ACL , year=

Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics , author=. ACL , year=

work page
[61]

ACL , year=

Quantifying privacy risks of masked language models using membership inference attacks , author=. ACL , year=

work page
[62]

NeurIPS , year=

Training language models to follow instructions with human feedback , author=. NeurIPS , year=

work page
[63]

Sharpness-Aware Minimization for Efficiently Improving Generalization

Sharpness-aware minimization for efficiently improving generalization , author=. arXiv preprint arXiv:2010.01412 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[64]

ICLR , year=

Quantifying memorization across neural language models , author=. ICLR , year=

work page
[65]

NeurIPS , year=

Positive-unlabeled learning with non-negative risk estimator , author=. NeurIPS , year=

work page
[66]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

NeurIPS , year=

Task arithmetic in the tangent space: Improved editing of pre-trained models , author=. NeurIPS , year=

work page
[68]

ICLR , year=

Decoupled weight decay regularization , author=. ICLR , year=

work page
[69]

NeurIPS , year=

Knowledge Circuits in Pretrained Transformers , author=. NeurIPS , year=

work page
[70]

EMNLP , year=

Knowledge Conflicts for LLMs: A Survey , author=. EMNLP , year=

work page
[71]

Provably robust

Chowdhury, Sayak Ray and Kini, Anush and Natarajan, Nagarajan , booktitle=. Provably robust

work page
[72]

arXiv preprint arXiv:2104.08164 , year=

Editing factual knowledge in language models , author=. arXiv preprint arXiv:2104.08164 , year=

work page arXiv
[73]

Reinforcement Learning for LLM Post-Training: A Survey

A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More , author=. arXiv preprint arXiv:2407.16216 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

EMNLP , year=

Transformer feed-forward layers are key-value memories , author=. EMNLP , year=

work page
[76]

USENIX Security , year=

Extracting training data from large language models , author=. USENIX Security , year=

work page
[77]

S&P , year=

Machine unlearning , author=. S&P , year=

work page
[78]

NeurIPS , year=

Large Language Model Unlearning , author=. NeurIPS , year=

work page
[79]

Code Llama: Open Foundation Models for Code

Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment , author=. arXiv preprint arXiv:2308.05374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Arxiv Preprint , year=

What Is Preference Optimization Doing, How and Why? , author=. Arxiv Preprint , year=

work page

[4] [4]

International Conference on Learning Representations , year=

Explainable LLM Unlearning through Reasoning , author=. International Conference on Learning Representations , year=

work page

[5] [5]

ICML , year=

In-Context Unlearning: Language Models as Few-Shot Unlearners , author=. ICML , year=

work page

[6] [6]

Nature Machine Intelligence , year=

Rethinking machine unlearning for large language models , author=. Nature Machine Intelligence , year=

work page

[7] [7]

Proceedings of the ACM collective intelligence conference , year=

Gender bias and stereotypes in large language models , author=. Proceedings of the ACM collective intelligence conference , year=

work page

[8] [8]

EMNLP , year=

Copyright violations and large language models , author=. EMNLP , year=

work page

[9] [9]

Public Choice , year=

More human than human: measuring ChatGPT political bias , author=. Public Choice , year=

work page

[10] [10]

Scalable Extraction of Training Data from (Production) Language Models

Scalable extraction of training data from (production) language models , author=. arXiv preprint arXiv:2311.17035 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

ICML , year=

GRU: Mitigating the Trade-off between Unlearning and Retention for LLMs , author=. ICML , year=

work page

[12] [12]

ICML , year=

Position: Trustllm: Trustworthiness in large language models , author=. ICML , year=

work page

[13] [13]

CoLM , year=

Tofu: A task of fictitious unlearning for llms , author=. CoLM , year=

work page

[14] [14]

CoLM , year=

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. CoLM , year=

work page

[15] [15]

ACL , year=

Knowledge unlearning for mitigating privacy risks in language models , author=. ACL , year=

work page

[16] [16]

ICLR , year=

Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond , author=. ICLR , year=

work page

[17] [17]

ICML , year=

Exploring Criteria of Loss Reweighting to Enhance LLM Unlearning , author=. ICML , year=

work page

[18] [18]

NeurIPS , year=

Lunar: Llm unlearning via neural activation redirection , author=. NeurIPS , year=

work page

[19] [19]

arXiv preprint arXiv:2504.05058 , year=

Not all data are unlearned equally , author=. arXiv preprint arXiv:2504.05058 , year=

work page arXiv

[20] [20]

ICLR , year=

A Probabilistic Perspective on Unlearning and Alignment for Large Language Models , author=. ICLR , year=

work page

[21] [21]

arXiv preprint arXiv:2509.24675 , year=

Understanding the Dilemma of Unlearning for Large Language Models , author=. arXiv preprint arXiv:2509.24675 , year=

work page arXiv

[22] [22]

ICLR , year=

LLM Unlearning with LLM Beliefs , author=. ICLR , year=

work page

[23] [23]

arXiv preprint arXiv:2511.04934 , year=

Leak@ k : Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding , author=. arXiv preprint arXiv:2511.04934 , year=

work page arXiv

[24] [24]

EMNLP , year=

A Fully Probabilistic Perspective on Large Language Model Unlearning: Evaluation and Optimization , author=. EMNLP , year=

work page

[25] [25]

S&P , year=

Learn what you want to unlearn: Unlearning inversion attacks against machine unlearning , author=. S&P , year=

work page

[26] [26]

2025 , howpublished =

Introducing GPT-5.2 , author =. 2025 , howpublished =

work page 2025

[27] [27]

do anything now

" do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , year=

work page 2024

[28] [28]

ICML , year=

Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond , author=. ICML , year=

work page

[29] [29]

URL https://openreview.net/forum?id= J5IRyTKZ9s

Eight methods to evaluate robust unlearning in llms , author=. arXiv preprint arXiv:2402.16835 , year=

work page arXiv

[30] [30]

Predicting Structured Data , year=

A tutorial on energy-based learning , author=. Predicting Structured Data , year=

work page

[31] [31]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

ICML 2025 Workshop MUGen , year=

On the Fragility of Latent Knowledge: Layer-wise Influence under Unlearning in Large Language Model , author=. ICML 2025 Workshop MUGen , year=

work page 2025

[33] [33]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Simplicity Prevails: Rethinking Negative Preference Optimization for

Fan, Chongyu and Liu, Jiancheng and Lin, Licong and Jia, Jinghan and Zhang, Ruiqi and Mei, Song and Liu, Sijia , booktitle=. Simplicity Prevails: Rethinking Negative Preference Optimization for

work page

[35] [35]

ICALP , year=

Differential privacy , author=. ICALP , year=

work page

[36] [36]

CVPR , year=

Eternal sunshine of the spotless net: Selective forgetting in deep networks , author=. CVPR , year=

work page

[37] [37]

S&P , year=

Towards making systems forget with machine unlearning , author=. S&P , year=

work page

[38] [38]

ACL , year=

Unsupervised word sense disambiguation rivaling supervised methods , author=. ACL , year=

work page

[39] [39]

Large language model safety: A holistic survey,

Large language model safety: A holistic survey , author=. arXiv preprint arXiv:2412.17686 , year=

work page arXiv

[40] [40]

NeurIPS , year=

A Sober Look at the Robustness of CLIPs to Spurious Features , author=. NeurIPS , year=

work page

[41] [41]

A comprehensive survey of machine unlearning techniques for large language models.arXiv preprint arXiv:2503.01854, 2025

A comprehensive survey of machine unlearning techniques for large language models , author=. arXiv preprint arXiv:2503.01854 , year=

work page arXiv

[42] [42]

Sutherland , booktitle=

Yi Ren and Danica J. Sutherland , booktitle=. Learning Dynamics of

work page

[43] [43]

NeurIPS , year=

Pareto multi-task learning , author=. NeurIPS , year=

work page

[44] [44]

NeurIPS , year=

Gradient episodic memory for continual learning , author=. NeurIPS , year=

work page

[45] [45]

Unrolling

Thudi, Anvith and Deza, Gabriel and Chandrasekaran, Varun and Papernot, Nicolas , booktitle=. Unrolling

work page

[46] [46]

The Platonic Representation Hypothesis

The platonic representation hypothesis , author=. arXiv preprint arXiv:2405.07987 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Textbooks Are All You Need II: phi-1.5 technical report

Li, Yuanzhi and Bubeck, S. Textbooks are all you need. arXiv preprint arXiv:2309.05463 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

ORPO: Monolithic Preference Optimization without Reference Model

Orpo: Monolithic preference optimization without reference model , author=. arXiv preprint arXiv:2403.07691 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

NeurIPS , year=

Direct preference optimization: Your language model is secretly a reward model , author=. NeurIPS , year=

work page

[51] [51]

Towards Effective Evaluations and Comparisons for

Qizhou Wang and Bo Han and Puning Yang and Jianing Zhu and Tongliang Liu and Masashi Sugiyama , booktitle=. Towards Effective Evaluations and Comparisons for

work page

[52] [52]

arXiv preprint arXiv:2402.00888 , year=

Security and privacy challenges of large language models: A survey , author=. arXiv preprint arXiv:2402.00888 , year=

work page arXiv

[53] [53]

Can sensitive information be deleted from

Patil, Vaidehi and Hase, Peter and Bansal, Mohit , booktitle=. Can sensitive information be deleted from

work page

[54] [54]

and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and others , booktitle=

Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and others , booktitle=. The

work page

[55] [55]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

Generating Wikipedia by Summarizing Long Sequences

Generating wikipedia by summarizing long sequences , author=. arXiv preprint arXiv:1801.10198 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

CVPR , year=

Robust fine-tuning of zero-shot models , author=. CVPR , year=

work page

[58] [58]

NeurIPS , year=

Language models are few-shot learners , author=. NeurIPS , year=

work page

[59] [59]

BloombergGPT: A Large Language Model for Finance

Bloomberggpt: A large language model for finance , author=. arXiv preprint arXiv:2303.17564 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

ACL , year=

Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics , author=. ACL , year=

work page

[61] [61]

ACL , year=

Quantifying privacy risks of masked language models using membership inference attacks , author=. ACL , year=

work page

[62] [62]

NeurIPS , year=

Training language models to follow instructions with human feedback , author=. NeurIPS , year=

work page

[63] [63]

Sharpness-Aware Minimization for Efficiently Improving Generalization

Sharpness-aware minimization for efficiently improving generalization , author=. arXiv preprint arXiv:2010.01412 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010

[64] [64]

ICLR , year=

Quantifying memorization across neural language models , author=. ICLR , year=

work page

[65] [65]

NeurIPS , year=

Positive-unlabeled learning with non-negative risk estimator , author=. NeurIPS , year=

work page

[66] [66]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[67] [67]

NeurIPS , year=

Task arithmetic in the tangent space: Improved editing of pre-trained models , author=. NeurIPS , year=

work page

[68] [68]

ICLR , year=

Decoupled weight decay regularization , author=. ICLR , year=

work page

[69] [69]

NeurIPS , year=

Knowledge Circuits in Pretrained Transformers , author=. NeurIPS , year=

work page

[70] [70]

EMNLP , year=

Knowledge Conflicts for LLMs: A Survey , author=. EMNLP , year=

work page

[71] [71]

Provably robust

Chowdhury, Sayak Ray and Kini, Anush and Natarajan, Nagarajan , booktitle=. Provably robust

work page

[72] [72]

arXiv preprint arXiv:2104.08164 , year=

Editing factual knowledge in language models , author=. arXiv preprint arXiv:2104.08164 , year=

work page arXiv

[73] [73]

Reinforcement Learning for LLM Post-Training: A Survey

A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More , author=. arXiv preprint arXiv:2407.16216 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[74] [74]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[75] [75]

EMNLP , year=

Transformer feed-forward layers are key-value memories , author=. EMNLP , year=

work page

[76] [76]

USENIX Security , year=

Extracting training data from large language models , author=. USENIX Security , year=

work page

[77] [77]

S&P , year=

Machine unlearning , author=. S&P , year=

work page

[78] [78]

NeurIPS , year=

Large Language Model Unlearning , author=. NeurIPS , year=

work page

[79] [79]

Code Llama: Open Foundation Models for Code

Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[80] [80]

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment , author=. arXiv preprint arXiv:2308.05374 , year=

work page internal anchor Pith review Pith/arXiv arXiv