Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Hadi Reisizadeh; Jiajun Ruan; Mingyi Hong; Sijia Liu; Soumyadeep Pal; Yiwei Chen

arxiv: 2511.04934 · v3 · pith:PFK7GKWInew · submitted 2025-11-07 · 💻 cs.LG

Leak@k: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Hadi Reisizadeh , Jiajun Ruan , Yiwei Chen , Soumyadeep Pal , Sijia Liu , Mingyi Hong This is my paper

classification 💻 cs.LG

keywords unlearningtextttleakunderdecodingmetricknowledgemethods

0 comments

read the original abstract

Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work, we show that \textit{almost all} existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned' models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks, we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \texttt{leak@$k$}, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating $k$ samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, systematic study of unlearning reliability using \texttt{leak@$k$} metric. Our findings demonstrate that knowledge leakage persists across methods and tasks, underscoring that current state-of-the-art (SOTA) unlearning techniques provide only limited forgetting. We propose an algorithm, termed Robust Unlearning under LEak@$k$ metric (\texttt{RULE}) to address this concern. We demonstrate that \texttt{RULE} provides an unlearned model for TOFU benchmark with no information leakage for a large number of generation samples. On the MUSE benchmark, \texttt{RULE} outperforms SOTA unlearning methods under the \texttt{leak@$k$} metric across most sampling budgets $k$. Codes are available at https://github.com/OptimAI-Lab/Leak-k.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning
cs.LG 2026-05 unverdicted novelty 6.0

Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.