arxiv: 2605.00364 · v2 · submitted 2026-05-01 · 💻 cs.CL

Recognition: unknown

Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning

Jiawei Wu , Doudou Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords machine unlearningtoken-level attributionlarge language modelsprivacy preservationgradient optimizationforgetting effectivenessmodel utility

0 comments

The pith

Token-level attribution lets language models unlearn specific knowledge more precisely by updating only critical tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing unlearning methods apply the same updates to every token in a sequence, which adds noise from irrelevant tokens and harms the model's general abilities. TokenUnlearn instead computes an importance score for each token by combining masking signals that reveal dependence on the target knowledge with entropy signals that highlight uncertainty. High-importance tokens then receive either exclusive hard unlearning or weighted gradient contributions in a soft variant. Experiments on TOFU and WMDP benchmarks across three model sizes show stronger forgetting of targeted facts alongside better retention of unrelated capabilities. A reader would care because this makes unlearning practical for privacy and safety needs without the usual utility penalty.

Core claim

TokenUnlearn identifies critical tokens for removal through importance scores derived from knowledge-aware masking and entropy-aware signals, then applies unlearning selectively via hard selection or soft weighting; this token-level focus raises the gradient signal-to-noise ratio and delivers better forgetting effectiveness with less utility loss than uniform sequence-level baselines.

What carries the argument

The token importance scoring mechanism that fuses masking-based knowledge signals with entropy-based signals, enabling either hard selection of top tokens or soft modulation of their gradient contributions.

If this is right

Token-level selection reduces gradient noise and improves the forgetting-utility trade-off.
Both hard and soft strategies can be added to existing sequence-level unlearning algorithms without altering their core logic.
Theoretical improvement in signal-to-noise ratio holds across the tested model architectures.
Consistent gains appear on standard benchmarks like TOFU and WMDP for forgetting effectiveness and utility preservation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar token-attribution logic could extend to other model-editing tasks such as bias correction or targeted capability addition.
The approach implies that knowledge in language models is sufficiently localized to tokens that selective removal is feasible.
Efficient approximations of the masking and entropy computations would be needed to scale the method to much larger models.
It raises the question of how distributed versus localized the encoding of specific facts actually is within transformer representations.

Load-bearing premise

The combined masking and entropy importance scores reliably identify exactly the tokens that encode the knowledge targeted for removal.

What would settle it

An experiment in which the model still recalls the targeted knowledge after unlearning only the selected high-importance tokens, or shows larger-than-expected drops in unrelated task performance.

Figures

Figures reproduced from arXiv: 2605.00364 by Doudou Zhou, Jiawei Wu.

**Figure 1.** Figure 1: Overview of the TokenUnlearn framework. Given an input question-answer pair from the unlearning dataset, our method computes two token importance signals: (1) Knowledge-Aware Attribution Signal: obtained via counterfactual masking, where knowledge-relevant portions of the input are masked and the resulting log-probability shifts identify tokens whose predictions depend heavily on the targeted knowledge. (2… view at source ↗

**Figure 2.** Figure 2: Ablation study on main hyperparameters using Qwen3-8B with T-WGA on TOFU. We vary the selection ratio r (blue) and entropy-aware signal ratio 1−α (green), reporting retain score (left axis, ↑ better) and unlearn score (right axis, ↓ better). the TOFU benchmark, and results are shown in view at source ↗

read the original abstract

Machine unlearning has emerged as a critical capability for addressing privacy, safety, and regulatory concerns in large language models (LLMs). Existing methods operate at the sequence level, applying uniform updates across all tokens despite only a subset encoding the knowledge targeted for removal. This introduces gradient noise, degrades utility, and leads to suboptimal forgetting. We propose TokenUnlearn, a token-level attribution framework that identifies and selectively targets critical tokens. Our approach combines knowledge-aware signals via masking, and entropy-aware signals to yield importance scores for precise token selection. We develop two complementary strategies: hard selection, applying unlearning only to high-importance tokens, and soft weighting, modulating gradient contributions based on importance scores. Both extend existing methods to token-level variants. Theoretical analysis shows token-level selection improves gradient signal-to-noise ratio. Experiments on TOFU and WMDP benchmarks across three model architectures demonstrate consistent improvements over sequence-level baselines in both forgetting effectiveness and utility preservation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TokenUnlearn shifts unlearning to the token level with masking-plus-entropy scoring and shows benchmark gains, but the accuracy of those scores is the unproven link.

read the letter

This paper takes the standard sequence-level unlearning setup and breaks it down to individual tokens. The authors score tokens by combining a masking signal that checks knowledge impact with an entropy signal, then apply either hard selection of the top tokens or soft gradient weighting. They extend existing methods this way and report better forgetting plus preserved utility on TOFU and WMDP across three model sizes. A short theoretical section claims the selective updates raise signal-to-noise ratio, which is a reasonable intuition on paper.

Referee Report

2 major / 2 minor

Summary. The paper introduces TokenUnlearn, a token-level attribution framework for precise LLM unlearning. It computes importance scores by combining knowledge-aware signals from masking with entropy-aware signals, then applies either hard selection (unlearning only on high-importance tokens) or soft weighting (modulating gradients by importance). Theoretical analysis claims this improves gradient signal-to-noise ratio over sequence-level methods. Experiments on the TOFU and WMDP benchmarks across three model architectures report consistent gains in forgetting effectiveness and utility preservation relative to sequence-level baselines.

Significance. If the proposed attribution reliably isolates the exact tokens encoding targeted knowledge, the method could meaningfully advance precise unlearning by reducing gradient noise and utility degradation. The reported cross-benchmark, cross-architecture consistency would then represent a practical advance over uniform sequence-level updates. However, the significance is conditional on evidence that observed gains arise from accurate token targeting rather than incidental effects such as reduced effective update magnitude.

major comments (2)

[theoretical analysis and importance-score computation] The theoretical SNR improvement does not establish that the masking-plus-entropy scores identify the causal tokens encoding the knowledge to be removed. If the scores systematically overweight high-entropy but non-causal tokens or miss distributed facts, both hard and soft variants reduce to noisy or partial gradient masking; the empirical gains could then be explained by magnitude reduction alone. A direct validation (e.g., intervention tests or comparison against ground-truth token attributions) is required in the method or analysis section.
[experiments on TOFU and WMDP] The headline experimental claim of consistent improvements on TOFU and WMDP rests on the untested assumption that the combined importance scores correctly surface the relevant tokens. Without ablations that compare against random-token selection or magnitude-matched uniform updates, it remains unclear whether the reported forgetting-utility trade-off stems from precise attribution or from simply applying smaller effective updates. Such controls should be added to the experimental section.

minor comments (2)

[method] The exact formula combining the masking-based and entropy-based signals into a single importance score should be stated explicitly as an equation rather than described in prose.
[experiments] Hyper-parameter choices, statistical significance tests, and full implementation details for the three architectures are referenced only at a high level; these must be documented to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below with clarifications and planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [theoretical analysis and importance-score computation] The theoretical SNR improvement does not establish that the masking-plus-entropy scores identify the causal tokens encoding the knowledge to be removed. If the scores systematically overweight high-entropy but non-causal tokens or miss distributed facts, both hard and soft variants reduce to noisy or partial gradient masking; the empirical gains could then be explained by magnitude reduction alone. A direct validation (e.g., intervention tests or comparison against ground-truth token attributions) is required in the method or analysis section.

Authors: We agree that the SNR analysis assumes informative importance scores but does not directly prove they isolate causal tokens. We will add a new subsection in the analysis with intervention tests, including token ablation (measuring output change after masking attributed tokens) and comparison to random selection. Where ground-truth attributions are feasible (e.g., synthetic facts in TOFU), we will include direct comparisons. We will also explicitly discuss limitations for distributed or high-entropy non-causal tokens. revision: yes
Referee: [experiments on TOFU and WMDP] The headline experimental claim of consistent improvements on TOFU and WMDP rests on the untested assumption that the combined importance scores correctly surface the relevant tokens. Without ablations that compare against random-token selection or magnitude-matched uniform updates, it remains unclear whether the reported forgetting-utility trade-off stems from precise attribution or from simply applying smaller effective updates. Such controls should be added to the experimental section.

Authors: We concur that these controls are necessary to rule out magnitude-reduction explanations. We will add two ablations to the experimental section: (1) random-token selection matching the count of hard-selected tokens, and (2) magnitude-matched uniform updates (scaling gradients or learning rate to equalize effective update size with soft weighting). Full results on TOFU and WMDP across models will be reported, with statistical comparisons to our method. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces a new token-level attribution method (masking + entropy signals) that extends prior sequence-level unlearning techniques, derives importance scores from those signals, and reports empirical gains on public TOFU/WMDP benchmarks across architectures. The SNR improvement is presented as an analysis of the proposed selection mechanism rather than a quantity defined by the inputs themselves. No equations or claims reduce by construction to fitted parameters, self-citations, or renamed known results; the central claims rest on independent empirical validation and the novel combination of signals.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that token importance can be estimated from masking and entropy without additional fitted parameters beyond those already present in the base unlearning methods; no new entities are postulated.

pith-pipeline@v0.9.0 · 5457 in / 1146 out tokens · 29229 ms · 2026-05-09T19:35:16.863957+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 19 canonical work pages · 6 internal anchors

[1]

and Yang, D

Chen, J. and Yang, D. Unlearn what you want to for- get: Efficient unlearning for llms.arXiv preprint arXiv:2310.20150,

work page arXiv
[2]

Who’s harry potter? approximate unlearning in llms, 2023.URL https://arxiv

Eldan, R. and Russinovich, M. Who’s harry pot- ter? approximate unlearning in llms, 2023.URL https://arxiv.org/abs/2310.02238, 1(2):8,

work page arXiv 2023
[3]

Regulation (EU) 2016/679 of the Euro- pean Parliament and of the Council.Official Journal of the European Union,

European Union. Regulation (EU) 2016/679 of the Euro- pean Parliament and of the Council.Official Journal of the European Union,

2016
[4]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review arXiv 2009
[6]

arXiv preprint arXiv:2210.01504 , year=

Jang, J., Yoon, D., Yang, S., Cha, S., Lee, M., Logeswaran, L., and Seo, M. Knowledge unlearning for mitigat- ing privacy risks in language models.arXiv preprint arXiv:2210.01504,

work page arXiv
[7]

Scalable and robust llm unlearning by correcting responses with retrieved exclusions.arXiv preprint arXiv:2509.25973,

Kim, J., Kim, K., Tack, J., Lim, D., and Shin, J. Scalable and robust llm unlearning by correcting responses with retrieved exclusions.arXiv preprint arXiv:2509.25973,

work page arXiv
[8]

arXiv preprint arXiv:2510.19422 , year=

Li, K., Wang, Q., Wang, Y ., Li, F., Liu, J., Han, B., and Zhou, J. Llm unlearning with llm beliefs.arXiv preprint arXiv:2510.19422, 2025a. Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., Dombrowski, A.-K., Goel, S., Phan, L., et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218,

work page arXiv
[9]

Li, Y ., Bubeck, S., Eldan, R., Del Giorno, A., Gunasekar, S., and Lee, Y . T. Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463,

work page internal anchor Pith review arXiv
[10]

F., Kurmanji, M., Qiu, X., Cai, D., Wu, C., and Lane, N

Li, Z., Wang, X., Shen, W. F., Kurmanji, M., Qiu, X., Cai, D., Wu, C., and Lane, N. D. Editing as unlearn- ing: Are knowledge editing methods strong baselines for large language model unlearning?arXiv preprint arXiv:2505.19855, 2025b. Liu, C., Wang, Y ., Flanigan, J., and Liu, Y . Large language model unlearning via embedding-corrupted prompts.Ad- vances ...

work page arXiv
[11]

2022 , month = oct, journal =

Nguyen, T. T., Huynh, T. T., Nguyen, P. L., Liew, A. W.-C., Yin, H., and Nguyen, Q. V . H. A survey of machine unlearning.arXiv preprint arXiv:2209.02299,

work page arXiv
[12]

Can sensitive information be deleted from llms? objectives for defending against extraction attacks.arXiv preprint arXiv:2309.17410, 2023

Patil, V ., Hase, P., and Bansal, M. Can sensitive information be deleted from llms? objectives for defending against ex- traction attacks.arXiv preprint arXiv:2309.17410,

work page arXiv
[13]

In-context unlearning: Language models as few shot unlearners.arXiv preprint arXiv:2310.07579,

Pawelczyk, M., Neel, S., and Lakkaraju, H. In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579,

work page arXiv
[14]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Detecting pretraining data from large language models.arXiv preprint arXiv:2310.16789, 2023

Shi, W., Ajith, A., Xia, M., Huang, Y ., Liu, D., Blevins, T., Chen, D., and Zettlemoyer, L. Detecting pretrain- ing data from large language models.arXiv preprint arXiv:2310.16789,

work page arXiv
[16]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Tokens for learning, tokens for unlearning: Mitigating membership inference attacks in large language models via dual-purpose training.arXiv preprint arXiv:2502.19726,

Tran, T., Liu, R., and Xiong, L. Tokens for learning, tokens for unlearning: Mitigating membership inference attacks in large language models via dual-purpose training.arXiv preprint arXiv:2502.19726,

work page arXiv
[18]

Not every token needs forgetting: Selective un- learning to limit change in utility in large language model unlearning.arXiv preprint arXiv:2506.00876,

Wan, Y ., Ramakrishna, A., Chang, K.-W., Cevher, V ., and Gupta, R. Not every token needs forgetting: Selective un- learning to limit change in utility in large language model unlearning.arXiv preprint arXiv:2506.00876,

work page arXiv
[19]

P., Zhou, Z., Shin, S., Han, B., and Weinberger, K

Wang, Q., Zhou, J. P., Zhou, Z., Shin, S., Han, B., and Weinberger, K. Q. Rethinking llm unlearning objectives: A gradient perspective and go beyond.arXiv preprint arXiv:2502.19301, 2025a. Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., et al. Beyond the 80/20 rule: High-entropy minority tokens drive effecti...

work page arXiv
[20]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, P., Wang, Q., Huang, Z., Liu, T., Zhang, C., and Han, B. Exploring criteria of loss reweighting to enhance LLM unlearning. InProceedings of the 42nd International Conference on Machine L...

work page internal anchor Pith review Pith/arXiv arXiv