Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens
Pith reviewed 2026-05-10 04:28 UTC · model grok-4.3
The pith
Weighting unlearning loss by token entropy lets models forget key facts while preserving general capabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors propose Entropy-guided Token Weighting (ETW) as a token-level regularizer for unlearning in LLMs. ETW multiplies the contribution of each token to the forgetting loss by the entropy of the model's predictive distribution at that position. They establish that informative tokens carry higher entropy due to greater predictive uncertainty, whereas structural tokens remain low-entropy because they are highly predictable. This weighting produces stronger removal of the desired targets alongside reduced loss of model utility on unrelated tasks compared with uniform loss or other token-wise baselines.
What carries the argument
Entropy-guided Token Weighting (ETW), a regularizer that scales each token's unlearning loss by the entropy of the model's softmax distribution over next tokens.
If this is right
- Unlearning focuses computational effort on uncertain positions rather than altering all tokens equally.
- Low-entropy structural tokens experience smaller updates, limiting damage to syntax and fluency.
- The approach needs no external parsers, ground-truth confidence scores, or additional labels.
- Targeted forgetting becomes possible with smaller overall changes to the model's parameter space.
Where Pith is reading between the lines
- The entropy proxy could extend to other selective-update settings such as editing factual knowledge or mitigating bias without full retraining.
- Sequence-aware variants of the weighting might better handle facts that span multiple tokens.
- Direct comparison of model entropy against human annotations of token importance would test how well the proxy generalizes across domains.
Load-bearing premise
The entropy of the model's next-token predictions reliably marks which tokens carry the semantic content or behavior that must be removed.
What would settle it
Perform ETW unlearning on a model for specific facts or behaviors, then measure both the continued generation rate of the forgotten material and accuracy on standard benchmarks such as MMLU; if ETW produces weaker forgetting or larger utility drops than uniform baselines, the central benefit does not hold.
Figures
read the original abstract
Unlearning in large language models (LLMs) has emerged as a promising safeguard against adversarial behaviors. When the forgetting loss is applied uniformly without considering token-level semantic importance, model utility can be unnecessarily degraded. Recent studies have explored token-wise loss regularizers that prioritize informative tokens, but largely rely on ground-truth confidence or external linguistic parsers, which limits their ability to capture contextual information or the model's overall predictive state. Intuitively, function words like "the" primarily serve syntactic roles and are highly predictable with little ambiguity, but informative words admit multiple plausible alternatives with greater uncertainty. Based on this intuition, we propose Entropy-guided Token Weighting (ETW), a token-level unlearning regularizer that uses entropy of the predictive distribution as a proxy for token informativeness. We demonstrate that informative tokens tend to have higher entropy, whereas structural tokens tend to have lower entropy. This behavior enables ETW to achieve more effective unlearning while better preserving model utility than existing token-level approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Entropy-guided Token Weighting (ETW), a token-level regularizer for LLM unlearning that weights the forgetting loss by the entropy of the model's own next-token predictive distribution. The central intuition is that high-entropy tokens are semantically informative while low-entropy tokens are structural/function words; ETW therefore applies stronger unlearning pressure to the former. The authors claim this yields superior unlearning effectiveness and utility preservation compared with prior token-wise methods that rely on ground-truth confidence or external parsers.
Significance. If the empirical claims hold, ETW supplies a simple, fully model-intrinsic, parameter-free mechanism for selective unlearning that avoids external linguistic tools. This could meaningfully improve the practicality of unlearning safeguards in deployed LLMs.
major comments (2)
- [Abstract, §3] Abstract and §3: the central claim that 'informative tokens tend to have higher entropy' and that ETW therefore produces better unlearning-utility trade-offs is asserted without any reported correlation statistics, ablation tables, or quantitative comparison to baselines. The soundness assessment cannot be completed until the experimental sections supply these data with error bars and statistical tests.
- [§4] §4 (method): the weighting formula w_t = H(p(·|x_<t)) is presented as directly following from the entropy-informativeness intuition, yet no derivation or sensitivity analysis shows why this particular functional form is preferred over alternatives (e.g., normalized entropy, mutual information, or variance of the predictive distribution).
minor comments (2)
- [§4] Notation: the manuscript should explicitly define whether entropy is computed over the full vocabulary or a top-k subset, and whether temperature scaling is applied before entropy calculation.
- [§2] Related work: the comparison to prior token-level unlearning regularizers would benefit from a concise table summarizing their information sources (ground-truth, parser, model entropy).
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major point below and indicate the revisions we will make to improve the manuscript's rigor.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3: the central claim that 'informative tokens tend to have higher entropy' and that ETW therefore produces better unlearning-utility trade-offs is asserted without any reported correlation statistics, ablation tables, or quantitative comparison to baselines. The soundness assessment cannot be completed until the experimental sections supply these data with error bars and statistical tests.
Authors: We agree that explicit quantitative support strengthens the central claim. The experimental section (§5) already reports unlearning-utility trade-offs against token-level baselines, but we did not include direct correlation statistics (e.g., between entropy and token informativeness proxies) or statistical significance tests. In revision we will add these: a correlation analysis (Pearson coefficients with error bars), ablation tables, and t-test p-values across multiple random seeds. revision: yes
-
Referee: [§4] §4 (method): the weighting formula w_t = H(p(·|x_<t)) is presented as directly following from the entropy-informativeness intuition, yet no derivation or sensitivity analysis shows why this particular functional form is preferred over alternatives (e.g., normalized entropy, mutual information, or variance of the predictive distribution).
Authors: The raw entropy form follows directly from the definition of predictive uncertainty without introducing scale-dependent normalizers. We acknowledge the absence of sensitivity analysis. We will revise §4 with a short justification paragraph and add an ablation study in the experiments comparing raw entropy against normalized entropy, predictive variance, and mutual information, reporting the resulting unlearning-utility curves. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper's central construction is Entropy-guided Token Weighting (ETW), which directly computes per-token weights from the entropy of the model's own next-token predictive distribution. This is an explicit, non-fitted proxy derived from the forward pass rather than from any target unlearning labels or fitted parameters. The abstract states the supporting intuition and the empirical demonstration that informative tokens exhibit higher entropy, but presents neither as a mathematical derivation that reduces to the inputs by construction. No equations, self-citations, uniqueness theorems, or ansatzes are referenced in the provided text that would create a load-bearing loop. The method therefore remains falsifiable via downstream unlearning-utility trade-offs and does not collapse into a renaming or self-definition of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Entropy of the predictive distribution serves as a reliable proxy for token semantic informativeness.
Reference graph
Works this paper leans on
-
[1]
Pratyush Maini and Zhili Feng and Avi Schwarzschild and Zachary Chase Lipton and J Zico Kolter , booktitle=. 2024 , url=
work page 2024
-
[2]
Smith and Chiyuan Zhang , booktitle=
Weijia Shi and Jaechan Lee and Yangsibo Huang and Sadhika Malladi and Jieyu Zhao and Ari Holtzman and Daogao Liu and Luke Zettlemoyer and Noah A. Smith and Chiyuan Zhang , booktitle=. 2025 , url=
work page 2025
-
[3]
Bourtoule, Lucas and Chandrasekaran, Varun and Choquette-Choo, Christopher A. and Jia, Hengrui and Travers, Adelin and Zhang, Baiwu and Lie, David and Papernot, Nicolas , booktitle=. Machine Unlearning , year=
-
[4]
Golatkar, Aditya and Achille, Alessandro and Ravichandran, Avinash and Polito, Marzia and Soatto, Stefano , title =. CVPR , year =
-
[5]
Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks , author=. CVPR , year=
-
[6]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =
Koh, Seunghee and Shon, Hyounguk and Lee, Janghyeon and Hong, Hyeong Gwon and Kim, Junmo , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =
work page 2023
-
[7]
Qizhou Wang and Jin Peng Zhou and Zhanke Zhou and Saebyeol Shin and Bo Han and Kilian Q Weinberger , booktitle=. Rethinking. 2025 , url=
work page 2025
-
[8]
UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models
Dong, Yijiang River and Lin, Hongzhou and Belkin, Mikhail and Huerta, Ramon and Vuli \'c , Ivan. UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...
-
[9]
Towards Robust and Parameter-Efficient Knowledge Unlearning for
Sungmin Cha and Sungjun Cho and Dasol Hwang and Moontae Lee , booktitle=. Towards Robust and Parameter-Efficient Knowledge Unlearning for. 2025 , url=
work page 2025
-
[10]
The Thirteenth International Conference on Learning Representations , year=
Min-K\ author=. The Thirteenth International Conference on Learning Representations , year=
-
[11]
Towards Understanding Jailbreak Attacks in LLM s: A Representation Space Analysis
Lin, Yuping and He, Pengfei and Xu, Han and Xing, Yue and Yamada, Makoto and Liu, Hui and Tang, Jiliang. Towards Understanding Jailbreak Attacks in LLM s: A Representation Space Analysis. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.401
-
[12]
The Twelfth International Conference on Learning Representations , year=
Curiosity-driven Red-teaming for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[13]
spaCy: Industrial-strength Natural Language Processing in Python , author =. 2020 , rul =
work page 2020
-
[14]
Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew Bo and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xiaoyuan and Tami...
work page 2024
-
[15]
First Conference on Language Modeling , year=
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. First Conference on Language Modeling , year=
-
[16]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[17]
Dorna, Vineeth and Mekala, Anmol and Zhao, Wenlong and McCallum, Andrew and Lipton, Zachary C and Kolter, J Zico and Maini, Pratyush , journal=. 2025 , url=
work page 2025
-
[18]
An Algorithm for Suffix Stripping , author =. Program , volume =. 1980 , doi =
work page 1980
-
[19]
Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit , author =
-
[20]
Exploring Criteria of Loss Reweighting to Enhance
Puning Yang and Qizhou Wang and Zhuo Huang and Tongliang Liu and Chengqi Zhang and Bo Han , booktitle=. Exploring Criteria of Loss Reweighting to Enhance. 2025 , url=
work page 2025
-
[21]
Alternate Preference Optimization for Unlearning Factual Knowledge in Large Language Models
Mekala, Anmol and Dorna, Vineeth and Dubey, Shreya and Lalwani, Abhishek and Koleczek, David and Rungta, Mukund and Hasan, Sadid and Lobo, Elita. Alternate Preference Optimization for Unlearning Factual Knowledge in Large Language Models. Proceedings of the 31st International Conference on Computational Linguistics. 2025
work page 2025
-
[22]
Simplicity Prevails: Rethinking Negative Preference Optimization for
Chongyu Fan and Jiancheng Liu and Licong Lin and Jinghan Jia and Ruiqi Zhang and Song Mei and Sijia Liu , booktitle=. Simplicity Prevails: Rethinking Negative Preference Optimization for. 2024 , url=
work page 2024
-
[23]
The Twelfth International Conference on Learning Representations , year=
Detecting Pretraining Data from Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[24]
ROUGE : A Package for Automatic Evaluation of Summaries
Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004
work page 2004
-
[25]
Evaluating LLaMA 3.2 for Software Vulnerability Detection , author=. 2025 , eprint=
work page 2025
-
[26]
StableLM Zephyr 3B , author =
-
[27]
International Conference on Learning Representations , year=
Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
-
[28]
Reasoning with exploration: An entropy perspective
Reasoning with Exploration: An Entropy Perspective , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2026 , month=. doi:10.1609/aaai.v40i36.40290 , abstractNote=
-
[29]
Token Cleaning: Fine-Grained Data Selection for
Jinlong Pang and Na Di and Zhaowei Zhu and Jiaheng Wei and Hao Cheng and Chen Qian and Yang Liu , booktitle=. Token Cleaning: Fine-Grained Data Selection for. 2025 , url=
work page 2025
-
[30]
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for
Shenzhi Wang and Le Yu and Chang Gao and Chujie Zheng and Shixuan Liu and Rui Lu and Kai Dang and Xiong-Hui Chen and Jianxin Yang and Zhenru Zhang and Yuqiong Liu and An Yang and Andrew Zhao and Yang Yue and Shiji Song and Bowen Yu and Gao Huang and Junyang Lin , booktitle=. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement...
work page 2025
-
[31]
Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.