Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
Pith reviewed 2026-05-18 06:45 UTC · model grok-4.3
The pith
Attention-shifting unlearning lets LLMs forget facts without hallucinations or accuracy loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Attention-Shifting achieves selective unlearning by applying importance-aware suppression to attenuate attention on memorized facts in the unlearning set and attention-guided retention enhancement to reinforce semantically important tokens in the retained set. The two components are jointly optimized via a dual-loss objective that localizes changes under representation superposition, preserving linguistic structure and unrelated knowledge.
What carries the argument
Attention-Shifting (AS) framework that performs context-preserving suppression on the unlearning set combined with hallucination-resistant retention enhancement on the retained set through a dual-loss objective.
Load-bearing premise
That jointly optimizing importance-aware suppression on unlearning data and attention-guided enhancement on retained data can localize forgetting without disrupting overall language capabilities or unrelated knowledge.
What would settle it
Running the Attention-Shifting method on the ToFU or TDEC benchmarks and finding accuracy equal to or lower than existing unlearning baselines, or observing persistent hallucinated responses on queries about the unlearned content.
Figures
read the original abstract
The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce an Attention-Shifting (AS) framework for selective, hallucination-free unlearning in LLMs. It combines importance-aware suppression on fact-bearing tokens from the unlearning set with attention-guided retention enhancement on semantically essential tokens from the retained set; these interventions are jointly optimized through a dual-loss objective that purportedly forms a soft boundary localizing unlearning effects while preserving linguistic structure and unrelated knowledge under representation superposition. Experiments report up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark relative to prior unlearning methods, while maintaining competitive unlearning effectiveness.
Significance. If the empirical claims are robust, the work would meaningfully advance LLM unlearning by offering a practical mechanism that better balances forgetting of sensitive data against preservation of model utility and response reliability, addressing a central practical limitation in knowledge-intensive deployments.
major comments (2)
- [Abstract / Experimental Results] Abstract and Experimental Results: the reported accuracy gains (15% on ToFU, 10% on TDEC) are presented without error bars, statistical tests, or ablations on the dual-loss weighting parameter; this is load-bearing for the central claim of superior performance preservation over SOTA methods.
- [Method (dual-loss objective)] Method description of the dual-loss objective: the claim that importance-aware suppression and attention-guided retention can be jointly optimized to localize unlearning without eroding retained knowledge rests on the untested assumption that attention interventions remain separable despite representation superposition; no analysis or experiment addresses whether the same heads or token representations participate in both memorized facts and general linguistic structure.
minor comments (2)
- [Abstract] The abstract could more explicitly define the hallucination-free metric used to claim competitive effectiveness.
- [Method] Notation and implementation details for the attention interventions would benefit from a concise pseudocode or equation block to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical reporting and methodological analysis.
read point-by-point responses
-
Referee: [Abstract / Experimental Results] Abstract and Experimental Results: the reported accuracy gains (15% on ToFU, 10% on TDEC) are presented without error bars, statistical tests, or ablations on the dual-loss weighting parameter; this is load-bearing for the central claim of superior performance preservation over SOTA methods.
Authors: We agree that error bars, statistical tests, and ablations on the dual-loss weighting parameter are necessary to robustly support the performance claims. In the revised manuscript, we will include results averaged over multiple random seeds with standard error bars, apply statistical significance tests (such as paired t-tests) against baseline methods, and add an ablation study varying the dual-loss weighting parameter to demonstrate its effect on the unlearning-utility trade-off. revision: yes
-
Referee: [Method (dual-loss objective)] Method description of the dual-loss objective: the claim that importance-aware suppression and attention-guided retention can be jointly optimized to localize unlearning without eroding retained knowledge rests on the untested assumption that attention interventions remain separable despite representation superposition; no analysis or experiment addresses whether the same heads or token representations participate in both memorized facts and general linguistic structure.
Authors: The dual-loss objective is intended to create a soft boundary that localizes unlearning effects while preserving unrelated knowledge, as supported by the empirical utility preservation results. We acknowledge that a direct examination of whether the same attention heads or token representations are involved in both unlearning targets and retained linguistic structure would provide additional insight into separability under superposition. We will incorporate such an analysis (e.g., via attention map comparisons or representation probing) in the revised version. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces an empirical Attention-Shifting framework consisting of importance-aware suppression and attention-guided retention enhancement, jointly optimized by a dual-loss objective and evaluated on external benchmarks (ToFU, TDEC). No equations, fitted parameters, or self-citations are presented in the provided text that reduce the reported accuracy gains or unlearning effectiveness to quantities defined by construction from the method's own inputs. The central claims rest on experimental outcomes against independent test sets rather than any self-referential renaming, ansatz smuggling, or load-bearing self-citation chain, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer attention layers can be intervened upon at inference or fine-tuning time to attenuate fact-bearing tokens while preserving overall linguistic structure.
Forward citations
Cited by 1 Pith paper
-
LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning
Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.
Reference graph
Works this paper leans on
-
[1]
Snapkv: Llm knows what you are looking for before generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37:22947–22970, 2024
work page 2024
-
[2]
Extracting training data from large language models
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021
work page 2021
-
[3]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[4]
Paul V oigt and Axel V on dem Bussche. The eu general data protection regulation (gdpr).A practical guide, 1st ed., Cham: Springer International Publishing, 10(3152676):10–5555, 2017
work page 2017
-
[5]
Knowledge unlearning for mitigating privacy risks in language models
Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 14389–14408, Toronto, Canada, July 2023. Association for Compu...
work page 2023
-
[6]
Unlearn what you want to forget: Efficient unlearning for LLMs
Jiaao Chen and Diyi Yang. Unlearn what you want to forget: Efficient unlearning for LLMs. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[7]
Rethinking machine unlearning for large language models
Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, et al. Rethinking machine unlearning for large language models. Nature Machine Intelligence, pages 1–14, 2025
work page 2025
-
[8]
Large language model unlearn- ing via embedding-corrupted prompts
Chris Liu, Yaxuan Wang, Jeffrey Flanigan, and Yang Liu. Large language model unlearn- ing via embedding-corrupted prompts. Advances in Neural Information Processing Systems, 37:118198–118266, 2024
work page 2024
-
[9]
Ulmr: Unlearning large language models via negative response and model parameter average
Shaojie Shi, Xiaoyu Tan, Xihe Qiu, Chao Qu, Kexin Nie, Yuan Cheng, Wei Chu, Xu Yinghui, and Yuan Qi. Ulmr: Unlearning large language models via negative response and model parameter average. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 755–762, 2024
work page 2024
-
[10]
LLM unlearning via loss adjustment with only forget data
Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Shah, Yujia Bao, Yang Liu, and Wei Wei. LLM unlearning via loss adjustment with only forget data. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[11]
Towards robust and parameter- efficient knowledge unlearning for LLMs
Sungmin Cha, Sungjun Cho, Dasol Hwang, and Moontae Lee. Towards robust and parameter- efficient knowledge unlearning for LLMs. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[12]
Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference
Jiabao Ji, Yujian Liu, Yang Zhang, Gaowen Liu, Ramana Kompella, Sijia Liu, and Shiyu Chang. Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference. Advances in Neural Information Processing Systems, 37:12581–12611, 2024
work page 2024
-
[13]
Offset unlearning for large language models
James Y Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, and Muhao Chen. Offset unlearning for large language models. arXiv preprint arXiv:2404.11045, 2024. 11
-
[14]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, 2025
work page 2025
-
[15]
Quantifying the uncertainty of llm hallucination spreading in complex adaptive social networks
Guozhi Hao, Jun Wu, Qianqian Pan, and Rosario Morello. Quantifying the uncertainty of llm hallucination spreading in complex adaptive social networks. Scientific reports, 14(1):16375, 2024
work page 2024
-
[16]
Forget to flourish: Leveraging machine-unlearning on pretrained language models for privacy leakage
Md Rafi Ur Rashid, Jing Liu, Toshiaki Koike-Akino, Ye Wang, and Shagufta Mehnaz. Forget to flourish: Leveraging machine-unlearning on pretrained language models for privacy leakage. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20139– 20147, 2025
work page 2025
-
[17]
To forget or not? towards practical knowl- edge unlearning for large language models
Bozhong Tian, Xiaozhuan Liang, Siyuan Cheng, Qingbin Liu, Mengru Wang, Dianbo Sui, Xi Chen, Huajun Chen, and Ningyu Zhang. To forget or not? towards practical knowl- edge unlearning for large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1524–1537, 2024
work page 2024
-
[18]
A closer look at machine unlearning for large language models
Xiaojian Yuan, Tianyu Pang, Chao Du, Kejiang Chen, Weiming Zhang, and Min Lin. A closer look at machine unlearning for large language models. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[19]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
The rise of parameter specialization for knowledge storage in large language models
Yihuai Hong, Yiran Zhao, Wei Tang, Yang Deng, Yu Rong, and Wenxuan Zhang. The rise of parameter specialization for knowledge storage in large language models. arXiv preprint arXiv:2505.17260, 2025
-
[21]
TOFU: A task of fictitious unlearning for LLMs
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, and J Zico Kolter. TOFU: A task of fictitious unlearning for LLMs. In First Conference on Language Modeling, 2024
work page 2024
-
[22]
Negative preference optimization: From catastrophic collapse to effective unlearning
Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. In First Conference on Language Modeling, 2024
work page 2024
-
[23]
Large language model unlearning
Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. Advances in Neural Information Processing Systems, 37:105425–105475, 2024
work page 2024
-
[24]
Selective forgetting: Advancing machine unlearning techniques and evaluation in language models
Lingzhi Wang, Xingshan Zeng, Jinsong Guo, Kam-Fai Wong, and Georg Gottlob. Selective forgetting: Advancing machine unlearning techniques and evaluation in language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 843–851, 2025
work page 2025
-
[25]
Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge
Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, and Cen Chen. Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge. arXiv preprint arXiv:2404.05880, 2024
-
[26]
Towards safer large language models through machine unlearning
Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Towards safer large language models through machine unlearning. InFindings of the Association for Computational Linguistics ACL 2024, pages 1817–1829, 2024
work page 2024
-
[27]
Meow: Memory supervised llm unlearning via inverted facts
Tianle Gu, Kexin Huang, Ruilin Luo, Yuanqi Yao, Yujiu Yang, Yan Teng, and Yingchun Wang. Meow: Memory supervised llm unlearning via inverted facts. arXiv preprint arXiv:2409.11844, 2024
-
[28]
Who’s harry potter? approximate unlearning in llms
Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms. arXiv e-prints, pages arXiv–2310, 2023
work page 2023
-
[29]
Safety alignment should be made more than just a few tokens deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations, 2025. 12
work page 2025
-
[30]
Soft prompting for unlearning in large language models
Karuna Bhaila, Minh-Hao Van, and Xintao Wu. Soft prompting for unlearning in large language models. arXiv preprint arXiv:2406.12038, 2024
-
[31]
Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 5050–5063, 2024
work page 2024
-
[32]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[34]
PubMedQA: A Dataset for Biomedical Research Question Answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019
-
[35]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, 2021
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, 2021. 5297715, 2022
work page 2021
-
[37]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[38]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. Supplemental Material A Theoretical Analysis of Attention Shifting A.1 Effect of Attention Reweighting on Output Distribution We begin by analyzing how modifying the attent...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.