Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

arxiv: 2510.17210 · v3 · submitted 2025-10-20 · 💻 cs.CL

Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

Chenchen Tan , Youyang Qu , Xinghao Li , Hui Zhang , Shujie Cui , Cunjian Chen , Longxiang Gao This is my paper

Pith reviewed 2026-05-18 06:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords machine unlearninglarge language modelsattention shiftinghallucination preventionselective forgettingknowledge localizationdual-loss optimization

0 comments p. Extension

The pith

Attention-shifting unlearning lets LLMs forget facts without hallucinations or accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an Attention-Shifting framework that removes targeted knowledge from large language models while keeping general performance intact and preventing fabricated answers. It works by reducing attention to fact-bearing tokens in the data to be forgotten and reinforcing attention to essential tokens in the data to be retained. These two interventions are trained together through a dual-loss objective that creates a soft boundary around the unwanted knowledge. Experiments on standard benchmarks show the approach maintains higher accuracy than prior unlearning techniques while still blocking hallucinations on forgotten topics.

Core claim

Attention-Shifting achieves selective unlearning by applying importance-aware suppression to attenuate attention on memorized facts in the unlearning set and attention-guided retention enhancement to reinforce semantically important tokens in the retained set. The two components are jointly optimized via a dual-loss objective that localizes changes under representation superposition, preserving linguistic structure and unrelated knowledge.

What carries the argument

Attention-Shifting (AS) framework that performs context-preserving suppression on the unlearning set combined with hallucination-resistant retention enhancement on the retained set through a dual-loss objective.

Load-bearing premise

That jointly optimizing importance-aware suppression on unlearning data and attention-guided enhancement on retained data can localize forgetting without disrupting overall language capabilities or unrelated knowledge.

What would settle it

Running the Attention-Shifting method on the ToFU or TDEC benchmarks and finding accuracy equal to or lower than existing unlearning baselines, or observing persistent hallucinated responses on queries about the unlearned content.

Figures

Figures reproduced from arXiv: 2510.17210 by Chenchen Tan, Cunjian Chen, Hui Zhang, Longxiang Gao, Shujie Cui, Xinghao Li, Youyang Qu.

**Figure 2.** Figure 2: Illustration of the proposed Attention-Shifting based unlearning in LLMs. An adapter [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The performance of unlearning methods with [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Outputs hallucination and reproduction rates across different unlearning methods. how often the model regenerates the unlearned content or its paraphrased variants. Both metrics are assessed by GPT-4 [38], comparing the model outputs with the original content4 . As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Model utility across unlearning levels under different model performance retaining strategies. Retaining Method Usage Ablation. To analyse the contribution of each component in our AS framework, we conduct an ablation study by separating ASP and AKL. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Evaluation of model utility accuracy degradation under a fixed unlearning threshold across [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Model utility degradation across multiple continue-unlearning requests (4 samples) for [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: The left heatmap shows the original model’s attention, where fact-bearing tokens such [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Gradient cosine similarity between the unlearn and retain losses with respect to the adapter [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Evaluation of model utility degradation under a fixed unlearning threshold across varying [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

read the original abstract

The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's attention-shifting unlearning method reports useful benchmark gains on utility preservation but the localization claim under superposition lacks the ablations needed to confirm it works cleanly.

read the letter

The main takeaway is that this work shifts unlearning from gradient changes to attention interventions, using importance-aware suppression on unlearning facts and retention enhancement on kept data under a joint dual loss. That combination is the actual novelty here, and it targets the practical problem of deleting knowledge without triggering hallucinations on the forgotten topics while keeping general performance intact. The reported results show up to 15% better accuracy on ToFU and 10% on TDEC versus prior methods, which is a concrete improvement worth noting if the controls hold. The framing around forming a soft boundary that respects representation superposition is a reasonable way to think about the trade-off. The paper does a solid job laying out why existing unlearning either over-deletes or leaves hallucination risks, and the benchmark comparisons give a clear sense of where the gains appear. The soft spots are the missing details on token selection for the attention masks, the exact dual-loss weighting, and any ablations that would show the two interventions do not interfere. The stress-test point about attention overlap is a real concern here; if heads or representations handle both memorized facts and broader language structure, suppression on unlearning queries could still shift patterns for retained queries and erode the accuracy numbers. The abstract presents the gains without error bars or sensitivity checks, so the central claim that the method localizes effects without unintended degradation needs more evidence from the full experiments. This is aimed at researchers working on machine unlearning and LLM reliability for data-deletion scenarios. Readers who care about practical fixes for the utility-hallucination tension would get value from the comparisons, even if they want to re-run the ablations themselves. It deserves peer review to get the implementation details and robustness checks properly examined.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce an Attention-Shifting (AS) framework for selective, hallucination-free unlearning in LLMs. It combines importance-aware suppression on fact-bearing tokens from the unlearning set with attention-guided retention enhancement on semantically essential tokens from the retained set; these interventions are jointly optimized through a dual-loss objective that purportedly forms a soft boundary localizing unlearning effects while preserving linguistic structure and unrelated knowledge under representation superposition. Experiments report up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark relative to prior unlearning methods, while maintaining competitive unlearning effectiveness.

Significance. If the empirical claims are robust, the work would meaningfully advance LLM unlearning by offering a practical mechanism that better balances forgetting of sensitive data against preservation of model utility and response reliability, addressing a central practical limitation in knowledge-intensive deployments.

major comments (2)

[Abstract / Experimental Results] Abstract and Experimental Results: the reported accuracy gains (15% on ToFU, 10% on TDEC) are presented without error bars, statistical tests, or ablations on the dual-loss weighting parameter; this is load-bearing for the central claim of superior performance preservation over SOTA methods.
[Method (dual-loss objective)] Method description of the dual-loss objective: the claim that importance-aware suppression and attention-guided retention can be jointly optimized to localize unlearning without eroding retained knowledge rests on the untested assumption that attention interventions remain separable despite representation superposition; no analysis or experiment addresses whether the same heads or token representations participate in both memorized facts and general linguistic structure.

minor comments (2)

[Abstract] The abstract could more explicitly define the hallucination-free metric used to claim competitive effectiveness.
[Method] Notation and implementation details for the attention interventions would benefit from a concise pseudocode or equation block to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical reporting and methodological analysis.

read point-by-point responses

Referee: [Abstract / Experimental Results] Abstract and Experimental Results: the reported accuracy gains (15% on ToFU, 10% on TDEC) are presented without error bars, statistical tests, or ablations on the dual-loss weighting parameter; this is load-bearing for the central claim of superior performance preservation over SOTA methods.

Authors: We agree that error bars, statistical tests, and ablations on the dual-loss weighting parameter are necessary to robustly support the performance claims. In the revised manuscript, we will include results averaged over multiple random seeds with standard error bars, apply statistical significance tests (such as paired t-tests) against baseline methods, and add an ablation study varying the dual-loss weighting parameter to demonstrate its effect on the unlearning-utility trade-off. revision: yes
Referee: [Method (dual-loss objective)] Method description of the dual-loss objective: the claim that importance-aware suppression and attention-guided retention can be jointly optimized to localize unlearning without eroding retained knowledge rests on the untested assumption that attention interventions remain separable despite representation superposition; no analysis or experiment addresses whether the same heads or token representations participate in both memorized facts and general linguistic structure.

Authors: The dual-loss objective is intended to create a soft boundary that localizes unlearning effects while preserving unrelated knowledge, as supported by the empirical utility preservation results. We acknowledge that a direct examination of whether the same attention heads or token representations are involved in both unlearning targets and retained linguistic structure would provide additional insight into separability under superposition. We will incorporate such an analysis (e.g., via attention map comparisons or representation probing) in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an empirical Attention-Shifting framework consisting of importance-aware suppression and attention-guided retention enhancement, jointly optimized by a dual-loss objective and evaluated on external benchmarks (ToFU, TDEC). No equations, fitted parameters, or self-citations are presented in the provided text that reduce the reported accuracy gains or unlearning effectiveness to quantities defined by construction from the method's own inputs. The central claims rest on experimental outcomes against independent test sets rather than any self-referential renaming, ansatz smuggling, or load-bearing self-citation chain, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard transformer attention mechanics and the assumption that attention weights can be selectively modulated without breaking next-token prediction coherence.

axioms (1)

domain assumption Transformer attention layers can be intervened upon at inference or fine-tuning time to attenuate fact-bearing tokens while preserving overall linguistic structure.
Invoked in the description of context-preserving suppression.

pith-pipeline@v0.9.0 · 5833 in / 1179 out tokens · 23118 ms · 2026-05-18T06:45:55.108472+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning
cs.CR 2026-05 unverdicted novelty 6.0

Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37:22947–22970, 2024

work page 2024
[2]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021

work page 2021
[3]

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[4]

The eu general data protection regulation (gdpr).A practical guide, 1st ed., Cham: Springer International Publishing, 10(3152676):10–5555, 2017

Paul V oigt and Axel V on dem Bussche. The eu general data protection regulation (gdpr).A practical guide, 1st ed., Cham: Springer International Publishing, 10(3152676):10–5555, 2017

work page 2017
[5]

Knowledge unlearning for mitigating privacy risks in language models

Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 14389–14408, Toronto, Canada, July 2023. Association for Compu...

work page 2023
[6]

Unlearn what you want to forget: Efficient unlearning for LLMs

Jiaao Chen and Diyi Yang. Unlearn what you want to forget: Efficient unlearning for LLMs. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[7]

Rethinking machine unlearning for large language models

Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, et al. Rethinking machine unlearning for large language models. Nature Machine Intelligence, pages 1–14, 2025

work page 2025
[8]

Large language model unlearn- ing via embedding-corrupted prompts

Chris Liu, Yaxuan Wang, Jeffrey Flanigan, and Yang Liu. Large language model unlearn- ing via embedding-corrupted prompts. Advances in Neural Information Processing Systems, 37:118198–118266, 2024

work page 2024
[9]

Ulmr: Unlearning large language models via negative response and model parameter average

Shaojie Shi, Xiaoyu Tan, Xihe Qiu, Chao Qu, Kexin Nie, Yuan Cheng, Wei Chu, Xu Yinghui, and Yuan Qi. Ulmr: Unlearning large language models via negative response and model parameter average. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 755–762, 2024

work page 2024
[10]

LLM unlearning via loss adjustment with only forget data

Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Shah, Yujia Bao, Yang Liu, and Wei Wei. LLM unlearning via loss adjustment with only forget data. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[11]

Towards robust and parameter- efficient knowledge unlearning for LLMs

Sungmin Cha, Sungjun Cho, Dasol Hwang, and Moontae Lee. Towards robust and parameter- efficient knowledge unlearning for LLMs. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[12]

Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference

Jiabao Ji, Yujian Liu, Yang Zhang, Gaowen Liu, Ramana Kompella, Sijia Liu, and Shiyu Chang. Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference. Advances in Neural Information Processing Systems, 37:12581–12611, 2024

work page 2024
[13]

Offset unlearning for large language models

James Y Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, and Muhao Chen. Offset unlearning for large language models. arXiv preprint arXiv:2404.11045, 2024. 11

work page arXiv 2024
[14]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025
[15]

Quantifying the uncertainty of llm hallucination spreading in complex adaptive social networks

Guozhi Hao, Jun Wu, Qianqian Pan, and Rosario Morello. Quantifying the uncertainty of llm hallucination spreading in complex adaptive social networks. Scientific reports, 14(1):16375, 2024

work page 2024
[16]

Forget to flourish: Leveraging machine-unlearning on pretrained language models for privacy leakage

Md Rafi Ur Rashid, Jing Liu, Toshiaki Koike-Akino, Ye Wang, and Shagufta Mehnaz. Forget to flourish: Leveraging machine-unlearning on pretrained language models for privacy leakage. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20139– 20147, 2025

work page 2025
[17]

To forget or not? towards practical knowl- edge unlearning for large language models

Bozhong Tian, Xiaozhuan Liang, Siyuan Cheng, Qingbin Liu, Mengru Wang, Dianbo Sui, Xi Chen, Huajun Chen, and Ningyu Zhang. To forget or not? towards practical knowl- edge unlearning for large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1524–1537, 2024

work page 2024
[18]

A closer look at machine unlearning for large language models

Xiaojian Yuan, Tianyu Pang, Chao Du, Kejiang Chen, Weiming Zhang, and Min Lin. A closer look at machine unlearning for large language models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[19]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

The rise of parameter specialization for knowledge storage in large language models

Yihuai Hong, Yiran Zhao, Wei Tang, Yang Deng, Yu Rong, and Wenxuan Zhang. The rise of parameter specialization for knowledge storage in large language models. arXiv preprint arXiv:2505.17260, 2025

work page arXiv 2025
[21]

TOFU: A task of fictitious unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, and J Zico Kolter. TOFU: A task of fictitious unlearning for LLMs. In First Conference on Language Modeling, 2024

work page 2024
[22]

Negative preference optimization: From catastrophic collapse to effective unlearning

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. In First Conference on Language Modeling, 2024

work page 2024
[23]

Large language model unlearning

Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. Advances in Neural Information Processing Systems, 37:105425–105475, 2024

work page 2024
[24]

Selective forgetting: Advancing machine unlearning techniques and evaluation in language models

Lingzhi Wang, Xingshan Zeng, Jinsong Guo, Kam-Fai Wong, and Georg Gottlob. Selective forgetting: Advancing machine unlearning techniques and evaluation in language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 843–851, 2025

work page 2025
[25]

Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge

Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, and Cen Chen. Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge. arXiv preprint arXiv:2404.05880, 2024

work page arXiv 2024
[26]

Towards safer large language models through machine unlearning

Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Towards safer large language models through machine unlearning. InFindings of the Association for Computational Linguistics ACL 2024, pages 1817–1829, 2024

work page 2024
[27]

Meow: Memory supervised llm unlearning via inverted facts

Tianle Gu, Kexin Huang, Ruilin Luo, Yuanqi Yao, Yujiu Yang, Yan Teng, and Yingchun Wang. Meow: Memory supervised llm unlearning via inverted facts. arXiv preprint arXiv:2409.11844, 2024

work page arXiv 2024
[28]

Who’s harry potter? approximate unlearning in llms

Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms. arXiv e-prints, pages arXiv–2310, 2023

work page 2023
[29]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations, 2025. 12

work page 2025
[30]

Soft prompting for unlearning in large language models

Karuna Bhaila, Minh-Hao Van, and Xintao Wu. Soft prompting for unlearning in large language models. arXiv preprint arXiv:2406.12038, 2024

work page arXiv 2024
[31]

Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 5050–5063, 2024

work page 2024
[32]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[34]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019

work page arXiv 1909
[35]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, 2021

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, 2021. 5297715, 2022

work page 2021
[37]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[38]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Vasquez”, “Lorenzo

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. Supplemental Material A Theoretical Analysis of Attention Shifting A.1 Effect of Attention Reweighting on Output Distribution We begin by analyzing how modifying the attent...

work page 2022

[1] [1]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37:22947–22970, 2024

work page 2024

[2] [2]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021

work page 2021

[3] [3]

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[4] [4]

The eu general data protection regulation (gdpr).A practical guide, 1st ed., Cham: Springer International Publishing, 10(3152676):10–5555, 2017

Paul V oigt and Axel V on dem Bussche. The eu general data protection regulation (gdpr).A practical guide, 1st ed., Cham: Springer International Publishing, 10(3152676):10–5555, 2017

work page 2017

[5] [5]

Knowledge unlearning for mitigating privacy risks in language models

Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 14389–14408, Toronto, Canada, July 2023. Association for Compu...

work page 2023

[6] [6]

Unlearn what you want to forget: Efficient unlearning for LLMs

Jiaao Chen and Diyi Yang. Unlearn what you want to forget: Efficient unlearning for LLMs. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[7] [7]

Rethinking machine unlearning for large language models

Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, et al. Rethinking machine unlearning for large language models. Nature Machine Intelligence, pages 1–14, 2025

work page 2025

[8] [8]

Large language model unlearn- ing via embedding-corrupted prompts

Chris Liu, Yaxuan Wang, Jeffrey Flanigan, and Yang Liu. Large language model unlearn- ing via embedding-corrupted prompts. Advances in Neural Information Processing Systems, 37:118198–118266, 2024

work page 2024

[9] [9]

Ulmr: Unlearning large language models via negative response and model parameter average

Shaojie Shi, Xiaoyu Tan, Xihe Qiu, Chao Qu, Kexin Nie, Yuan Cheng, Wei Chu, Xu Yinghui, and Yuan Qi. Ulmr: Unlearning large language models via negative response and model parameter average. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 755–762, 2024

work page 2024

[10] [10]

LLM unlearning via loss adjustment with only forget data

Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Shah, Yujia Bao, Yang Liu, and Wei Wei. LLM unlearning via loss adjustment with only forget data. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[11] [11]

Towards robust and parameter- efficient knowledge unlearning for LLMs

Sungmin Cha, Sungjun Cho, Dasol Hwang, and Moontae Lee. Towards robust and parameter- efficient knowledge unlearning for LLMs. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[12] [12]

Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference

Jiabao Ji, Yujian Liu, Yang Zhang, Gaowen Liu, Ramana Kompella, Sijia Liu, and Shiyu Chang. Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference. Advances in Neural Information Processing Systems, 37:12581–12611, 2024

work page 2024

[13] [13]

Offset unlearning for large language models

James Y Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, and Muhao Chen. Offset unlearning for large language models. arXiv preprint arXiv:2404.11045, 2024. 11

work page arXiv 2024

[14] [14]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025

[15] [15]

Quantifying the uncertainty of llm hallucination spreading in complex adaptive social networks

Guozhi Hao, Jun Wu, Qianqian Pan, and Rosario Morello. Quantifying the uncertainty of llm hallucination spreading in complex adaptive social networks. Scientific reports, 14(1):16375, 2024

work page 2024

[16] [16]

Forget to flourish: Leveraging machine-unlearning on pretrained language models for privacy leakage

Md Rafi Ur Rashid, Jing Liu, Toshiaki Koike-Akino, Ye Wang, and Shagufta Mehnaz. Forget to flourish: Leveraging machine-unlearning on pretrained language models for privacy leakage. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20139– 20147, 2025

work page 2025

[17] [17]

To forget or not? towards practical knowl- edge unlearning for large language models

Bozhong Tian, Xiaozhuan Liang, Siyuan Cheng, Qingbin Liu, Mengru Wang, Dianbo Sui, Xi Chen, Huajun Chen, and Ningyu Zhang. To forget or not? towards practical knowl- edge unlearning for large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1524–1537, 2024

work page 2024

[18] [18]

A closer look at machine unlearning for large language models

Xiaojian Yuan, Tianyu Pang, Chao Du, Kejiang Chen, Weiming Zhang, and Min Lin. A closer look at machine unlearning for large language models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[19] [19]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

The rise of parameter specialization for knowledge storage in large language models

Yihuai Hong, Yiran Zhao, Wei Tang, Yang Deng, Yu Rong, and Wenxuan Zhang. The rise of parameter specialization for knowledge storage in large language models. arXiv preprint arXiv:2505.17260, 2025

work page arXiv 2025

[21] [21]

TOFU: A task of fictitious unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, and J Zico Kolter. TOFU: A task of fictitious unlearning for LLMs. In First Conference on Language Modeling, 2024

work page 2024

[22] [22]

Negative preference optimization: From catastrophic collapse to effective unlearning

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. In First Conference on Language Modeling, 2024

work page 2024

[23] [23]

Large language model unlearning

Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. Advances in Neural Information Processing Systems, 37:105425–105475, 2024

work page 2024

[24] [24]

Selective forgetting: Advancing machine unlearning techniques and evaluation in language models

Lingzhi Wang, Xingshan Zeng, Jinsong Guo, Kam-Fai Wong, and Georg Gottlob. Selective forgetting: Advancing machine unlearning techniques and evaluation in language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 843–851, 2025

work page 2025

[25] [25]

Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge

Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, and Cen Chen. Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge. arXiv preprint arXiv:2404.05880, 2024

work page arXiv 2024

[26] [26]

Towards safer large language models through machine unlearning

Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Towards safer large language models through machine unlearning. InFindings of the Association for Computational Linguistics ACL 2024, pages 1817–1829, 2024

work page 2024

[27] [27]

Meow: Memory supervised llm unlearning via inverted facts

Tianle Gu, Kexin Huang, Ruilin Luo, Yuanqi Yao, Yujiu Yang, Yan Teng, and Yingchun Wang. Meow: Memory supervised llm unlearning via inverted facts. arXiv preprint arXiv:2409.11844, 2024

work page arXiv 2024

[28] [28]

Who’s harry potter? approximate unlearning in llms

Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms. arXiv e-prints, pages arXiv–2310, 2023

work page 2023

[29] [29]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations, 2025. 12

work page 2025

[30] [30]

Soft prompting for unlearning in large language models

Karuna Bhaila, Minh-Hao Van, and Xintao Wu. Soft prompting for unlearning in large language models. arXiv preprint arXiv:2406.12038, 2024

work page arXiv 2024

[31] [31]

Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 5050–5063, 2024

work page 2024

[32] [32]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[34] [34]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019

work page arXiv 1909

[35] [35]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, 2021

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, 2021. 5297715, 2022

work page 2021

[37] [37]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[38] [38]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Vasquez”, “Lorenzo

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. Supplemental Material A Theoretical Analysis of Attention Shifting A.1 Effect of Attention Reweighting on Output Distribution We begin by analyzing how modifying the attent...

work page 2022