Inference-Time Machine Unlearning via Gated Activation Redirection

Christian Mattjie; Flavio du Pin Calmon; Joana Pasquali; Jo\~ao Vitor Boer Abitante; Kristen K. Arguello; Lucas S. Kupssinsk\"u; Ot\'avio Parraga; Ramiro N. Barros; Rodrigo C. Barros; Vin\'icius Conte Turani

arxiv: 2605.12765 · v2 · pith:GUD5X3KTnew · submitted 2026-05-12 · 💻 cs.LG

Inference-Time Machine Unlearning via Gated Activation Redirection

Vin\'icius Conte Turani , Ot\'avio Parraga , Jo\~ao Vitor Boer Abitante , Kristen K. Arguello , Joana Pasquali , Ramiro N. Barros , Flavio du Pin Calmon , Christian Mattjie

show 2 more authors

Rodrigo C. Barros Lucas S. Kupssinsk\"u

This is my paper

Pith reviewed 2026-05-20 21:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords machine unlearningactivation engineeringinference timelarge language modelsresidual streamgated redirectionquantizationprivacy

0 comments

The pith

GUARD-IT unlearns specific data from large language models by steering activations at inference time without updating weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents GUARD-IT as a way to remove the influence of targeted training data from LLMs at inference time. It does this by using a gating mechanism to create input-specific rotations of activations in the residual stream. The goal is to approximate a model that was never trained on the forget set while keeping performance on other data. This matters because current unlearning methods require expensive retraining or cause side effects, whereas this approach leaves the model weights unchanged and works even after quantization.

Core claim

The paper claims that GUARD-IT, a training- and gradient-free method, achieves machine unlearning by applying input-dependent activation steering as norm-preserving rotations in the residual stream during inference. Experiments on the TOFU and MUSE benchmarks demonstrate that it matches or exceeds twelve gradient-based baselines across three model scales. It is the only method tested that simultaneously preserves utility, suppresses memorization, and avoids catastrophic collapse in all settings. The method also enables continual unlearning without retraining and maintains effectiveness when models are quantized.

What carries the argument

Gated activation redirection, which computes an input-dependent steering vector and applies it as a norm-preserving rotation in the residual stream to steer behavior without weight changes.

If this is right

Unlearning can be performed without any gradient computations or parameter updates.
The approach works on quantized models where weight-editing methods fail.
Continual unlearning is possible by applying successive interventions without retraining.
Utility and memorization suppression hold across small, medium, and large model scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such inference-time methods could make it feasible to comply with data deletion requests in real-time for deployed AI systems.
The technique might be adapted to other goals like reducing hallucinations or enforcing safety constraints dynamically.
If the rotations are truly norm-preserving, they may preserve more of the original model's capabilities than additive steering vectors.

Load-bearing premise

That the gating mechanism can produce rotations removing targeted information without unintended side effects or performance degradation on unrelated tasks.

What would settle it

If the model after GUARD-IT still generates content from the forget set on some inputs or shows reduced accuracy on standard benchmarks, the central claim would be falsified.

read the original abstract

Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GUARD-IT offers a training-free inference-time unlearning method via input-dependent gated rotations that holds up on TOFU and MUSE while staying stable under quantization, though the gate computation itself stays thin on details.

read the letter

The main thing to know is that this paper gives a practical way to handle unlearning requests in deployed models by redirecting activations at inference time instead of retraining or editing weights. The approach uses a gate to make the intervention input-specific and applies it as a norm-preserving rotation in the residual stream, which keeps the model weights untouched and reversible.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces GUARD-IT, a training- and gradient-free method for machine unlearning in LLMs that performs input-dependent activation steering at inference time. It applies gated redirections in the residual stream as norm-preserving rotations without altering model weights. On TOFU and MUSE benchmarks, GUARD-IT is reported to match or exceed 12 gradient-based baselines across three model scales while being the only approach that simultaneously preserves utility, suppresses memorization, and avoids catastrophic collapse; it also supports continual unlearning without retraining and remains effective under quantization.

Significance. If the empirical claims hold with adequate controls, the work would be significant for offering a practical, reversible, and quantization-robust alternative to parameter-editing unlearning methods. The inference-time approach addresses computational and deployment limitations of gradient-based techniques and extends activation engineering to selective unlearning, which could influence future research on efficient post-training interventions.

major comments (2)

[§3.2] §3.2: The gating function and rotation construction are described conceptually as a fixed, non-learned computation that produces input-specific norm-preserving rotations, but no explicit equation or algorithm is given for how the gate is derived from the forget set (e.g., no definition of the similarity measure, projection, or activation threshold). This is load-bearing for the central claim that the method is training-free yet avoids the failure modes of global steering.
[§5.1, Table 1] §5.1, Table 1: The headline claim that GUARD-IT is the only method to simultaneously preserve utility, suppress memorization, and avoid collapse across all settings relies on aggregate performance numbers, but the table reports point estimates without error bars, run counts, or statistical significance tests against the 12 baselines; this weakens the uniqueness assertion.

minor comments (3)

[Abstract] Abstract: The phrase 'three model scales' is used without naming the specific models or parameter counts.
[§4] §4: The description of the norm-preserving property of the rotation would benefit from a short proof sketch or reference to the relevant linear-algebra fact.
[Figure 3] Figure 3: Axis labels and legend entries are too small for readability in the printed version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below and outline the revisions we will make.

read point-by-point responses

Referee: [§3.2] §3.2: The gating function and rotation construction are described conceptually as a fixed, non-learned computation that produces input-specific norm-preserving rotations, but no explicit equation or algorithm is given for how the gate is derived from the forget set (e.g., no definition of the similarity measure, projection, or activation threshold). This is load-bearing for the central claim that the method is training-free yet avoids the failure modes of global steering.

Authors: We agree that an explicit mathematical formulation is necessary to fully substantiate the central claims. In the revised manuscript we will expand §3.2 with the precise equations and algorithm: the gate is computed from a similarity measure between the input activation and a prototype derived from the forget-set activations, followed by a projection step and a threshold that determines whether redirection is applied. The rotation itself is constructed as a norm-preserving operation in the residual stream. These additions will make transparent how the input-dependent mechanism is obtained without any training or gradient steps. revision: yes
Referee: [§5.1, Table 1] §5.1, Table 1: The headline claim that GUARD-IT is the only method to simultaneously preserve utility, suppress memorization, and avoid collapse across all settings relies on aggregate performance numbers, but the table reports point estimates without error bars, run counts, or statistical significance tests against the 12 baselines; this weakens the uniqueness assertion.

Authors: We acknowledge that the current presentation of Table 1 uses point estimates and would be strengthened by error bars and run counts. Each configuration was evaluated once owing to the high computational cost of the large-model experiments and the full set of baselines. In the revision we will add error bars from repeated runs on the smaller models and include a note on consistency across the three model scales reported in the main results and supplement. The uniqueness claim rests on the observed pattern that GUARD-IT is the only method satisfying all three criteria in every setting; we will qualify the language to reflect the empirical scope while retaining the comparative observation. revision: partial

Circularity Check

0 steps flagged

No circularity: method is training-free empirical construction with no derivation reducing to inputs

full rationale

The paper presents GUARD-IT explicitly as a training- and gradient-free inference-time technique that applies input-dependent norm-preserving rotations in the residual stream. No equations, fitting procedures, or self-citations are shown that would make any claimed performance outcome equivalent to its own inputs by construction. The central claims rest on empirical comparisons against 12 baselines on TOFU and MUSE across model scales, with no load-bearing uniqueness theorem, ansatz smuggling, or renaming of known results. The derivation chain is therefore self-contained as a novel algorithmic proposal rather than a tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes that activation steering can be made input-dependent and norm-preserving without side effects.

pith-pipeline@v0.9.0 · 5836 in / 1222 out tokens · 82104 ms · 2026-05-20T21:51:21.408844+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. ... The resulting intervention is applied as a norm-preserving rotation in the residual stream
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GUARD-IT partitions the forget corpus into semantic clusters ... computes one steering vector per cluster ... routes each user query through a similarity gateway

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 4 internal anchors

[1]

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini and Zhili Feng and Avi Schwarzschild and Zachary C. Lipton and J. Zico Kolter , year=. 2401.06121 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , url =

Reimers, Nils and Gurevych, Iryna , biburl =. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , url =. EMNLP/IJCNLP (1) , crossref =

work page
[3]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019

work page 2019
[4]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Jang, Joel and Yoon, Dongkeun and Yang, Sohee and Cha, Sungmin and Lee, Moontae and Logeswaran, Lajanugen and Seo, Minjoon , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

work page
[5]

Smith and Chiyuan Zhang , booktitle=

Weijia Shi and Jaechan Lee and Yangsibo Huang and Sadhika Malladi and Jieyu Zhao and Ari Holtzman and Daogao Liu and Luke Zettlemoyer and Noah A. Smith and Chiyuan Zhang , booktitle=. 2025 , url=

work page 2025
[6]

ArXiv , year=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. ArXiv , year=

work page
[7]

Steering Llama 2 via Contrastive Activation Addition , url =

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024
[8]

ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

Angular Steering: Behavior Control via Rotation in Activation Space , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

work page 2025
[9]

2026 , eprint=

Spherical Steering: Geometry-Aware Activation Rotation for Language Models , author=. 2026 , eprint=

work page 2026
[10]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[11]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[12]

Simplicity Prevails: Rethinking Negative Preference Optimization for

Chongyu Fan and Jiancheng Liu and Licong Lin and Jinghan Jia and Ruiqi Zhang and Song Mei and Sijia Liu , booktitle=. Simplicity Prevails: Rethinking Negative Preference Optimization for. 2024 , url=

work page 2024
[13]

Nature Machine Intelligence , volume =

Rethinking machine unlearning for large language models , author =. Nature Machine Intelligence , volume =. 2025 , publisher =

work page 2025
[14]

BLUR : A Bi-Level Optimization Approach for LLM Unlearning

Reisizadeh, Hadi and Jia, Jinghan and Bu, Zhiqi and Vinzamuri, Bhanukiran and Ramakrishna, Anil and Chang, Kai-Wei and Cevher, Volkan and Liu, Sijia and Hong, Mingyi. BLUR : A Bi-Level Optimization Approach for LLM Unlearning. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Pa...

work page doi:10.18653/v1/2026.eacl-long.331 2026
[15]

The Thirteenth International Conference on Learning Representations , year=

Programming Refusal with Conditional Activation Steering , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[16]

Semantics-Adaptive Activation Intervention for

Weixuan Wang and JINGYUAN YANG and Wei Peng , booktitle=. Semantics-Adaptive Activation Intervention for. 2025 , url=

work page 2025
[17]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Park, Kiho and Choe, Yo Joong and Veitch, Victor , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[18]

2022 , eprint=

Toy Models of Superposition , author=. 2022 , eprint=

work page 2022
[19]

Quantization-Robust

Jo. Quantization-Robust. 2026 , eprint=

work page 2026
[20]

Proceedings of The 1st Conference on Lifelong Learning Agents , editor =

Liu, Bo and Liu, Qiang and Stone, Peter , title =. Proceedings of The 1st Conference on Lifelong Learning Agents , editor =. 2022 , url =

work page 2022
[21]

The Probabilistic Relevance Framework:

Robertson, Stephen and Zaragoza, Hugo , year =. The Probabilistic Relevance Framework:. Foundations and Trends in Information Retrieval , doi =

work page
[22]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

Thakur, Nandan and Reimers, Nils and R. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

work page
[23]

2025 , issue_date =

Meng, Chuan and Arabzadeh, Negar and Askari, Arian and Aliannejadi, Mohammad and Rijke, Maarten de , title =. 2025 , issue_date =. doi:10.1145/3736402 , journal =

work page doi:10.1145/3736402 2025
[24]

Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

Yu, Tianhe and Kumar, Saurabh and Gupta, Abhishek and Levine, Sergey and Hausman, Karol and Finn, Chelsea , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

work page 2020
[25]

Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =

Liu, Bo and Liu, Xingchao and Jin, Xiaojie and Stone, Peter and Liu, Qiang , title =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =. 2021 , isbn =

work page 2021
[26]

and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Laban, Rassin and Hendrycks, Dan , booktitle =

Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Laban, Rassin and Hendrycks, Dan , booktitle =. The

work page
[27]

The Eleventh International Conference on Learning Representations , year =

Editing Models with Task Arithmetic , author =. The Eleventh International Conference on Learning Representations , year =

work page
[28]

Proceedings of the 41st International Conference on Machine Learning , year =

In-Context Unlearning: Language Models as Few-Shot Unlearners , author =. Proceedings of the 41st International Conference on Machine Learning , year =

work page
[29]

2024 , eprint=

Steering Language Models With Activation Engineering , author=. 2024 , eprint=

work page 2024
[30]

ACM Computing Surveys , volume =

Fairness in Deep Learning: A survey on vision and language research , author =. ACM Computing Surveys , volume =. 2025 , publisher =

work page 2025
[31]

Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

Michael Li and Nishant Subramani , year=. Echoes of. 2506.02132 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models

Dong, Yijiang River and Lin, Hongzhou and Belkin, Mikhail and Huerta, Ramon and Vuli \'c , Ivan. UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

work page doi:10.18653/v1/2025.naacl-long.444 2025
[33]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[34]

First Conference on Language Modeling , year=

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. First Conference on Language Modeling , year=

work page
[35]

OpenUnlearning: Accelerating

Vineeth Dorna and Anmol Reddy Mekala and Wenlong Zhao and Andrew McCallum and J Zico Kolter and Zachary Chase Lipton and Pratyush Maini , booktitle=. OpenUnlearning: Accelerating. 2026 , url=

work page 2026
[36]

Advances in Neural Information Processing Systems , volume=

Analysing the generalisation and reliability of steering vectors , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

Layer by Layer: Uncovering Hidden Representations in Language Models

Layer by layer: Uncovering hidden representations in language models , author=. arXiv preprint arXiv:2502.02013 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =

Penedo, Guilherme and Kydl\'. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =. Advances in Neural Information Processing Systems , doi =

work page
[39]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in. 2022 , publisher =

work page 2022
[40]

arXiv preprint arXiv:2406.01506 , year=

The geometry of categorical and hierarchical concepts in large language models , author=. arXiv preprint arXiv:2406.01506 , year=

work page arXiv
[41]

arXiv preprint arXiv:2410.16454 , year=

Catastrophic failure of llm unlearning via quantization , author=. arXiv preprint arXiv:2410.16454 , year=

work page arXiv
[42]

2025 , eprint=

Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models , author=. 2025 , eprint=

work page 2025
[43]

Shen and Xinchi Qiu and Meghdad Kurmanji and Alex Iacob and Lorenzo Sani and Yihong Chen and Nicola Cancedda and Nicholas D

William F. Shen and Xinchi Qiu and Meghdad Kurmanji and Alex Iacob and Lorenzo Sani and Yihong Chen and Nicola Cancedda and Nicholas D. Lane , year=. 2502.07218 , archivePrefix=

work page arXiv
[44]

2023 , eprint=

Machine Unlearning: A Survey , author=. 2023 , eprint=

work page 2023
[45]

2023 , eprint=

Learn to Unlearn: A Survey on Machine Unlearning , author=. 2023 , eprint=

work page 2023
[46]

IEEE Transactions on Emerging Topics in Computational Intelligence , volume =

Machine Unlearning: Solutions and Challenges , author =. IEEE Transactions on Emerging Topics in Computational Intelligence , volume =. 2024 , month = jun, doi =

work page 2024
[47]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Large Language Model Unlearning , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[48]

Retrieval-Augmented Generation for Knowledge-Intensive

Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K. Retrieval-Augmented Generation for Knowledge-Intensive. 2020 , eprint=

work page 2020
[49]

Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh , booktitle =. Self-. 2024 , url =

work page 2024
[50]

2022 , eprint=

Memory-Based Model Editing at Scale , author=. 2022 , eprint=

work page 2022
[51]

2025 , eprint=

A Law of Next-Token Prediction in Large Language Models , author=. 2025 , eprint=

work page 2025
[52]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[53]

Exploring Criteria of Loss Reweighting to Enhance

Puning Yang and Qizhou Wang and Zhuo Huang and Tongliang Liu and Chengqi Zhang and Bo Han , booktitle=. Exploring Criteria of Loss Reweighting to Enhance. 2025 , url=

work page 2025
[54]

Rethinking

Qizhou Wang and Jin Peng Zhou and Zhanke Zhou and Saebyeol Shin and Bo Han and Kilian Q Weinberger , booktitle=. Rethinking. 2025 , url=

work page 2025
[55]

2025 , eprint=

A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models , author=. 2025 , eprint=

work page 2025
[56]

2025 , url =

Yang, Bo , journal =. 2025 , url =

work page 2025
[57]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke , booktitle=

work page
[59]

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle=

work page

[1] [1]

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini and Zhili Feng and Avi Schwarzschild and Zachary C. Lipton and J. Zico Kolter , year=. 2401.06121 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , url =

Reimers, Nils and Gurevych, Iryna , biburl =. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , url =. EMNLP/IJCNLP (1) , crossref =

work page

[3] [3]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019

work page 2019

[4] [4]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Jang, Joel and Yoon, Dongkeun and Yang, Sohee and Cha, Sungmin and Lee, Moontae and Logeswaran, Lajanugen and Seo, Minjoon , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

work page

[5] [5]

Smith and Chiyuan Zhang , booktitle=

Weijia Shi and Jaechan Lee and Yangsibo Huang and Sadhika Malladi and Jieyu Zhao and Ari Holtzman and Daogao Liu and Luke Zettlemoyer and Noah A. Smith and Chiyuan Zhang , booktitle=. 2025 , url=

work page 2025

[6] [6]

ArXiv , year=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. ArXiv , year=

work page

[7] [7]

Steering Llama 2 via Contrastive Activation Addition , url =

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024

[8] [8]

ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

Angular Steering: Behavior Control via Rotation in Activation Space , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

work page 2025

[9] [9]

2026 , eprint=

Spherical Steering: Geometry-Aware Activation Rotation for Language Models , author=. 2026 , eprint=

work page 2026

[10] [10]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[11] [11]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[12] [12]

Simplicity Prevails: Rethinking Negative Preference Optimization for

Chongyu Fan and Jiancheng Liu and Licong Lin and Jinghan Jia and Ruiqi Zhang and Song Mei and Sijia Liu , booktitle=. Simplicity Prevails: Rethinking Negative Preference Optimization for. 2024 , url=

work page 2024

[13] [13]

Nature Machine Intelligence , volume =

Rethinking machine unlearning for large language models , author =. Nature Machine Intelligence , volume =. 2025 , publisher =

work page 2025

[14] [14]

BLUR : A Bi-Level Optimization Approach for LLM Unlearning

Reisizadeh, Hadi and Jia, Jinghan and Bu, Zhiqi and Vinzamuri, Bhanukiran and Ramakrishna, Anil and Chang, Kai-Wei and Cevher, Volkan and Liu, Sijia and Hong, Mingyi. BLUR : A Bi-Level Optimization Approach for LLM Unlearning. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Pa...

work page doi:10.18653/v1/2026.eacl-long.331 2026

[15] [15]

The Thirteenth International Conference on Learning Representations , year=

Programming Refusal with Conditional Activation Steering , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[16] [16]

Semantics-Adaptive Activation Intervention for

Weixuan Wang and JINGYUAN YANG and Wei Peng , booktitle=. Semantics-Adaptive Activation Intervention for. 2025 , url=

work page 2025

[17] [17]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Park, Kiho and Choe, Yo Joong and Veitch, Victor , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024

[18] [18]

2022 , eprint=

Toy Models of Superposition , author=. 2022 , eprint=

work page 2022

[19] [19]

Quantization-Robust

Jo. Quantization-Robust. 2026 , eprint=

work page 2026

[20] [20]

Proceedings of The 1st Conference on Lifelong Learning Agents , editor =

Liu, Bo and Liu, Qiang and Stone, Peter , title =. Proceedings of The 1st Conference on Lifelong Learning Agents , editor =. 2022 , url =

work page 2022

[21] [21]

The Probabilistic Relevance Framework:

Robertson, Stephen and Zaragoza, Hugo , year =. The Probabilistic Relevance Framework:. Foundations and Trends in Information Retrieval , doi =

work page

[22] [22]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

Thakur, Nandan and Reimers, Nils and R. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

work page

[23] [23]

2025 , issue_date =

Meng, Chuan and Arabzadeh, Negar and Askari, Arian and Aliannejadi, Mohammad and Rijke, Maarten de , title =. 2025 , issue_date =. doi:10.1145/3736402 , journal =

work page doi:10.1145/3736402 2025

[24] [24]

Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

Yu, Tianhe and Kumar, Saurabh and Gupta, Abhishek and Levine, Sergey and Hausman, Karol and Finn, Chelsea , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

work page 2020

[25] [25]

Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =

Liu, Bo and Liu, Xingchao and Jin, Xiaojie and Stone, Peter and Liu, Qiang , title =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =. 2021 , isbn =

work page 2021

[26] [26]

and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Laban, Rassin and Hendrycks, Dan , booktitle =

Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Laban, Rassin and Hendrycks, Dan , booktitle =. The

work page

[27] [27]

The Eleventh International Conference on Learning Representations , year =

Editing Models with Task Arithmetic , author =. The Eleventh International Conference on Learning Representations , year =

work page

[28] [28]

Proceedings of the 41st International Conference on Machine Learning , year =

In-Context Unlearning: Language Models as Few-Shot Unlearners , author =. Proceedings of the 41st International Conference on Machine Learning , year =

work page

[29] [29]

2024 , eprint=

Steering Language Models With Activation Engineering , author=. 2024 , eprint=

work page 2024

[30] [30]

ACM Computing Surveys , volume =

Fairness in Deep Learning: A survey on vision and language research , author =. ACM Computing Surveys , volume =. 2025 , publisher =

work page 2025

[31] [31]

Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

Michael Li and Nishant Subramani , year=. Echoes of. 2506.02132 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models

Dong, Yijiang River and Lin, Hongzhou and Belkin, Mikhail and Huerta, Ramon and Vuli \'c , Ivan. UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

work page doi:10.18653/v1/2025.naacl-long.444 2025

[33] [33]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[34] [34]

First Conference on Language Modeling , year=

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. First Conference on Language Modeling , year=

work page

[35] [35]

OpenUnlearning: Accelerating

Vineeth Dorna and Anmol Reddy Mekala and Wenlong Zhao and Andrew McCallum and J Zico Kolter and Zachary Chase Lipton and Pratyush Maini , booktitle=. OpenUnlearning: Accelerating. 2026 , url=

work page 2026

[36] [36]

Advances in Neural Information Processing Systems , volume=

Analysing the generalisation and reliability of steering vectors , author=. Advances in Neural Information Processing Systems , volume=

work page

[37] [37]

Layer by Layer: Uncovering Hidden Representations in Language Models

Layer by layer: Uncovering hidden representations in language models , author=. arXiv preprint arXiv:2502.02013 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =

Penedo, Guilherme and Kydl\'. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =. Advances in Neural Information Processing Systems , doi =

work page

[39] [39]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in. 2022 , publisher =

work page 2022

[40] [40]

arXiv preprint arXiv:2406.01506 , year=

The geometry of categorical and hierarchical concepts in large language models , author=. arXiv preprint arXiv:2406.01506 , year=

work page arXiv

[41] [41]

arXiv preprint arXiv:2410.16454 , year=

Catastrophic failure of llm unlearning via quantization , author=. arXiv preprint arXiv:2410.16454 , year=

work page arXiv

[42] [42]

2025 , eprint=

Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models , author=. 2025 , eprint=

work page 2025

[43] [43]

Shen and Xinchi Qiu and Meghdad Kurmanji and Alex Iacob and Lorenzo Sani and Yihong Chen and Nicola Cancedda and Nicholas D

William F. Shen and Xinchi Qiu and Meghdad Kurmanji and Alex Iacob and Lorenzo Sani and Yihong Chen and Nicola Cancedda and Nicholas D. Lane , year=. 2502.07218 , archivePrefix=

work page arXiv

[44] [44]

2023 , eprint=

Machine Unlearning: A Survey , author=. 2023 , eprint=

work page 2023

[45] [45]

2023 , eprint=

Learn to Unlearn: A Survey on Machine Unlearning , author=. 2023 , eprint=

work page 2023

[46] [46]

IEEE Transactions on Emerging Topics in Computational Intelligence , volume =

Machine Unlearning: Solutions and Challenges , author =. IEEE Transactions on Emerging Topics in Computational Intelligence , volume =. 2024 , month = jun, doi =

work page 2024

[47] [47]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Large Language Model Unlearning , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[48] [48]

Retrieval-Augmented Generation for Knowledge-Intensive

Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K. Retrieval-Augmented Generation for Knowledge-Intensive. 2020 , eprint=

work page 2020

[49] [49]

Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh , booktitle =. Self-. 2024 , url =

work page 2024

[50] [50]

2022 , eprint=

Memory-Based Model Editing at Scale , author=. 2022 , eprint=

work page 2022

[51] [51]

2025 , eprint=

A Law of Next-Token Prediction in Large Language Models , author=. 2025 , eprint=

work page 2025

[52] [52]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[53] [53]

Exploring Criteria of Loss Reweighting to Enhance

Puning Yang and Qizhou Wang and Zhuo Huang and Tongliang Liu and Chengqi Zhang and Bo Han , booktitle=. Exploring Criteria of Loss Reweighting to Enhance. 2025 , url=

work page 2025

[54] [54]

Rethinking

Qizhou Wang and Jin Peng Zhou and Zhanke Zhou and Saebyeol Shin and Bo Han and Kilian Q Weinberger , booktitle=. Rethinking. 2025 , url=

work page 2025

[55] [55]

2025 , eprint=

A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models , author=. 2025 , eprint=

work page 2025

[56] [56]

2025 , url =

Yang, Bo , journal =. 2025 , url =

work page 2025

[57] [57]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke , booktitle=

work page

[59] [59]

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle=

work page