Recognition: unknown
Inference-Time Machine Unlearning via Gated Activation Redirection
Pith reviewed 2026-05-14 20:54 UTC · model grok-4.3
The pith
A gated input-dependent rotation in the residual stream enables inference-time unlearning of specific data in LLMs without altering weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GUARD-IT achieves machine unlearning by applying an input-dependent gated activation redirection as a norm-preserving rotation in the residual stream at inference time, matching or exceeding 12 gradient-based baselines across three model scales on TOFU and MUSE datasets while preserving utility, suppressing memorization, and avoiding catastrophic collapse.
What carries the argument
Gated Activation Redirection (GUARD-IT): an inference-time intervention that computes an input-dependent norm-preserving rotation in the residual stream to steer model behavior away from a forget set.
If this is right
- Matches or exceeds performance of 12 gradient-based unlearning baselines on TOFU and MUSE across three model scales.
- Preserves model utility on retain data while suppressing memorization of the forget set.
- Avoids catastrophic collapse in all tested settings, unlike some parameter-editing methods.
- Supports continual unlearning without requiring retraining for new forget sets.
- Remains effective when models are quantized for deployment, where weight-editing methods degrade.
Where Pith is reading between the lines
- Deployed models could support user-requested data removal on the fly without retraining or weight access.
- The method might extend to steering for other goals like reducing harmful outputs in safety applications.
- Further work could test if the rotation mechanism scales to very large models or multimodal systems.
Load-bearing premise
That an input-dependent norm-preserving rotation in the residual stream can selectively remove the influence of a forget set without introducing unintended changes to model behavior on unrelated inputs.
What would settle it
Observing that the method either fails to reduce memorization scores on the forget set or causes performance drops on a control set of unrelated prompts after applying the redirection.
read the original abstract
Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GUARD-IT, a training- and gradient-free inference-time unlearning method for LLMs. It applies an input-dependent gated activation redirection, realized as a norm-preserving rotation in the residual stream, to suppress the influence of a targeted forget set. Experiments on the TOFU and MUSE benchmarks across three model scales claim that GUARD-IT matches or exceeds 12 gradient-based baselines while being the only approach that simultaneously preserves utility, suppresses memorization, and avoids catastrophic collapse. The method is further stated to support continual unlearning without retraining and to remain effective under quantization.
Significance. If the selectivity of the gated redirection and the reported comparative results hold under detailed scrutiny, the work would offer a practical advance in machine unlearning. By avoiding any weight updates, it provides efficiency, reversibility, and quantization compatibility that address documented limitations of parameter-editing methods. The input-dependent gating mechanism is positioned as an improvement over global activation steering.
major comments (3)
- [§3] §3 (Method description): No equations, pseudocode, or explicit algorithm is given for constructing the input-dependent gate or the redirection (rotation) vector from activation statistics. Because the central claim rests on the gate selectively triggering only for forget-set-related inputs without false positives on distributionally similar but unrelated inputs, the absence of these details prevents verification that the intervention satisfies the utility-preservation requirement.
- [§4] §4 (Experiments and results): The claim that GUARD-IT matches or exceeds 12 baselines while uniquely satisfying all three criteria is load-bearing, yet the manuscript provides neither exact metric values (e.g., forget accuracy, utility scores), statistical significance tests, nor implementation details for the baselines. This omission makes the comparative evaluation impossible to assess from the reported text.
- [§5] §5 (Continual unlearning and quantization): The assertions that the method supports continual unlearning without retraining and remains effective under quantization lack quantitative results, ablation studies, or before/after comparisons. These scenarios are presented as key advantages, so supporting data are required to substantiate them.
minor comments (2)
- [Abstract] Abstract: The statement that GUARD-IT is 'the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse' should include a reference to the specific table or section that enumerates all compared methods and their failure modes.
- [Abstract] Notation: The acronym GUARD-IT and the phrase 'Gated Activation Redirection' are introduced without an immediate parenthetical definition or forward reference to the section where the components are defined.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that additional methodological clarity, precise experimental reporting, and supporting quantitative results are needed to strengthen the manuscript. We will revise the paper accordingly and address each major comment below.
read point-by-point responses
-
Referee: [§3] §3 (Method description): No equations, pseudocode, or explicit algorithm is given for constructing the input-dependent gate or the redirection (rotation) vector from activation statistics. Because the central claim rests on the gate selectively triggering only for forget-set-related inputs without false positives on distributionally similar but unrelated inputs, the absence of these details prevents verification that the intervention satisfies the utility-preservation requirement.
Authors: We agree that the method section requires more explicit detail. In the revised manuscript we will add the full set of equations defining the input-dependent gate (computed from per-layer activation statistics) and the norm-preserving rotation vector, together with pseudocode for the complete inference-time procedure. This will make the selectivity mechanism and its utility-preservation properties directly verifiable. revision: yes
-
Referee: [§4] §4 (Experiments and results): The claim that GUARD-IT matches or exceeds 12 baselines while uniquely satisfying all three criteria is load-bearing, yet the manuscript provides neither exact metric values (e.g., forget accuracy, utility scores), statistical significance tests, nor implementation details for the baselines. This omission makes the comparative evaluation impossible to assess from the reported text.
Authors: We acknowledge the need for complete numerical transparency. The revised version will include full tables reporting exact metric values (forget accuracy, utility scores, etc.) for all models and benchmarks, the results of statistical significance tests (with p-values), and detailed implementation specifications for each of the 12 baselines, including hyper-parameters and code references. revision: yes
-
Referee: [§5] §5 (Continual unlearning and quantization): The assertions that the method supports continual unlearning without retraining and remains effective under quantization lack quantitative results, ablation studies, or before/after comparisons. These scenarios are presented as key advantages, so supporting data are required to substantiate them.
Authors: We will expand the experimental section with new quantitative evaluations. For continual unlearning we will report performance after sequential unlearning of multiple forget sets without retraining. For quantization we will add before-and-after comparisons and ablations under 8-bit and 4-bit quantization, demonstrating that GUARD-IT retains its effectiveness where parameter-editing baselines degrade. revision: yes
Circularity Check
No circularity; empirical inference-time method with no self-referential derivations
full rationale
The paper presents GUARD-IT as a training- and gradient-free method that performs unlearning through input-dependent activation steering realized as a norm-preserving rotation in the residual stream. No equations, parameter fits, or derivation steps are shown that reduce the claimed performance gains to fitted inputs, self-citations, or ansatzes imported from prior work by the same authors. The central claims rest on empirical comparisons against 12 gradient-based baselines on TOFU and MUSE across model scales, with explicit statements that the approach avoids weight updates and supports quantization. Because the method is positioned as an empirical alternative rather than a closed mathematical reduction, and no load-bearing self-citation or self-definitional step is present, the derivation chain is self-contained.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Gated Activation Redirection
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2401.06121 , year=
Pratyush Maini and Zhili Feng and Avi Schwarzschild and Zachary C. Lipton and J. Zico Kolter , year=. 2401.06121 , archivePrefix=
-
[2]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , url =
Reimers, Nils and Gurevych, Iryna , biburl =. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , url =. EMNLP/IJCNLP (1) , crossref =
-
[3]
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019
work page 2019
-
[4]
Jang, Joel and Yoon, Dongkeun and Yang, Sohee and Cha, Sungmin and Lee, Moontae and Logeswaran, Lajanugen and Seo, Minjoon , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[5]
Smith and Chiyuan Zhang , booktitle=
Weijia Shi and Jaechan Lee and Yangsibo Huang and Sadhika Malladi and Jieyu Zhao and Ari Holtzman and Daogao Liu and Luke Zettlemoyer and Noah A. Smith and Chiyuan Zhang , booktitle=. 2025 , url=
work page 2025
-
[6]
Representation Engineering: A Top-Down Approach to AI Transparency , author=. ArXiv , year=
-
[7]
Steering Llama 2 via Contrastive Activation Addition
Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828
-
[8]
ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=
Angular Steering: Behavior Control via Rotation in Activation Space , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=
work page 2025
-
[9]
Spherical Steering: Geometry-Aware Activation Rotation for Language Models , author=. 2026 , eprint=
work page 2026
-
[10]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[11]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[12]
Simplicity Prevails: Rethinking Negative Preference Optimization for
Chongyu Fan and Jiancheng Liu and Licong Lin and Jinghan Jia and Ruiqi Zhang and Song Mei and Sijia Liu , booktitle=. Simplicity Prevails: Rethinking Negative Preference Optimization for. 2024 , url=
work page 2024
-
[13]
Nature Machine Intelligence , volume =
Rethinking machine unlearning for large language models , author =. Nature Machine Intelligence , volume =. 2025 , publisher =
work page 2025
-
[14]
BLUR : A Bi-Level Optimization Approach for LLM Unlearning
Reisizadeh, Hadi and Jia, Jinghan and Bu, Zhiqi and Vinzamuri, Bhanukiran and Ramakrishna, Anil and Chang, Kai-Wei and Cevher, Volkan and Liu, Sijia and Hong, Mingyi. BLUR : A Bi-Level Optimization Approach for LLM Unlearning. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Pa...
-
[15]
The Thirteenth International Conference on Learning Representations , year=
Programming Refusal with Conditional Activation Steering , author=. The Thirteenth International Conference on Learning Representations , year=
-
[16]
Semantics-Adaptive Activation Intervention for
Weixuan Wang and JINGYUAN YANG and Wei Peng , booktitle=. Semantics-Adaptive Activation Intervention for. 2025 , url=
work page 2025
-
[17]
Proceedings of the 41st International Conference on Machine Learning , articleno =
Park, Kiho and Choe, Yo Joong and Veitch, Victor , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =
work page 2024
- [18]
- [19]
-
[20]
Proceedings of The 1st Conference on Lifelong Learning Agents , editor =
Liu, Bo and Liu, Qiang and Stone, Peter , title =. Proceedings of The 1st Conference on Lifelong Learning Agents , editor =. 2022 , url =
work page 2022
-
[21]
The Probabilistic Relevance Framework:
Robertson, Stephen and Zaragoza, Hugo , year =. The Probabilistic Relevance Framework:. Foundations and Trends in Information Retrieval , doi =
-
[22]
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =
Thakur, Nandan and Reimers, Nils and R. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =
-
[23]
Meng, Chuan and Arabzadeh, Negar and Askari, Arian and Aliannejadi, Mohammad and Rijke, Maarten de , title =. 2025 , issue_date =. doi:10.1145/3736402 , journal =
-
[24]
Yu, Tianhe and Kumar, Saurabh and Gupta, Abhishek and Levine, Sergey and Hausman, Karol and Finn, Chelsea , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =
work page 2020
-
[25]
Liu, Bo and Liu, Xingchao and Jin, Xiaojie and Stone, Peter and Liu, Qiang , title =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =. 2021 , isbn =
work page 2021
-
[26]
Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Laban, Rassin and Hendrycks, Dan , booktitle =. The
-
[27]
The Eleventh International Conference on Learning Representations , year =
Editing Models with Task Arithmetic , author =. The Eleventh International Conference on Learning Representations , year =
-
[28]
Proceedings of the 41st International Conference on Machine Learning , year =
In-Context Unlearning: Language Models as Few-Shot Unlearners , author =. Proceedings of the 41st International Conference on Machine Learning , year =
-
[29]
Steering Language Models With Activation Engineering , author=. 2024 , eprint=
work page 2024
-
[30]
ACM Computing Surveys , volume =
Fairness in Deep Learning: A survey on vision and language research , author =. ACM Computing Surveys , volume =. 2025 , publisher =
work page 2025
-
[31]
Michael Li and Nishant Subramani , year=. Echoes of. 2506.02132 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models
Dong, Yijiang River and Lin, Hongzhou and Belkin, Mikhail and Huerta, Ramon and Vuli \'c , Ivan. UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...
-
[33]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[34]
First Conference on Language Modeling , year=
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. First Conference on Language Modeling , year=
-
[35]
Vineeth Dorna and Anmol Reddy Mekala and Wenlong Zhao and Andrew McCallum and J Zico Kolter and Zachary Chase Lipton and Pratyush Maini , booktitle=. OpenUnlearning: Accelerating. 2026 , url=
work page 2026
-
[36]
Advances in Neural Information Processing Systems , volume=
Analysing the generalisation and reliability of steering vectors , author=. Advances in Neural Information Processing Systems , volume=
-
[37]
Layer by layer: Uncovering hidden representations in language models , author=. arXiv preprint arXiv:2502.02013 , year=
-
[38]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =
Penedo, Guilherme and Kydl\'. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =. Advances in Neural Information Processing Systems , doi =
-
[39]
Locating and Editing Factual Associations in
Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in. 2022 , publisher =
work page 2022
-
[40]
arXiv preprint arXiv:2406.01506 , year=
The geometry of categorical and hierarchical concepts in large language models , author=. arXiv preprint arXiv:2406.01506 , year=
-
[41]
arXiv preprint arXiv:2410.16454 , year=
Catastrophic failure of llm unlearning via quantization , author=. arXiv preprint arXiv:2410.16454 , year=
-
[42]
Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models , author=. 2025 , eprint=
work page 2025
-
[43]
William F. Shen and Xinchi Qiu and Meghdad Kurmanji and Alex Iacob and Lorenzo Sani and Yihong Chen and Nicola Cancedda and Nicholas D. Lane , year=. 2502.07218 , archivePrefix=
- [44]
-
[45]
Learn to Unlearn: A Survey on Machine Unlearning , author=. 2023 , eprint=
work page 2023
-
[46]
IEEE Transactions on Emerging Topics in Computational Intelligence , volume =
Machine Unlearning: Solutions and Challenges , author =. IEEE Transactions on Emerging Topics in Computational Intelligence , volume =. 2024 , month = jun, doi =
work page 2024
-
[47]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Large Language Model Unlearning , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[48]
Retrieval-Augmented Generation for Knowledge-Intensive
Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K. Retrieval-Augmented Generation for Knowledge-Intensive. 2020 , eprint=
work page 2020
-
[49]
Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh , booktitle =. Self-. 2024 , url =
work page 2024
- [50]
-
[51]
A Law of Next-Token Prediction in Large Language Models , author=. 2025 , eprint=
work page 2025
-
[52]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[53]
Exploring Criteria of Loss Reweighting to Enhance
Puning Yang and Qizhou Wang and Zhuo Huang and Tongliang Liu and Chengqi Zhang and Bo Han , booktitle=. Exploring Criteria of Loss Reweighting to Enhance. 2025 , url=
work page 2025
-
[54]
Qizhou Wang and Jin Peng Zhou and Zhanke Zhou and Saebyeol Shin and Bo Han and Kilian Q Weinberger , booktitle=. Rethinking. 2025 , url=
work page 2025
-
[55]
A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models , author=. 2025 , eprint=
work page 2025
- [56]
-
[57]
The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke , booktitle=
-
[59]
Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.