Inference-Time Machine Unlearning via Gated Activation Redirection
Pith reviewed 2026-05-20 21:51 UTC · model grok-4.3
The pith
GUARD-IT unlearns specific data from large language models by steering activations at inference time without updating weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that GUARD-IT, a training- and gradient-free method, achieves machine unlearning by applying input-dependent activation steering as norm-preserving rotations in the residual stream during inference. Experiments on the TOFU and MUSE benchmarks demonstrate that it matches or exceeds twelve gradient-based baselines across three model scales. It is the only method tested that simultaneously preserves utility, suppresses memorization, and avoids catastrophic collapse in all settings. The method also enables continual unlearning without retraining and maintains effectiveness when models are quantized.
What carries the argument
Gated activation redirection, which computes an input-dependent steering vector and applies it as a norm-preserving rotation in the residual stream to steer behavior without weight changes.
If this is right
- Unlearning can be performed without any gradient computations or parameter updates.
- The approach works on quantized models where weight-editing methods fail.
- Continual unlearning is possible by applying successive interventions without retraining.
- Utility and memorization suppression hold across small, medium, and large model scales.
Where Pith is reading between the lines
- Such inference-time methods could make it feasible to comply with data deletion requests in real-time for deployed AI systems.
- The technique might be adapted to other goals like reducing hallucinations or enforcing safety constraints dynamically.
- If the rotations are truly norm-preserving, they may preserve more of the original model's capabilities than additive steering vectors.
Load-bearing premise
That the gating mechanism can produce rotations removing targeted information without unintended side effects or performance degradation on unrelated tasks.
What would settle it
If the model after GUARD-IT still generates content from the forget set on some inputs or shows reduced accuracy on standard benchmarks, the central claim would be falsified.
read the original abstract
Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GUARD-IT, a training- and gradient-free method for machine unlearning in LLMs that performs input-dependent activation steering at inference time. It applies gated redirections in the residual stream as norm-preserving rotations without altering model weights. On TOFU and MUSE benchmarks, GUARD-IT is reported to match or exceed 12 gradient-based baselines across three model scales while being the only approach that simultaneously preserves utility, suppresses memorization, and avoids catastrophic collapse; it also supports continual unlearning without retraining and remains effective under quantization.
Significance. If the empirical claims hold with adequate controls, the work would be significant for offering a practical, reversible, and quantization-robust alternative to parameter-editing unlearning methods. The inference-time approach addresses computational and deployment limitations of gradient-based techniques and extends activation engineering to selective unlearning, which could influence future research on efficient post-training interventions.
major comments (2)
- [§3.2] §3.2: The gating function and rotation construction are described conceptually as a fixed, non-learned computation that produces input-specific norm-preserving rotations, but no explicit equation or algorithm is given for how the gate is derived from the forget set (e.g., no definition of the similarity measure, projection, or activation threshold). This is load-bearing for the central claim that the method is training-free yet avoids the failure modes of global steering.
- [§5.1, Table 1] §5.1, Table 1: The headline claim that GUARD-IT is the only method to simultaneously preserve utility, suppress memorization, and avoid collapse across all settings relies on aggregate performance numbers, but the table reports point estimates without error bars, run counts, or statistical significance tests against the 12 baselines; this weakens the uniqueness assertion.
minor comments (3)
- [Abstract] Abstract: The phrase 'three model scales' is used without naming the specific models or parameter counts.
- [§4] §4: The description of the norm-preserving property of the rotation would benefit from a short proof sketch or reference to the relevant linear-algebra fact.
- [Figure 3] Figure 3: Axis labels and legend entries are too small for readability in the printed version.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below and outline the revisions we will make.
read point-by-point responses
-
Referee: [§3.2] §3.2: The gating function and rotation construction are described conceptually as a fixed, non-learned computation that produces input-specific norm-preserving rotations, but no explicit equation or algorithm is given for how the gate is derived from the forget set (e.g., no definition of the similarity measure, projection, or activation threshold). This is load-bearing for the central claim that the method is training-free yet avoids the failure modes of global steering.
Authors: We agree that an explicit mathematical formulation is necessary to fully substantiate the central claims. In the revised manuscript we will expand §3.2 with the precise equations and algorithm: the gate is computed from a similarity measure between the input activation and a prototype derived from the forget-set activations, followed by a projection step and a threshold that determines whether redirection is applied. The rotation itself is constructed as a norm-preserving operation in the residual stream. These additions will make transparent how the input-dependent mechanism is obtained without any training or gradient steps. revision: yes
-
Referee: [§5.1, Table 1] §5.1, Table 1: The headline claim that GUARD-IT is the only method to simultaneously preserve utility, suppress memorization, and avoid collapse across all settings relies on aggregate performance numbers, but the table reports point estimates without error bars, run counts, or statistical significance tests against the 12 baselines; this weakens the uniqueness assertion.
Authors: We acknowledge that the current presentation of Table 1 uses point estimates and would be strengthened by error bars and run counts. Each configuration was evaluated once owing to the high computational cost of the large-model experiments and the full set of baselines. In the revision we will add error bars from repeated runs on the smaller models and include a note on consistency across the three model scales reported in the main results and supplement. The uniqueness claim rests on the observed pattern that GUARD-IT is the only method satisfying all three criteria in every setting; we will qualify the language to reflect the empirical scope while retaining the comparative observation. revision: partial
Circularity Check
No circularity: method is training-free empirical construction with no derivation reducing to inputs
full rationale
The paper presents GUARD-IT explicitly as a training- and gradient-free inference-time technique that applies input-dependent norm-preserving rotations in the residual stream. No equations, fitting procedures, or self-citations are shown that would make any claimed performance outcome equivalent to its own inputs by construction. The central claims rest on empirical comparisons against 12 baselines on TOFU and MUSE across model scales, with no load-bearing uniqueness theorem, ansatz smuggling, or renaming of known results. The derivation chain is therefore self-contained as a novel algorithmic proposal rather than a tautological reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. ... The resulting intervention is applied as a norm-preserving rotation in the residual stream
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GUARD-IT partitions the forget corpus into semantic clusters ... computes one steering vector per cluster ... routes each user query through a similarity gateway
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
TOFU: A Task of Fictitious Unlearning for LLMs
Pratyush Maini and Zhili Feng and Avi Schwarzschild and Zachary C. Lipton and J. Zico Kolter , year=. 2401.06121 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , url =
Reimers, Nils and Gurevych, Iryna , biburl =. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , url =. EMNLP/IJCNLP (1) , crossref =
-
[3]
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019
work page 2019
-
[4]
Jang, Joel and Yoon, Dongkeun and Yang, Sohee and Cha, Sungmin and Lee, Moontae and Logeswaran, Lajanugen and Seo, Minjoon , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[5]
Smith and Chiyuan Zhang , booktitle=
Weijia Shi and Jaechan Lee and Yangsibo Huang and Sadhika Malladi and Jieyu Zhao and Ari Holtzman and Daogao Liu and Luke Zettlemoyer and Noah A. Smith and Chiyuan Zhang , booktitle=. 2025 , url=
work page 2025
-
[6]
Representation Engineering: A Top-Down Approach to AI Transparency , author=. ArXiv , year=
-
[7]
Steering Llama 2 via Contrastive Activation Addition , url =
Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828
-
[8]
ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=
Angular Steering: Behavior Control via Rotation in Activation Space , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=
work page 2025
-
[9]
Spherical Steering: Geometry-Aware Activation Rotation for Language Models , author=. 2026 , eprint=
work page 2026
-
[10]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[11]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[12]
Simplicity Prevails: Rethinking Negative Preference Optimization for
Chongyu Fan and Jiancheng Liu and Licong Lin and Jinghan Jia and Ruiqi Zhang and Song Mei and Sijia Liu , booktitle=. Simplicity Prevails: Rethinking Negative Preference Optimization for. 2024 , url=
work page 2024
-
[13]
Nature Machine Intelligence , volume =
Rethinking machine unlearning for large language models , author =. Nature Machine Intelligence , volume =. 2025 , publisher =
work page 2025
-
[14]
BLUR : A Bi-Level Optimization Approach for LLM Unlearning
Reisizadeh, Hadi and Jia, Jinghan and Bu, Zhiqi and Vinzamuri, Bhanukiran and Ramakrishna, Anil and Chang, Kai-Wei and Cevher, Volkan and Liu, Sijia and Hong, Mingyi. BLUR : A Bi-Level Optimization Approach for LLM Unlearning. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Pa...
-
[15]
The Thirteenth International Conference on Learning Representations , year=
Programming Refusal with Conditional Activation Steering , author=. The Thirteenth International Conference on Learning Representations , year=
-
[16]
Semantics-Adaptive Activation Intervention for
Weixuan Wang and JINGYUAN YANG and Wei Peng , booktitle=. Semantics-Adaptive Activation Intervention for. 2025 , url=
work page 2025
-
[17]
Proceedings of the 41st International Conference on Machine Learning , articleno =
Park, Kiho and Choe, Yo Joong and Veitch, Victor , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =
work page 2024
- [18]
- [19]
-
[20]
Proceedings of The 1st Conference on Lifelong Learning Agents , editor =
Liu, Bo and Liu, Qiang and Stone, Peter , title =. Proceedings of The 1st Conference on Lifelong Learning Agents , editor =. 2022 , url =
work page 2022
-
[21]
The Probabilistic Relevance Framework:
Robertson, Stephen and Zaragoza, Hugo , year =. The Probabilistic Relevance Framework:. Foundations and Trends in Information Retrieval , doi =
-
[22]
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =
Thakur, Nandan and Reimers, Nils and R. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =
-
[23]
Meng, Chuan and Arabzadeh, Negar and Askari, Arian and Aliannejadi, Mohammad and Rijke, Maarten de , title =. 2025 , issue_date =. doi:10.1145/3736402 , journal =
-
[24]
Yu, Tianhe and Kumar, Saurabh and Gupta, Abhishek and Levine, Sergey and Hausman, Karol and Finn, Chelsea , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =
work page 2020
-
[25]
Liu, Bo and Liu, Xingchao and Jin, Xiaojie and Stone, Peter and Liu, Qiang , title =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =. 2021 , isbn =
work page 2021
-
[26]
Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Laban, Rassin and Hendrycks, Dan , booktitle =. The
-
[27]
The Eleventh International Conference on Learning Representations , year =
Editing Models with Task Arithmetic , author =. The Eleventh International Conference on Learning Representations , year =
-
[28]
Proceedings of the 41st International Conference on Machine Learning , year =
In-Context Unlearning: Language Models as Few-Shot Unlearners , author =. Proceedings of the 41st International Conference on Machine Learning , year =
-
[29]
Steering Language Models With Activation Engineering , author=. 2024 , eprint=
work page 2024
-
[30]
ACM Computing Surveys , volume =
Fairness in Deep Learning: A survey on vision and language research , author =. ACM Computing Surveys , volume =. 2025 , publisher =
work page 2025
-
[31]
Michael Li and Nishant Subramani , year=. Echoes of. 2506.02132 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models
Dong, Yijiang River and Lin, Hongzhou and Belkin, Mikhail and Huerta, Ramon and Vuli \'c , Ivan. UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...
-
[33]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[34]
First Conference on Language Modeling , year=
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. First Conference on Language Modeling , year=
-
[35]
Vineeth Dorna and Anmol Reddy Mekala and Wenlong Zhao and Andrew McCallum and J Zico Kolter and Zachary Chase Lipton and Pratyush Maini , booktitle=. OpenUnlearning: Accelerating. 2026 , url=
work page 2026
-
[36]
Advances in Neural Information Processing Systems , volume=
Analysing the generalisation and reliability of steering vectors , author=. Advances in Neural Information Processing Systems , volume=
-
[37]
Layer by Layer: Uncovering Hidden Representations in Language Models
Layer by layer: Uncovering hidden representations in language models , author=. arXiv preprint arXiv:2502.02013 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =
Penedo, Guilherme and Kydl\'. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =. Advances in Neural Information Processing Systems , doi =
-
[39]
Locating and Editing Factual Associations in
Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in. 2022 , publisher =
work page 2022
-
[40]
arXiv preprint arXiv:2406.01506 , year=
The geometry of categorical and hierarchical concepts in large language models , author=. arXiv preprint arXiv:2406.01506 , year=
-
[41]
arXiv preprint arXiv:2410.16454 , year=
Catastrophic failure of llm unlearning via quantization , author=. arXiv preprint arXiv:2410.16454 , year=
-
[42]
Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models , author=. 2025 , eprint=
work page 2025
-
[43]
William F. Shen and Xinchi Qiu and Meghdad Kurmanji and Alex Iacob and Lorenzo Sani and Yihong Chen and Nicola Cancedda and Nicholas D. Lane , year=. 2502.07218 , archivePrefix=
- [44]
-
[45]
Learn to Unlearn: A Survey on Machine Unlearning , author=. 2023 , eprint=
work page 2023
-
[46]
IEEE Transactions on Emerging Topics in Computational Intelligence , volume =
Machine Unlearning: Solutions and Challenges , author =. IEEE Transactions on Emerging Topics in Computational Intelligence , volume =. 2024 , month = jun, doi =
work page 2024
-
[47]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Large Language Model Unlearning , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[48]
Retrieval-Augmented Generation for Knowledge-Intensive
Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K. Retrieval-Augmented Generation for Knowledge-Intensive. 2020 , eprint=
work page 2020
-
[49]
Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh , booktitle =. Self-. 2024 , url =
work page 2024
- [50]
-
[51]
A Law of Next-Token Prediction in Large Language Models , author=. 2025 , eprint=
work page 2025
-
[52]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[53]
Exploring Criteria of Loss Reweighting to Enhance
Puning Yang and Qizhou Wang and Zhuo Huang and Tongliang Liu and Chengqi Zhang and Bo Han , booktitle=. Exploring Criteria of Loss Reweighting to Enhance. 2025 , url=
work page 2025
-
[54]
Qizhou Wang and Jin Peng Zhou and Zhanke Zhou and Saebyeol Shin and Bo Han and Kilian Q Weinberger , booktitle=. Rethinking. 2025 , url=
work page 2025
-
[55]
A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models , author=. 2025 , eprint=
work page 2025
- [56]
-
[57]
The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke , booktitle=
-
[59]
Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.