pith. sign in

arxiv: 2605.12765 · v2 · pith:GUD5X3KTnew · submitted 2026-05-12 · 💻 cs.LG

Inference-Time Machine Unlearning via Gated Activation Redirection

Pith reviewed 2026-05-20 21:51 UTC · model grok-4.3

classification 💻 cs.LG
keywords machine unlearningactivation engineeringinference timelarge language modelsresidual streamgated redirectionquantizationprivacy
1
0 comments X

The pith

GUARD-IT unlearns specific data from large language models by steering activations at inference time without updating weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents GUARD-IT as a way to remove the influence of targeted training data from LLMs at inference time. It does this by using a gating mechanism to create input-specific rotations of activations in the residual stream. The goal is to approximate a model that was never trained on the forget set while keeping performance on other data. This matters because current unlearning methods require expensive retraining or cause side effects, whereas this approach leaves the model weights unchanged and works even after quantization.

Core claim

The paper claims that GUARD-IT, a training- and gradient-free method, achieves machine unlearning by applying input-dependent activation steering as norm-preserving rotations in the residual stream during inference. Experiments on the TOFU and MUSE benchmarks demonstrate that it matches or exceeds twelve gradient-based baselines across three model scales. It is the only method tested that simultaneously preserves utility, suppresses memorization, and avoids catastrophic collapse in all settings. The method also enables continual unlearning without retraining and maintains effectiveness when models are quantized.

What carries the argument

Gated activation redirection, which computes an input-dependent steering vector and applies it as a norm-preserving rotation in the residual stream to steer behavior without weight changes.

If this is right

  • Unlearning can be performed without any gradient computations or parameter updates.
  • The approach works on quantized models where weight-editing methods fail.
  • Continual unlearning is possible by applying successive interventions without retraining.
  • Utility and memorization suppression hold across small, medium, and large model scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such inference-time methods could make it feasible to comply with data deletion requests in real-time for deployed AI systems.
  • The technique might be adapted to other goals like reducing hallucinations or enforcing safety constraints dynamically.
  • If the rotations are truly norm-preserving, they may preserve more of the original model's capabilities than additive steering vectors.

Load-bearing premise

That the gating mechanism can produce rotations removing targeted information without unintended side effects or performance degradation on unrelated tasks.

What would settle it

If the model after GUARD-IT still generates content from the forget set on some inputs or shows reduced accuracy on standard benchmarks, the central claim would be falsified.

read the original abstract

Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces GUARD-IT, a training- and gradient-free method for machine unlearning in LLMs that performs input-dependent activation steering at inference time. It applies gated redirections in the residual stream as norm-preserving rotations without altering model weights. On TOFU and MUSE benchmarks, GUARD-IT is reported to match or exceed 12 gradient-based baselines across three model scales while being the only approach that simultaneously preserves utility, suppresses memorization, and avoids catastrophic collapse; it also supports continual unlearning without retraining and remains effective under quantization.

Significance. If the empirical claims hold with adequate controls, the work would be significant for offering a practical, reversible, and quantization-robust alternative to parameter-editing unlearning methods. The inference-time approach addresses computational and deployment limitations of gradient-based techniques and extends activation engineering to selective unlearning, which could influence future research on efficient post-training interventions.

major comments (2)
  1. [§3.2] §3.2: The gating function and rotation construction are described conceptually as a fixed, non-learned computation that produces input-specific norm-preserving rotations, but no explicit equation or algorithm is given for how the gate is derived from the forget set (e.g., no definition of the similarity measure, projection, or activation threshold). This is load-bearing for the central claim that the method is training-free yet avoids the failure modes of global steering.
  2. [§5.1, Table 1] §5.1, Table 1: The headline claim that GUARD-IT is the only method to simultaneously preserve utility, suppress memorization, and avoid collapse across all settings relies on aggregate performance numbers, but the table reports point estimates without error bars, run counts, or statistical significance tests against the 12 baselines; this weakens the uniqueness assertion.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'three model scales' is used without naming the specific models or parameter counts.
  2. [§4] §4: The description of the norm-preserving property of the rotation would benefit from a short proof sketch or reference to the relevant linear-algebra fact.
  3. [Figure 3] Figure 3: Axis labels and legend entries are too small for readability in the printed version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The gating function and rotation construction are described conceptually as a fixed, non-learned computation that produces input-specific norm-preserving rotations, but no explicit equation or algorithm is given for how the gate is derived from the forget set (e.g., no definition of the similarity measure, projection, or activation threshold). This is load-bearing for the central claim that the method is training-free yet avoids the failure modes of global steering.

    Authors: We agree that an explicit mathematical formulation is necessary to fully substantiate the central claims. In the revised manuscript we will expand §3.2 with the precise equations and algorithm: the gate is computed from a similarity measure between the input activation and a prototype derived from the forget-set activations, followed by a projection step and a threshold that determines whether redirection is applied. The rotation itself is constructed as a norm-preserving operation in the residual stream. These additions will make transparent how the input-dependent mechanism is obtained without any training or gradient steps. revision: yes

  2. Referee: [§5.1, Table 1] §5.1, Table 1: The headline claim that GUARD-IT is the only method to simultaneously preserve utility, suppress memorization, and avoid collapse across all settings relies on aggregate performance numbers, but the table reports point estimates without error bars, run counts, or statistical significance tests against the 12 baselines; this weakens the uniqueness assertion.

    Authors: We acknowledge that the current presentation of Table 1 uses point estimates and would be strengthened by error bars and run counts. Each configuration was evaluated once owing to the high computational cost of the large-model experiments and the full set of baselines. In the revision we will add error bars from repeated runs on the smaller models and include a note on consistency across the three model scales reported in the main results and supplement. The uniqueness claim rests on the observed pattern that GUARD-IT is the only method satisfying all three criteria in every setting; we will qualify the language to reflect the empirical scope while retaining the comparative observation. revision: partial

Circularity Check

0 steps flagged

No circularity: method is training-free empirical construction with no derivation reducing to inputs

full rationale

The paper presents GUARD-IT explicitly as a training- and gradient-free inference-time technique that applies input-dependent norm-preserving rotations in the residual stream. No equations, fitting procedures, or self-citations are shown that would make any claimed performance outcome equivalent to its own inputs by construction. The central claims rest on empirical comparisons against 12 baselines on TOFU and MUSE across model scales, with no load-bearing uniqueness theorem, ansatz smuggling, or renaming of known results. The derivation chain is therefore self-contained as a novel algorithmic proposal rather than a tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes that activation steering can be made input-dependent and norm-preserving without side effects.

pith-pipeline@v0.9.0 · 5836 in / 1222 out tokens · 82104 ms · 2026-05-20T21:51:21.408844+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 4 internal anchors

  1. [1]

    TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini and Zhili Feng and Avi Schwarzschild and Zachary C. Lipton and J. Zico Kolter , year=. 2401.06121 , archivePrefix=

  2. [2]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , url =

    Reimers, Nils and Gurevych, Iryna , biburl =. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , url =. EMNLP/IJCNLP (1) , crossref =

  3. [3]

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019

  4. [4]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

    Jang, Joel and Yoon, Dongkeun and Yang, Sohee and Cha, Sungmin and Lee, Moontae and Logeswaran, Lajanugen and Seo, Minjoon , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

  5. [5]

    Smith and Chiyuan Zhang , booktitle=

    Weijia Shi and Jaechan Lee and Yangsibo Huang and Sadhika Malladi and Jieyu Zhao and Ari Holtzman and Daogao Liu and Luke Zettlemoyer and Noah A. Smith and Chiyuan Zhang , booktitle=. 2025 , url=

  6. [6]

    ArXiv , year=

    Representation Engineering: A Top-Down Approach to AI Transparency , author=. ArXiv , year=

  7. [7]

    Steering Llama 2 via Contrastive Activation Addition , url =

    Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

  8. [8]

    ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

    Angular Steering: Behavior Control via Rotation in Activation Space , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

  9. [9]

    2026 , eprint=

    Spherical Steering: Geometry-Aware Activation Rotation for Language Models , author=. 2026 , eprint=

  10. [10]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  11. [11]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  12. [12]

    Simplicity Prevails: Rethinking Negative Preference Optimization for

    Chongyu Fan and Jiancheng Liu and Licong Lin and Jinghan Jia and Ruiqi Zhang and Song Mei and Sijia Liu , booktitle=. Simplicity Prevails: Rethinking Negative Preference Optimization for. 2024 , url=

  13. [13]

    Nature Machine Intelligence , volume =

    Rethinking machine unlearning for large language models , author =. Nature Machine Intelligence , volume =. 2025 , publisher =

  14. [14]

    BLUR : A Bi-Level Optimization Approach for LLM Unlearning

    Reisizadeh, Hadi and Jia, Jinghan and Bu, Zhiqi and Vinzamuri, Bhanukiran and Ramakrishna, Anil and Chang, Kai-Wei and Cevher, Volkan and Liu, Sijia and Hong, Mingyi. BLUR : A Bi-Level Optimization Approach for LLM Unlearning. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Pa...

  15. [15]

    The Thirteenth International Conference on Learning Representations , year=

    Programming Refusal with Conditional Activation Steering , author=. The Thirteenth International Conference on Learning Representations , year=

  16. [16]

    Semantics-Adaptive Activation Intervention for

    Weixuan Wang and JINGYUAN YANG and Wei Peng , booktitle=. Semantics-Adaptive Activation Intervention for. 2025 , url=

  17. [17]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Park, Kiho and Choe, Yo Joong and Veitch, Victor , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  18. [18]

    2022 , eprint=

    Toy Models of Superposition , author=. 2022 , eprint=

  19. [19]

    Quantization-Robust

    Jo. Quantization-Robust. 2026 , eprint=

  20. [20]

    Proceedings of The 1st Conference on Lifelong Learning Agents , editor =

    Liu, Bo and Liu, Qiang and Stone, Peter , title =. Proceedings of The 1st Conference on Lifelong Learning Agents , editor =. 2022 , url =

  21. [21]

    The Probabilistic Relevance Framework:

    Robertson, Stephen and Zaragoza, Hugo , year =. The Probabilistic Relevance Framework:. Foundations and Trends in Information Retrieval , doi =

  22. [22]

    Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

    Thakur, Nandan and Reimers, Nils and R. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

  23. [23]

    2025 , issue_date =

    Meng, Chuan and Arabzadeh, Negar and Askari, Arian and Aliannejadi, Mohammad and Rijke, Maarten de , title =. 2025 , issue_date =. doi:10.1145/3736402 , journal =

  24. [24]

    Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

    Yu, Tianhe and Kumar, Saurabh and Gupta, Abhishek and Levine, Sergey and Hausman, Karol and Finn, Chelsea , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

  25. [25]

    Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =

    Liu, Bo and Liu, Xingchao and Jin, Xiaojie and Stone, Peter and Liu, Qiang , title =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =. 2021 , isbn =

  26. [26]

    and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Laban, Rassin and Hendrycks, Dan , booktitle =

    Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Laban, Rassin and Hendrycks, Dan , booktitle =. The

  27. [27]

    The Eleventh International Conference on Learning Representations , year =

    Editing Models with Task Arithmetic , author =. The Eleventh International Conference on Learning Representations , year =

  28. [28]

    Proceedings of the 41st International Conference on Machine Learning , year =

    In-Context Unlearning: Language Models as Few-Shot Unlearners , author =. Proceedings of the 41st International Conference on Machine Learning , year =

  29. [29]

    2024 , eprint=

    Steering Language Models With Activation Engineering , author=. 2024 , eprint=

  30. [30]

    ACM Computing Surveys , volume =

    Fairness in Deep Learning: A survey on vision and language research , author =. ACM Computing Surveys , volume =. 2025 , publisher =

  31. [31]

    Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

    Michael Li and Nishant Subramani , year=. Echoes of. 2506.02132 , archivePrefix=

  32. [32]

    UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models

    Dong, Yijiang River and Lin, Hongzhou and Belkin, Mikhail and Huerta, Ramon and Vuli \'c , Ivan. UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

  33. [33]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  34. [34]

    First Conference on Language Modeling , year=

    Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. First Conference on Language Modeling , year=

  35. [35]

    OpenUnlearning: Accelerating

    Vineeth Dorna and Anmol Reddy Mekala and Wenlong Zhao and Andrew McCallum and J Zico Kolter and Zachary Chase Lipton and Pratyush Maini , booktitle=. OpenUnlearning: Accelerating. 2026 , url=

  36. [36]

    Advances in Neural Information Processing Systems , volume=

    Analysing the generalisation and reliability of steering vectors , author=. Advances in Neural Information Processing Systems , volume=

  37. [37]

    Layer by Layer: Uncovering Hidden Representations in Language Models

    Layer by layer: Uncovering hidden representations in language models , author=. arXiv preprint arXiv:2502.02013 , year=

  38. [38]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =

    Penedo, Guilherme and Kydl\'. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =. Advances in Neural Information Processing Systems , doi =

  39. [39]

    Locating and Editing Factual Associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in. 2022 , publisher =

  40. [40]

    arXiv preprint arXiv:2406.01506 , year=

    The geometry of categorical and hierarchical concepts in large language models , author=. arXiv preprint arXiv:2406.01506 , year=

  41. [41]

    arXiv preprint arXiv:2410.16454 , year=

    Catastrophic failure of llm unlearning via quantization , author=. arXiv preprint arXiv:2410.16454 , year=

  42. [42]

    2025 , eprint=

    Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models , author=. 2025 , eprint=

  43. [43]

    Shen and Xinchi Qiu and Meghdad Kurmanji and Alex Iacob and Lorenzo Sani and Yihong Chen and Nicola Cancedda and Nicholas D

    William F. Shen and Xinchi Qiu and Meghdad Kurmanji and Alex Iacob and Lorenzo Sani and Yihong Chen and Nicola Cancedda and Nicholas D. Lane , year=. 2502.07218 , archivePrefix=

  44. [44]

    2023 , eprint=

    Machine Unlearning: A Survey , author=. 2023 , eprint=

  45. [45]

    2023 , eprint=

    Learn to Unlearn: A Survey on Machine Unlearning , author=. 2023 , eprint=

  46. [46]

    IEEE Transactions on Emerging Topics in Computational Intelligence , volume =

    Machine Unlearning: Solutions and Challenges , author =. IEEE Transactions on Emerging Topics in Computational Intelligence , volume =. 2024 , month = jun, doi =

  47. [47]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Large Language Model Unlearning , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  48. [48]

    Retrieval-Augmented Generation for Knowledge-Intensive

    Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K. Retrieval-Augmented Generation for Knowledge-Intensive. 2020 , eprint=

  49. [49]

    Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh , booktitle =. Self-. 2024 , url =

  50. [50]

    2022 , eprint=

    Memory-Based Model Editing at Scale , author=. 2022 , eprint=

  51. [51]

    2025 , eprint=

    A Law of Next-Token Prediction in Large Language Models , author=. 2025 , eprint=

  52. [52]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  53. [53]

    Exploring Criteria of Loss Reweighting to Enhance

    Puning Yang and Qizhou Wang and Zhuo Huang and Tongliang Liu and Chengqi Zhang and Bo Han , booktitle=. Exploring Criteria of Loss Reweighting to Enhance. 2025 , url=

  54. [54]

    Rethinking

    Qizhou Wang and Jin Peng Zhou and Zhanke Zhou and Saebyeol Shin and Bo Han and Kilian Q Weinberger , booktitle=. Rethinking. 2025 , url=

  55. [55]

    2025 , eprint=

    A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models , author=. 2025 , eprint=

  56. [56]

    2025 , url =

    Yang, Bo , journal =. 2025 , url =

  57. [57]

    The Llama 3 Herd of Models

    The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

  58. [58]

    Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke , booktitle=

  59. [59]

    Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle=