pith. machine review for the scientific record. sign in

arxiv: 2604.12820 · v1 · submitted 2026-04-14 · 💻 cs.AI · cs.CL

Recognition: unknown

RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords interactive machine unlearningprompt-aware model repairactivation manipulationLLM editingon-device unlearningrefusal subspacepseudoinverse update
0
0 comments X

The pith

RePAIR lets users tell large language models to forget specific knowledge through natural language prompts at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new paradigm of Interactive Machine Unlearning in which end users can direct LLMs to remove targeted information such as harmful facts, misinformation, or personal data using ordinary prompts. It presents RePAIR, a three-part framework consisting of a watchdog that detects unlearning requests, a surgeon that generates repair steps, and a patient model that applies autonomous parameter updates. The core technique, STAMP, performs training-free unlearning by shifting MLP activations toward a refusal direction through a closed-form pseudoinverse calculation on a single example. This approach matters because existing unlearning methods require model providers to run full retraining pipelines, leaving ordinary users without any direct control over what their models remember. If the method works as described, it opens the door to on-device, user-initiated editing that preserves overall model capability while driving forget metrics to zero.

Core claim

RePAIR achieves near-zero forget scores on targeted knowledge while retaining high utility on unrelated tasks by redirecting model activations toward refusal subspaces with single-sample pseudoinverse updates, outperforming six prior unlearning baselines across harmful knowledge suppression, misinformation correction, and personal data erasure tasks.

What carries the argument

STAMP (Steering Through Activation Manipulation with PseudoInverse), a training-free method that uses closed-form pseudoinverse updates on MLP activations to steer outputs into a refusal subspace for a targeted prompt.

If this is right

  • Users gain direct, inference-time control over model knowledge without needing provider intervention or retain datasets.
  • The low-rank STAMP variant reduces complexity from cubic to near-linear in dimension, supporting efficient on-device operation with up to 3x speedup.
  • The same mechanism works across harmful knowledge removal, misinformation correction, and personal data erasure while keeping retain accuracy high.
  • Model updates remain autonomous once the repair procedure is generated, enabling transparent user-driven editing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could support conversational, multi-turn unlearning where users refine what to forget over several prompts.
  • Similar activation-steering ideas might apply to selective knowledge addition or editing rather than only removal.
  • Widespread adoption would reduce dependence on centralized providers for correcting model behavior.

Load-bearing premise

Redirecting activations via a single-sample pseudoinverse update will suppress only the intended knowledge without side effects on unrelated capabilities or future inputs.

What would settle it

Apply RePAIR to a concrete fact such as a specific personal detail or false claim, then measure whether the model refuses all queries about that fact while still correctly answering unrelated questions and maintaining benchmark scores.

Figures

Figures reproduced from arXiv: 2604.12820 by Amit Shukla, Jagadeesh Rachapudi, Praful Hambarde, Pranav Singh, Ritali Vatsi.

Figure 1
Figure 1. Figure 1: Motivating example for Interactive Machine Un [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Conceptual illustration of RePAIR. Msurgeon repairs Mpatient (left) using STAMP, transforming it into Mhealed (right). despite their empirical differences, these methods share a common structural limitation: they are designed for practitioners with deep access to model internals, requiring curated retain datasets and full training pipelines. End users the very individuals whose data is at stake are entirel… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the RePAIR framework. User U inter￾acts with Mpatient via prompts 𝑝𝑡 and responses 𝑟𝑡 . Mwatchdog detects unlearning requests from 𝐻𝑡 , forwards (𝑝𝑓 , 𝑟𝑓 ) to Msurgeon, which generates 𝐶𝑡 to transform Mpatient into Mhealed. 4 Method We propose RePAIR, a framework for interactive machine un￾learning with three components: Mpatient interacts with user U through prompts and responses, Mwatchdog mo… view at source ↗
Figure 4
Figure 4. Figure 4: SwiGLU MLP architecture in Llama-3-8B. STAMP [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cosine divergence between WMDP and refusal acti [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Large language models (LLMs) inherently absorb harmful knowledge, misinformation, and personal data during pretraining on large-scale web corpora, with no native mechanism for selective removal. While machine unlearning offers a principled solution, existing approaches are provider-centric, requiring retraining pipelines, curated retain datasets, and direct intervention by model service providers (MSPs), thereby excluding end users from controlling their own data. We introduce Interactive Machine Unlearning (IMU), a new paradigm in which users can instruct LLMs to forget targeted knowledge through natural language at inference time. To realize IMU, we propose RePAIR, a prompt-aware model repair framework comprising (i) a watchdog model for unlearning intent detection, (ii) a surgeon model for generating repair procedures, and (iii) a patient model whose parameters are updated autonomously. At the core of RePAIR, we develop Steering Through Activation Manipulation with PseudoInverse (STAMP), a training-free, single-sample unlearning method that redirects MLP activations toward a refusal subspace via closed-form pseudoinverse updates. Its low-rank variant reduces computational complexity from O(d^3) to O(r^3 + r^2 * d), enabling efficient on-device unlearning with up to ~3x speedup over training-based baselines. Extensive experiments across harmful knowledge suppression, misinformation correction, and personal data erasure demonstrate that RePAIR achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. These results establish RePAIR as an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Interactive Machine Unlearning (IMU), a paradigm enabling end-users to remove targeted knowledge from LLMs via natural-language prompts at inference time. RePAIR realizes IMU through a watchdog model (intent detection), surgeon model (repair procedure generation), and patient model (autonomous parameter updates). Its core contribution is STAMP, a training-free method that applies closed-form pseudoinverse updates to MLP activations to steer them toward a refusal subspace; a low-rank variant reduces complexity from O(d^3) to O(r^3 + r^2 d). Experiments on harmful-knowledge suppression, misinformation correction, and personal-data erasure report near-zero forget metrics (Acc_f = 0.00, F-RL = 0.00) while retaining utility (Acc_r up to 84.47, R-RL up to 0.88) and outperforming six baselines.

Significance. If the central empirical claims are substantiated, RePAIR would constitute a meaningful advance by shifting machine unlearning from provider-centric retraining pipelines to user-driven, on-device, inference-time editing. The training-free, closed-form character of STAMP together with the explicit low-rank speedup are concrete strengths that could extend to multimodal models. These elements directly address the absence of native selective-removal mechanisms in current LLMs.

major comments (3)
  1. [§3] §3 (STAMP formulation): The single-sample pseudoinverse update that redirects MLP activations to the refusal subspace is presented as sufficient to achieve Acc_f = 0.00 and F-RL = 0.00 across all future prompts. No derivation or bound is supplied showing that this low-rank adjustment isolates the target knowledge from the highly entangled representations typical of LLMs; without such analysis the perfect forget scores rest on an unverified isolation assumption that is load-bearing for the central claim.
  2. [§4] §4 (Experimental results): The reported metrics (Acc_f = 0.00, Acc_r up to 84.47, outperformance over six baselines) are given without any description of the underlying datasets, patient-model architecture, number of evaluation prompts, statistical significance tests, or ablation on the rank-r hyperparameter and subspace stability. This absence prevents verification that the near-zero forget scores generalize beyond the specific single-sample inputs used for the update.
  3. [§4.2] §4.2 (Ablation or OOD evaluation): No experiments test the stability of the refusal subspace under paraphrased prompts, context shifts, or out-of-distribution retain tasks. Because the method operates on a single sample, such tests are required to substantiate that collateral degradation on unrelated capabilities does not occur, directly addressing the weakest link in the generalization argument.
minor comments (2)
  1. [Abstract] Abstract: The quantitative claims are stated without naming the patient LLM, the concrete datasets, or the precise definition of the refusal subspace, reducing immediate readability.
  2. [§2] Notation: The symbols Acc_f, F-RL, Acc_r, and R-RL are used without an explicit equation or table defining them in the main text; a short definitions table would clarify the forget/retain distinction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important areas for strengthening the theoretical grounding, experimental transparency, and generalization analysis of RePAIR and STAMP. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (STAMP formulation): The single-sample pseudoinverse update that redirects MLP activations to the refusal subspace is presented as sufficient to achieve Acc_f = 0.00 and F-RL = 0.00 across all future prompts. No derivation or bound is supplied showing that this low-rank adjustment isolates the target knowledge from the highly entangled representations typical of LLMs; without such analysis the perfect forget scores rest on an unverified isolation assumption that is load-bearing for the central claim.

    Authors: We agree that a formal derivation or bound demonstrating isolation of the target knowledge would provide stronger theoretical support, particularly given the entangled representations in LLMs. STAMP is presented as an empirical, training-free method that uses the closed-form pseudoinverse to steer activations toward a refusal subspace computed from the single sample; the perfect forget metrics are observed empirically rather than proven via isolation guarantees. In the revision we will expand §3 with an explicit discussion of the method's assumptions, the geometry of the update, and the empirical evidence from our experiments that the refusal subspace effectively suppresses the targeted knowledge without requiring a full theoretical bound. We will also flag the absence of such a bound as an open direction for future work. revision: partial

  2. Referee: [§4] §4 (Experimental results): The reported metrics (Acc_f = 0.00, Acc_r up to 84.47, outperformance over six baselines) are given without any description of the underlying datasets, patient-model architecture, number of evaluation prompts, statistical significance tests, or ablation on the rank-r hyperparameter and subspace stability. This absence prevents verification that the near-zero forget scores generalize beyond the specific single-sample inputs used for the update.

    Authors: The current manuscript contains high-level descriptions of the evaluation domains and patient-model families, but we acknowledge that the level of detail is insufficient for independent verification. In the revised version we will expand §4 to include: (i) precise dataset citations and construction details for the harmful-knowledge, misinformation, and personal-data tasks; (ii) the exact patient-model architectures and sizes; (iii) the total number of evaluation prompts per task; (iv) statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals) on the reported metrics; and (v) a dedicated ablation study on the rank-r hyperparameter together with quantitative measures of subspace stability across runs. revision: yes

  3. Referee: [§4.2] §4.2 (Ablation or OOD evaluation): No experiments test the stability of the refusal subspace under paraphrased prompts, context shifts, or out-of-distribution retain tasks. Because the method operates on a single sample, such tests are required to substantiate that collateral degradation on unrelated capabilities does not occur, directly addressing the weakest link in the generalization argument.

    Authors: We concur that robustness to paraphrases, context shifts, and OOD retain tasks is essential to substantiate generalization, especially for a single-sample method. We will add a new set of experiments (to be placed in an expanded §4.2 or a dedicated robustness subsection) that evaluate STAMP on paraphrased forget prompts, shifted conversational contexts, and unrelated retain tasks drawn from standard benchmarks. These experiments will report both forget and retain metrics to quantify any collateral degradation and will include qualitative examples of subspace behavior under these variations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper introduces RePAIR and STAMP as a training-free method using closed-form pseudoinverse updates on MLP activations from a single sample to redirect toward a refusal subspace. Performance claims (near-zero forget scores with preserved utility) are presented strictly as outcomes of experiments comparing against six baselines, with no equations, derivations, or predictions shown that reduce the reported metrics to fitted parameters, self-definitions, or self-citations by construction. No load-bearing self-citations, ansatzes, or renamings of known results appear in the provided text. The derivation chain is self-contained as an empirical proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 3 invented entities

The approach relies on standard linear algebra for pseudoinverse computation and assumes the existence of a separable refusal subspace in model activations; no free parameters are explicitly fitted in the abstract description, but the low-rank approximation introduces a rank hyperparameter.

free parameters (1)
  • rank r in low-rank variant
    Controls the dimensionality reduction for computational speedup; value not specified but affects O(r^3 + r^2 * d) complexity.
axioms (2)
  • standard math Closed-form pseudoinverse provides an exact solution for redirecting activations to a target subspace without iterative optimization.
    Invoked as the core of the training-free STAMP update.
  • domain assumption MLP activations contain a distinct refusal subspace that can be targeted to suppress specific knowledge.
    Central to the activation manipulation mechanism.
invented entities (3)
  • refusal subspace no independent evidence
    purpose: Target direction in activation space for unlearning specific knowledge.
    Postulated as part of STAMP; no independent evidence or external validation provided.
  • watchdog model no independent evidence
    purpose: Detects unlearning intent from user prompts.
    New component in the RePAIR architecture.
  • surgeon model no independent evidence
    purpose: Generates repair procedures for the patient model.
    New component in the RePAIR architecture.

pith-pipeline@v0.9.0 · 5646 in / 1580 out tokens · 82680 ms · 2026-05-10T16:08:05.477915+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models

    cs.CR 2026-04 unverdicted novelty 6.0

    BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.

Reference graph

Works this paper leans on

29 extracted references · cited by 1 Pith paper

  1. [1]

    Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. 2024. The surprising effectiveness of test-time training for few-shot learning

  2. [2]

    Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri, Devvrit Khatri, David Brandfonbrener, David Alvarez-Melis, Prajjwal Bhargava, Mihir Sanjay Kale, et al. 2025. Let’s (not) just put things in Context: Test-Time Training for Long-Context LLMs

  3. [3]

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. 2024. Titans: Learning to memo- rize at test time

  4. [4]

    Rob Bonta. 2022. California consumer privacy act (CCPA). 4–40 pages

  5. [5]

    Junhao Chen, Bowen Wang, Zhouqiang Jiang, and Yuta Nakashima. 2025. Putting people in llms’ shoes: generating better answers via question rewriter. 23577– 23585 pages

  6. [6]

    Kate Crawford and Trevor Paglen. 2021. Excavating AI: The politics of images in machine learning training sets.Ai & Society36, 4 (2021), 1105–1116

  7. [7]

    Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english? ACM MM ’26, October 2026, Melbourne, Australia Rachapudi et al

  8. [8]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models

  9. [9]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing

  10. [10]

    Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. 2025. Test-time learning for large language models

  11. [11]

    Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. 2023. Look before you leap: An exploratory study of uncertainty measurement for large language models

  12. [12]

    Zhehao Huang, Xinwen Cheng, Jie Zhang, Jinghao Zheng, Haoran Wang, Zheng- bao He, Tao Li, and Xiaolin Huang. 2025. A unified gradient-based framework for task-agnostic continual learning-unlearning

  13. [13]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report

  14. [14]

    Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. https://api...

  15. [15]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models

  16. [16]

    Pranjal Kumar. 2024. Large language models (LLMs): survey, technical frame- works, and future challenges.Artificial Intelligence Review57, 10 (2024), 260

  17. [17]

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Al- ice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. 2024. The wmdp benchmark: Measuring and reducing malicious use with unlearning

  18. [18]

    Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, and Noa Garcia. 2024. Can multiple-choice questions really be useful in detecting the abilities of LLMs? 2819–2834 pages

  19. [19]

    Data Protection. 2018. General data protection regulation

  20. [20]

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al . 2024. Learning to (learn at test time): Rnns with expressive hidden states

  21. [21]

    Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, et al. 2025. End-to-end test-time training for long context

  22. [22]

    Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, et al . 2025. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment

  23. [23]

    Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q Weinberger. 2025. Rethinking llm unlearning objectives: A gradient perspective and go beyond

  24. [24]

    Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Parag Shah, Yujia Bao, Yang Liu, and Wei Wei. 2024. Llm unlearning via loss adjustment with only forget data

  25. [25]

    Yuanshun Yao and Xiaojun Xu. 2024. Large language model unlearning.Advances in Neural Information Processing Systems37 (2024), 105425–105475

  26. [26]

    Huahui Yi, Kun Wang, Qiankun Li, Miao Yu, Liang Lin, Gongli Xi, Hao Wu, Xuming Hu, Kang Li, and Yang Liu. 2025. SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

  27. [27]

    Saleh Zare Zade, Xiangyu Zhou, Sijia Liu, and Dongxiao Zhu. 2026. Attention Smoothing Is All You Need For Unlearning

  28. [28]

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. 2024. Negative preference optimization: From catastrophic collapse to effective unlearning

  29. [29]

    Yibo Zhang and Liang Lin. 2025. Enj: Optimizing noise with genetic algorithms to jailbreak lsms