Recognition: unknown
RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair
Pith reviewed 2026-05-10 16:08 UTC · model grok-4.3
The pith
RePAIR lets users tell large language models to forget specific knowledge through natural language prompts at inference time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RePAIR achieves near-zero forget scores on targeted knowledge while retaining high utility on unrelated tasks by redirecting model activations toward refusal subspaces with single-sample pseudoinverse updates, outperforming six prior unlearning baselines across harmful knowledge suppression, misinformation correction, and personal data erasure tasks.
What carries the argument
STAMP (Steering Through Activation Manipulation with PseudoInverse), a training-free method that uses closed-form pseudoinverse updates on MLP activations to steer outputs into a refusal subspace for a targeted prompt.
If this is right
- Users gain direct, inference-time control over model knowledge without needing provider intervention or retain datasets.
- The low-rank STAMP variant reduces complexity from cubic to near-linear in dimension, supporting efficient on-device operation with up to 3x speedup.
- The same mechanism works across harmful knowledge removal, misinformation correction, and personal data erasure while keeping retain accuracy high.
- Model updates remain autonomous once the repair procedure is generated, enabling transparent user-driven editing.
Where Pith is reading between the lines
- The framework could support conversational, multi-turn unlearning where users refine what to forget over several prompts.
- Similar activation-steering ideas might apply to selective knowledge addition or editing rather than only removal.
- Widespread adoption would reduce dependence on centralized providers for correcting model behavior.
Load-bearing premise
Redirecting activations via a single-sample pseudoinverse update will suppress only the intended knowledge without side effects on unrelated capabilities or future inputs.
What would settle it
Apply RePAIR to a concrete fact such as a specific personal detail or false claim, then measure whether the model refuses all queries about that fact while still correctly answering unrelated questions and maintaining benchmark scores.
Figures
read the original abstract
Large language models (LLMs) inherently absorb harmful knowledge, misinformation, and personal data during pretraining on large-scale web corpora, with no native mechanism for selective removal. While machine unlearning offers a principled solution, existing approaches are provider-centric, requiring retraining pipelines, curated retain datasets, and direct intervention by model service providers (MSPs), thereby excluding end users from controlling their own data. We introduce Interactive Machine Unlearning (IMU), a new paradigm in which users can instruct LLMs to forget targeted knowledge through natural language at inference time. To realize IMU, we propose RePAIR, a prompt-aware model repair framework comprising (i) a watchdog model for unlearning intent detection, (ii) a surgeon model for generating repair procedures, and (iii) a patient model whose parameters are updated autonomously. At the core of RePAIR, we develop Steering Through Activation Manipulation with PseudoInverse (STAMP), a training-free, single-sample unlearning method that redirects MLP activations toward a refusal subspace via closed-form pseudoinverse updates. Its low-rank variant reduces computational complexity from O(d^3) to O(r^3 + r^2 * d), enabling efficient on-device unlearning with up to ~3x speedup over training-based baselines. Extensive experiments across harmful knowledge suppression, misinformation correction, and personal data erasure demonstrate that RePAIR achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. These results establish RePAIR as an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Interactive Machine Unlearning (IMU), a paradigm enabling end-users to remove targeted knowledge from LLMs via natural-language prompts at inference time. RePAIR realizes IMU through a watchdog model (intent detection), surgeon model (repair procedure generation), and patient model (autonomous parameter updates). Its core contribution is STAMP, a training-free method that applies closed-form pseudoinverse updates to MLP activations to steer them toward a refusal subspace; a low-rank variant reduces complexity from O(d^3) to O(r^3 + r^2 d). Experiments on harmful-knowledge suppression, misinformation correction, and personal-data erasure report near-zero forget metrics (Acc_f = 0.00, F-RL = 0.00) while retaining utility (Acc_r up to 84.47, R-RL up to 0.88) and outperforming six baselines.
Significance. If the central empirical claims are substantiated, RePAIR would constitute a meaningful advance by shifting machine unlearning from provider-centric retraining pipelines to user-driven, on-device, inference-time editing. The training-free, closed-form character of STAMP together with the explicit low-rank speedup are concrete strengths that could extend to multimodal models. These elements directly address the absence of native selective-removal mechanisms in current LLMs.
major comments (3)
- [§3] §3 (STAMP formulation): The single-sample pseudoinverse update that redirects MLP activations to the refusal subspace is presented as sufficient to achieve Acc_f = 0.00 and F-RL = 0.00 across all future prompts. No derivation or bound is supplied showing that this low-rank adjustment isolates the target knowledge from the highly entangled representations typical of LLMs; without such analysis the perfect forget scores rest on an unverified isolation assumption that is load-bearing for the central claim.
- [§4] §4 (Experimental results): The reported metrics (Acc_f = 0.00, Acc_r up to 84.47, outperformance over six baselines) are given without any description of the underlying datasets, patient-model architecture, number of evaluation prompts, statistical significance tests, or ablation on the rank-r hyperparameter and subspace stability. This absence prevents verification that the near-zero forget scores generalize beyond the specific single-sample inputs used for the update.
- [§4.2] §4.2 (Ablation or OOD evaluation): No experiments test the stability of the refusal subspace under paraphrased prompts, context shifts, or out-of-distribution retain tasks. Because the method operates on a single sample, such tests are required to substantiate that collateral degradation on unrelated capabilities does not occur, directly addressing the weakest link in the generalization argument.
minor comments (2)
- [Abstract] Abstract: The quantitative claims are stated without naming the patient LLM, the concrete datasets, or the precise definition of the refusal subspace, reducing immediate readability.
- [§2] Notation: The symbols Acc_f, F-RL, Acc_r, and R-RL are used without an explicit equation or table defining them in the main text; a short definitions table would clarify the forget/retain distinction.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments highlight important areas for strengthening the theoretical grounding, experimental transparency, and generalization analysis of RePAIR and STAMP. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (STAMP formulation): The single-sample pseudoinverse update that redirects MLP activations to the refusal subspace is presented as sufficient to achieve Acc_f = 0.00 and F-RL = 0.00 across all future prompts. No derivation or bound is supplied showing that this low-rank adjustment isolates the target knowledge from the highly entangled representations typical of LLMs; without such analysis the perfect forget scores rest on an unverified isolation assumption that is load-bearing for the central claim.
Authors: We agree that a formal derivation or bound demonstrating isolation of the target knowledge would provide stronger theoretical support, particularly given the entangled representations in LLMs. STAMP is presented as an empirical, training-free method that uses the closed-form pseudoinverse to steer activations toward a refusal subspace computed from the single sample; the perfect forget metrics are observed empirically rather than proven via isolation guarantees. In the revision we will expand §3 with an explicit discussion of the method's assumptions, the geometry of the update, and the empirical evidence from our experiments that the refusal subspace effectively suppresses the targeted knowledge without requiring a full theoretical bound. We will also flag the absence of such a bound as an open direction for future work. revision: partial
-
Referee: [§4] §4 (Experimental results): The reported metrics (Acc_f = 0.00, Acc_r up to 84.47, outperformance over six baselines) are given without any description of the underlying datasets, patient-model architecture, number of evaluation prompts, statistical significance tests, or ablation on the rank-r hyperparameter and subspace stability. This absence prevents verification that the near-zero forget scores generalize beyond the specific single-sample inputs used for the update.
Authors: The current manuscript contains high-level descriptions of the evaluation domains and patient-model families, but we acknowledge that the level of detail is insufficient for independent verification. In the revised version we will expand §4 to include: (i) precise dataset citations and construction details for the harmful-knowledge, misinformation, and personal-data tasks; (ii) the exact patient-model architectures and sizes; (iii) the total number of evaluation prompts per task; (iv) statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals) on the reported metrics; and (v) a dedicated ablation study on the rank-r hyperparameter together with quantitative measures of subspace stability across runs. revision: yes
-
Referee: [§4.2] §4.2 (Ablation or OOD evaluation): No experiments test the stability of the refusal subspace under paraphrased prompts, context shifts, or out-of-distribution retain tasks. Because the method operates on a single sample, such tests are required to substantiate that collateral degradation on unrelated capabilities does not occur, directly addressing the weakest link in the generalization argument.
Authors: We concur that robustness to paraphrases, context shifts, and OOD retain tasks is essential to substantiate generalization, especially for a single-sample method. We will add a new set of experiments (to be placed in an expanded §4.2 or a dedicated robustness subsection) that evaluate STAMP on paraphrased forget prompts, shifted conversational contexts, and unrelated retain tasks drawn from standard benchmarks. These experiments will report both forget and retain metrics to quantify any collateral degradation and will include qualitative examples of subspace behavior under these variations. revision: yes
Circularity Check
No significant circularity; empirical results independent of inputs
full rationale
The paper introduces RePAIR and STAMP as a training-free method using closed-form pseudoinverse updates on MLP activations from a single sample to redirect toward a refusal subspace. Performance claims (near-zero forget scores with preserved utility) are presented strictly as outcomes of experiments comparing against six baselines, with no equations, derivations, or predictions shown that reduce the reported metrics to fitted parameters, self-definitions, or self-citations by construction. No load-bearing self-citations, ansatzes, or renamings of known results appear in the provided text. The derivation chain is self-contained as an empirical proposal.
Axiom & Free-Parameter Ledger
free parameters (1)
- rank r in low-rank variant
axioms (2)
- standard math Closed-form pseudoinverse provides an exact solution for redirecting activations to a target subspace without iterative optimization.
- domain assumption MLP activations contain a distinct refusal subspace that can be targeted to suppress specific knowledge.
invented entities (3)
-
refusal subspace
no independent evidence
-
watchdog model
no independent evidence
-
surgeon model
no independent evidence
Forward citations
Cited by 1 Pith paper
-
BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
Reference graph
Works this paper leans on
-
[1]
Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. 2024. The surprising effectiveness of test-time training for few-shot learning
2024
-
[2]
Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri, Devvrit Khatri, David Brandfonbrener, David Alvarez-Melis, Prajjwal Bhargava, Mihir Sanjay Kale, et al. 2025. Let’s (not) just put things in Context: Test-Time Training for Long-Context LLMs
2025
-
[3]
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. 2024. Titans: Learning to memo- rize at test time
2024
-
[4]
Rob Bonta. 2022. California consumer privacy act (CCPA). 4–40 pages
2022
-
[5]
Junhao Chen, Bowen Wang, Zhouqiang Jiang, and Yuta Nakashima. 2025. Putting people in llms’ shoes: generating better answers via question rewriter. 23577– 23585 pages
2025
-
[6]
Kate Crawford and Trevor Paglen. 2021. Excavating AI: The politics of images in machine learning training sets.Ai & Society36, 4 (2021), 1105–1116
2021
-
[7]
Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english? ACM MM ’26, October 2026, Melbourne, Australia Rachapudi et al
2023
-
[8]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models
2024
-
[9]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing
2020
-
[10]
Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. 2025. Test-time learning for large language models
2025
-
[11]
Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. 2023. Look before you leap: An exploratory study of uncertainty measurement for large language models
2023
-
[12]
Zhehao Huang, Xinwen Cheng, Jie Zhang, Jinghao Zheng, Haoran Wang, Zheng- bao He, Tao Li, and Xiaolin Huang. 2025. A unified gradient-based framework for task-agnostic continual learning-unlearning
2025
-
[13]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report
2024
-
[14]
Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. https://api...
2023
-
[15]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models
2020
-
[16]
Pranjal Kumar. 2024. Large language models (LLMs): survey, technical frame- works, and future challenges.Artificial Intelligence Review57, 10 (2024), 260
2024
-
[17]
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Al- ice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. 2024. The wmdp benchmark: Measuring and reducing malicious use with unlearning
2024
-
[18]
Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, and Noa Garcia. 2024. Can multiple-choice questions really be useful in detecting the abilities of LLMs? 2819–2834 pages
2024
-
[19]
Data Protection. 2018. General data protection regulation
2018
-
[20]
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al . 2024. Learning to (learn at test time): Rnns with expressive hidden states
2024
-
[21]
Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, et al. 2025. End-to-end test-time training for long context
2025
-
[22]
Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, et al . 2025. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment
2025
-
[23]
Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q Weinberger. 2025. Rethinking llm unlearning objectives: A gradient perspective and go beyond
2025
-
[24]
Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Parag Shah, Yujia Bao, Yang Liu, and Wei Wei. 2024. Llm unlearning via loss adjustment with only forget data
2024
-
[25]
Yuanshun Yao and Xiaojun Xu. 2024. Large language model unlearning.Advances in Neural Information Processing Systems37 (2024), 105425–105475
2024
-
[26]
Huahui Yi, Kun Wang, Qiankun Li, Miao Yu, Liang Lin, Gongli Xi, Hao Wu, Xuming Hu, Kang Li, and Yang Liu. 2025. SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
2025
-
[27]
Saleh Zare Zade, Xiangyu Zhou, Sijia Liu, and Dongxiao Zhu. 2026. Attention Smoothing Is All You Need For Unlearning
2026
-
[28]
Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. 2024. Negative preference optimization: From catastrophic collapse to effective unlearning
2024
-
[29]
Yibo Zhang and Liang Lin. 2025. Enj: Optimizing noise with genetic algorithms to jailbreak lsms
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.