ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
Pith reviewed 2026-05-20 19:18 UTC · model grok-4.3
The pith
Activation redirection followed by reward optimization lets multimodal models unlearn sensitive cross-modal knowledge while preserving generation quality and utility.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that combining activation redirection to induce initial refusal with subsequent optimization by a customized reward function achieves a superior trade-off: stronger unlearning of targeted cross-modal information, markedly higher generation quality, and preserved model utility, all with only a small amount of retained supervision data.
What carries the argument
The ASRU framework, which uses activation redirection to induce refusal behavior and then refines it via a customized reward function to set fine-grained refusal boundaries.
If this is right
- Unlearning effectiveness rises while generation quality improves by a large factor.
- Model utility remains intact even after targeted forgetting.
- Only small amounts of retained supervision data are needed to achieve the balance.
- The method avoids the hallucinations and rigidity that plague prior unlearning approaches.
Where Pith is reading between the lines
- The two-stage pattern of steering then rewarding could be tested on text-only models to see if similar refusal control emerges.
- The approach might generalize to removing other unwanted behaviors such as biases or safety violations beyond memorized facts.
- Future checks could examine whether the refined refusal boundaries remain stable across longer conversations or new input distributions.
Load-bearing premise
That the initial refusal created by activation redirection can be reliably refined by the reward function into stable unlearning without producing new hallucinations or overly rigid behavior.
What would settle it
Compare post-unlearning outputs on queries about the removed sensitive content versus unrelated topics; if the model refuses the former while generating coherent non-hallucinated answers on the latter at rates far above baselines, the claim holds.
Figures
read the original abstract
Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8x) on average while effectively preserving model utility, using only a small amount of retained supervision data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ASRU, a controllable multimodal unlearning framework for MLLMs. It first applies activation redirection to induce initial refusal behavior on sensitive cross-modal information, then refines fine-grained refusal boundaries via a customized reward function within a reinforcement unlearning setup. The goal is to achieve a superior trade-off between target knowledge removal and preservation of generation quality plus model utility. Experiments on Qwen3-VL are reported to yield average gains of +24.6% in unlearning effectiveness and 5.8x in generation quality while retaining utility, using only a small amount of retained supervision data.
Significance. If the experimental claims are substantiated, the work would meaningfully advance machine unlearning for multimodal models by treating generation quality as a first-class objective rather than an afterthought. The combination of activation steering with reward-based boundary optimization is a plausible way to mitigate post-unlearning hallucinations and rigidity, and the reported data efficiency (small retained set) would be a practical strength for deployment.
major comments (3)
- [§4] §4 (Experimental Setup): The abstract and results claim specific quantitative gains (+24.6% unlearning effectiveness, 5.8x generation quality) yet provide no description of the evaluation metrics, baseline methods, dataset splits, or error bars. Without these, it is impossible to determine whether the data support the central claim of an improved trade-off.
- [§3.2] §3.2 (Reward Function): The customized reward function is described only at a high level; no explicit formulation, weighting scheme, or cross-modal consistency term is given. This is load-bearing because the skeptic concern (inconsistent refusal or new hallucinations after redirection) cannot be evaluated without knowing how the reward penalizes rigidity or cross-modal over-refusal on held-out queries.
- [§4.3] §4.3 (Ablation / Analysis): No ablation isolating the contribution of activation redirection versus the subsequent reward optimization is reported, nor are results shown on held-out multimodal queries that probe hallucination or rigidity. This leaves the stability of the claimed trade-off unverified.
minor comments (2)
- [§3] Notation for the activation redirection vector and the reward components should be introduced with explicit equations rather than prose descriptions.
- [§4] Figure captions and axis labels in the results section should explicitly state the evaluation metrics and number of runs.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below and will revise the manuscript to provide the requested details and analyses.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Setup): The abstract and results claim specific quantitative gains (+24.6% unlearning effectiveness, 5.8x generation quality) yet provide no description of the evaluation metrics, baseline methods, dataset splits, or error bars. Without these, it is impossible to determine whether the data support the central claim of an improved trade-off.
Authors: We agree that the experimental setup requires substantially more detail to substantiate the reported gains. In the revised manuscript we will expand §4 with explicit definitions of the unlearning-effectiveness metric (refusal rate on sensitive cross-modal queries) and generation-quality metric (coherence and utility scores), a complete list of baselines, the precise train/validation/test splits, and error bars computed over multiple random seeds. revision: yes
-
Referee: [§3.2] §3.2 (Reward Function): The customized reward function is described only at a high level; no explicit formulation, weighting scheme, or cross-modal consistency term is given. This is load-bearing because the skeptic concern (inconsistent refusal or new hallucinations after redirection) cannot be evaluated without knowing how the reward penalizes rigidity or cross-modal over-refusal on held-out queries.
Authors: We acknowledge the need for a precise formulation. We will add the full mathematical expression of the reward function to §3.2, including the weighting coefficients for each term and the explicit cross-modal consistency penalty that discourages both under-refusal on target content and over-refusal or hallucination on held-out queries. revision: yes
-
Referee: [§4.3] §4.3 (Ablation / Analysis): No ablation isolating the contribution of activation redirection versus the subsequent reward optimization is reported, nor are results shown on held-out multimodal queries that probe hallucination or rigidity. This leaves the stability of the claimed trade-off unverified.
Authors: We agree that component-wise ablations and targeted held-out evaluations are necessary. In the revision we will insert an ablation study in §4.3 that isolates activation redirection from the subsequent reward-optimized fine-tuning, together with quantitative results on held-out multimodal queries that measure hallucination rates and response rigidity. revision: yes
Circularity Check
No significant circularity: procedural framework with experimental validation
full rationale
The paper presents ASRU as a two-stage algorithmic procedure (activation redirection to induce refusal, followed by reward-based boundary optimization) evaluated empirically on Qwen3-VL. No equations, derivations, or first-principles results are described that reduce to fitted inputs by construction, self-definitions, or self-citation chains. The central claims rest on reported experimental deltas (+24.6% unlearning, 5.8x quality) using retained supervision data, which are externally falsifiable and do not rely on renaming or smuggling ansatzes. This is a standard empirical methods paper whose derivation chain is self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Chen, J., Deng, Z., Zheng, K., Yan, Y ., Liu, S., Wu, P., Jiang, P., Liu, J., and Hu, X. Safeeraser: Enhancing safety in multimodal large language models through multimodal machine unlearning.arXiv preprint arXiv:2502.12520,
-
[2]
Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe rlhf: Safe reinforcement learning 9 Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models from human feedback.arXiv preprint arXiv:2310.12773,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Ding, C., Wu, J., Sheng, L., Zhang, F., Yuan, Y ., Wang, X., and He, X. Mllmeraser: Achieving test-time unlearning in multimodal large language models through activation steering.arXiv preprint arXiv:2510.04217,
-
[4]
CLEAR: Character unlearning in textual and visual modalities
Dontsov, A., Korzh, D., Zhavoronkin, A., Mikheev, B., Bobkov, D., Alanov, A., Rogov, O., Oseledets, I., and Tutubalina, E. CLEAR: Character unlearning in textual and visual modalities. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 20582–20603,
work page 2025
-
[5]
Hu, Z., Li, J., Pu, Z., Chan, H. P., and Yin, Y . Praxis- vlm: Vision-grounded decision making via text-driven reinforcement learning.arXiv preprint arXiv:2503.16965,
-
[6]
Huang, J., Zhang, J., Jiang, K., Qiu, H., and Lu, S. Visual instruction tuning towards general-purpose multimodal model: A survey.arXiv preprint arXiv:2312.16602,
-
[7]
Demystifying verbatim memorization in large language models
Huang, J., Yang, D., and Potts, C. Demystifying verbatim memorization in large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 10711–10732. Association for Computational Linguistics, November
work page 2024
-
[8]
Huo, J., Yan, Y ., Zheng, X., Lyu, Y ., Zou, X., Wei, Z., and Hu, X. MMUnlearner: Reformulating mul- timodal machine unlearning in the era of multimodal large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 7190– 7206, Vienna, Austria, July
work page 2025
-
[9]
Association for Com- putational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.375. Jorgensen, O., Cope, D., Schoots, N., and Shanahan, M. Improving activation steering in language models with mean-centring.arXiv preprint arXiv:2312.03813,
-
[10]
Copy- right violations and large language models
Karamolegkou, A., Li, J., Zhou, L., and Søgaard, A. Copy- right violations and large language models. In Bouamor, H., Pino, J., and Bali, K. (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7403–7412, Singapore, December
work page 2023
-
[11]
Li, J., Zhang, C., Du, M., Zhang, H., Chen, Y ., Wei, Q., Fang, J., Wang, R., Bi, S., and Qi, G. Forget the token and pixel: Rethinking gradient ascent for concept unlearning in multimodal generative models. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 12179–12200, Vienna, Austria, July 2025a. Association for Computational Li...
work page 2025
-
[12]
Liu, S., Yao, Y ., Jia, J., Casper, S., Baracaldo, N., Hase, P., Yao, Y ., Liu, C. Y ., Xu, X., Li, H., et al. Rethinking machine unlearning for large language models.Nature Machine Intelligence, pp. 1–14, 2025a. Liu, Z., Dou, G., Tan, Z., Tian, Y ., and Jiang, M. Towards safer large language models through machine unlearning. arXiv preprint arXiv:2402.10058,
-
[13]
TOFU: A Task of Fictitious Unlearning for LLMs
Liu, Z., Dou, G., Jia, M., Tan, Z., Zeng, Q., Yuan, Y ., and Jiang, M. Protecting privacy in multimodal large language models with mllmu-bench. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pp. 4105–4135, 2025b. Maini, P., ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
The Linear Representation Hypothesis and the Geometry of Large Language Models
Park, K., Choe, Y . J., and Veitch, V . The linear represen- tation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Steering Llama 2 via Contrastive Activation Addition
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.828. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms,
-
[16]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Alphasteer: Learn- ing refusal steering with principled null-space constraint
Sheng, L., Shen, C., Zhao, W., Fang, J., Liu, X., Liang, Z., Wang, X., Zhang, A., and Chua, T.-S. Alphasteer: Learn- ing refusal steering with principled null-space constraint. arXiv preprint arXiv:2506.07022,
-
[18]
Representation surgery: Theory and practice of affine steering.arXiv preprint arXiv:2402.09631,
Singh, S., Ravfogel, S., Herzig, J., Aharoni, R., Cot- terell, R., and Kumaraguru, P. Representation surgery: Theory and practice of affine steering.arXiv preprint arXiv:2402.09631,
-
[19]
Activation scaling for steer- ing and interpreting language models
Stoehr, N., Du, K., Snæbjarnarson, V ., West, R., Cot- terell, R., and Schein, A. Activation scaling for steer- ing and interpreting language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 8189–8200, Miami, Florida, USA, November
work page 2024
-
[20]
doi: 10.18653/v1/2024.findings-emnlp.479
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.479. Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B. Improving instruction-following in lan- guage models through activation steering.arXiv preprint arXiv:2410.12877,
-
[21]
Steering Language Models With Activation Engineering
Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering lan- guage models with activation engineering.arXiv preprint arXiv:2308.10248,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
P., Zhou, Z., Shin, S., Han, B., and Weinberger, K
Wang, Q., Zhou, J. P., Zhou, Z., Shin, S., Han, B., and Weinberger, K. Q. Rethinking llm unlearning objectives: A gradient perspective and go beyond.arXiv preprint arXiv:2502.19301,
-
[24]
Wu, S., Jin, G., Huang, W., Wang, J., and Huang, X. Acti- vation steering meets preference optimization: Defense against jailbreaks in vision language models.arXiv preprint arXiv:2509.00373,
-
[25]
Zhang, C., Jin, Z., Yuan, H., Wei, J., Zhou, T., Liu, K., Zhao, J., and Chen, Y . Rule: Reinforcement unlearning achieves forget-retain pareto optimality.arXiv preprint arXiv:2506.07171,
-
[26]
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
Zhang, R., Lin, L., Bai, Y ., and Mei, S. Negative preference optimization: From catastrophic collapse to effective un- learning.arXiv preprint arXiv:2404.05868,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
resolves this problem by imposing constraints on the retain set, with the specific form as follows: L:=−L GA(θ;D f) +λE Dr [logπ θ(yr|xr)](18) whereλis the trade-off hyperparameter. B.3.4. KL MIN KL Min(Maini et al., 2024), which applies GA on Df while matching the retain-set output distribution via KL divergence: TheL KL loss function is defined as: LKL ...
work page 2024
-
[28]
the requested information is not present in the image
These settings are adapted from the implementation of MLLMU-Bench. 14 Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models Table 7.Hyperparameters Settings of Baselines MLLMs Epochs Batch Size Optimizer LoRA Learning Rate Qwen3-VL-4B-Instruct 4 4 Adam True5×10 −5 Qwen3-VL-8B-Instruct 4 4 Adam True5×10 −5 Table 8.The perf...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.