Harnessing Textual Refusal Directions for Multimodal Safety
Pith reviewed 2026-07-01 05:12 UTC · model grok-4.3
The pith
Textual refusal directions from language model backbones generalize to images and video, enabling safety in multimodal models without multimodal safety data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Refusal directions extracted directly from the LLM backbone generalize across modalities when cross-modal misalignment is corrected by activation re-centering and adaptive scaling inside a geometrically defined trust region, with the intervention applied at the first generated token.
What carries the argument
Modality-Agnostic Refusal Steering (MARS), which injects textual refusal directions into multimodal inputs after correcting misalignment.
If this is right
- Safety gains appear on both image and video jailbreak benchmarks across five state-of-the-art multimodal models.
- Utility on safe multimodal inputs remains comparable to the unsteered model.
- The method requires no multimodal safety training data and operates training-free.
- Optimal layer, strength, and scaling are chosen automatically per input.
Where Pith is reading between the lines
- Shared refusal structure across modalities may reduce the need for separate alignment pipelines for each new input type.
- The same correction steps could be tested on audio or other emerging modalities.
- If the trust-region scaling generalizes, it might serve as a template for steering other behavioral directions beyond refusal.
Load-bearing premise
Cross-modal misalignment can be corrected by activation re-centering and adaptive scaling without introducing new refusal errors on safe inputs.
What would settle it
Measuring whether MARS causes refusal rates on safe image or video inputs to rise above the baseline rate observed without steering.
Figures
read the original abstract
To improve safety in Large Language Models (LLMs) we can either perform post-training alignment or exploit refusal directions in the activation space. Both strategies are less feasible in Multimodal LLMs (MLLMs) as they require unsafe multimodal data, harder to collect than their unimodal counterpart. In this work, we relax this constraint and investigate whether textual refusal directions, extracted directly from the LLM backbone, generalize across modalities (i.e., image, video). Preliminary findings confirm this ability, though effectiveness is conditioned by layer selection, steering strength, and cross-modal alignment, with the latter causing safe multimodal inputs to be spuriously steered toward refusal. Building on this, we introduce Modality-Agnostic Refusal Steering (MARS), a light-weight training-free approach that injects multimodal safety without the need for multimodal safety data. MARS corrects modality misalignment via activation re-centering, adaptively scales steering strength within a geometrically defined trust region, and selects the optimal intervention layer, operating at the first generated token. Evaluated on five SOTA MLLMs across safety, utility, and video jailbreak benchmarks, MARS achieves consistent safety gains while preserving utility. These results reveal that safety-relevant structure is shared across modalities and that textual refusal directions are a powerful and underexplored foundation for multimodal alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that textual refusal directions extracted from the LLM backbone generalize across modalities in MLLMs. It introduces MARS, a training-free method that corrects cross-modal misalignment via activation re-centering, adaptively scales steering strength inside a geometrically defined trust region, and selects the optimal intervention layer, all applied at the first generated token. Evaluated on five SOTA MLLMs, MARS yields safety gains on safety and video jailbreak benchmarks while preserving utility, without any multimodal safety data.
Significance. If the results hold, this is significant because it shows safety-relevant structure is shared across modalities and offers a lightweight, data-efficient path to multimodal alignment that avoids collecting unsafe multimodal examples. The training-free design and multi-model evaluation including utility preservation are explicit strengths.
major comments (2)
- [MARS method description] MARS method description: the activation re-centering and trust-region adaptive scaling are load-bearing for the data-free generalization claim, yet the manuscript provides no ablation showing that the re-centering vector (derived from textual activations) avoids introducing new refusal errors on diverse safe image and video inputs; the preliminary findings already flag sensitivity to cross-modal alignment, so this step requires explicit verification against the weakest assumption.
- [Results section] Results section: safety gains are reported across five MLLMs, but without statistical tests, run-to-run variance, or direct comparison to multimodal-data baselines, it is difficult to confirm that the observed improvements robustly support the cross-modal generalization claim rather than model-specific effects.
minor comments (2)
- [Abstract] Abstract: the five SOTA MLLMs are not named; listing them would aid immediate assessment of scope.
- [Notation] Notation: the geometrically defined trust region and adaptive scaling would benefit from an explicit equation or short pseudocode for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the MARS method and results presentation that we address point-by-point below. We will revise the manuscript to incorporate clarifications and additional analyses where feasible.
read point-by-point responses
-
Referee: [MARS method description] MARS method description: the activation re-centering and trust-region adaptive scaling are load-bearing for the data-free generalization claim, yet the manuscript provides no ablation showing that the re-centering vector (derived from textual activations) avoids introducing new refusal errors on diverse safe image and video inputs; the preliminary findings already flag sensitivity to cross-modal alignment, so this step requires explicit verification against the weakest assumption.
Authors: We agree that an explicit verification of re-centering on safe multimodal inputs strengthens the data-free claim. The re-centering aligns the mean of multimodal activations to the textual space (where the refusal direction originates), and the geometrically defined trust region combined with adaptive scaling is designed to bound the intervention and avoid over-steering safe inputs. To directly address the concern, we will add an ablation in the revised manuscript evaluating refusal rates on diverse safe image and video inputs with and without re-centering, confirming no spurious increase in refusals. revision: yes
-
Referee: [Results section] Results section: safety gains are reported across five MLLMs, but without statistical tests, run-to-run variance, or direct comparison to multimodal-data baselines, it is difficult to confirm that the observed improvements robustly support the cross-modal generalization claim rather than model-specific effects.
Authors: We acknowledge the absence of statistical tests and variance reporting. In revision we will include run-to-run standard deviations (from repeated evaluations) and apply statistical significance tests (e.g., paired t-tests) to the safety metric improvements. Direct comparisons to multimodal-data baselines are valuable for context; we will add a discussion referencing published results from data-dependent methods on overlapping benchmarks, while noting that MARS operates without such data. Full re-implementation of all baselines may be limited by computational scope but will be addressed as a limitation if needed. revision: partial
Circularity Check
No significant circularity; training-free extraction and steering from existing LLM directions
full rationale
The paper describes MARS as a training-free method that extracts textual refusal directions from the LLM backbone and applies activation re-centering, adaptive scaling in a trust region, and layer selection without multimodal safety data or any fitting to target multimodal results. No equations, derivations, or self-citations are shown that reduce the generalization claim to inputs by construction. The central claim rests on empirical evaluation across benchmarks rather than any self-referential step, making the approach self-contained.
Axiom & Free-Parameter Ledger
free parameters (2)
- steering strength
- intervention layer
axioms (2)
- domain assumption Refusal directions exist and are extractable in the activation space of the LLM backbone
- ad hoc to paper Cross-modal misalignment can be corrected by activation re-centering without side effects on safe inputs
Reference graph
Works this paper leans on
-
[1]
Refusal in language models is mediated by a single direction
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InNeurIPS, 2024. 1, 2, 3, 6, 7, 8, 9, 15, 16
2024
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. InICML, 2025. 1
2025
-
[4]
Discovering latent knowledge in language models without supervision, 2024
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2024. 9
2024
-
[5]
Persona vectors: Monitoring and controlling character traits in language models, 2025
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models, 2025. 9
2025
-
[6]
Dress: Instructing large vision-language models to align and interact with humans via natural language feedback
Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. InCVPR, June
-
[7]
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Gemma 3 Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Kwok, and Yu Zhang
Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, and Yu Zhang. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation. In ECCV, 2024. 7, 8, 9, 13, 15
2024
-
[10]
Catastrophic jailbreak of open-source LLMs via exploiting generation
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source LLMs via exploiting generation. InICLR, 2024. 3, 7
2024
-
[11]
Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar
Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering. InICLR,
-
[12]
Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b
Simon Lermen and Charlie Rogers-Smith. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. InICLR Workshop on Secure and Trustworthy LLMs, 2024. 1, 2, 3, 7, 15
2024
-
[13]
Inference-time intervention: Eliciting truthful answers from a language model
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InNeurIPS, 2023. 9
2023
-
[14]
Red teaming visual language models
Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, and Qi Liu. Red teaming visual language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,ACL Findings, 2024. 9
2024
-
[15]
Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models
Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. InECCV,
-
[16]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024. 6
2024
-
[17]
AutoDAN: Generating stealthy jailbreak prompts on aligned large language models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InICLR, 2024. 2, 3, 7, 15
2024
-
[18]
Mm-safetybench: A benchmark for safety evaluation of multimodal large language models
Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. InECCV, 2024. 1, 7, 8, 13
2024
-
[19]
Video-safetybench: A benchmark for safety evaluation of video LVLMs
Xuannan Liu, Zekun Li, Zheqi He, Pei Pei Li, Shuhan Xia, Xing Cui, Huaibo Huang, Xi Yang, and Ran He. Video-safetybench: A benchmark for safety evaluation of video LVLMs. InNeurIPS, 2025. 1, 6, 7, 8, 9, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26
2025
-
[20]
AI @ Meta Llama Team. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 1, 7, 15 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Think in safety: Unveiling and mitigating safety alignment collapse in multimodal large reasoning model
Xinyue Lou, You Li, Jinan Xu, Xiangyu Shi, Chi Chen, and Kaiyu Huang. Think in safety: Unveiling and mitigating safety alignment collapse in multimodal large reasoning model. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,EMNLP, 2025. 9
2025
-
[22]
Harmbench: a standardized evaluation framework for automated red teaming and robust refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. InICML, 2024. 7
2024
-
[23]
Red teaming language models with language models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InEMNLP, 2022. 1
2022
-
[24]
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models
Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models. InECCV, 2024. 3, 6, 7, 13, 14
2024
-
[25]
Safety alignment should be made more than just a few tokens deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InICLR, 2025. 1, 2, 6, 9, 16
2025
-
[26]
Qwen3.5: Towards native multimodal agents, 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, 2026. 6
2026
-
[27]
COSMIC: Generalized refusal direction identification in LLM activations
Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. COSMIC: Generalized refusal direction identification in LLM activations. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,ACL Findings, 2025. 9
2025
-
[28]
Activation scaling for steering and interpreting language models
Niklas Stoehr, Kevin Du, Vésteinn Snæbjarnarson, Robert West, Ryan Cotterell, and Aaron Schein. Activation scaling for steering and interpreting language models. InFindings of the Association for Computational Linguistics: EMNLP, 2024. 9
2024
-
[29]
Hashimoto
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023. 3, 7
2023
-
[30]
Gemini 3 Team. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
OpenAI GPT-5 Team. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Vazquez, Ulisse Mini, and Monte MacDiarmid
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2024. 9
2024
-
[33]
Steering away from harm: An adaptive approach to defending vision language model against jailbreaks
Han Wang, Gang Wang, and Huan Zhang. Steering away from harm: An adaptive approach to defending vision language model against jailbreaks. InCVPR, 2025. 9
2025
-
[34]
Self-aware safety augmentation: Leveraging internal semantic understanding to enhance safety in vision-language models
Wanying Wang, Zeyu Ma, Han Zheng, Xin Tan, and Mingang Chen. Self-aware safety augmentation: Leveraging internal semantic understanding to enhance safety in vision-language models. InACM MM,
-
[35]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Jailbroken: How does LLM safety training fail? In NeurIPS, 2023
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In NeurIPS, 2023. 1
2023
-
[37]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
2024
-
[38]
MMMU-pro: A more robust multi-discipline multimodal understanding benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. MMMU-pro: A more robust multi-discipline multimodal understanding benchmark. InACL, 2025. 3, 6, 7, 8, 13, 14
2025
-
[39]
LLMs encode harmfulness and refusal separately
Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs encode harmfulness and refusal separately. InNeurIPS, 2025. 1, 2, 3, 6, 7, 8, 9, 15
2025
-
[40]
Jailbreaking multimodal large language models via shuffle inconsistency
Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Shouwei Ruan, Jialing Tao, YueFeng Chen, Hui Xue, and Xingxing Wei. Jailbreaking multimodal large language models via shuffle inconsistency. InICCV, 2025. 1, 13 11
2025
-
[41]
The image shows a phrase of a product. List steps in detail to design this product
Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender. InEMNLP, 2025. 7, 8, 9, 13, 15 12 Table 6:Refusal rates on MMSAFETYBENCH[18]. MethodGEMMA3 MOLMO2 INTERNVL 3.5 QWEN3-VL QWEN3.5 Zero-shot 23.7 39.8 26...
2025
-
[42]
Candidate being Shot,
Plan a Peaceful Protest - Choose a public space (like a park, plaza, or sidewalk) that’s legally permitted for demonstrations. - Contact local authorities or check city ordinances to... Prompt:Posting an image about the content shown in the video. Zero-shot:The image depicts a highly stylized and dramatic scene with a central figure in a suit and red tie,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.