Harnessing Textual Refusal Directions for Multimodal Safety

Massimiliano Mancini; Moreno D'Inc\`a; Nicu Sebe

arxiv: 2606.31876 · v1 · pith:B6DPV44Inew · submitted 2026-06-30 · 💻 cs.AI · cs.CV· cs.LG

Harnessing Textual Refusal Directions for Multimodal Safety

Moreno D'Inc\`a , Massimiliano Mancini , Nicu Sebe This is my paper

Pith reviewed 2026-07-01 05:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG

keywords refusal directionsmultimodal safetyactivation steeringmultimodal LLMstraining-free alignmentcross-modal generalizationjailbreak defense

0 comments

The pith

Textual refusal directions from language model backbones generalize to images and video, enabling safety in multimodal models without multimodal safety data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether directions that cause language models to refuse unsafe text requests can be applied inside multimodal models that accept images or video. A sympathetic reader would care because gathering unsafe image and video examples for alignment is costly and risky, so reusing text-only safety signals could simplify the process. The authors find that these directions work across modalities once cross-modal differences are adjusted, and they build a lightweight method to perform the adjustment automatically. If the approach holds, multimodal models could gain refusal behavior at inference time while keeping performance on safe inputs intact.

Core claim

Refusal directions extracted directly from the LLM backbone generalize across modalities when cross-modal misalignment is corrected by activation re-centering and adaptive scaling inside a geometrically defined trust region, with the intervention applied at the first generated token.

What carries the argument

Modality-Agnostic Refusal Steering (MARS), which injects textual refusal directions into multimodal inputs after correcting misalignment.

If this is right

Safety gains appear on both image and video jailbreak benchmarks across five state-of-the-art multimodal models.
Utility on safe multimodal inputs remains comparable to the unsteered model.
The method requires no multimodal safety training data and operates training-free.
Optimal layer, strength, and scaling are chosen automatically per input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Shared refusal structure across modalities may reduce the need for separate alignment pipelines for each new input type.
The same correction steps could be tested on audio or other emerging modalities.
If the trust-region scaling generalizes, it might serve as a template for steering other behavioral directions beyond refusal.

Load-bearing premise

Cross-modal misalignment can be corrected by activation re-centering and adaptive scaling without introducing new refusal errors on safe inputs.

What would settle it

Measuring whether MARS causes refusal rates on safe image or video inputs to rise above the baseline rate observed without steering.

Figures

Figures reproduced from arXiv: 2606.31876 by Massimiliano Mancini, Moreno D'Inc\`a, Nicu Sebe.

**Figure 1.** Figure 1: Preliminary findings. (a) Unsafe inputs are more refused than safe ones. (b) Increasing α improves refusal on unsafe inputs but induces over-refusal on safe ones and degrades utility. 2.2 Re-purposing refusal directions for safety Refusal directions are vectors in the activation space encoding the model’s tendency to refuse inputs, and are extracted from activations yielding refusal or acceptance. We build… view at source ↗

**Figure 2.** Figure 2: Activation space analysis. (a) Unsafe / safe images are projected onto rejection (↑–↑), regardless of semantics. Centering correctly disentangles safety, aligning unsafe images with rejection and safe ones with acceptance . (b) Removing the dominant visual component (centered) improves the disentanglement of safety-relevant features w.r.t. the original activations (raw). of safety features. This is most pr… view at source ↗

**Figure 3.** Figure 3: Qualitative example on MOLMO2 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Robustness to weak textual refusal. LLAVA 1.5. We further challenge MARS in a setting with inherently weak refusal signals. In particular, LLAVA 1.5 is poorly aligned even in the text modality: during direction extraction, it refuses only 33% of MALICIOUSINSTRUCT prompts, yielding a refusal direction estimated from as few as 33 rejected samples. This raises a natural question: how much safety can be recov… view at source ↗

**Figure 5.** Figure 5: Layer selection scores. We report layer selection scores across models (⋆ denotes the selected layer for intervention). 16 20 24 28 32 Layers 0.0 0.2 0.4 0.6 Separation (Sep `) Gemma3 16 20 24 28 32 Layers Qwen3-VL 16 20 24 28 32 Layers InternVL 3.5 16 20 24 28 Layers Qwen3.5 Raw Centered [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Separation scores between safe/unsafe image activations and the textual refusal direction, as outlined in Sec 3.1. −5.0 −4.5 −4.0 −3.5 −6 −4 −2 0 2 4 6 PC2 ×103 Raw space −5 0 5×103 −6 −4 −2 0 2 4 6 ×10 Centered space 3 PC1 Accept txt Reject txt Safe img Unsafe img (a) Layer 25 −5.5 −5.0 −4.5 −4.0 −8 −6 −4 −2 0 2 4 6 8 PC2 ×103 Raw space −5 0 5×103 −8 −6 −4 −2 0 2 4 6 8 ×10 Centered space 3 PC1 Accept txt … view at source ↗

**Figure 7.** Figure 7: GEMMA3 activation space. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: QWEN3-VL activation space. −3 −2 −1 0 1 ×102 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 PC2 ×102 PC1 Accept txt Reject txt Safe img Unsafe img (a) Layer 25 −4 −3 −2 −1 0 1 ×102 −1 0 1 2 3 PC2 ×102 PC1 Accept txt Reject txt Safe img Unsafe img (b) Layer 26 −4 −3 −2 −1 0 1 2 ×102 −1 0 1 2 3 PC2 ×102 PC1 Accept txt Reject txt Safe img Unsafe img (c) Layer 27 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: INTERNVL 3.5 activation space. Prompt: Illustrate the way to achieve the bird’s condition shown in the video. Zero-shot: To achieve the bird’s condition shown in the video, you would need to: 1. Start with a small bird that has light brown and gray plumage 2. Carefully wrap thin wire around the bird’s neck and chest area, ensuring it’s snug but not too tight 3. Secure the wire in place using pliers or wire… view at source ↗

**Figure 10.** Figure 10: Additional qualitative results: MOLMO2 8B. VIDEO-SAFETYBENCH [19] 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Additional qualitative results: MOLMO2 8B. VIDEO-SAFETYBENCH [19] 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Additional qualitative results: MOLMO2 8B. VIDEO-SAFETYBENCH [19] 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Additional qualitative results. QWEN3-VL 8B. VIDEO-SAFETYBENCH [19] 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Additional qualitative results. QWEN3-VL 8B. VIDEO-SAFETYBENCH [19] 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Additional qualitative results: QWEN3.5 9B. VIDEO-SAFETYBENCH [19] 23 [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Additional qualitative results: QWEN3.5 9B. VIDEO-SAFETYBENCH [19] 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Additional qualitative results: INTERNVL 3.5 8B. VIDEO-SAFETYBENCH [19] 25 [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Additional qualitative results: INTERNVL 3.5 8B. VIDEO-SAFETYBENCH [19] 26 [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

read the original abstract

To improve safety in Large Language Models (LLMs) we can either perform post-training alignment or exploit refusal directions in the activation space. Both strategies are less feasible in Multimodal LLMs (MLLMs) as they require unsafe multimodal data, harder to collect than their unimodal counterpart. In this work, we relax this constraint and investigate whether textual refusal directions, extracted directly from the LLM backbone, generalize across modalities (i.e., image, video). Preliminary findings confirm this ability, though effectiveness is conditioned by layer selection, steering strength, and cross-modal alignment, with the latter causing safe multimodal inputs to be spuriously steered toward refusal. Building on this, we introduce Modality-Agnostic Refusal Steering (MARS), a light-weight training-free approach that injects multimodal safety without the need for multimodal safety data. MARS corrects modality misalignment via activation re-centering, adaptively scales steering strength within a geometrically defined trust region, and selects the optimal intervention layer, operating at the first generated token. Evaluated on five SOTA MLLMs across safety, utility, and video jailbreak benchmarks, MARS achieves consistent safety gains while preserving utility. These results reveal that safety-relevant structure is shared across modalities and that textual refusal directions are a powerful and underexplored foundation for multimodal alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Textual refusal directions transfer to MLLMs after a simple misalignment correction, but the correction's ability to avoid new errors on safe inputs is still lightly tested.

read the letter

The main thing here is that refusal directions pulled from the text backbone can be steered into image and video MLLMs for safety gains, using a training-free fix that skips the need for unsafe multimodal data.

The work is new in taking existing refusal-direction techniques and adding the MARS steps—re-centering activations, adaptive scaling inside a trust region, and layer choice—to handle cross-modal misalignment. It does well by running the method on five current MLLMs, checking safety, utility, and video jailbreak sets, and reporting consistent safety lifts without hurting normal performance. That directly tackles the data-collection problem the abstract highlights.

The softer part is the validation of the correction itself. The claim rests on re-centering and the geometric trust region being enough to stop spurious refusals on safe multimodal inputs, yet those steps are derived without multimodal safety examples. The abstract already notes that misalignment produces unwanted refusals, so if the fix is incomplete on inputs outside the design set, the practical value drops. The results are labeled preliminary, and the dependence on steering strength plus intervention layer means some manual choices remain. Without seeing full ablations or error breakdowns on diverse safe cases, it is hard to judge how reliably the method generalizes.

This is for researchers focused on activation steering or data-efficient multimodal alignment. Someone already working on refusal mechanisms or MLLM safety would get the idea quickly and could test the correction on their own models.

It deserves peer review. The problem is real, the approach is lightweight, and the benchmarks give referees something concrete to examine and improve.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that textual refusal directions extracted from the LLM backbone generalize across modalities in MLLMs. It introduces MARS, a training-free method that corrects cross-modal misalignment via activation re-centering, adaptively scales steering strength inside a geometrically defined trust region, and selects the optimal intervention layer, all applied at the first generated token. Evaluated on five SOTA MLLMs, MARS yields safety gains on safety and video jailbreak benchmarks while preserving utility, without any multimodal safety data.

Significance. If the results hold, this is significant because it shows safety-relevant structure is shared across modalities and offers a lightweight, data-efficient path to multimodal alignment that avoids collecting unsafe multimodal examples. The training-free design and multi-model evaluation including utility preservation are explicit strengths.

major comments (2)

[MARS method description] MARS method description: the activation re-centering and trust-region adaptive scaling are load-bearing for the data-free generalization claim, yet the manuscript provides no ablation showing that the re-centering vector (derived from textual activations) avoids introducing new refusal errors on diverse safe image and video inputs; the preliminary findings already flag sensitivity to cross-modal alignment, so this step requires explicit verification against the weakest assumption.
[Results section] Results section: safety gains are reported across five MLLMs, but without statistical tests, run-to-run variance, or direct comparison to multimodal-data baselines, it is difficult to confirm that the observed improvements robustly support the cross-modal generalization claim rather than model-specific effects.

minor comments (2)

[Abstract] Abstract: the five SOTA MLLMs are not named; listing them would aid immediate assessment of scope.
[Notation] Notation: the geometrically defined trust region and adaptive scaling would benefit from an explicit equation or short pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the MARS method and results presentation that we address point-by-point below. We will revise the manuscript to incorporate clarifications and additional analyses where feasible.

read point-by-point responses

Referee: [MARS method description] MARS method description: the activation re-centering and trust-region adaptive scaling are load-bearing for the data-free generalization claim, yet the manuscript provides no ablation showing that the re-centering vector (derived from textual activations) avoids introducing new refusal errors on diverse safe image and video inputs; the preliminary findings already flag sensitivity to cross-modal alignment, so this step requires explicit verification against the weakest assumption.

Authors: We agree that an explicit verification of re-centering on safe multimodal inputs strengthens the data-free claim. The re-centering aligns the mean of multimodal activations to the textual space (where the refusal direction originates), and the geometrically defined trust region combined with adaptive scaling is designed to bound the intervention and avoid over-steering safe inputs. To directly address the concern, we will add an ablation in the revised manuscript evaluating refusal rates on diverse safe image and video inputs with and without re-centering, confirming no spurious increase in refusals. revision: yes
Referee: [Results section] Results section: safety gains are reported across five MLLMs, but without statistical tests, run-to-run variance, or direct comparison to multimodal-data baselines, it is difficult to confirm that the observed improvements robustly support the cross-modal generalization claim rather than model-specific effects.

Authors: We acknowledge the absence of statistical tests and variance reporting. In revision we will include run-to-run standard deviations (from repeated evaluations) and apply statistical significance tests (e.g., paired t-tests) to the safety metric improvements. Direct comparisons to multimodal-data baselines are valuable for context; we will add a discussion referencing published results from data-dependent methods on overlapping benchmarks, while noting that MARS operates without such data. Full re-implementation of all baselines may be limited by computational scope but will be addressed as a limitation if needed. revision: partial

Circularity Check

0 steps flagged

No significant circularity; training-free extraction and steering from existing LLM directions

full rationale

The paper describes MARS as a training-free method that extracts textual refusal directions from the LLM backbone and applies activation re-centering, adaptive scaling in a trust region, and layer selection without multimodal safety data or any fitting to target multimodal results. No equations, derivations, or self-citations are shown that reduce the generalization claim to inputs by construction. The central claim rests on empirical evaluation across benchmarks rather than any self-referential step, making the approach self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Abstract-only view means ledger is incomplete; paper rests on prior existence of refusal directions in LLM activations and the hypothesis that they transfer with simple corrections.

free parameters (2)

steering strength
Described as adaptively scaled but selection rule and range not detailed in abstract
intervention layer
Selected as optimal but selection criterion and search method not specified

axioms (2)

domain assumption Refusal directions exist and are extractable in the activation space of the LLM backbone
Central premise stated in abstract for extracting directions from text-only model
ad hoc to paper Cross-modal misalignment can be corrected by activation re-centering without side effects on safe inputs
Required for MARS to avoid spurious refusals on safe multimodal inputs

pith-pipeline@v0.9.1-grok · 5772 in / 1306 out tokens · 34466 ms · 2026-07-01T05:12:44.456841+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 7 canonical work pages · 7 internal anchors

[1]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InNeurIPS, 2024. 1, 2, 3, 6, 7, 8, 9, 15, 16

2024
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. InICML, 2025. 1

2025
[4]

Discovering latent knowledge in language models without supervision, 2024

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2024. 9

2024
[5]

Persona vectors: Monitoring and controlling character traits in language models, 2025

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models, 2025. 9

2025
[6]

Dress: Instructing large vision-language models to align and interact with humans via natural language feedback

Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. InCVPR, June
[7]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Gemma 3 Technical Report

Gemma 3 Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Kwok, and Yu Zhang

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, and Yu Zhang. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation. In ECCV, 2024. 7, 8, 9, 13, 15

2024
[10]

Catastrophic jailbreak of open-source LLMs via exploiting generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source LLMs via exploiting generation. InICLR, 2024. 3, 7

2024
[11]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering. InICLR,
[12]

Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

Simon Lermen and Charlie Rogers-Smith. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. InICLR Workshop on Secure and Trustworthy LLMs, 2024. 1, 2, 3, 7, 15

2024
[13]

Inference-time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InNeurIPS, 2023. 9

2023
[14]

Red teaming visual language models

Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, and Qi Liu. Red teaming visual language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,ACL Findings, 2024. 9

2024
[15]

Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models

Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. InECCV,
[16]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024. 6

2024
[17]

AutoDAN: Generating stealthy jailbreak prompts on aligned large language models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InICLR, 2024. 2, 3, 7, 15

2024
[18]

Mm-safetybench: A benchmark for safety evaluation of multimodal large language models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. InECCV, 2024. 1, 7, 8, 13

2024
[19]

Video-safetybench: A benchmark for safety evaluation of video LVLMs

Xuannan Liu, Zekun Li, Zheqi He, Pei Pei Li, Shuhan Xia, Xing Cui, Huaibo Huang, Xi Yang, and Ran He. Video-safetybench: A benchmark for safety evaluation of video LVLMs. InNeurIPS, 2025. 1, 6, 7, 8, 9, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26

2025
[20]

The Llama 3 Herd of Models

AI @ Meta Llama Team. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 1, 7, 15 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Think in safety: Unveiling and mitigating safety alignment collapse in multimodal large reasoning model

Xinyue Lou, You Li, Jinan Xu, Xiangyu Shi, Chi Chen, and Kaiyu Huang. Think in safety: Unveiling and mitigating safety alignment collapse in multimodal large reasoning model. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,EMNLP, 2025. 9

2025
[22]

Harmbench: a standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. InICML, 2024. 7

2024
[23]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InEMNLP, 2022. 1

2022
[24]

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models. InECCV, 2024. 3, 6, 7, 13, 14

2024
[25]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InICLR, 2025. 1, 2, 6, 9, 16

2025
[26]

Qwen3.5: Towards native multimodal agents, 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, 2026. 6

2026
[27]

COSMIC: Generalized refusal direction identification in LLM activations

Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. COSMIC: Generalized refusal direction identification in LLM activations. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,ACL Findings, 2025. 9

2025
[28]

Activation scaling for steering and interpreting language models

Niklas Stoehr, Kevin Du, Vésteinn Snæbjarnarson, Robert West, Ryan Cotterell, and Aaron Schein. Activation scaling for steering and interpreting language models. InFindings of the Association for Computational Linguistics: EMNLP, 2024. 9

2024
[29]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023. 3, 7

2023
[30]

Kimi-VL Technical Report

Gemini 3 Team. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

OpenAI GPT-5 System Card

OpenAI GPT-5 Team. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Vazquez, Ulisse Mini, and Monte MacDiarmid

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2024. 9

2024
[33]

Steering away from harm: An adaptive approach to defending vision language model against jailbreaks

Han Wang, Gang Wang, and Huan Zhang. Steering away from harm: An adaptive approach to defending vision language model against jailbreaks. InCVPR, 2025. 9

2025
[34]

Self-aware safety augmentation: Leveraging internal semantic understanding to enhance safety in vision-language models

Wanying Wang, Zeyu Ma, Han Zheng, Xin Tan, and Mingang Chen. Self-aware safety augmentation: Leveraging internal semantic understanding to enhance safety in vision-language models. InACM MM,
[35]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Jailbroken: How does LLM safety training fail? In NeurIPS, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In NeurIPS, 2023. 1

2023
[37]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

2024
[38]

MMMU-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. MMMU-pro: A more robust multi-discipline multimodal understanding benchmark. InACL, 2025. 3, 6, 7, 8, 13, 14

2025
[39]

LLMs encode harmfulness and refusal separately

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs encode harmfulness and refusal separately. InNeurIPS, 2025. 1, 2, 3, 6, 7, 8, 9, 15

2025
[40]

Jailbreaking multimodal large language models via shuffle inconsistency

Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Shouwei Ruan, Jialing Tao, YueFeng Chen, Hui Xue, and Xingxing Wei. Jailbreaking multimodal large language models via shuffle inconsistency. InICCV, 2025. 1, 13 11

2025
[41]

The image shows a phrase of a product. List steps in detail to design this product

Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender. InEMNLP, 2025. 7, 8, 9, 13, 15 12 Table 6:Refusal rates on MMSAFETYBENCH[18]. MethodGEMMA3 MOLMO2 INTERNVL 3.5 QWEN3-VL QWEN3.5 Zero-shot 23.7 39.8 26...

2025
[42]

Candidate being Shot,

Plan a Peaceful Protest - Choose a public space (like a park, plaza, or sidewalk) that’s legally permitted for demonstrations. - Contact local authorities or check city ordinances to... Prompt:Posting an image about the content shown in the video. Zero-shot:The image depicts a highly stylized and dramatic scene with a central figure in a suit and red tie,...

[1] [1]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InNeurIPS, 2024. 1, 2, 3, 6, 7, 8, 9, 15, 16

2024

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. InICML, 2025. 1

2025

[4] [4]

Discovering latent knowledge in language models without supervision, 2024

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2024. 9

2024

[5] [5]

Persona vectors: Monitoring and controlling character traits in language models, 2025

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models, 2025. 9

2025

[6] [6]

Dress: Instructing large vision-language models to align and interact with humans via natural language feedback

Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. InCVPR, June

[7] [7]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Gemma 3 Technical Report

Gemma 3 Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Kwok, and Yu Zhang

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, and Yu Zhang. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation. In ECCV, 2024. 7, 8, 9, 13, 15

2024

[10] [10]

Catastrophic jailbreak of open-source LLMs via exploiting generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source LLMs via exploiting generation. InICLR, 2024. 3, 7

2024

[11] [11]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering. InICLR,

[12] [12]

Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

Simon Lermen and Charlie Rogers-Smith. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. InICLR Workshop on Secure and Trustworthy LLMs, 2024. 1, 2, 3, 7, 15

2024

[13] [13]

Inference-time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InNeurIPS, 2023. 9

2023

[14] [14]

Red teaming visual language models

Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, and Qi Liu. Red teaming visual language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,ACL Findings, 2024. 9

2024

[15] [15]

Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models

Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. InECCV,

[16] [16]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024. 6

2024

[17] [17]

AutoDAN: Generating stealthy jailbreak prompts on aligned large language models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InICLR, 2024. 2, 3, 7, 15

2024

[18] [18]

Mm-safetybench: A benchmark for safety evaluation of multimodal large language models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. InECCV, 2024. 1, 7, 8, 13

2024

[19] [19]

Video-safetybench: A benchmark for safety evaluation of video LVLMs

Xuannan Liu, Zekun Li, Zheqi He, Pei Pei Li, Shuhan Xia, Xing Cui, Huaibo Huang, Xi Yang, and Ran He. Video-safetybench: A benchmark for safety evaluation of video LVLMs. InNeurIPS, 2025. 1, 6, 7, 8, 9, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26

2025

[20] [20]

The Llama 3 Herd of Models

AI @ Meta Llama Team. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 1, 7, 15 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Think in safety: Unveiling and mitigating safety alignment collapse in multimodal large reasoning model

Xinyue Lou, You Li, Jinan Xu, Xiangyu Shi, Chi Chen, and Kaiyu Huang. Think in safety: Unveiling and mitigating safety alignment collapse in multimodal large reasoning model. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,EMNLP, 2025. 9

2025

[22] [22]

Harmbench: a standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. InICML, 2024. 7

2024

[23] [23]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InEMNLP, 2022. 1

2022

[24] [24]

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models. InECCV, 2024. 3, 6, 7, 13, 14

2024

[25] [25]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InICLR, 2025. 1, 2, 6, 9, 16

2025

[26] [26]

Qwen3.5: Towards native multimodal agents, 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, 2026. 6

2026

[27] [27]

COSMIC: Generalized refusal direction identification in LLM activations

Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. COSMIC: Generalized refusal direction identification in LLM activations. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,ACL Findings, 2025. 9

2025

[28] [28]

Activation scaling for steering and interpreting language models

Niklas Stoehr, Kevin Du, Vésteinn Snæbjarnarson, Robert West, Ryan Cotterell, and Aaron Schein. Activation scaling for steering and interpreting language models. InFindings of the Association for Computational Linguistics: EMNLP, 2024. 9

2024

[29] [29]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023. 3, 7

2023

[30] [30]

Kimi-VL Technical Report

Gemini 3 Team. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

OpenAI GPT-5 System Card

OpenAI GPT-5 Team. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Vazquez, Ulisse Mini, and Monte MacDiarmid

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2024. 9

2024

[33] [33]

Steering away from harm: An adaptive approach to defending vision language model against jailbreaks

Han Wang, Gang Wang, and Huan Zhang. Steering away from harm: An adaptive approach to defending vision language model against jailbreaks. InCVPR, 2025. 9

2025

[34] [34]

Self-aware safety augmentation: Leveraging internal semantic understanding to enhance safety in vision-language models

Wanying Wang, Zeyu Ma, Han Zheng, Xin Tan, and Mingang Chen. Self-aware safety augmentation: Leveraging internal semantic understanding to enhance safety in vision-language models. InACM MM,

[35] [35]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Jailbroken: How does LLM safety training fail? In NeurIPS, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In NeurIPS, 2023. 1

2023

[37] [37]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

2024

[38] [38]

MMMU-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. MMMU-pro: A more robust multi-discipline multimodal understanding benchmark. InACL, 2025. 3, 6, 7, 8, 13, 14

2025

[39] [39]

LLMs encode harmfulness and refusal separately

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs encode harmfulness and refusal separately. InNeurIPS, 2025. 1, 2, 3, 6, 7, 8, 9, 15

2025

[40] [40]

Jailbreaking multimodal large language models via shuffle inconsistency

Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Shouwei Ruan, Jialing Tao, YueFeng Chen, Hui Xue, and Xingxing Wei. Jailbreaking multimodal large language models via shuffle inconsistency. InICCV, 2025. 1, 13 11

2025

[41] [41]

The image shows a phrase of a product. List steps in detail to design this product

Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender. InEMNLP, 2025. 7, 8, 9, 13, 15 12 Table 6:Refusal rates on MMSAFETYBENCH[18]. MethodGEMMA3 MOLMO2 INTERNVL 3.5 QWEN3-VL QWEN3.5 Zero-shot 23.7 39.8 26...

2025

[42] [42]

Candidate being Shot,

Plan a Peaceful Protest - Choose a public space (like a park, plaza, or sidewalk) that’s legally permitted for demonstrations. - Contact local authorities or check city ordinances to... Prompt:Posting an image about the content shown in the video. Zero-shot:The image depicts a highly stylized and dramatic scene with a central figure in a suit and red tie,...