Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction
Pith reviewed 2026-05-20 10:54 UTC · model grok-4.3
The pith
Multimodal inputs compress separation along the refusal direction, causing safety geometry collapse that drift correction can reverse.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multimodal inputs induce a drift direction that compresses the projection onto the text-aligned refusal direction, producing safety geometry collapse in which harmful and harmless inputs become harder to separate for refusal. Counteracting the drift through fixed-strength activation intervention restores conditional refusal separability; afterward the model exhibits self-rectification that supplies an internal harmfulness signal, which ReGap uses for adaptive, training-free correction.
What carries the argument
Text-aligned refusal direction together with modality-induced drift direction, where the drift is estimated and subtracted via fixed-strength activation intervention to restore geometric separability.
If this is right
- Conditional refusal separability recovers once the estimated drift is counteracted.
- Self-rectification emerges as a reliable internal signal of the model's perceived harmfulness.
- ReGap improves safety scores on multimodal benchmarks while preserving performance on utility tasks.
- Representation-level modality alignment can be performed at inference time without parameter updates.
Where Pith is reading between the lines
- Safety mechanisms may depend more on maintaining geometric alignment across modalities than on the content of training data alone.
- Similar drift-induced collapses could affect other alignment properties such as factual consistency or bias detection in multimodal settings.
- Adaptive correction using internal signals might generalize to dynamic, multi-turn interactions that mix text with other modalities.
Load-bearing premise
A single stable refusal direction aligned with text remains identifiable and meaningful once modality-induced drift appears, and drift is the main factor whose correction restores separability.
What would settle it
Measure refusal rates on harmful multimodal inputs before and after the fixed drift-counteracting intervention; if rates fail to rise or if safe inputs begin to be refused, the claimed causal role of drift in safety geometry collapse is undermined.
Figures
read the original abstract
Multimodal large language models (MLLMs) often fail to transfer safety capabilities learned in the text modality to semantically equivalent non-text inputs, revealing a persistent multimodal safety gap. We study this gap from a representation-geometric perspective by analyzing a text-aligned refusal direction and a modality-induced drift direction. We show that multimodal inputs compress the usable separation along the refusal direction, making it no longer reliable for identifying and refusing harmful inputs. We refer to this failure mode as Safety Geometry Collapse. We quantify it through conditional refusal separability and show that stronger modality-induced drift is consistently associated with weaker refusal separability and higher attack success rates. We then validate the causal role of modality-induced drift through a fixed-strength activation intervention: counteracting the estimated drift restores refusal separability and improves multimodal safety. After drift correction, we further observe self-rectification, where the model recovers its ability to recognize and refuse harmful multimodal inputs during forward dynamics. This effect also provides an internal signal of the model's perceived harmfulness of each input. Motivated by this signal, we propose ReGap, a training-free inference-time method that adaptively corrects modality drift using self-rectification. Experiments across multiple multimodal safety benchmarks and utility benchmarks demonstrate the effectiveness of ReGap, which significantly improves the safety of MLLMs without compromising general capabilities. Our findings highlight representation-level modality alignment as a crucial direction for real-time safety improvement and for building safer, more reliable MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that multimodal large language models exhibit 'Safety Geometry Collapse,' where multimodal inputs compress the separation along a text-aligned refusal direction, reducing its reliability for detecting and refusing harmful content. This is quantified via conditional refusal separability, shown to correlate with modality-induced drift strength and attack success rates. The causal role of drift is validated using a fixed-strength activation intervention that counteracts the drift, restoring separability and triggering self-rectification. Motivated by this, the authors propose ReGap, a training-free inference-time method for adaptive drift correction, which improves safety on multiple benchmarks without compromising utility.
Significance. If the geometric analysis and causal validation hold, this paper provides a valuable representation-level insight into the multimodal safety gap and a practical mitigation strategy. The identification of self-rectification as an internal signal is a strength, and the training-free nature of ReGap makes it immediately applicable. This could influence future work on aligning representations across modalities for safety.
major comments (1)
- [Causal validation via activation intervention] The fixed-strength activation intervention used to validate the causal role of modality-induced drift lacks specificity controls such as random directions of matched magnitude or zero-strength baselines. Without these, the observed restoration of refusal separability could be due to generic steering effects rather than targeted drift correction, which is central to supporting the primary causal claim and the motivation for ReGap.
minor comments (2)
- [Abstract and experimental results] The abstract and methods description lack details on dataset sizes, statistical tests, and error bars for the reported improvements in separability and attack success rates, which would help assess the robustness of the findings.
- [Representation analysis] Clarify the exact definition and extraction method for the text-aligned refusal direction and the modality-induced drift direction, perhaps with an equation reference, to make the geometric claims more precise.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of our geometric analysis, the identification of self-rectification, and the practical value of the training-free ReGap method. We address the single major comment below with a commitment to strengthen the causal evidence.
read point-by-point responses
-
Referee: [Causal validation via activation intervention] The fixed-strength activation intervention used to validate the causal role of modality-induced drift lacks specificity controls such as random directions of matched magnitude or zero-strength baselines. Without these, the observed restoration of refusal separability could be due to generic steering effects rather than targeted drift correction, which is central to supporting the primary causal claim and the motivation for ReGap.
Authors: We agree that explicit specificity controls are necessary to rule out generic steering. The intervention is constructed by estimating the modality-induced drift vector (difference between multimodal and text-aligned representations) and applying a fixed-magnitude correction in the opposing direction; however, we acknowledge that this alone does not fully isolate the effect. In the revised manuscript we will add two controls: (1) a zero-strength baseline (no activation added, corresponding to the original model) and (2) interventions along random unit vectors scaled to the same magnitude as the drift correction. These will be evaluated on the same refusal-separability and safety metrics. Preliminary internal checks indicate that random directions produce negligible restoration compared with the drift-specific correction; we will report the quantitative differences and statistical significance in the revision. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's derivation proceeds from empirical identification of a text-aligned refusal direction, measurement of its compression under multimodal inputs via conditional refusal separability, observed correlations with modality-induced drift, and a fixed-strength intervention to test causality, followed by motivation of the ReGap method from the resulting self-rectification signal. None of these steps reduce by construction to the inputs: separability and drift are quantified as independent geometric quantities, the intervention is an external manipulation rather than a tautological restatement, and no self-citations or fitted parameters are invoked as load-bearing premises for the central claims. The chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A text-aligned refusal direction can be identified and remains a meaningful axis once multimodal inputs are introduced
invented entities (1)
-
Safety Geometry Collapse
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formulate multimodal safety through a two-dimensional diagnostic space defined by a text-aligned refusal direction and a modality-induced drift direction... Φl(x) = (ϕl_r(x), ϕl_g(x))
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We estimate the text-aligned refusal direction... rl = μl_ref − μl_comp ... gl = g_raw^l − (g_raw^l ⊤ rl / ||rl||²) rl
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models , author=. arXiv preprint arXiv:2311.07919 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Advances in neural information processing systems , volume=
Visual instruction tuning , author=. Advances in neural information processing systems , volume=
-
[4]
Deyao Zhu and Jun Chen and Xiaoqian Shen and Xiang Li and Mohamed Elhoseiny , booktitle=. Mini. 2024 , url=
work page 2024
- [5]
-
[6]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [7]
-
[8]
Qwen3-Omni Technical Report , author=. arXiv preprint arXiv:2509.17765 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Baichuan-omni-1.5 technical report
Baichuan-Omni-1.5 Technical Report , author=. arXiv preprint arXiv:2501.15368 , year=
-
[10]
OmniBench: Towards The Future of Universal Omni-Language Models , author=. 2025 , eprint=
work page 2025
-
[11]
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[12]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark , author=. 2024 , eprint=
work page 2024
-
[13]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Spa-vl: A comprehensive safety preference alignment dataset for vision language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[14]
European Conference on Computer Vision , pages=
Mm-safetybench: A benchmark for safety evaluation of multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[15]
arXiv preprint arXiv:2508.07173 , year=
Omni-SafetyBench: A benchmark for safety evaluation of audio-visual large language models , author=. arXiv preprint arXiv:2508.07173 , year=
-
[16]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Unraveling and mitigating safety alignment degradation of vision-language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
work page 2025
-
[17]
arXiv preprint arXiv:2502.13095 , year=
Understanding and rectifying safety perception distortion in vlms , author=. arXiv preprint arXiv:2502.13095 , year=
-
[18]
arXiv preprint arXiv:2502.10486 , year=
VLM-Guard: Safeguarding vision-language models via fulfilling safety alignment gap , author=. arXiv preprint arXiv:2502.10486 , year=
-
[19]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
Mllm-protector: Ensuring mllm’s safety without hurting performance , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2024
-
[20]
Proceedings of the 41st International Conference on Machine Learning , pages=
Safety fine-tuning at (almost) no cost: a baseline for vision large language models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
-
[21]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[22]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Adasteer: Your aligned llm is inherently an adaptive jailbreak defender , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[23]
Advances in Neural Information Processing Systems , volume=
Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=
-
[24]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Figstep: Jailbreaking large vision-language models via typographic visual prompts , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[26]
Advances in Neural Information Processing Systems , volume=
Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning , author=. Advances in Neural Information Processing Systems , volume=
-
[27]
European Conference on Computer Vision , pages=
Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[28]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[29]
arXiv preprint arXiv:2603.09095 , year=
Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs , author=. arXiv preprint arXiv:2603.09095 , year=
-
[30]
Forty-first International Conference on Machine Learning , year=
The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. Forty-first International Conference on Machine Learning , year=
-
[31]
Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=
work page 2013
-
[32]
arXiv preprint arXiv:2410.02298 , year=
Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language models , author=. arXiv preprint arXiv:2410.02298 , year=
-
[33]
arXiv preprint arXiv:2501.16727 , year=
xjailbreak: Representation space guided reinforcement learning for interpretable llm jailbreaking , author=. arXiv preprint arXiv:2501.16727 , year=
-
[34]
arXiv preprint arXiv:2411.11114 , year=
Jailbreaklens: Interpreting jailbreak mechanism in the lens of representation and circuit , author=. arXiv preprint arXiv:2411.11114 , year=
-
[35]
ArXiv preprint, abs/2401.06824 , year=
Rethinking jailbreaking through the lens of representation engineering , author=. ArXiv preprint, abs/2401.06824 , year=
-
[36]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
Towards understanding jailbreak attacks in llms: A representation space analysis , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2024
-
[37]
Representation Engineering: A Top-Down Approach to AI Transparency
Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment , author=. 2025 , eprint=
work page 2025
-
[39]
arXiv preprint arXiv:2410.03415 , year=
Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation , author=. arXiv preprint arXiv:2410.03415 , year=
-
[40]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[41]
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models , author=. 2025 , eprint=
work page 2025
-
[42]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[43]
Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=
work page 2022
-
[44]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=
work page 2022
-
[45]
Pku-saferlhf: Towards multi-level safety alignment for llms with human preference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[46]
Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,
Is dpo superior to ppo for llm alignment? a comprehensive study , author=. arXiv preprint arXiv:2404.10719 , year=
-
[47]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[48]
arXiv preprint arXiv:2405.13820 , year=
Towards comprehensive post safety alignment of large language models via safety patching , author=. arXiv preprint arXiv:2405.13820 , year=
-
[49]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Separate the wheat from the chaff: A post-hoc approach to safety re-alignment for fine-tuned language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
work page 2025
-
[50]
Findings of the Association for Computational Linguistics: NAACL 2024 , pages=
Ethos: Rectifying language models in orthogonal parameter space , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=
work page 2024
-
[51]
Advances in Neural Information Processing Systems , volume=
Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=
-
[52]
Attacks, defenses and evaluations for llm conversation safety: A survey , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
work page 2024
-
[53]
arXiv preprint arXiv:2411.09259 , year=
Jailbreak attacks and defenses against multimodal generative models: A survey , author=. arXiv preprint arXiv:2411.09259 , year=
-
[54]
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models , author=. 2026 , eprint=
work page 2026
-
[55]
Safe rlhf-v: Safe reinforcement learning from multi-modal human feedback , author=. arXiv preprint arXiv:2503.17682 , year=
-
[56]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[57]
Jalmbench: Benchmarking jail- break vulnerabilities in audio language models,
Jalmbench: Benchmarking jailbreak vulnerabilities in audio language models , author=. arXiv preprint arXiv:2505.17568 , year=
-
[58]
Sea: Low-resource safety alignment for multimodal large language models via synthetic embeddings , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
- [59]
-
[60]
Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective , author=. 2025 , eprint=
work page 2025
-
[61]
Alphasteer: Learn- ing refusal steering with principled null-space constraint
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint , author=. arXiv preprint arXiv:2506.07022 , year=
-
[62]
Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models , author=. 2026 , eprint=
work page 2026
-
[63]
Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models , author=. 2026 , eprint=
work page 2026
-
[64]
VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap , author=. 2025 , eprint=
work page 2025
- [65]
-
[66]
Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=
work page 2023
-
[67]
Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...
work page 2020
-
[68]
arXiv preprint arXiv:2509.25175 , year=
EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering , author=. arXiv preprint arXiv:2509.25175 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.