Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

Bing Qin; Dandan Tu; Jiahe Guo; Jiaxuan Chen; Qianchao Wang; Weixiang Zhao; Xiangran Guo; Yanyan Zhao; Yutai Hou

arxiv: 2605.18104 · v1 · pith:WVE3KQN5new · submitted 2026-05-18 · 💻 cs.AI · cs.CR

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

Jiahe Guo , Xiangran Guo , Jiaxuan Chen , Weixiang Zhao , Yanyan Zhao , Yutai Hou , Qianchao Wang , Dandan Tu

show 1 more author

Bing Qin

This is my paper

Pith reviewed 2026-05-20 10:54 UTC · model grok-4.3

classification 💻 cs.AI cs.CR

keywords multimodal LLMssafety geometry collapserefusal directionmodality driftdrift correctionself-rectificationinference-time interventionMLLM safety

0 comments

The pith

Multimodal inputs compress separation along the refusal direction, causing safety geometry collapse that drift correction can reverse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal large language models lose reliable refusal behavior on non-text inputs because those inputs induce a drift that shrinks the usable component along a text-aligned refusal direction. This compression is called safety geometry collapse and is tracked through conditional refusal separability, which declines as drift strengthens and attack success rises. The authors show that a fixed-strength activation intervention to cancel the estimated drift restores separability and triggers self-rectification, in which the model regains the capacity to recognize and refuse harmful content during its own forward pass. They then introduce ReGap, a training-free method that reads the self-rectification signal to correct drift adaptively at inference time. If the account is right, existing models can close the multimodal safety gap without retraining or loss of general capability.

Core claim

Multimodal inputs induce a drift direction that compresses the projection onto the text-aligned refusal direction, producing safety geometry collapse in which harmful and harmless inputs become harder to separate for refusal. Counteracting the drift through fixed-strength activation intervention restores conditional refusal separability; afterward the model exhibits self-rectification that supplies an internal harmfulness signal, which ReGap uses for adaptive, training-free correction.

What carries the argument

Text-aligned refusal direction together with modality-induced drift direction, where the drift is estimated and subtracted via fixed-strength activation intervention to restore geometric separability.

If this is right

Conditional refusal separability recovers once the estimated drift is counteracted.
Self-rectification emerges as a reliable internal signal of the model's perceived harmfulness.
ReGap improves safety scores on multimodal benchmarks while preserving performance on utility tasks.
Representation-level modality alignment can be performed at inference time without parameter updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety mechanisms may depend more on maintaining geometric alignment across modalities than on the content of training data alone.
Similar drift-induced collapses could affect other alignment properties such as factual consistency or bias detection in multimodal settings.
Adaptive correction using internal signals might generalize to dynamic, multi-turn interactions that mix text with other modalities.

Load-bearing premise

A single stable refusal direction aligned with text remains identifiable and meaningful once modality-induced drift appears, and drift is the main factor whose correction restores separability.

What would settle it

Measure refusal rates on harmful multimodal inputs before and after the fixed drift-counteracting intervention; if rates fail to rise or if safe inputs begin to be refused, the claimed causal role of drift in safety geometry collapse is undermined.

Figures

Figures reproduced from arXiv: 2605.18104 by Bing Qin, Dandan Tu, Jiahe Guo, Jiaxuan Chen, Qianchao Wang, Weixiang Zhao, Xiangran Guo, Yanyan Zhao, Yutai Hou.

**Figure 1.** Figure 1: Geometric view of multimodal safety. Left: Text-aligned refusal geometry. Middle: Modality-induced drift and Safety Geometry Collapse in MLLMs. Right: Intervention against modality-induced drift, self-rectification dynamics, and ReGap. that, in many cases, the issue is not simply insufficient safety alignment, but rather its failure to transfer reliably from text-only behavior to semantically equivalent mu… view at source ↗

**Figure 2.** Figure 2: Visualization of the multimodal safety space at middle layers of MiniCPM-o-4.5 [Yi et al., 2025]. Multimodal inputs share a text-aligned refusal direction, but larger modality-induced drift makes refused and complied harmful inputs increasingly entangled along this direction. Different modalities share a text-aligned refusal direction. Along the horizontal axis, refused and complied harmful inputs are … view at source ↗

**Figure 3.** Figure 3: Quantification of Safety Geometry Collapse across three models on the calibration split, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of fixed-strength intervention against modality-induced drift. The x-axis denotes [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of drift intervention. Left: Representation shift in the multimodal safety space with intervention. Right: Layer-wise self-rectification scores for harmful and benign inputs. consequence of adding a refusal vector. This provides interventional evidence that modality-induced drift contributes to unsafe multimodal compliance. The intervention partially restores refusal geometry. Figure 5a visualizes t… view at source ↗

**Figure 6.** Figure 6: Refusal prompt for conditional ASR. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Understanding prompt for conditional ASR. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Unified rule-based evaluation pipelines for OmniBench, MMMU-Pro, and MMAU. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Pairwise cosine similarities between modality-specific refusal directions estimated on Omni [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) often fail to transfer safety capabilities learned in the text modality to semantically equivalent non-text inputs, revealing a persistent multimodal safety gap. We study this gap from a representation-geometric perspective by analyzing a text-aligned refusal direction and a modality-induced drift direction. We show that multimodal inputs compress the usable separation along the refusal direction, making it no longer reliable for identifying and refusing harmful inputs. We refer to this failure mode as Safety Geometry Collapse. We quantify it through conditional refusal separability and show that stronger modality-induced drift is consistently associated with weaker refusal separability and higher attack success rates. We then validate the causal role of modality-induced drift through a fixed-strength activation intervention: counteracting the estimated drift restores refusal separability and improves multimodal safety. After drift correction, we further observe self-rectification, where the model recovers its ability to recognize and refuse harmful multimodal inputs during forward dynamics. This effect also provides an internal signal of the model's perceived harmfulness of each input. Motivated by this signal, we propose ReGap, a training-free inference-time method that adaptively corrects modality drift using self-rectification. Experiments across multiple multimodal safety benchmarks and utility benchmarks demonstrate the effectiveness of ReGap, which significantly improves the safety of MLLMs without compromising general capabilities. Our findings highlight representation-level modality alignment as a crucial direction for real-time safety improvement and for building safer, more reliable MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames multimodal safety failures as compression along a refusal direction and offers a training-free adaptive correction via self-rectification signals.

read the letter

The core observation is that multimodal inputs shrink the usable separation along a text-aligned refusal direction, which they label Safety Geometry Collapse, and that counteracting the associated drift restores separability and triggers some self-rectification during forward passes. They turn the rectification signal into ReGap, a simple inference-time adjustment that improves safety scores on multimodal benchmarks while leaving utility metrics mostly intact. That combination of geometric diagnosis and a practical, training-free lever is the clearest new piece. The intervention experiment gives at least correlational support for treating drift as causal, and the fact that they track both safety and capability benchmarks is a plus. The method itself looks lightweight enough to try on other models without much overhead. The main soft spot is that the causal test uses a fixed-strength correction without the usual specificity checks, such as random directions of similar magnitude or zero-strength baselines. Without those, it is hard to rule out that any consistent activation offset would produce similar restoration. The abstract also skips error bars, exact dataset splits, and statistical tests, so the strength of the separability and attack-success correlations is difficult to judge from the summary alone. The refusal direction is treated as stable and identifiable once drift is present, which may or may not hold under broader testing. This work is aimed at researchers who already think about representation geometry in safety settings and want inference-time options rather than another fine-tuning run. A reader who cares about practical multimodal guardrails would find the ReGap procedure worth trying. The idea is concrete enough and the experiments are broad enough that it deserves a serious referee rather than a desk reject, even if the causality section needs tightening.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that multimodal large language models exhibit 'Safety Geometry Collapse,' where multimodal inputs compress the separation along a text-aligned refusal direction, reducing its reliability for detecting and refusing harmful content. This is quantified via conditional refusal separability, shown to correlate with modality-induced drift strength and attack success rates. The causal role of drift is validated using a fixed-strength activation intervention that counteracts the drift, restoring separability and triggering self-rectification. Motivated by this, the authors propose ReGap, a training-free inference-time method for adaptive drift correction, which improves safety on multiple benchmarks without compromising utility.

Significance. If the geometric analysis and causal validation hold, this paper provides a valuable representation-level insight into the multimodal safety gap and a practical mitigation strategy. The identification of self-rectification as an internal signal is a strength, and the training-free nature of ReGap makes it immediately applicable. This could influence future work on aligning representations across modalities for safety.

major comments (1)

[Causal validation via activation intervention] The fixed-strength activation intervention used to validate the causal role of modality-induced drift lacks specificity controls such as random directions of matched magnitude or zero-strength baselines. Without these, the observed restoration of refusal separability could be due to generic steering effects rather than targeted drift correction, which is central to supporting the primary causal claim and the motivation for ReGap.

minor comments (2)

[Abstract and experimental results] The abstract and methods description lack details on dataset sizes, statistical tests, and error bars for the reported improvements in separability and attack success rates, which would help assess the robustness of the findings.
[Representation analysis] Clarify the exact definition and extraction method for the text-aligned refusal direction and the modality-induced drift direction, perhaps with an equation reference, to make the geometric claims more precise.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our geometric analysis, the identification of self-rectification, and the practical value of the training-free ReGap method. We address the single major comment below with a commitment to strengthen the causal evidence.

read point-by-point responses

Referee: [Causal validation via activation intervention] The fixed-strength activation intervention used to validate the causal role of modality-induced drift lacks specificity controls such as random directions of matched magnitude or zero-strength baselines. Without these, the observed restoration of refusal separability could be due to generic steering effects rather than targeted drift correction, which is central to supporting the primary causal claim and the motivation for ReGap.

Authors: We agree that explicit specificity controls are necessary to rule out generic steering. The intervention is constructed by estimating the modality-induced drift vector (difference between multimodal and text-aligned representations) and applying a fixed-magnitude correction in the opposing direction; however, we acknowledge that this alone does not fully isolate the effect. In the revised manuscript we will add two controls: (1) a zero-strength baseline (no activation added, corresponding to the original model) and (2) interventions along random unit vectors scaled to the same magnitude as the drift correction. These will be evaluated on the same refusal-separability and safety metrics. Preliminary internal checks indicate that random directions produce negligible restoration compared with the drift-specific correction; we will report the quantitative differences and statistical significance in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation proceeds from empirical identification of a text-aligned refusal direction, measurement of its compression under multimodal inputs via conditional refusal separability, observed correlations with modality-induced drift, and a fixed-strength intervention to test causality, followed by motivation of the ReGap method from the resulting self-rectification signal. None of these steps reduce by construction to the inputs: separability and drift are quantified as independent geometric quantities, the intervention is an external manipulation rather than a tautological restatement, and no self-citations or fitted parameters are invoked as load-bearing premises for the central claims. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a stable refusal direction exists in representation space and that modality drift is the dominant cause of collapse; no free parameters or invented physical entities are declared in the abstract.

axioms (1)

domain assumption A text-aligned refusal direction can be identified and remains a meaningful axis once multimodal inputs are introduced
Invoked when defining Safety Geometry Collapse and when performing the fixed-strength activation intervention.

invented entities (1)

Safety Geometry Collapse no independent evidence
purpose: Label for the observed compression of refusal separability
New descriptive term for the failure mode; no independent falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5817 in / 1294 out tokens · 38310 ms · 2026-05-20T10:54:34.338561+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate multimodal safety through a two-dimensional diagnostic space defined by a text-aligned refusal direction and a modality-induced drift direction... Φl(x) = (ϕl_r(x), ϕl_g(x))
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We estimate the text-aligned refusal direction... rl = μl_ref − μl_comp ... gl = g_raw^l − (g_raw^l ⊤ rl / ||rl||²) rl

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 6 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models , author=. arXiv preprint arXiv:2311.07919 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page
[4]

Deyao Zhu and Jun Chen and Xiaoqian Shen and Xiang Li and Mohamed Elhoseiny , booktitle=. Mini. 2024 , url=

work page 2024
[5]

2025 , eprint=

Step-Audio 2 Technical Report , author=. 2025 , eprint=

work page 2025
[6]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

2025 , eprint=

Qwen2.5-Omni Technical Report , author=. 2025 , eprint=

work page 2025
[8]

Qwen3-Omni Technical Report

Qwen3-Omni Technical Report , author=. arXiv preprint arXiv:2509.17765 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Baichuan-omni-1.5 technical report

Baichuan-Omni-1.5 Technical Report , author=. arXiv preprint arXiv:2501.15368 , year=

work page arXiv
[10]

2025 , eprint=

OmniBench: Towards The Future of Universal Omni-Language Models , author=. 2025 , eprint=

work page 2025
[11]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[12]

2024 , eprint=

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark , author=. 2024 , eprint=

work page 2024
[13]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Spa-vl: A comprehensive safety preference alignment dataset for vision language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[14]

European Conference on Computer Vision , pages=

Mm-safetybench: A benchmark for safety evaluation of multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[15]

arXiv preprint arXiv:2508.07173 , year=

Omni-SafetyBench: A benchmark for safety evaluation of audio-visual large language models , author=. arXiv preprint arXiv:2508.07173 , year=

work page arXiv
[16]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Unraveling and mitigating safety alignment degradation of vision-language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[17]

arXiv preprint arXiv:2502.13095 , year=

Understanding and rectifying safety perception distortion in vlms , author=. arXiv preprint arXiv:2502.13095 , year=

work page arXiv
[18]

arXiv preprint arXiv:2502.10486 , year=

VLM-Guard: Safeguarding vision-language models via fulfilling safety alignment gap , author=. arXiv preprint arXiv:2502.10486 , year=

work page arXiv
[19]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Mllm-protector: Ensuring mllm’s safety without hurting performance , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[20]

Proceedings of the 41st International Conference on Machine Learning , pages=

Safety fine-tuning at (almost) no cost: a baseline for vision large language models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page
[21]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[22]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Adasteer: Your aligned llm is inherently an adaptive jailbreak defender , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[23]

Advances in Neural Information Processing Systems , volume=

Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=

work page
[24]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Figstep: Jailbreaking large vision-language models via typographic visual prompts , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[26]

Advances in Neural Information Processing Systems , volume=

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[27]

European Conference on Computer Vision , pages=

Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[28]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[29]

arXiv preprint arXiv:2603.09095 , year=

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs , author=. arXiv preprint arXiv:2603.09095 , year=

work page arXiv
[30]

Forty-first International Conference on Machine Learning , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. Forty-first International Conference on Machine Learning , year=

work page
[31]

Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

work page 2013
[32]

arXiv preprint arXiv:2410.02298 , year=

Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language models , author=. arXiv preprint arXiv:2410.02298 , year=

work page arXiv
[33]

arXiv preprint arXiv:2501.16727 , year=

xjailbreak: Representation space guided reinforcement learning for interpretable llm jailbreaking , author=. arXiv preprint arXiv:2501.16727 , year=

work page arXiv
[34]

arXiv preprint arXiv:2411.11114 , year=

Jailbreaklens: Interpreting jailbreak mechanism in the lens of representation and circuit , author=. arXiv preprint arXiv:2411.11114 , year=

work page arXiv
[35]

ArXiv preprint, abs/2401.06824 , year=

Rethinking jailbreaking through the lens of representation engineering , author=. ArXiv preprint, abs/2401.06824 , year=

work page arXiv
[36]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Towards understanding jailbreak attacks in llms: A representation space analysis , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[37]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

2025 , eprint=

Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment , author=. 2025 , eprint=

work page 2025
[39]

arXiv preprint arXiv:2410.03415 , year=

Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation , author=. arXiv preprint arXiv:2410.03415 , year=

work page arXiv
[40]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[41]

2025 , eprint=

Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models , author=. 2025 , eprint=

work page 2025
[42]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[43]

2022 , eprint=

Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

work page 2022
[44]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

work page 2022
[45]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[46]

Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,

Is dpo superior to ppo for llm alignment? a comprehensive study , author=. arXiv preprint arXiv:2404.10719 , year=

work page arXiv
[47]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[48]

arXiv preprint arXiv:2405.13820 , year=

Towards comprehensive post safety alignment of large language models via safety patching , author=. arXiv preprint arXiv:2405.13820 , year=

work page arXiv
[49]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Separate the wheat from the chaff: A post-hoc approach to safety re-alignment for fine-tuned language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[50]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

Ethos: Rectifying language models in orthogonal parameter space , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

work page 2024
[51]

Advances in Neural Information Processing Systems , volume=

Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=

work page
[52]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Attacks, defenses and evaluations for llm conversation safety: A survey , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024
[53]

arXiv preprint arXiv:2411.09259 , year=

Jailbreak attacks and defenses against multimodal generative models: A survey , author=. arXiv preprint arXiv:2411.09259 , year=

work page arXiv
[54]

2026 , eprint=

AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models , author=. 2026 , eprint=

work page 2026
[55]

Safe rlhf-v: Safe reinforcement learning from multi-modal human feedback.arXiv preprint arXiv:2503.17682, 2025

Safe rlhf-v: Safe reinforcement learning from multi-modal human feedback , author=. arXiv preprint arXiv:2503.17682 , year=

work page arXiv
[56]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[57]

Jalmbench: Benchmarking jail- break vulnerabilities in audio language models,

Jalmbench: Benchmarking jailbreak vulnerabilities in audio language models , author=. arXiv preprint arXiv:2505.17568 , year=

work page arXiv
[58]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Sea: Low-resource safety alignment for multimodal large language models via synthetic embeddings , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[59]

Zheng, F

On prompt-driven safeguarding for large language models , author=. arXiv preprint arXiv:2401.18018 , year=

work page arXiv
[60]

2025 , eprint=

Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective , author=. 2025 , eprint=

work page 2025
[61]

Alphasteer: Learn- ing refusal steering with principled null-space constraint

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint , author=. arXiv preprint arXiv:2506.07022 , year=

work page arXiv
[62]

2026 , eprint=

Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models , author=. 2026 , eprint=

work page 2026
[63]

2026 , eprint=

Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models , author=. 2026 , eprint=

work page 2026
[64]

2025 , eprint=

VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap , author=. 2025 , eprint=

work page 2025
[65]

2026 , eprint=

OpenAI GPT-5 System Card , author=. 2026 , eprint=

work page 2026
[66]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

work page 2023
[67]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

work page 2020
[68]

arXiv preprint arXiv:2509.25175 , year=

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering , author=. arXiv preprint arXiv:2509.25175 , year=

work page arXiv

[1] [1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models , author=. arXiv preprint arXiv:2311.07919 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page

[4] [4]

Deyao Zhu and Jun Chen and Xiaoqian Shen and Xiang Li and Mohamed Elhoseiny , booktitle=. Mini. 2024 , url=

work page 2024

[5] [5]

2025 , eprint=

Step-Audio 2 Technical Report , author=. 2025 , eprint=

work page 2025

[6] [6]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

2025 , eprint=

Qwen2.5-Omni Technical Report , author=. 2025 , eprint=

work page 2025

[8] [8]

Qwen3-Omni Technical Report

Qwen3-Omni Technical Report , author=. arXiv preprint arXiv:2509.17765 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Baichuan-omni-1.5 technical report

Baichuan-Omni-1.5 Technical Report , author=. arXiv preprint arXiv:2501.15368 , year=

work page arXiv

[10] [10]

2025 , eprint=

OmniBench: Towards The Future of Universal Omni-Language Models , author=. 2025 , eprint=

work page 2025

[11] [11]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[12] [12]

2024 , eprint=

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark , author=. 2024 , eprint=

work page 2024

[13] [13]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Spa-vl: A comprehensive safety preference alignment dataset for vision language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[14] [14]

European Conference on Computer Vision , pages=

Mm-safetybench: A benchmark for safety evaluation of multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[15] [15]

arXiv preprint arXiv:2508.07173 , year=

Omni-SafetyBench: A benchmark for safety evaluation of audio-visual large language models , author=. arXiv preprint arXiv:2508.07173 , year=

work page arXiv

[16] [16]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Unraveling and mitigating safety alignment degradation of vision-language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[17] [17]

arXiv preprint arXiv:2502.13095 , year=

Understanding and rectifying safety perception distortion in vlms , author=. arXiv preprint arXiv:2502.13095 , year=

work page arXiv

[18] [18]

arXiv preprint arXiv:2502.10486 , year=

VLM-Guard: Safeguarding vision-language models via fulfilling safety alignment gap , author=. arXiv preprint arXiv:2502.10486 , year=

work page arXiv

[19] [19]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Mllm-protector: Ensuring mllm’s safety without hurting performance , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024

[20] [20]

Proceedings of the 41st International Conference on Machine Learning , pages=

Safety fine-tuning at (almost) no cost: a baseline for vision large language models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page

[21] [21]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[22] [22]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Adasteer: Your aligned llm is inherently an adaptive jailbreak defender , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[23] [23]

Advances in Neural Information Processing Systems , volume=

Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=

work page

[24] [24]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Figstep: Jailbreaking large vision-language models via typographic visual prompts , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[26] [26]

Advances in Neural Information Processing Systems , volume=

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[27] [27]

European Conference on Computer Vision , pages=

Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[28] [28]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[29] [29]

arXiv preprint arXiv:2603.09095 , year=

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs , author=. arXiv preprint arXiv:2603.09095 , year=

work page arXiv

[30] [30]

Forty-first International Conference on Machine Learning , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. Forty-first International Conference on Machine Learning , year=

work page

[31] [31]

Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

work page 2013

[32] [32]

arXiv preprint arXiv:2410.02298 , year=

Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language models , author=. arXiv preprint arXiv:2410.02298 , year=

work page arXiv

[33] [33]

arXiv preprint arXiv:2501.16727 , year=

xjailbreak: Representation space guided reinforcement learning for interpretable llm jailbreaking , author=. arXiv preprint arXiv:2501.16727 , year=

work page arXiv

[34] [34]

arXiv preprint arXiv:2411.11114 , year=

Jailbreaklens: Interpreting jailbreak mechanism in the lens of representation and circuit , author=. arXiv preprint arXiv:2411.11114 , year=

work page arXiv

[35] [35]

ArXiv preprint, abs/2401.06824 , year=

Rethinking jailbreaking through the lens of representation engineering , author=. ArXiv preprint, abs/2401.06824 , year=

work page arXiv

[36] [36]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Towards understanding jailbreak attacks in llms: A representation space analysis , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024

[37] [37]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

2025 , eprint=

Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment , author=. 2025 , eprint=

work page 2025

[39] [39]

arXiv preprint arXiv:2410.03415 , year=

Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation , author=. arXiv preprint arXiv:2410.03415 , year=

work page arXiv

[40] [40]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page

[41] [41]

2025 , eprint=

Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models , author=. 2025 , eprint=

work page 2025

[42] [42]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[43] [43]

2022 , eprint=

Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

work page 2022

[44] [44]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

work page 2022

[45] [45]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[46] [46]

Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,

Is dpo superior to ppo for llm alignment? a comprehensive study , author=. arXiv preprint arXiv:2404.10719 , year=

work page arXiv

[47] [47]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page

[48] [48]

arXiv preprint arXiv:2405.13820 , year=

Towards comprehensive post safety alignment of large language models via safety patching , author=. arXiv preprint arXiv:2405.13820 , year=

work page arXiv

[49] [49]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Separate the wheat from the chaff: A post-hoc approach to safety re-alignment for fine-tuned language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[50] [50]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

Ethos: Rectifying language models in orthogonal parameter space , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

work page 2024

[51] [51]

Advances in Neural Information Processing Systems , volume=

Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=

work page

[52] [52]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Attacks, defenses and evaluations for llm conversation safety: A survey , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024

[53] [53]

arXiv preprint arXiv:2411.09259 , year=

Jailbreak attacks and defenses against multimodal generative models: A survey , author=. arXiv preprint arXiv:2411.09259 , year=

work page arXiv

[54] [54]

2026 , eprint=

AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models , author=. 2026 , eprint=

work page 2026

[55] [55]

Safe rlhf-v: Safe reinforcement learning from multi-modal human feedback.arXiv preprint arXiv:2503.17682, 2025

Safe rlhf-v: Safe reinforcement learning from multi-modal human feedback , author=. arXiv preprint arXiv:2503.17682 , year=

work page arXiv

[56] [56]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[57] [57]

Jalmbench: Benchmarking jail- break vulnerabilities in audio language models,

Jalmbench: Benchmarking jailbreak vulnerabilities in audio language models , author=. arXiv preprint arXiv:2505.17568 , year=

work page arXiv

[58] [58]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Sea: Low-resource safety alignment for multimodal large language models via synthetic embeddings , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[59] [59]

Zheng, F

On prompt-driven safeguarding for large language models , author=. arXiv preprint arXiv:2401.18018 , year=

work page arXiv

[60] [60]

2025 , eprint=

Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective , author=. 2025 , eprint=

work page 2025

[61] [61]

Alphasteer: Learn- ing refusal steering with principled null-space constraint

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint , author=. arXiv preprint arXiv:2506.07022 , year=

work page arXiv

[62] [62]

2026 , eprint=

Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models , author=. 2026 , eprint=

work page 2026

[63] [63]

2026 , eprint=

Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models , author=. 2026 , eprint=

work page 2026

[64] [64]

2025 , eprint=

VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap , author=. 2025 , eprint=

work page 2025

[65] [65]

2026 , eprint=

OpenAI GPT-5 System Card , author=. 2026 , eprint=

work page 2026

[66] [66]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

work page 2023

[67] [67]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

work page 2020

[68] [68]

arXiv preprint arXiv:2509.25175 , year=

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering , author=. arXiv preprint arXiv:2509.25175 , year=

work page arXiv