pith. sign in

arxiv: 2605.18104 · v1 · pith:WVE3KQN5new · submitted 2026-05-18 · 💻 cs.AI · cs.CR

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

Pith reviewed 2026-05-20 10:54 UTC · model grok-4.3

classification 💻 cs.AI cs.CR
keywords multimodal LLMssafety geometry collapserefusal directionmodality driftdrift correctionself-rectificationinference-time interventionMLLM safety
0
0 comments X

The pith

Multimodal inputs compress separation along the refusal direction, causing safety geometry collapse that drift correction can reverse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal large language models lose reliable refusal behavior on non-text inputs because those inputs induce a drift that shrinks the usable component along a text-aligned refusal direction. This compression is called safety geometry collapse and is tracked through conditional refusal separability, which declines as drift strengthens and attack success rises. The authors show that a fixed-strength activation intervention to cancel the estimated drift restores separability and triggers self-rectification, in which the model regains the capacity to recognize and refuse harmful content during its own forward pass. They then introduce ReGap, a training-free method that reads the self-rectification signal to correct drift adaptively at inference time. If the account is right, existing models can close the multimodal safety gap without retraining or loss of general capability.

Core claim

Multimodal inputs induce a drift direction that compresses the projection onto the text-aligned refusal direction, producing safety geometry collapse in which harmful and harmless inputs become harder to separate for refusal. Counteracting the drift through fixed-strength activation intervention restores conditional refusal separability; afterward the model exhibits self-rectification that supplies an internal harmfulness signal, which ReGap uses for adaptive, training-free correction.

What carries the argument

Text-aligned refusal direction together with modality-induced drift direction, where the drift is estimated and subtracted via fixed-strength activation intervention to restore geometric separability.

If this is right

  • Conditional refusal separability recovers once the estimated drift is counteracted.
  • Self-rectification emerges as a reliable internal signal of the model's perceived harmfulness.
  • ReGap improves safety scores on multimodal benchmarks while preserving performance on utility tasks.
  • Representation-level modality alignment can be performed at inference time without parameter updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety mechanisms may depend more on maintaining geometric alignment across modalities than on the content of training data alone.
  • Similar drift-induced collapses could affect other alignment properties such as factual consistency or bias detection in multimodal settings.
  • Adaptive correction using internal signals might generalize to dynamic, multi-turn interactions that mix text with other modalities.

Load-bearing premise

A single stable refusal direction aligned with text remains identifiable and meaningful once modality-induced drift appears, and drift is the main factor whose correction restores separability.

What would settle it

Measure refusal rates on harmful multimodal inputs before and after the fixed drift-counteracting intervention; if rates fail to rise or if safe inputs begin to be refused, the claimed causal role of drift in safety geometry collapse is undermined.

Figures

Figures reproduced from arXiv: 2605.18104 by Bing Qin, Dandan Tu, Jiahe Guo, Jiaxuan Chen, Qianchao Wang, Weixiang Zhao, Xiangran Guo, Yanyan Zhao, Yutai Hou.

Figure 1
Figure 1. Figure 1: Geometric view of multimodal safety. Left: Text-aligned refusal geometry. Middle: Modality-induced drift and Safety Geometry Collapse in MLLMs. Right: Intervention against modality-induced drift, self-rectification dynamics, and ReGap. that, in many cases, the issue is not simply insufficient safety alignment, but rather its failure to transfer reliably from text-only behavior to semantically equivalent mu… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the multimodal safety space at middle layers of MiniCPM-o-4.5 [Yi et al., 2025]. Multimodal inputs share a text-aligned re￾fusal direction, but larger modality-induced drift makes refused and complied harmful inputs in￾creasingly entangled along this direction. Different modalities share a text-aligned re￾fusal direction. Along the horizontal axis, re￾fused and complied harmful inputs are … view at source ↗
Figure 3
Figure 3. Figure 3: Quantification of Safety Geometry Collapse across three models on the calibration split, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of fixed-strength intervention against modality-induced drift. The x-axis denotes [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of drift intervention. Left: Representation shift in the multimodal safety space with intervention. Right: Layer-wise self-rectification scores for harmful and benign inputs. consequence of adding a refusal vector. This provides interventional evidence that modality-induced drift contributes to unsafe multimodal compliance. The intervention partially restores refusal geometry. Figure 5a visualizes t… view at source ↗
Figure 6
Figure 6. Figure 6: Refusal prompt for conditional ASR. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Understanding prompt for conditional ASR. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Unified rule-based evaluation pipelines for OmniBench, MMMU-Pro, and MMAU. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pairwise cosine similarities between modality-specific refusal directions estimated on Omni [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
read the original abstract

Multimodal large language models (MLLMs) often fail to transfer safety capabilities learned in the text modality to semantically equivalent non-text inputs, revealing a persistent multimodal safety gap. We study this gap from a representation-geometric perspective by analyzing a text-aligned refusal direction and a modality-induced drift direction. We show that multimodal inputs compress the usable separation along the refusal direction, making it no longer reliable for identifying and refusing harmful inputs. We refer to this failure mode as Safety Geometry Collapse. We quantify it through conditional refusal separability and show that stronger modality-induced drift is consistently associated with weaker refusal separability and higher attack success rates. We then validate the causal role of modality-induced drift through a fixed-strength activation intervention: counteracting the estimated drift restores refusal separability and improves multimodal safety. After drift correction, we further observe self-rectification, where the model recovers its ability to recognize and refuse harmful multimodal inputs during forward dynamics. This effect also provides an internal signal of the model's perceived harmfulness of each input. Motivated by this signal, we propose ReGap, a training-free inference-time method that adaptively corrects modality drift using self-rectification. Experiments across multiple multimodal safety benchmarks and utility benchmarks demonstrate the effectiveness of ReGap, which significantly improves the safety of MLLMs without compromising general capabilities. Our findings highlight representation-level modality alignment as a crucial direction for real-time safety improvement and for building safer, more reliable MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that multimodal large language models exhibit 'Safety Geometry Collapse,' where multimodal inputs compress the separation along a text-aligned refusal direction, reducing its reliability for detecting and refusing harmful content. This is quantified via conditional refusal separability, shown to correlate with modality-induced drift strength and attack success rates. The causal role of drift is validated using a fixed-strength activation intervention that counteracts the drift, restoring separability and triggering self-rectification. Motivated by this, the authors propose ReGap, a training-free inference-time method for adaptive drift correction, which improves safety on multiple benchmarks without compromising utility.

Significance. If the geometric analysis and causal validation hold, this paper provides a valuable representation-level insight into the multimodal safety gap and a practical mitigation strategy. The identification of self-rectification as an internal signal is a strength, and the training-free nature of ReGap makes it immediately applicable. This could influence future work on aligning representations across modalities for safety.

major comments (1)
  1. [Causal validation via activation intervention] The fixed-strength activation intervention used to validate the causal role of modality-induced drift lacks specificity controls such as random directions of matched magnitude or zero-strength baselines. Without these, the observed restoration of refusal separability could be due to generic steering effects rather than targeted drift correction, which is central to supporting the primary causal claim and the motivation for ReGap.
minor comments (2)
  1. [Abstract and experimental results] The abstract and methods description lack details on dataset sizes, statistical tests, and error bars for the reported improvements in separability and attack success rates, which would help assess the robustness of the findings.
  2. [Representation analysis] Clarify the exact definition and extraction method for the text-aligned refusal direction and the modality-induced drift direction, perhaps with an equation reference, to make the geometric claims more precise.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our geometric analysis, the identification of self-rectification, and the practical value of the training-free ReGap method. We address the single major comment below with a commitment to strengthen the causal evidence.

read point-by-point responses
  1. Referee: [Causal validation via activation intervention] The fixed-strength activation intervention used to validate the causal role of modality-induced drift lacks specificity controls such as random directions of matched magnitude or zero-strength baselines. Without these, the observed restoration of refusal separability could be due to generic steering effects rather than targeted drift correction, which is central to supporting the primary causal claim and the motivation for ReGap.

    Authors: We agree that explicit specificity controls are necessary to rule out generic steering. The intervention is constructed by estimating the modality-induced drift vector (difference between multimodal and text-aligned representations) and applying a fixed-magnitude correction in the opposing direction; however, we acknowledge that this alone does not fully isolate the effect. In the revised manuscript we will add two controls: (1) a zero-strength baseline (no activation added, corresponding to the original model) and (2) interventions along random unit vectors scaled to the same magnitude as the drift correction. These will be evaluated on the same refusal-separability and safety metrics. Preliminary internal checks indicate that random directions produce negligible restoration compared with the drift-specific correction; we will report the quantitative differences and statistical significance in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation proceeds from empirical identification of a text-aligned refusal direction, measurement of its compression under multimodal inputs via conditional refusal separability, observed correlations with modality-induced drift, and a fixed-strength intervention to test causality, followed by motivation of the ReGap method from the resulting self-rectification signal. None of these steps reduce by construction to the inputs: separability and drift are quantified as independent geometric quantities, the intervention is an external manipulation rather than a tautological restatement, and no self-citations or fitted parameters are invoked as load-bearing premises for the central claims. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a stable refusal direction exists in representation space and that modality drift is the dominant cause of collapse; no free parameters or invented physical entities are declared in the abstract.

axioms (1)
  • domain assumption A text-aligned refusal direction can be identified and remains a meaningful axis once multimodal inputs are introduced
    Invoked when defining Safety Geometry Collapse and when performing the fixed-strength activation intervention.
invented entities (1)
  • Safety Geometry Collapse no independent evidence
    purpose: Label for the observed compression of refusal separability
    New descriptive term for the failure mode; no independent falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5817 in / 1294 out tokens · 38310 ms · 2026-05-20T10:54:34.338561+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 6 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

  2. [2]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models , author=. arXiv preprint arXiv:2311.07919 , year=

  3. [3]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  4. [4]

    Deyao Zhu and Jun Chen and Xiaoqian Shen and Xiang Li and Mohamed Elhoseiny , booktitle=. Mini. 2024 , url=

  5. [5]

    2025 , eprint=

    Step-Audio 2 Technical Report , author=. 2025 , eprint=

  6. [6]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=

  7. [7]

    2025 , eprint=

    Qwen2.5-Omni Technical Report , author=. 2025 , eprint=

  8. [8]

    Qwen3-Omni Technical Report

    Qwen3-Omni Technical Report , author=. arXiv preprint arXiv:2509.17765 , year=

  9. [9]

    Baichuan-omni-1.5 technical report

    Baichuan-Omni-1.5 Technical Report , author=. arXiv preprint arXiv:2501.15368 , year=

  10. [10]

    2025 , eprint=

    OmniBench: Towards The Future of Universal Omni-Language Models , author=. 2025 , eprint=

  11. [11]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  12. [12]

    2024 , eprint=

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark , author=. 2024 , eprint=

  13. [13]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Spa-vl: A comprehensive safety preference alignment dataset for vision language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  14. [14]

    European Conference on Computer Vision , pages=

    Mm-safetybench: A benchmark for safety evaluation of multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  15. [15]

    arXiv preprint arXiv:2508.07173 , year=

    Omni-SafetyBench: A benchmark for safety evaluation of audio-visual large language models , author=. arXiv preprint arXiv:2508.07173 , year=

  16. [16]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Unraveling and mitigating safety alignment degradation of vision-language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  17. [17]

    arXiv preprint arXiv:2502.13095 , year=

    Understanding and rectifying safety perception distortion in vlms , author=. arXiv preprint arXiv:2502.13095 , year=

  18. [18]

    arXiv preprint arXiv:2502.10486 , year=

    VLM-Guard: Safeguarding vision-language models via fulfilling safety alignment gap , author=. arXiv preprint arXiv:2502.10486 , year=

  19. [19]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Mllm-protector: Ensuring mllm’s safety without hurting performance , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  20. [20]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    Safety fine-tuning at (almost) no cost: a baseline for vision large language models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

  21. [21]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  22. [22]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Adasteer: Your aligned llm is inherently an adaptive jailbreak defender , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  25. [25]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Figstep: Jailbreaking large vision-language models via typographic visual prompts , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    European Conference on Computer Vision , pages=

    Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  28. [28]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  29. [29]

    arXiv preprint arXiv:2603.09095 , year=

    Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs , author=. arXiv preprint arXiv:2603.09095 , year=

  30. [30]

    Forty-first International Conference on Machine Learning , year=

    The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. Forty-first International Conference on Machine Learning , year=

  31. [31]

    Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

    Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

  32. [32]

    arXiv preprint arXiv:2410.02298 , year=

    Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language models , author=. arXiv preprint arXiv:2410.02298 , year=

  33. [33]

    arXiv preprint arXiv:2501.16727 , year=

    xjailbreak: Representation space guided reinforcement learning for interpretable llm jailbreaking , author=. arXiv preprint arXiv:2501.16727 , year=

  34. [34]

    arXiv preprint arXiv:2411.11114 , year=

    Jailbreaklens: Interpreting jailbreak mechanism in the lens of representation and circuit , author=. arXiv preprint arXiv:2411.11114 , year=

  35. [35]

    ArXiv preprint, abs/2401.06824 , year=

    Rethinking jailbreaking through the lens of representation engineering , author=. ArXiv preprint, abs/2401.06824 , year=

  36. [36]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Towards understanding jailbreak attacks in llms: A representation space analysis , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  37. [37]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

  38. [38]

    2025 , eprint=

    Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment , author=. 2025 , eprint=

  39. [39]

    arXiv preprint arXiv:2410.03415 , year=

    Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation , author=. arXiv preprint arXiv:2410.03415 , year=

  40. [40]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  41. [41]

    2025 , eprint=

    Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models , author=. 2025 , eprint=

  42. [42]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  43. [43]

    2022 , eprint=

    Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

  44. [44]

    2022 , eprint=

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

  45. [45]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Pku-saferlhf: Towards multi-level safety alignment for llms with human preference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  46. [46]

    Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,

    Is dpo superior to ppo for llm alignment? a comprehensive study , author=. arXiv preprint arXiv:2404.10719 , year=

  47. [47]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  48. [48]

    arXiv preprint arXiv:2405.13820 , year=

    Towards comprehensive post safety alignment of large language models via safety patching , author=. arXiv preprint arXiv:2405.13820 , year=

  49. [49]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Separate the wheat from the chaff: A post-hoc approach to safety re-alignment for fine-tuned language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  50. [50]

    Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

    Ethos: Rectifying language models in orthogonal parameter space , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

  51. [51]

    Advances in Neural Information Processing Systems , volume=

    Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=

  52. [52]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Attacks, defenses and evaluations for llm conversation safety: A survey , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  53. [53]

    arXiv preprint arXiv:2411.09259 , year=

    Jailbreak attacks and defenses against multimodal generative models: A survey , author=. arXiv preprint arXiv:2411.09259 , year=

  54. [54]

    2026 , eprint=

    AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models , author=. 2026 , eprint=

  55. [55]

    Safe rlhf-v: Safe reinforcement learning from multi-modal human feedback.arXiv preprint arXiv:2503.17682, 2025

    Safe rlhf-v: Safe reinforcement learning from multi-modal human feedback , author=. arXiv preprint arXiv:2503.17682 , year=

  56. [56]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  57. [57]

    Jalmbench: Benchmarking jail- break vulnerabilities in audio language models,

    Jalmbench: Benchmarking jailbreak vulnerabilities in audio language models , author=. arXiv preprint arXiv:2505.17568 , year=

  58. [58]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Sea: Low-resource safety alignment for multimodal large language models via synthetic embeddings , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  59. [59]

    Zheng, F

    On prompt-driven safeguarding for large language models , author=. arXiv preprint arXiv:2401.18018 , year=

  60. [60]

    2025 , eprint=

    Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective , author=. 2025 , eprint=

  61. [61]

    Alphasteer: Learn- ing refusal steering with principled null-space constraint

    AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint , author=. arXiv preprint arXiv:2506.07022 , year=

  62. [62]

    2026 , eprint=

    Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models , author=. 2026 , eprint=

  63. [63]

    2026 , eprint=

    Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models , author=. 2026 , eprint=

  64. [64]

    2025 , eprint=

    VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap , author=. 2025 , eprint=

  65. [65]

    2026 , eprint=

    OpenAI GPT-5 System Card , author=. 2026 , eprint=

  66. [66]

    2023 , eprint=

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

  67. [67]

    Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

  68. [68]

    arXiv preprint arXiv:2509.25175 , year=

    EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering , author=. arXiv preprint arXiv:2509.25175 , year=