Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

Casey Ford; Emily Dix; Madison Van Doren; Sicheng Jin

arxiv: 2605.23157 · v1 · pith:GKEVWVMInew · submitted 2026-05-22 · 💻 cs.CL

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

Casey Ford , Madison Van Doren , Sicheng Jin , Emily Dix This is my paper

Pith reviewed 2026-05-25 05:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords jailbreak attacksmultimodal LLMscross-lingual evaluationalignment failuresred-teaminglanguage effectsvisual attacksmodel vulnerability

0 comments

The pith

Switching from English to Spanish changes which jailbreak attacks succeed on the same frontier MLLMs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that jailbreak vulnerability in multimodal large language models is not uniform across languages. Linguistic framing attacks such as role-play lose effectiveness under Spanish prompting while visually explicit multimodal attacks gain effectiveness. A sympathetic reader would care because this dissociation shows that language and vision failures arise through separate mechanisms rather than a single model property. Safety rankings therefore reverse between English and Spanish conditions, and no scalar adjustment of English scores recovers the Spanish outcomes. The result indicates that current evaluation frameworks treating language and modality as independent dimensions misspecify the attack surface of globally deployed models.

Core claim

The attack surface of a multimodal large language model is language-dependent in ways that reveal the mechanistic structure of alignment failures. The central finding is that language does not scale vulnerability uniformly. Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective, which directly implicates the prompt-language interface rather than global annotator leniency. This dissociation indicates that linguistic and visual alignment failures operate through distinct mechanisms, and that switching language is sufficient to expose<f|

What carries the argument

Bayesian mixed-effects analyses comparing attack success rates on a fixed 363-prompt benchmark in text-only and multimodal conditions across English and Spanish native-speaker panels

If this is right

Safety rankings are not preserved across languages, with Qwen Omni overtaking Pixtral Large as most vulnerable under Spanish prompting.
Absolute attack success rates decline across model generations without closing gaps between models.
Linguistic and visual alignment failures operate through distinct mechanisms exposed by language switching.
Safety evaluation frameworks must treat language and modality as interacting dimensions rather than independent ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may need to run safety tests in multiple languages to surface language-specific weaknesses before global release.
Alignment training could incorporate language-specific data to address the observed dissociation between linguistic and visual attacks.
The pattern might generalize to other language pairs, suggesting that English-only benchmarks systematically understate certain risks.
Redesigning benchmarks to include matched multimodal conditions across languages could change which models are considered safest.

Load-bearing premise

The fixed 363-prompt benchmark and matched native-speaker annotator panels are sufficient to isolate prompt-language effects from cultural, lexical, or rating-scale differences between the two language groups.

What would settle it

A replication study with a different prompt set or annotator panels that finds uniform attack success rates across English and Spanish conditions would falsify the claim of language-dependent vulnerability.

Figures

Figures reproduced from arXiv: 2605.23157 by Casey Ford, Emily Dix, Madison Van Doren, Sicheng Jin.

**Figure 2.** Figure 2: Annotating responses from LLM models tested [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of harmfulness rating of multimodal prompt output across languages. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Annotator disagreement matrices overall and by language group. Rows = assigned rating; [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of harmfulness of model responses overall. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Pairwise model contrasts for ASR. Error bars = CrI (95%). Grey = Not Credible; Blue = [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Pairwise model contrasts for harm severity (overall). Error bars = CrI (95%). Grey = Not [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Descriptive ASR by model and language [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Pairwise model contrasts for ASR by language. Error bars = CrI (95%). Grey = Not [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Language contrasts (es-MX vs en-US) within each model for harm severity. Error bars = [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Pairwise model contrasts for harm severity by language. Error bars = CrI (95%). Grey = [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Pairwise model contrasts for ASR by modality. Error bars = CrI (95%). Grey = Not [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Modality contrasts (text-only vs multimodal) within each model for harm severity. Error [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Descriptive ASR by model, modality, and language. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Pairwise model contrasts for ASR by language and modality. Error bars = CrI (95%). [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Language contrasts within each model and modality for harm severity. Error bars = CrI [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Pairwise attack technique contrasts for ASR. Error bars = CrI (95%). Grey = Not Credible; [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Pairwise attack technique contrasts for ASR by language. Error bars = CrI (95%). Grey = [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

**Figure 19.** Figure 19: Pairwise attack execution contrasts for ASR overall. Error bars = CrI (95%). Grey = Not [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗

**Figure 20.** Figure 20: Pairwise attack execution comparison by language. Error bars = CrI (95%). Faded = Not [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

**Figure 21.** Figure 21: Pairwise harm category comparison on ASR overall. Error bars = CrI (95%). Faded = Not [PITH_FULL_IMAGE:figures/full_fig_p025_21.png] view at source ↗

**Figure 22.** Figure 22: Pairwise harm category comparison on ASR by language. Error bars = CrI (95%). Faded [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗

read the original abstract

The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study comparing jailbreak vulnerability in US English (en-US) and Mexican Spanish (es-MX) across four frontier MLLMs: Claude Sonnet 4.5, GPT-5, Pixtral Large, and Qwen Omni. Using a fixed adversarial benchmark of 363 diverse prompt scenarios administered in text-only and multimodal conditions, we collected 52,272 harm ratings and binary attack success judgements from matched panels of nine native-speaker annotators per language group. Our central finding is that language does not scale vulnerability uniformly. Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective, which directly implicates the prompt-language interface rather than global annotator leniency. This dissociation indicates that linguistic and visual alignment failures operate through distinct mechanisms, and that switching language is sufficient to expose that separation. The practical consequence is that safety rankings are not preserved across languages. Qwen Omni overtakes Pixtral Large as the most vulnerable model among es-MX participants, a rank reversal no scalar correction of English-condition scores could recover, and absolute attack success rates have declined across model generations without closing the gaps between them. These findings demonstrate that safety evaluation frameworks treating language and modality as independent dimensions fundamentally misspecify the attack surface of globally deployed MLLMs, and must be redesigned accordingly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Language flips which jailbreak types succeed on frontier MLLMs and reverses model safety rankings between Qwen and Pixtral.

read the letter

Language changes the attack surface in these models more than we expected. Role-play attacks lose ground in Spanish while visual ones gain it, and that flips which model looks worse between Qwen Omni and Pixtral Large under es-MX prompting. The work stands out for running the same 363 prompts in English and Mexican Spanish on four current MLLMs, with text-only and multimodal versions. They collected 52k ratings from nine native annotators per language group and applied Bayesian mixed-effects models to test for language-by-attack-type interactions. The rank reversal is a concrete result that English-only tests would miss, and the drop in absolute success rates across model generations is useful context. The analysis tries to separate prompt-language effects from global leniency by modeling the interaction term, which is a reasonable approach on the data they have. The main soft spot is whether cultural or rating-scale differences between the US English and Mexican Spanish panels could generate the same dissociation without any change in the models themselves. Nine matched native speakers is decent but the fixed prompt set and lack of cross-language calibration leave room for that alternative. The paper claims the pattern implicates the prompt-language interface, yet the abstract alone does not show the full model specs or exclusion rules needed to judge how cleanly that holds. This is for teams working on multilingual safety evaluation or red-teaming of deployed MLLMs. The question matters for any system used outside English, even if the evidence needs tighter checks on annotator effects. It deserves peer review so the methods and raw data can be examined directly.

Referee Report

3 major / 2 minor

Summary. The manuscript reports results from a cross-lingual, multimodal red-teaming study on four frontier MLLMs (Claude Sonnet 4.5, GPT-5, Pixtral Large, Qwen Omni). Using a fixed set of 363 adversarial prompts administered in text-only and multimodal conditions, it collects 52,272 harm ratings and binary success judgments from nine native-speaker annotators per language (en-US vs. es-MX). Bayesian mixed-effects models are used to test for language-by-attack-type interactions; the central claim is that linguistic framing attacks (e.g., role-play) become less effective under Spanish prompting while visually explicit multimodal attacks become more effective, producing model rank reversals (Qwen Omni overtakes Pixtral Large in es-MX) that cannot be recovered by scalar correction of English scores. The authors interpret this dissociation as evidence that linguistic and visual alignment failures operate through distinct mechanisms at the prompt-language interface.

Significance. If the reported language-by-attack-type interaction survives controls for rater cultural differences, the work would be significant for AI safety evaluation. It supplies the first large-scale, matched-panel evidence that safety rankings are not preserved across languages and that treating language and modality as independent dimensions misspecifies the attack surface. The scale of the annotation effort (52k ratings) and use of Bayesian mixed-effects modeling to detect interactions are methodological strengths that could support falsifiable claims about alignment mechanisms.

major comments (3)

[Methods] Methods section: the manuscript provides no description of the random-effects structure, factor coding for language and attack type, priors, or convergence diagnostics for the Bayesian mixed-effects models. Because the central claim rests on a language-by-attack-type interaction being attributable to model mechanisms rather than annotator differences, the absence of these specifications prevents verification that the reported dissociation is robust.
[Methods] Methods section: no cross-language calibration procedures (bilingual raters, shared anchoring prompts, or inter-language reliability checks) are described. The fixed 363-prompt benchmark and matched native-speaker panels alone do not rule out systematic cultural differences in harm perception for role-play versus visual explicitness; this alternative directly threatens the interpretation that the dissociation implicates the prompt-language interface.
[Results] Results section: the claim that 'no scalar correction of English-condition scores could recover' the rank reversal is presented without the actual correction procedure, sensitivity analysis, or quantitative comparison of effect sizes across languages. This detail is load-bearing for the practical consequence that safety rankings are language-dependent.

minor comments (2)

[Abstract] Abstract: the sentence reporting the central finding is long and contains multiple clauses; splitting it would improve readability while preserving the dissociation claim.
[Figures] Figure captions: several figures showing attack success rates by language and modality lack error bars or credible intervals from the mixed-effects models, making visual assessment of the interaction difficult.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate planned revisions to improve methodological transparency and interpretive caution.

read point-by-point responses

Referee: [Methods] Methods section: the manuscript provides no description of the random-effects structure, factor coding for language and attack type, priors, or convergence diagnostics for the Bayesian mixed-effects models. Because the central claim rests on a language-by-attack-type interaction being attributable to model mechanisms rather than annotator differences, the absence of these specifications prevents verification that the reported dissociation is robust.

Authors: We agree that these specifications are necessary for verification. In the revised manuscript we will add a dedicated Methods subsection specifying: random intercepts for annotator and prompt (with attack-type slopes by annotator), sum-to-zero coding for the two-level language factor and the attack-type factor, weakly informative normal(0, 1) priors on fixed-effect coefficients, and convergence diagnostics (all R-hat < 1.01, bulk ESS > 1000). This addition will directly address the concern that the interaction could be driven by annotator variance. revision: yes
Referee: [Methods] Methods section: no cross-language calibration procedures (bilingual raters, shared anchoring prompts, or inter-language reliability checks) are described. The fixed 363-prompt benchmark and matched native-speaker panels alone do not rule out systematic cultural differences in harm perception for role-play versus visual explicitness; this alternative directly threatens the interpretation that the dissociation implicates the prompt-language interface.

Authors: The referee correctly identifies a limitation: we did not employ bilingual raters, shared anchoring prompts, or explicit inter-language reliability metrics. The design uses native-speaker panels and annotator-level random effects, but these do not fully isolate cultural differences in harm perception. In revision we will add an explicit Limitations paragraph acknowledging this alternative explanation and stating that the observed dissociation is consistent with, but not definitive proof of, prompt-language interface mechanisms. We cannot retroactively add calibration data, so the revision will be textual only. revision: partial
Referee: [Results] Results section: the claim that 'no scalar correction of English-condition scores could recover' the rank reversal is presented without the actual correction procedure, sensitivity analysis, or quantitative comparison of effect sizes across languages. This detail is load-bearing for the practical consequence that safety rankings are language-dependent.

Authors: We will expand the Results section to describe the scalar correction: English success rates were shifted by the mean language main effect (and separately by attack-type-specific language effects), after which model rankings were recomputed; the Qwen–Pixtral reversal persisted. We will also report sensitivity checks (shifts of ±1 SD of the language effect) and language-specific effect sizes (posterior means and 95% CrIs for the language-by-attack interaction terms). These details will be added to support the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical data collection with standard statistical analysis.

full rationale

This is a new data collection study using external human raters and Bayesian mixed-effects models on collected ratings. No equations, derivations, or self-citations reduce the reported language-by-attack-type interactions to quantities fitted from the same data or prior author work. The central claims derive from the 52,272 new ratings rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the annotation protocol and the Bayesian mixed-effects model; no free parameters, ad-hoc axioms, or new entities are introduced beyond standard statistical practice.

axioms (1)

domain assumption Bayesian mixed-effects models can isolate prompt-language effects from rater leniency when panels are matched by native-speaker status
Invoked to support the claim that differences reflect the language interface rather than global annotator differences

pith-pipeline@v0.9.0 · 5830 in / 1295 out tokens · 26389 ms · 2026-05-25T05:01:33.323171+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The attack surface of a multimodal large language model (MLLM) is language-dependent

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

[1]

Red teaming multimodal language models: Evaluating harm across prompt modalities and models

Madison Van Doren and Casey Ford. Red teaming multimodal language models: Evaluating harm across prompt modalities and models. InAAAI 2026 Workshop on AI Governance and EurIPS 2025 Workshop on Unifying Perspectives on Learning Biases, 2025. arXiv:2509.15478

work page arXiv 2026
[2]

M ¯atr.k¯a: Multilingual jailbreak evaluation of open-source large language models

Murali Emani et al. M ¯atr.k¯a: Multilingual jailbreak evaluation of open-source large language models. In Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025), pages 117–121, 2025

work page 2025
[3]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. InFindings of the Association for Computational Linguistics: EMNLP, pages 3356–3369, 2020

work page 2020
[4]

Process for adapting language models to society (palms) with values-targeted datasets

Irene Solaiman and Christy Dennison. Process for adapting language models to society (palms) with values-targeted datasets. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[5]

Ethical and social risks of harm from language models

Laura Weidinger et al. Ethical and social risks of harm from language models. Technical report, Google DeepMind, 2021

work page 2021
[6]

Query-based adversarial prompt generation

Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, and Milad Nasr. Query-based adversarial prompt generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[7]

Generative language models exhibit social identity biases.Nature Computational Science, 5(1):65–75, 2024

Tiancheng Hu, Yara Kyrychenko, Steve Rathje, Nigel Collier, Sander Van Der Linden, and Jon Roozenbeek. Generative language models exhibit social identity biases.Nature Computational Science, 5(1):65–75, 2024

work page 2024
[8]

Realistic evaluation of toxicity in language models

Tinh Luong, Thanh-Thien Le, Linh Ngo, and Thien Nguyen. Realistic evaluation of toxicity in language models. InFindings of the Association for Computational Linguistics (ACL), pages 1038–1047, 2024

work page 2024
[9]

Adversarial attacks and defenses in large language models: Old and new threats.arXiv preprint arXiv:2310.19737, 2023

Leo Schwinn, David Dobre, Stephan Günnemann, and Gauthier Gidel. Adversarial attacks and defenses in large language models: Old and new threats.arXiv preprint arXiv:2310.19737, 2023

work page arXiv 2023
[10]

Survey of vulnerabilities in large language models revealed by adversarial attacks.arXiv preprint arXiv:2310.10844, 2023

Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu-Ghazaleh. Survey of vulnerabilities in large language models revealed by adversarial attacks.arXiv preprint arXiv:2310.10844, 2023

work page arXiv 2023
[11]

Safety is not only about refusal: Reasoning-enhanced fine-tuning for interpretable llm safety

Yuyou Zhang, Miao Li, William Han, Yihang Yao, Zhepeng Cen, and Ding Zhao. Safety is not only about refusal: Reasoning-enhanced fine-tuning for interpretable llm safety. InFindings of the Association for Computational Linguistics (ACL), pages 18727–18746, 2025

work page 2025
[12]

Humorreject: Decoupling llm safety from refusal prefix via a little humor

Zihui Wu, Haichang Gao, Jiacheng Luo, and Zhaoxiang Liu. Humorreject: Decoupling llm safety from refusal prefix via a little humor. InProceedings of the 40th Annual AAAI Conference on Artificial Intelligence, 2025

work page 2025
[13]

Adjacent words, divergent intents: Jailbreaking large language models via task concurrency, 2025

Yukun Jiang, Mingjie Li, Michael Backes, and Yang Zhang. Adjacent words, divergent intents: Jailbreaking large language models via task concurrency, 2025

work page 2025
[14]

Foot-in-the-door: A multi-turn jailbreak for llms

Zixuan Weng, Xiaolong Jin, Jinyuan Jia, and Xiangyu Zhang. Foot-in-the-door: A multi-turn jailbreak for llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1939–1950, 2025

work page 2025
[15]

Large reasoning models are autonomous jailbreak agents

Thilo Hagendorff, Erik Derner, and Nuria Oliver. Large reasoning models are autonomous jailbreak agents. Nature Communications, 2026

work page 2026
[16]

Enhancing jailbreak attacks on llms via persona prompts

Zheng Zhang, Peilin Zhao, Deheng Ye, and Hao Wang. Enhancing jailbreak attacks on llms via persona prompts. InNeurIPS Workshop on LLM Persona Modeling, 2025

work page 2025
[17]

Agentharm: A benchmark for measuring harmfulness of llm agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of llm agents. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[18]

Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474, 2023

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474, 2023

work page arXiv 2023
[19]

The language barrier: Dissecting safety challenges of llms in multilingual contexts

Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual contexts. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2668–2680, 2024. 10

work page 2024
[20]

Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach. Low-resource languages jailbreak gpt-4, 2024

work page 2024
[21]

Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models

Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. In Proceedings of the 18th European Conference on Computer Vision (ECCV), 2025

work page 2025
[22]

Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

work page arXiv 2024
[23]

Youze Wang, Wenbo Hu, Yinpeng Dong, Jing Liu, Hanwang Zhang, and Richang Hong. Align is not enough: Multimodal universal jailbreak attack against multimodal large language models.IEEE Transactions on Circuits and Systems for Video Technology, 35(6):5475–5488, 2025

work page 2025
[24]

Distraction is all you need for multimodal large language model jailbreaking

Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. Distraction is all you need for multimodal large language model jailbreaking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[25]

Stylebreak: Revealing alignment vulnerabilities in large audio-language models via style-aware audio jailbreak

Hongyi Li, Chengxuan Zhou, Chu Wang, Sicheng Liang, Yanting Chen, Qinlin Xie, Jiawei Ye, and Jie Wu. Stylebreak: Revealing alignment vulnerabilities in large audio-language models via style-aware audio jailbreak. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 37591–37599, 2026

work page 2026
[26]

Beyond words: Multilingual and multimodal red teaming of mllms

Erik Derner and Kristina Batisti ˇc. Beyond words: Multilingual and multimodal red teaming of mllms. InProceedings of the First Workshop on LLM Security (LLMSEC), pages 198–206. Association for Computational Linguistics, 2025

work page 2025
[27]

Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms

Zhixin Xie, Xurui Song, and Jun Luo. Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[28]

Claude sonnet 4.5

Anthropic. Claude sonnet 4.5. Technical report, Anthropic, 2025

work page 2025
[29]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, OpenAI, 2025

work page 2025
[30]

Pixtral large

Mistral AI. Pixtral large. 2024

work page 2024
[31]

Qwen2.5 technical report

Qwen Team. Qwen2.5 technical report. 2025

work page 2025
[32]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. Technical report, Anthropic, 2024

work page 2024
[33]

Gpt-4o system card

OpenAI, Aaron Hurst, Adam Lerer, et al. Gpt-4o system card. Technical report, OpenAI, 2024

work page 2024
[34]

Pixtral 12b

Pravesh Agrawal, Szymon Antoniak, et al. Pixtral 12b. 2024

work page 2024
[35]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

Jinze Bai, Shuai Bai, Shusheng Yang, et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023

work page 2023
[36]

Has machine translation achieved human parity? a case for document-level evaluation

Samuel Läubli, Rico Sennrich, and Martin V olk. Has machine translation achieved human parity? a case for document-level evaluation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4791–4796, Brussels, Belgium, October-November 2018...

work page 2018
[37]

Machine translationese: Effects of algorithmic bias on linguistic complexity in machine translation

Eva Vanmassenhove, Dimitar Shterionov, and Matthew Gwilliam. Machine translationese: Effects of algorithmic bias on linguistic complexity in machine translation. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors,Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2203–221...

work page 2021
[38]

brms: An R package for bayesian multilevel models using Stan.Journal of Statistical Software, 80(1):1–28, 2017

Paul-Christian Bürkner. brms: An R package for bayesian multilevel models using Stan.Journal of Statistical Software, 80(1):1–28, 2017

work page 2017
[39]

Advanced bayesian multilevel modeling with the R package brms.The R Journal, 10(1):395–411, 2018

Paul-Christian Bürkner. Advanced bayesian multilevel modeling with the R package brms.The R Journal, 10(1):395–411, 2018

work page 2018
[40]

R Foundation for Statistical Computing, Vienna, Austria, 2024

R Core Team.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024

work page 2024
[41]

Hoffman and Andrew Gelman

Matthew D. Hoffman and Andrew Gelman. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.Journal of Machine Learning Research, 15(1):1593–1623, 2014

work page 2014
[42]

John Wiley & Sons, Hoboken, NJ, 2nd edition, 2010

Alan Agresti.Analysis of Ordinal Categorical Data. John Wiley & Sons, Hoboken, NJ, 2nd edition, 2010. 11

work page 2010
[43]

Lenth.emmeans: Estimated Marginal Means, aka Least-Squares Means, 2024

Russell V . Lenth.emmeans: Estimated Marginal Means, aka Least-Squares Means, 2024. R package version 2.0.0

work page 2024
[44]

Gwet.Handbook of Inter-Rater Reliability

Kilem L. Gwet.Handbook of Inter-Rater Reliability. Advanced Analytics, Gaithersburg, MD, 4 edition, 2014

work page 2014
[45]

Computing Krippendorff’s alpha-reliability.Departmental Papers (ASC), 2011

Klaus Krippendorff. Computing Krippendorff’s alpha-reliability.Departmental Papers (ASC), 2011

work page 2011
[46]

R package version 0.84.1

Matthias Gamer, Jim Lemon, Ian Fellows, and Puspendra Singh.irr: Various Coefficients of Interrater Reliability and Agreement, 2019. R package version 0.84.1

work page 2019
[47]

Gwet.irrCAC: Computing Chance-Corrected Agreement Coefficients, 2022

Kilem L. Gwet.irrCAC: Computing Chance-Corrected Agreement Coefficients, 2022. R package version 1.0

work page 2022
[48]

Safety layers in aligned large language models: The key to llm security

Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li. Safety layers in aligned large language models: The key to llm security. InProceedings of the International Conference on Learning Representations (ICLR), 2025. A Appendix A - Participant Information and Inter-Rater Reliability A.1 Annotator Demographics Table 3: en-US Annotators Participant ID Age Gender enU...

work page 2025

[1] [1]

Red teaming multimodal language models: Evaluating harm across prompt modalities and models

Madison Van Doren and Casey Ford. Red teaming multimodal language models: Evaluating harm across prompt modalities and models. InAAAI 2026 Workshop on AI Governance and EurIPS 2025 Workshop on Unifying Perspectives on Learning Biases, 2025. arXiv:2509.15478

work page arXiv 2026

[2] [2]

M ¯atr.k¯a: Multilingual jailbreak evaluation of open-source large language models

Murali Emani et al. M ¯atr.k¯a: Multilingual jailbreak evaluation of open-source large language models. In Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025), pages 117–121, 2025

work page 2025

[3] [3]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. InFindings of the Association for Computational Linguistics: EMNLP, pages 3356–3369, 2020

work page 2020

[4] [4]

Process for adapting language models to society (palms) with values-targeted datasets

Irene Solaiman and Christy Dennison. Process for adapting language models to society (palms) with values-targeted datasets. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[5] [5]

Ethical and social risks of harm from language models

Laura Weidinger et al. Ethical and social risks of harm from language models. Technical report, Google DeepMind, 2021

work page 2021

[6] [6]

Query-based adversarial prompt generation

Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, and Milad Nasr. Query-based adversarial prompt generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[7] [7]

Generative language models exhibit social identity biases.Nature Computational Science, 5(1):65–75, 2024

Tiancheng Hu, Yara Kyrychenko, Steve Rathje, Nigel Collier, Sander Van Der Linden, and Jon Roozenbeek. Generative language models exhibit social identity biases.Nature Computational Science, 5(1):65–75, 2024

work page 2024

[8] [8]

Realistic evaluation of toxicity in language models

Tinh Luong, Thanh-Thien Le, Linh Ngo, and Thien Nguyen. Realistic evaluation of toxicity in language models. InFindings of the Association for Computational Linguistics (ACL), pages 1038–1047, 2024

work page 2024

[9] [9]

Adversarial attacks and defenses in large language models: Old and new threats.arXiv preprint arXiv:2310.19737, 2023

Leo Schwinn, David Dobre, Stephan Günnemann, and Gauthier Gidel. Adversarial attacks and defenses in large language models: Old and new threats.arXiv preprint arXiv:2310.19737, 2023

work page arXiv 2023

[10] [10]

Survey of vulnerabilities in large language models revealed by adversarial attacks.arXiv preprint arXiv:2310.10844, 2023

Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu-Ghazaleh. Survey of vulnerabilities in large language models revealed by adversarial attacks.arXiv preprint arXiv:2310.10844, 2023

work page arXiv 2023

[11] [11]

Safety is not only about refusal: Reasoning-enhanced fine-tuning for interpretable llm safety

Yuyou Zhang, Miao Li, William Han, Yihang Yao, Zhepeng Cen, and Ding Zhao. Safety is not only about refusal: Reasoning-enhanced fine-tuning for interpretable llm safety. InFindings of the Association for Computational Linguistics (ACL), pages 18727–18746, 2025

work page 2025

[12] [12]

Humorreject: Decoupling llm safety from refusal prefix via a little humor

Zihui Wu, Haichang Gao, Jiacheng Luo, and Zhaoxiang Liu. Humorreject: Decoupling llm safety from refusal prefix via a little humor. InProceedings of the 40th Annual AAAI Conference on Artificial Intelligence, 2025

work page 2025

[13] [13]

Adjacent words, divergent intents: Jailbreaking large language models via task concurrency, 2025

Yukun Jiang, Mingjie Li, Michael Backes, and Yang Zhang. Adjacent words, divergent intents: Jailbreaking large language models via task concurrency, 2025

work page 2025

[14] [14]

Foot-in-the-door: A multi-turn jailbreak for llms

Zixuan Weng, Xiaolong Jin, Jinyuan Jia, and Xiangyu Zhang. Foot-in-the-door: A multi-turn jailbreak for llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1939–1950, 2025

work page 2025

[15] [15]

Large reasoning models are autonomous jailbreak agents

Thilo Hagendorff, Erik Derner, and Nuria Oliver. Large reasoning models are autonomous jailbreak agents. Nature Communications, 2026

work page 2026

[16] [16]

Enhancing jailbreak attacks on llms via persona prompts

Zheng Zhang, Peilin Zhao, Deheng Ye, and Hao Wang. Enhancing jailbreak attacks on llms via persona prompts. InNeurIPS Workshop on LLM Persona Modeling, 2025

work page 2025

[17] [17]

Agentharm: A benchmark for measuring harmfulness of llm agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of llm agents. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025

[18] [18]

Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474, 2023

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474, 2023

work page arXiv 2023

[19] [19]

The language barrier: Dissecting safety challenges of llms in multilingual contexts

Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual contexts. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2668–2680, 2024. 10

work page 2024

[20] [20]

Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach. Low-resource languages jailbreak gpt-4, 2024

work page 2024

[21] [21]

Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models

Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. In Proceedings of the 18th European Conference on Computer Vision (ECCV), 2025

work page 2025

[22] [22]

Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

work page arXiv 2024

[23] [23]

Youze Wang, Wenbo Hu, Yinpeng Dong, Jing Liu, Hanwang Zhang, and Richang Hong. Align is not enough: Multimodal universal jailbreak attack against multimodal large language models.IEEE Transactions on Circuits and Systems for Video Technology, 35(6):5475–5488, 2025

work page 2025

[24] [24]

Distraction is all you need for multimodal large language model jailbreaking

Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. Distraction is all you need for multimodal large language model jailbreaking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026

[25] [25]

Stylebreak: Revealing alignment vulnerabilities in large audio-language models via style-aware audio jailbreak

Hongyi Li, Chengxuan Zhou, Chu Wang, Sicheng Liang, Yanting Chen, Qinlin Xie, Jiawei Ye, and Jie Wu. Stylebreak: Revealing alignment vulnerabilities in large audio-language models via style-aware audio jailbreak. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 37591–37599, 2026

work page 2026

[26] [26]

Beyond words: Multilingual and multimodal red teaming of mllms

Erik Derner and Kristina Batisti ˇc. Beyond words: Multilingual and multimodal red teaming of mllms. InProceedings of the First Workshop on LLM Security (LLMSEC), pages 198–206. Association for Computational Linguistics, 2025

work page 2025

[27] [27]

Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms

Zhixin Xie, Xurui Song, and Jun Luo. Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[28] [28]

Claude sonnet 4.5

Anthropic. Claude sonnet 4.5. Technical report, Anthropic, 2025

work page 2025

[29] [29]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, OpenAI, 2025

work page 2025

[30] [30]

Pixtral large

Mistral AI. Pixtral large. 2024

work page 2024

[31] [31]

Qwen2.5 technical report

Qwen Team. Qwen2.5 technical report. 2025

work page 2025

[32] [32]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. Technical report, Anthropic, 2024

work page 2024

[33] [33]

Gpt-4o system card

OpenAI, Aaron Hurst, Adam Lerer, et al. Gpt-4o system card. Technical report, OpenAI, 2024

work page 2024

[34] [34]

Pixtral 12b

Pravesh Agrawal, Szymon Antoniak, et al. Pixtral 12b. 2024

work page 2024

[35] [35]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

Jinze Bai, Shuai Bai, Shusheng Yang, et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023

work page 2023

[36] [36]

Has machine translation achieved human parity? a case for document-level evaluation

Samuel Läubli, Rico Sennrich, and Martin V olk. Has machine translation achieved human parity? a case for document-level evaluation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4791–4796, Brussels, Belgium, October-November 2018...

work page 2018

[37] [37]

Machine translationese: Effects of algorithmic bias on linguistic complexity in machine translation

Eva Vanmassenhove, Dimitar Shterionov, and Matthew Gwilliam. Machine translationese: Effects of algorithmic bias on linguistic complexity in machine translation. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors,Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2203–221...

work page 2021

[38] [38]

brms: An R package for bayesian multilevel models using Stan.Journal of Statistical Software, 80(1):1–28, 2017

Paul-Christian Bürkner. brms: An R package for bayesian multilevel models using Stan.Journal of Statistical Software, 80(1):1–28, 2017

work page 2017

[39] [39]

Advanced bayesian multilevel modeling with the R package brms.The R Journal, 10(1):395–411, 2018

Paul-Christian Bürkner. Advanced bayesian multilevel modeling with the R package brms.The R Journal, 10(1):395–411, 2018

work page 2018

[40] [40]

R Foundation for Statistical Computing, Vienna, Austria, 2024

R Core Team.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024

work page 2024

[41] [41]

Hoffman and Andrew Gelman

Matthew D. Hoffman and Andrew Gelman. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.Journal of Machine Learning Research, 15(1):1593–1623, 2014

work page 2014

[42] [42]

John Wiley & Sons, Hoboken, NJ, 2nd edition, 2010

Alan Agresti.Analysis of Ordinal Categorical Data. John Wiley & Sons, Hoboken, NJ, 2nd edition, 2010. 11

work page 2010

[43] [43]

Lenth.emmeans: Estimated Marginal Means, aka Least-Squares Means, 2024

Russell V . Lenth.emmeans: Estimated Marginal Means, aka Least-Squares Means, 2024. R package version 2.0.0

work page 2024

[44] [44]

Gwet.Handbook of Inter-Rater Reliability

Kilem L. Gwet.Handbook of Inter-Rater Reliability. Advanced Analytics, Gaithersburg, MD, 4 edition, 2014

work page 2014

[45] [45]

Computing Krippendorff’s alpha-reliability.Departmental Papers (ASC), 2011

Klaus Krippendorff. Computing Krippendorff’s alpha-reliability.Departmental Papers (ASC), 2011

work page 2011

[46] [46]

R package version 0.84.1

Matthias Gamer, Jim Lemon, Ian Fellows, and Puspendra Singh.irr: Various Coefficients of Interrater Reliability and Agreement, 2019. R package version 0.84.1

work page 2019

[47] [47]

Gwet.irrCAC: Computing Chance-Corrected Agreement Coefficients, 2022

Kilem L. Gwet.irrCAC: Computing Chance-Corrected Agreement Coefficients, 2022. R package version 1.0

work page 2022

[48] [48]

Safety layers in aligned large language models: The key to llm security

Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li. Safety layers in aligned large language models: The key to llm security. InProceedings of the International Conference on Learning Representations (ICLR), 2025. A Appendix A - Participant Information and Inter-Rater Reliability A.1 Annotator Demographics Table 3: en-US Annotators Participant ID Age Gender enU...

work page 2025