pith. sign in

arxiv: 2605.23157 · v1 · pith:GKEVWVMInew · submitted 2026-05-22 · 💻 cs.CL

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

Pith reviewed 2026-05-25 05:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords jailbreak attacksmultimodal LLMscross-lingual evaluationalignment failuresred-teaminglanguage effectsvisual attacksmodel vulnerability
3
0 comments X

The pith

Switching from English to Spanish changes which jailbreak attacks succeed on the same frontier MLLMs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that jailbreak vulnerability in multimodal large language models is not uniform across languages. Linguistic framing attacks such as role-play lose effectiveness under Spanish prompting while visually explicit multimodal attacks gain effectiveness. A sympathetic reader would care because this dissociation shows that language and vision failures arise through separate mechanisms rather than a single model property. Safety rankings therefore reverse between English and Spanish conditions, and no scalar adjustment of English scores recovers the Spanish outcomes. The result indicates that current evaluation frameworks treating language and modality as independent dimensions misspecify the attack surface of globally deployed models.

Core claim

The attack surface of a multimodal large language model is language-dependent in ways that reveal the mechanistic structure of alignment failures. The central finding is that language does not scale vulnerability uniformly. Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective, which directly implicates the prompt-language interface rather than global annotator leniency. This dissociation indicates that linguistic and visual alignment failures operate through distinct mechanisms, and that switching language is sufficient to expose<f|

What carries the argument

Bayesian mixed-effects analyses comparing attack success rates on a fixed 363-prompt benchmark in text-only and multimodal conditions across English and Spanish native-speaker panels

If this is right

  • Safety rankings are not preserved across languages, with Qwen Omni overtaking Pixtral Large as most vulnerable under Spanish prompting.
  • Absolute attack success rates decline across model generations without closing gaps between models.
  • Linguistic and visual alignment failures operate through distinct mechanisms exposed by language switching.
  • Safety evaluation frameworks must treat language and modality as interacting dimensions rather than independent ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers may need to run safety tests in multiple languages to surface language-specific weaknesses before global release.
  • Alignment training could incorporate language-specific data to address the observed dissociation between linguistic and visual attacks.
  • The pattern might generalize to other language pairs, suggesting that English-only benchmarks systematically understate certain risks.
  • Redesigning benchmarks to include matched multimodal conditions across languages could change which models are considered safest.

Load-bearing premise

The fixed 363-prompt benchmark and matched native-speaker annotator panels are sufficient to isolate prompt-language effects from cultural, lexical, or rating-scale differences between the two language groups.

What would settle it

A replication study with a different prompt set or annotator panels that finds uniform attack success rates across English and Spanish conditions would falsify the claim of language-dependent vulnerability.

Figures

Figures reproduced from arXiv: 2605.23157 by Casey Ford, Emily Dix, Madison Van Doren, Sicheng Jin.

Figure 1
Figure 1. Figure 1: Elicitation of adversarial prompt dataset [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Annotating responses from LLM models tested [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of harmfulness rating of multimodal prompt output across languages. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Annotator disagreement matrices overall and by language group. Rows = assigned rating; [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of harmfulness of model responses overall. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pairwise model contrasts for ASR. Error bars = CrI (95%). Grey = Not Credible; Blue = [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pairwise model contrasts for harm severity (overall). Error bars = CrI (95%). Grey = Not [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Descriptive ASR by model and language [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pairwise model contrasts for ASR by language. Error bars = CrI (95%). Grey = Not [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Language contrasts (es-MX vs en-US) within each model for harm severity. Error bars = [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Pairwise model contrasts for harm severity by language. Error bars = CrI (95%). Grey = [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Pairwise model contrasts for ASR by modality. Error bars = CrI (95%). Grey = Not [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Modality contrasts (text-only vs multimodal) within each model for harm severity. Error [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Descriptive ASR by model, modality, and language. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Pairwise model contrasts for ASR by language and modality. Error bars = CrI (95%). [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Language contrasts within each model and modality for harm severity. Error bars = CrI [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Pairwise attack technique contrasts for ASR. Error bars = CrI (95%). Grey = Not Credible; [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Pairwise attack technique contrasts for ASR by language. Error bars = CrI (95%). Grey = [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Pairwise attack execution contrasts for ASR overall. Error bars = CrI (95%). Grey = Not [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Pairwise attack execution comparison by language. Error bars = CrI (95%). Faded = Not [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Pairwise harm category comparison on ASR overall. Error bars = CrI (95%). Faded = Not [PITH_FULL_IMAGE:figures/full_fig_p025_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Pairwise harm category comparison on ASR by language. Error bars = CrI (95%). Faded [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗
read the original abstract

The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study comparing jailbreak vulnerability in US English (en-US) and Mexican Spanish (es-MX) across four frontier MLLMs: Claude Sonnet 4.5, GPT-5, Pixtral Large, and Qwen Omni. Using a fixed adversarial benchmark of 363 diverse prompt scenarios administered in text-only and multimodal conditions, we collected 52,272 harm ratings and binary attack success judgements from matched panels of nine native-speaker annotators per language group. Our central finding is that language does not scale vulnerability uniformly. Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective, which directly implicates the prompt-language interface rather than global annotator leniency. This dissociation indicates that linguistic and visual alignment failures operate through distinct mechanisms, and that switching language is sufficient to expose that separation. The practical consequence is that safety rankings are not preserved across languages. Qwen Omni overtakes Pixtral Large as the most vulnerable model among es-MX participants, a rank reversal no scalar correction of English-condition scores could recover, and absolute attack success rates have declined across model generations without closing the gaps between them. These findings demonstrate that safety evaluation frameworks treating language and modality as independent dimensions fundamentally misspecify the attack surface of globally deployed MLLMs, and must be redesigned accordingly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports results from a cross-lingual, multimodal red-teaming study on four frontier MLLMs (Claude Sonnet 4.5, GPT-5, Pixtral Large, Qwen Omni). Using a fixed set of 363 adversarial prompts administered in text-only and multimodal conditions, it collects 52,272 harm ratings and binary success judgments from nine native-speaker annotators per language (en-US vs. es-MX). Bayesian mixed-effects models are used to test for language-by-attack-type interactions; the central claim is that linguistic framing attacks (e.g., role-play) become less effective under Spanish prompting while visually explicit multimodal attacks become more effective, producing model rank reversals (Qwen Omni overtakes Pixtral Large in es-MX) that cannot be recovered by scalar correction of English scores. The authors interpret this dissociation as evidence that linguistic and visual alignment failures operate through distinct mechanisms at the prompt-language interface.

Significance. If the reported language-by-attack-type interaction survives controls for rater cultural differences, the work would be significant for AI safety evaluation. It supplies the first large-scale, matched-panel evidence that safety rankings are not preserved across languages and that treating language and modality as independent dimensions misspecifies the attack surface. The scale of the annotation effort (52k ratings) and use of Bayesian mixed-effects modeling to detect interactions are methodological strengths that could support falsifiable claims about alignment mechanisms.

major comments (3)
  1. [Methods] Methods section: the manuscript provides no description of the random-effects structure, factor coding for language and attack type, priors, or convergence diagnostics for the Bayesian mixed-effects models. Because the central claim rests on a language-by-attack-type interaction being attributable to model mechanisms rather than annotator differences, the absence of these specifications prevents verification that the reported dissociation is robust.
  2. [Methods] Methods section: no cross-language calibration procedures (bilingual raters, shared anchoring prompts, or inter-language reliability checks) are described. The fixed 363-prompt benchmark and matched native-speaker panels alone do not rule out systematic cultural differences in harm perception for role-play versus visual explicitness; this alternative directly threatens the interpretation that the dissociation implicates the prompt-language interface.
  3. [Results] Results section: the claim that 'no scalar correction of English-condition scores could recover' the rank reversal is presented without the actual correction procedure, sensitivity analysis, or quantitative comparison of effect sizes across languages. This detail is load-bearing for the practical consequence that safety rankings are language-dependent.
minor comments (2)
  1. [Abstract] Abstract: the sentence reporting the central finding is long and contains multiple clauses; splitting it would improve readability while preserving the dissociation claim.
  2. [Figures] Figure captions: several figures showing attack success rates by language and modality lack error bars or credible intervals from the mixed-effects models, making visual assessment of the interaction difficult.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate planned revisions to improve methodological transparency and interpretive caution.

read point-by-point responses
  1. Referee: [Methods] Methods section: the manuscript provides no description of the random-effects structure, factor coding for language and attack type, priors, or convergence diagnostics for the Bayesian mixed-effects models. Because the central claim rests on a language-by-attack-type interaction being attributable to model mechanisms rather than annotator differences, the absence of these specifications prevents verification that the reported dissociation is robust.

    Authors: We agree that these specifications are necessary for verification. In the revised manuscript we will add a dedicated Methods subsection specifying: random intercepts for annotator and prompt (with attack-type slopes by annotator), sum-to-zero coding for the two-level language factor and the attack-type factor, weakly informative normal(0, 1) priors on fixed-effect coefficients, and convergence diagnostics (all R-hat < 1.01, bulk ESS > 1000). This addition will directly address the concern that the interaction could be driven by annotator variance. revision: yes

  2. Referee: [Methods] Methods section: no cross-language calibration procedures (bilingual raters, shared anchoring prompts, or inter-language reliability checks) are described. The fixed 363-prompt benchmark and matched native-speaker panels alone do not rule out systematic cultural differences in harm perception for role-play versus visual explicitness; this alternative directly threatens the interpretation that the dissociation implicates the prompt-language interface.

    Authors: The referee correctly identifies a limitation: we did not employ bilingual raters, shared anchoring prompts, or explicit inter-language reliability metrics. The design uses native-speaker panels and annotator-level random effects, but these do not fully isolate cultural differences in harm perception. In revision we will add an explicit Limitations paragraph acknowledging this alternative explanation and stating that the observed dissociation is consistent with, but not definitive proof of, prompt-language interface mechanisms. We cannot retroactively add calibration data, so the revision will be textual only. revision: partial

  3. Referee: [Results] Results section: the claim that 'no scalar correction of English-condition scores could recover' the rank reversal is presented without the actual correction procedure, sensitivity analysis, or quantitative comparison of effect sizes across languages. This detail is load-bearing for the practical consequence that safety rankings are language-dependent.

    Authors: We will expand the Results section to describe the scalar correction: English success rates were shifted by the mean language main effect (and separately by attack-type-specific language effects), after which model rankings were recomputed; the Qwen–Pixtral reversal persisted. We will also report sensitivity checks (shifts of ±1 SD of the language effect) and language-specific effect sizes (posterior means and 95% CrIs for the language-by-attack interaction terms). These details will be added to support the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical data collection with standard statistical analysis.

full rationale

This is a new data collection study using external human raters and Bayesian mixed-effects models on collected ratings. No equations, derivations, or self-citations reduce the reported language-by-attack-type interactions to quantities fitted from the same data or prior author work. The central claims derive from the 52,272 new ratings rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the annotation protocol and the Bayesian mixed-effects model; no free parameters, ad-hoc axioms, or new entities are introduced beyond standard statistical practice.

axioms (1)
  • domain assumption Bayesian mixed-effects models can isolate prompt-language effects from rater leniency when panels are matched by native-speaker status
    Invoked to support the claim that differences reflect the language interface rather than global annotator differences

pith-pipeline@v0.9.0 · 5830 in / 1295 out tokens · 26389 ms · 2026-05-25T05:01:33.323171+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    Red teaming multimodal language models: Evaluating harm across prompt modalities and models

    Madison Van Doren and Casey Ford. Red teaming multimodal language models: Evaluating harm across prompt modalities and models. InAAAI 2026 Workshop on AI Governance and EurIPS 2025 Workshop on Unifying Perspectives on Learning Biases, 2025. arXiv:2509.15478

  2. [2]

    M ¯atr.k¯a: Multilingual jailbreak evaluation of open-source large language models

    Murali Emani et al. M ¯atr.k¯a: Multilingual jailbreak evaluation of open-source large language models. In Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025), pages 117–121, 2025

  3. [3]

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. InFindings of the Association for Computational Linguistics: EMNLP, pages 3356–3369, 2020

  4. [4]

    Process for adapting language models to society (palms) with values-targeted datasets

    Irene Solaiman and Christy Dennison. Process for adapting language models to society (palms) with values-targeted datasets. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  5. [5]

    Ethical and social risks of harm from language models

    Laura Weidinger et al. Ethical and social risks of harm from language models. Technical report, Google DeepMind, 2021

  6. [6]

    Query-based adversarial prompt generation

    Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, and Milad Nasr. Query-based adversarial prompt generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  7. [7]

    Generative language models exhibit social identity biases.Nature Computational Science, 5(1):65–75, 2024

    Tiancheng Hu, Yara Kyrychenko, Steve Rathje, Nigel Collier, Sander Van Der Linden, and Jon Roozenbeek. Generative language models exhibit social identity biases.Nature Computational Science, 5(1):65–75, 2024

  8. [8]

    Realistic evaluation of toxicity in language models

    Tinh Luong, Thanh-Thien Le, Linh Ngo, and Thien Nguyen. Realistic evaluation of toxicity in language models. InFindings of the Association for Computational Linguistics (ACL), pages 1038–1047, 2024

  9. [9]

    Adversarial attacks and defenses in large language models: Old and new threats.arXiv preprint arXiv:2310.19737, 2023

    Leo Schwinn, David Dobre, Stephan Günnemann, and Gauthier Gidel. Adversarial attacks and defenses in large language models: Old and new threats.arXiv preprint arXiv:2310.19737, 2023

  10. [10]

    Survey of vulnerabilities in large language models revealed by adversarial attacks.arXiv preprint arXiv:2310.10844, 2023

    Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu-Ghazaleh. Survey of vulnerabilities in large language models revealed by adversarial attacks.arXiv preprint arXiv:2310.10844, 2023

  11. [11]

    Safety is not only about refusal: Reasoning-enhanced fine-tuning for interpretable llm safety

    Yuyou Zhang, Miao Li, William Han, Yihang Yao, Zhepeng Cen, and Ding Zhao. Safety is not only about refusal: Reasoning-enhanced fine-tuning for interpretable llm safety. InFindings of the Association for Computational Linguistics (ACL), pages 18727–18746, 2025

  12. [12]

    Humorreject: Decoupling llm safety from refusal prefix via a little humor

    Zihui Wu, Haichang Gao, Jiacheng Luo, and Zhaoxiang Liu. Humorreject: Decoupling llm safety from refusal prefix via a little humor. InProceedings of the 40th Annual AAAI Conference on Artificial Intelligence, 2025

  13. [13]

    Adjacent words, divergent intents: Jailbreaking large language models via task concurrency, 2025

    Yukun Jiang, Mingjie Li, Michael Backes, and Yang Zhang. Adjacent words, divergent intents: Jailbreaking large language models via task concurrency, 2025

  14. [14]

    Foot-in-the-door: A multi-turn jailbreak for llms

    Zixuan Weng, Xiaolong Jin, Jinyuan Jia, and Xiangyu Zhang. Foot-in-the-door: A multi-turn jailbreak for llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1939–1950, 2025

  15. [15]

    Large reasoning models are autonomous jailbreak agents

    Thilo Hagendorff, Erik Derner, and Nuria Oliver. Large reasoning models are autonomous jailbreak agents. Nature Communications, 2026

  16. [16]

    Enhancing jailbreak attacks on llms via persona prompts

    Zheng Zhang, Peilin Zhao, Deheng Ye, and Hao Wang. Enhancing jailbreak attacks on llms via persona prompts. InNeurIPS Workshop on LLM Persona Modeling, 2025

  17. [17]

    Agentharm: A benchmark for measuring harmfulness of llm agents

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of llm agents. InProceedings of the International Conference on Learning Representations (ICLR), 2025

  18. [18]

    Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474, 2023

    Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474, 2023

  19. [19]

    The language barrier: Dissecting safety challenges of llms in multilingual contexts

    Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual contexts. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2668–2680, 2024. 10

  20. [20]

    Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach. Low-resource languages jailbreak gpt-4, 2024

  21. [21]

    Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models

    Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. In Proceedings of the 18th European Conference on Computer Vision (ECCV), 2025

  22. [22]

    Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

    Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

  23. [23]

    Youze Wang, Wenbo Hu, Yinpeng Dong, Jing Liu, Hanwang Zhang, and Richang Hong. Align is not enough: Multimodal universal jailbreak attack against multimodal large language models.IEEE Transactions on Circuits and Systems for Video Technology, 35(6):5475–5488, 2025

  24. [24]

    Distraction is all you need for multimodal large language model jailbreaking

    Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. Distraction is all you need for multimodal large language model jailbreaking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  25. [25]

    Stylebreak: Revealing alignment vulnerabilities in large audio-language models via style-aware audio jailbreak

    Hongyi Li, Chengxuan Zhou, Chu Wang, Sicheng Liang, Yanting Chen, Qinlin Xie, Jiawei Ye, and Jie Wu. Stylebreak: Revealing alignment vulnerabilities in large audio-language models via style-aware audio jailbreak. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 37591–37599, 2026

  26. [26]

    Beyond words: Multilingual and multimodal red teaming of mllms

    Erik Derner and Kristina Batisti ˇc. Beyond words: Multilingual and multimodal red teaming of mllms. InProceedings of the First Workshop on LLM Security (LLMSEC), pages 198–206. Association for Computational Linguistics, 2025

  27. [27]

    Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms

    Zhixin Xie, Xurui Song, and Jun Luo. Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  28. [28]

    Claude sonnet 4.5

    Anthropic. Claude sonnet 4.5. Technical report, Anthropic, 2025

  29. [29]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. Technical report, OpenAI, 2025

  30. [30]

    Pixtral large

    Mistral AI. Pixtral large. 2024

  31. [31]

    Qwen2.5 technical report

    Qwen Team. Qwen2.5 technical report. 2025

  32. [32]

    Claude 3.5 sonnet

    Anthropic. Claude 3.5 sonnet. Technical report, Anthropic, 2024

  33. [33]

    Gpt-4o system card

    OpenAI, Aaron Hurst, Adam Lerer, et al. Gpt-4o system card. Technical report, OpenAI, 2024

  34. [34]

    Pixtral 12b

    Pravesh Agrawal, Szymon Antoniak, et al. Pixtral 12b. 2024

  35. [35]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023

  36. [36]

    Has machine translation achieved human parity? a case for document-level evaluation

    Samuel Läubli, Rico Sennrich, and Martin V olk. Has machine translation achieved human parity? a case for document-level evaluation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4791–4796, Brussels, Belgium, October-November 2018...

  37. [37]

    Machine translationese: Effects of algorithmic bias on linguistic complexity in machine translation

    Eva Vanmassenhove, Dimitar Shterionov, and Matthew Gwilliam. Machine translationese: Effects of algorithmic bias on linguistic complexity in machine translation. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors,Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2203–221...

  38. [38]

    brms: An R package for bayesian multilevel models using Stan.Journal of Statistical Software, 80(1):1–28, 2017

    Paul-Christian Bürkner. brms: An R package for bayesian multilevel models using Stan.Journal of Statistical Software, 80(1):1–28, 2017

  39. [39]

    Advanced bayesian multilevel modeling with the R package brms.The R Journal, 10(1):395–411, 2018

    Paul-Christian Bürkner. Advanced bayesian multilevel modeling with the R package brms.The R Journal, 10(1):395–411, 2018

  40. [40]

    R Foundation for Statistical Computing, Vienna, Austria, 2024

    R Core Team.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024

  41. [41]

    Hoffman and Andrew Gelman

    Matthew D. Hoffman and Andrew Gelman. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.Journal of Machine Learning Research, 15(1):1593–1623, 2014

  42. [42]

    John Wiley & Sons, Hoboken, NJ, 2nd edition, 2010

    Alan Agresti.Analysis of Ordinal Categorical Data. John Wiley & Sons, Hoboken, NJ, 2nd edition, 2010. 11

  43. [43]

    Lenth.emmeans: Estimated Marginal Means, aka Least-Squares Means, 2024

    Russell V . Lenth.emmeans: Estimated Marginal Means, aka Least-Squares Means, 2024. R package version 2.0.0

  44. [44]

    Gwet.Handbook of Inter-Rater Reliability

    Kilem L. Gwet.Handbook of Inter-Rater Reliability. Advanced Analytics, Gaithersburg, MD, 4 edition, 2014

  45. [45]

    Computing Krippendorff’s alpha-reliability.Departmental Papers (ASC), 2011

    Klaus Krippendorff. Computing Krippendorff’s alpha-reliability.Departmental Papers (ASC), 2011

  46. [46]

    R package version 0.84.1

    Matthias Gamer, Jim Lemon, Ian Fellows, and Puspendra Singh.irr: Various Coefficients of Interrater Reliability and Agreement, 2019. R package version 0.84.1

  47. [47]

    Gwet.irrCAC: Computing Chance-Corrected Agreement Coefficients, 2022

    Kilem L. Gwet.irrCAC: Computing Chance-Corrected Agreement Coefficients, 2022. R package version 1.0

  48. [48]

    Safety layers in aligned large language models: The key to llm security

    Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li. Safety layers in aligned large language models: The key to llm security. InProceedings of the International Conference on Learning Representations (ICLR), 2025. A Appendix A - Participant Information and Inter-Rater Reliability A.1 Annotator Demographics Table 3: en-US Annotators Participant ID Age Gender enU...