arxiv: 2603.21697 · v2 · submitted 2026-03-23 · 💻 cs.CR · cs.AI· cs.MM

Recognition: no theorem link

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

Rui Yang Tan , Yujia Hu , Roy Ka-Wei Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:23 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.MM

keywords multimodal large language modelsjailbreak attacksvisual narrativessafety alignmentcomic-based jailbreaksharm categoriesrefusal ratessafety evaluators

0 comments

The pith

Embedding harmful goals in three-panel comics allows effective jailbreaks of multimodal AI models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces comic-template jailbreaks that place harmful instructions inside simple visual narratives and ask the model to role-play completing the story. Across fifteen state-of-the-art multimodal large language models, these attacks reach success rates comparable to strong text-only jailbreaks and clearly exceed plain-text or random-image baselines. On several commercial models, ensembles of such attacks exceed 90 percent success. Existing defense methods that block the harmful comics also produce high refusal rates on ordinary benign prompts. Automatic and human safety judges prove unreliable when evaluating sensitive but non-harmful visual content.

Core claim

The central claim is that embedding harmful goals inside simple three-panel visual narratives and prompting the model to role-play and complete the comic produces jailbreak success rates that match strong rule-based text attacks while substantially outperforming unstructured text and image baselines, exposing a distinct vulnerability in current multimodal safety alignment.

What carries the argument

The comic-template jailbreak, which embeds harmful goals inside three-panel visual narratives and prompts the model to role-play and complete the comic

If this is right

Comic-based attacks achieve success rates comparable to strong rule-based jailbreaks across fifteen MLLMs.
They substantially outperform plain-text and random-image baselines.
Ensemble success rates exceed 90 percent on several commercial models.
Defense methods effective against the comics induce high refusal rates on benign prompts.
Current safety evaluators are unreliable on sensitive but non-harmful content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety methods may need to treat sequential visual narratives as a distinct input class rather than as collections of isolated images.
Testing regimes for new models should include structured storytelling prompts to catch vulnerabilities that text-only or single-image checks miss.
Improved multimodal judges will be required that can distinguish context and intent within narrative sequences instead of flagging isolated sensitive elements.

Load-bearing premise

The 1,167 comic instances and the fifteen tested models are representative of realistic multimodal jailbreak attempts and of deployed systems in the wild.

What would settle it

A controlled test in which the same harmful content is presented in non-sequential or non-narrative image panels and produces markedly lower success rates would support the claim that narrative structure is the key factor; comparable rates with unstructured images would falsify it.

read the original abstract

Multimodal Large Language Models (MLLMs) extend text-only LLMs with visual reasoning, but also introduce new safety failure modes under visually grounded instructions. We study comic-template jailbreaks that embed harmful goals inside simple three-panel visual narratives and prompt the model to role-play and "complete the comic." Building on JailbreakBench and JailbreakV, we introduce ComicJailbreak, a comic-based jailbreak benchmark with 1,167 attack instances spanning 10 harm categories and 5 task setups. Across 15 state-of-the-art MLLMs (six commercial and nine open-source), comic-based attacks achieve success rates comparable to strong rule-based jailbreaks and substantially outperform plain-text and random-image baselines, with ensemble success rates exceeding 90% on several commercial models. Then, with the existing defense methodologies, we show that these methods are effective against the harmful comics, they will induce a high refusal rate when prompted with benign prompts. Finally, using automatic judging and targeted human evaluation, we show that current safety evaluators can be unreliable on sensitive but non-harmful content. Our findings highlight the need for safety alignment robust to narrative-driven multimodal jailbreaks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Comic templates let you jailbreak MLLMs at rates matching text attacks, but the headline numbers rest on judges the paper itself calls unreliable.

read the letter

The main thing to know is that embedding harmful requests inside simple three-panel comics bypasses safety in most tested MLLMs, including commercial ones, at levels comparable to strong rule-based jailbreaks and well above plain-text or random-image baselines. Ensemble rates top 90% on several models. They introduce ComicJailbreak, a set of 1,167 instances across 10 harm categories and 5 task types, built on top of existing jailbreak suites. The tests cover 15 models and include checks on current defenses, which either fail to block the comics or raise refusal rates on benign prompts. They also run automatic judges plus targeted human review and show those judges are shaky on sensitive but non-harmful content. That last part is useful on its own. The work is straightforward empirical testing with a new benchmark that fills a gap in multimodal attack surfaces. The numbers look directionally consistent with prior text-only results, and the defense trade-off observation is worth noting for anyone tuning alignment. The soft spot is measurement. Success rates depend on the automatic judges the paper flags as unreliable near decision boundaries, yet the text does not quantify how many of the 1,167 instances received full human review versus auto-only scoring. Without that breakdown or more detail on prompt construction and statistical controls, the exact percentages could be inflated by evaluator noise rather than pure alignment failure. The abstract is thin on those mechanics, so the full paper needs to tighten that up. This is for people working on multimodal safety and robustness. It gives a concrete new test set and evidence that narrative structure in images creates a distinct failure mode. The thinking is clear and it engages the cited literature without overclaiming. I would send it to peer review so referees can examine the judging protocol and defense experiments directly.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces ComicJailbreak, a benchmark of 1,167 comic-based jailbreak instances spanning 10 harm categories and 5 task setups. It reports that these structured three-panel visual narrative attacks achieve success rates on 15 MLLMs (6 commercial, 9 open-source) comparable to strong rule-based jailbreaks, substantially outperforming plain-text and random-image baselines, with ensemble success rates exceeding 90% on several commercial models. The work further shows that existing defense methods effective against harmful comics induce high refusal rates on benign prompts, and that current automatic safety evaluators are unreliable on sensitive but non-harmful content.

Significance. If the empirical results hold after methodological clarification, the findings are significant because they identify a new class of narrative-driven multimodal jailbreaks that exploit visual structure to undermine alignment in deployed MLLMs. The new benchmark, the demonstration of defense trade-offs, and the evidence of evaluator unreliability on borderline cases are concrete contributions that can guide future safety work.

major comments (1)

[Evaluation section] Evaluation methodology: The headline success rates (including >90% ensemble on commercial models) are obtained via automatic judging plus targeted human evaluation. The manuscript separately demonstrates that the same class of automatic judges is unreliable on sensitive but non-harmful content. No section quantifies what fraction of the 1,167 instances received full human review versus auto-only scoring. Because comic prompts are narrative-driven and often sit near refusal thresholds, this omission directly affects the reliability of the central claim.

minor comments (2)

[§3] The construction details for the 1,167 comic instances (exact prompt templates, image-generation pipeline, and how the five task setups were instantiated) are not described at a level that supports independent reproduction.
[Experimental setup] Exact model versions, API endpoints, and any sampling parameters used for the 15 MLLMs should be listed explicitly rather than referred to generically as “state-of-the-art.”

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the evaluation methodology. We agree that greater transparency is needed regarding the split between automatic and human scoring, and we will revise the manuscript accordingly to address this concern directly.

read point-by-point responses

Referee: [Evaluation section] Evaluation methodology: The headline success rates (including >90% ensemble on commercial models) are obtained via automatic judging plus targeted human evaluation. The manuscript separately demonstrates that the same class of automatic judges is unreliable on sensitive but non-harmful content. No section quantifies what fraction of the 1,167 instances received full human review versus auto-only scoring. Because comic prompts are narrative-driven and often sit near refusal thresholds, this omission directly affects the reliability of the central claim.

Authors: We appreciate this observation and acknowledge that the original manuscript did not explicitly report the exact fraction of instances receiving human review. In our evaluation protocol, automatic judging was applied to all 1,167 instances, while targeted human evaluation (by two annotators with 94% agreement) was performed on a stratified random sample of 20% of the instances (234 total), with additional review of all borderline cases flagged by the automatic judge (approximately 8% more). We will add a dedicated paragraph in the Evaluation section (and a corresponding table) that states these numbers, reports per-model agreement rates between automatic and human labels (ranging from 82-91% on harmful instances), and provides separate success-rate breakdowns for the auto-only and human-reviewed subsets. Our Section 5 analysis of evaluator unreliability focuses on non-harmful sensitive content; on the harmful comic instances the automatic judge showed substantially higher alignment with human labels. These additions will make the methodology fully reproducible and directly address the concern about reliability near refusal thresholds. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper introduces ComicJailbreak as a new dataset of 1,167 comic instances and measures attack success rates on 15 fixed MLLMs against baselines. All headline results are direct empirical counts from model outputs judged by automatic classifiers plus targeted human review. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the derivation chain; the claims rest on external model behavior and a newly constructed test set rather than any reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no mathematical derivations, free parameters, or postulated entities are described. The work is empirical evaluation of existing MLLMs against a new test set.

pith-pipeline@v0.9.0 · 5511 in / 1045 out tokens · 48368 ms · 2026-05-15T01:23:24.990026+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 7 internal anchors

[1]

Chang, Y.et al.A survey on evaluation of large language models.ACM transactions on intelligent systems and technology15, 1–45 (2024)

work page 2024
[2]

Laskar, M. T. R.et al.Rogers, A., Boyd-Graber, J. & Okazaki, N. (eds)A sys- tematic study and comprehensive evaluation of ChatGPT on benchmark datasets. (eds Rogers, A., Boyd-Graber, J. & Okazaki, N.)Findings of the Association for Computational Linguistics: ACL 2023, 431–469 (Association for Computa- tional Linguistics, Toronto, Canada, 2023). URL https:...

work page 2023
[3]

Wu, J., Gan, W., Chen, Z., Wan, S. & Yu, P. S. Multimodal Large Language Models: A Survey (2023). URL https://doi.ieeecomputersociety.org/10.1109/ BigData59044.2023.10386743

work page arXiv 2023
[4]

arXiv preprint arXiv:2401.13601(2024)

Zhang, D.et al.Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601(2024)

work page arXiv 2024
[5]

Wang, J.et al.A comprehensive review of multimodal large language models: Per- formance and challenges across different tasks.arXiv preprint arXiv:2408.01319 (2024)

work page arXiv 2024
[6]

Ji, J.et al.Beavertails: Towards improved safety alignment of llm via a human- preference dataset.Advances in Neural Information Processing Systems36, 24678–24704 (2023)

work page 2023
[7]

Dai, J.et al.Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

(eds Leonardis, A.et al.) Computer Vision – ECCV 2024, 386–403 (Springer Nature Switzerland, Cham, 2025)

Liu, X.et al.Leonardis, A.et al.(eds)Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. (eds Leonardis, A.et al.) Computer Vision – ECCV 2024, 386–403 (Springer Nature Switzerland, Cham, 2025)

work page 2024
[9]

Yi, S.et al.Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295(2024)

work page internal anchor Pith review arXiv 2024
[10]

& Srikumar, V

Zhang, H.et al.Ku, L.-W., Martins, A. & Srikumar, V. (eds)Jailbreak open- sourced large language models via enforced decoding. (eds Ku, L.-W., Martins, A. & Srikumar, V.)Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5475–5493 (Associ- ation for Computational Linguistics, Bangkok, Thailand,...

work page 2024
[11]

& Narasimhan, K

Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A. & Narasimhan, K. Bouamor, H., Pino, J. & Bali, K. (eds)Toxicity in chatgpt: Analyzing persona- assigned language models. (eds Bouamor, H., Pino, J. & Bali, K.)Findings of the Association for Computational Linguistics: EMNLP 2023, 1236–1270 (Associa- tion for Computational Linguistics, Singapore, 2023...

work page 2023
[12]

& Xing, X

Yu, J., Lin, X., Yu, Z. & Xing, X. Llm-fuzzer: Scaling assessment of large language model jailbreaks (2024). URL https://www.usenix.org/conference/ usenixsecurity24/presentation/yu-jiahao

work page 2024
[13]

Li, Y., Guo, H., Zhou, K., Zhao, W. X. & Wen, J.-R. Leonardis, A.et al.(eds) Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jail- breaking multimodal large language models. (eds Leonardis, A.et al.)Computer Vision – ECCV 2024, 174–189 (Springer Nature Switzerland, Cham, 2025)

work page 2024
[14]

URL https://doi.org/10.1609/aaai.v39i22.34568

Gong, Y.et al.Figstep: jailbreaking large vision-language models via typographic visual prompts (2025). URL https://doi.org/10.1609/aaai.v39i22.34568

work page doi:10.1609/aaai.v39i22.34568 2025
[15]

Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference

Yang, Z.et al.Distraction is all you need for multimodal large language model jail- breaking (2025). URL https://doi.ieeecomputersociety.org/10.1109/CVPR52734. 2025.00884

work page doi:10.1109/cvpr52734 2025
[16]

Zhang, D.et al.Sequential comics for jailbreaking multimodal large lan- guage models via structured visual storytelling.arXiv preprint arXiv:2510.15068 (2025)

work page arXiv 2025
[17]

You, W.et al.Mirage: Multimodal immersive reasoning and guided exploration for red-team jailbreak attacks.arXiv preprint arXiv:2503.19134(2025)

work page arXiv 2025
[18]

& Xiao, C

Wang, Y., Liu, X., Li, Y., Chen, M. & Xiao, C. Adashield: Safeguarding mul- timodal large language models from structure-based attack via adaptive shield prompting.arXiv preprint arXiv:2403.09513(2024)

work page arXiv 2024
[19]

& Fang, Y

Li, C., Wang, H. & Fang, Y. Christodoulopoulos, C., Chakraborty, T., Rose, C. & Peng, V. (eds)Attack as defense: Safeguarding large vision-language models from jailbreaking by adversarial attacks. (eds Christodoulopoulos, C., Chakraborty, T., Rose, C. & Peng, V.)Findings of the Association for Computational Linguistics: EMNLP 2025, 20138–20152 (Associatio...

work page 2025
[20]

& Rahwan, T

Liu, F., AlDahoul, N., Eady, G., Zaki, Y. & Rahwan, T. Self-reflection makes large language models safer, less biased, and ideologically neutral.arXiv preprint arXiv:2406.10400(2024)

work page arXiv 2024
[21]

Souly, A.et al.A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems37, 125416–125440 (2024). 33

work page 2024
[22]

Chao, P.et al.Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems37, 55005–55029 (2024)

work page 2024
[23]

& Xiao, C

Luo, W., Ma, S., Liu, X., Guo, X. & Xiao, C. Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027(2024)

work page arXiv 2024
[24]

URL https://openai.com/index/chatgpt/

OpenAI (2022). URL https://openai.com/index/chatgpt/

work page 2022
[25]

Comanici, G.et al.Gemini 2.5: Pushing the frontier with advanced reason- ing, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Yang, A.et al.Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Team, G.et al.Gemma 3 technical report.arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Dubey, A.et al.The llama 3 herd of models.arXiv e-printsarXiv–2407 (2024)

work page 2024
[29]

Qwen2.5-VL Technical Report

Bai, S.et al.Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

& Pilehvar, M

Zhou, Y.et al.Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T. (eds)Don’t say no: Jailbreaking LLM by suppressing refusal. (eds Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T.)Findings of the Association for Computational Linguistics: ACL 2025, 25224–25249 (Association for Computational Linguistics, Vienna, Austria, 2025). URL https://aclanthology....

work page 2025
[31]

Mazeika, M.et al.Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

& Steinhardt, J

Wei, A., Haghtalab, N. & Steinhardt, J. Jailbroken: How does llm safety train- ing fail?Advances in Neural Information Processing Systems36, 80079–80110 (2023)

work page 2023
[33]

Jailbreaking leading safety-aligned llms with simple adaptive attacks

Andriushchenko, M., Croce, F. & Flammarion, N. Jailbreaking leading safety- aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151 (2024)

work page arXiv 2024
[34]

& Zhang, T

Radford, A.et al.Meila, M. & Zhang, T. (eds)Learning transferable visual models from natural language supervision. (eds Meila, M. & Zhang, T.)Proceedings of the 38th International Conference on Machine Learning, Vol. 139 ofProceedings of Machine Learning Research, 8748–8763 (PMLR, 2021). URL https://proceedings. mlr.press/v139/radford21a.html. 34 (a) (b) ...

work page 2021