pith. sign in

arxiv: 2605.18915 · v1 · pith:KQNXFIWFnew · submitted 2026-05-18 · 💻 cs.CR · cs.AI

DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs

Pith reviewed 2026-05-20 10:15 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords jailbreak attacksmultimodal LLMsmulti-image inputscompositional frameworksafety alignmentattack success rateGPT-4ovulnerabilities
0
0 comments X

The pith

DMN uses multi-image inputs to distribute jailbreak instructions and achieve over 90% success on GPT-4o, Gemini, and Claude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DMN as a compositional jailbreak method for multimodal LLMs that accept multiple images at once. It spreads harmful instructions across separate images, supplies supporting visual evidence, and adds a number-chain visual reasoning task to divert the model from safety checks. The approach rests on the observation that current safety alignment has focused less on multi-image scenarios than on text or single images. Experiments report attack success rates above 90 percent on GPT-4o, Gemini-2.5-pro, and Claude Sonnet 4, well above prior single-image baselines. If the results hold, multi-image support itself becomes a new and sizable attack surface.

Core claim

The DMN framework combines distributed instruction across multiple images, multimodal evidence, and a number chain task to expand the attack space and distract safety mechanisms, producing attack success rates over 90 percent on GPT-4o, Gemini-2.5-pro, and Claude Sonnet 4 while outperforming single-image methods.

What carries the argument

The DMN compositional strategy that splits the jailbreak into Distributed instruction (D), Multimodal evidence (M), and Number chain task (N) applied simultaneously to several images.

If this is right

  • Multi-image input capability in MLLMs creates exploitable gaps because alignment work has not kept pace with that feature.
  • Spreading instructions and evidence across images bypasses filters that single-image attacks cannot overcome.
  • Adding a distracting visual reasoning task reduces the model's focus on detecting harmful intent.
  • Compositional multi-image attacks reach success rates above 90 percent on major models where earlier methods do not.
  • Current safety mechanisms contain fundamental weaknesses when inputs arrive as coordinated sets of images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety training pipelines for MLLMs should add routine multi-image adversarial testing as a standard step.
  • Defenses could check consistency of intent across the full set of images rather than treating each image in isolation.
  • Similar distribution tactics might transfer to other input formats such as video or interleaved text-image sequences.
  • Deployment policies for MLLMs may need updated risk assessments that treat multi-image support as an elevated attack vector.

Load-bearing premise

That MLLMs have received far less safety alignment for multi-image inputs than for single-image or text inputs.

What would settle it

Fine-tune or retrain one of the tested MLLMs with explicit multi-image safety examples and then measure whether DMN still reaches above 50 percent attack success rate.

Figures

Figures reproduced from arXiv: 2605.18915 by Deyue Zhang, Dongdong Yang, Quanchen Zou, Wenzhuo Xu, Xiangzheng Zhang, Zhipeng Wei, Zonghao Ying.

Figure 1
Figure 1. Figure 1: An example of distributed instruction images. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline of multimodal evidence generation. First, an auxiliary LLM is utilized to generate a realistic [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A number chain frame example. MLLMs are instructed to extract 9 from this frame, and extract the next number in the 4th frame [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An illustration of the image sequence gen [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An example of input image sequence, input text and GPT-5’s corresponding output. The text input and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: The ASR and word count (only jailbroken re [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: The KFAR distribution of each task on SafeBench. BFI, CDFI and NC refer to blank frame indexing, cat/dog frame indexing and number chain task, respectively. Attention weights are obtained on Qwen-2.5-VL-7B. On the number of evidence pairs. To investigate the effect of the amount of multimodal evidence on jailbreak performance, we test the ASR of DMN using different number of evidence pairs. We also calcula… view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) are vulnerable to jailbreak attacks, which can elicit harmful responses from MLLMs. Many MLLMs support multi-image inputs, inadvertently introducing new vulnerabilities due to less efforts on multi-image safety alignment. Previous MLLM jailbreak methods only uses a single image, which restricts the attack space: they cannot distribute harmful requests across multiple images, carry abundant information, or exploit additional visual reasoning tasks to distract MLLMs. To address these limitations, in this paper, we propose a compositional jailbreak framework, \textbf{DMN}, which leverages \textbf{D}istributed instruction, \textbf{M}ultimodal evidence and a \textbf{N}umber chain task to fully enhance the jailbreak performance. Extensive experiments show that DMN is highly effective for MLLM jailbreaking, e.g. achieving attack success rates of over 90\% on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4, surpassing other baselines by a large margin. This compositional, multi-image jailbreak strategy reveals fundamental weaknesses in their safety mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes DMN, a compositional jailbreak framework for multimodal LLMs that distributes harmful instructions across multiple images, incorporates multimodal evidence, and adds a number-chain distraction task. It reports attack success rates exceeding 90% on GPT-4o, Gemini-2.5-pro, and Claude Sonnet 4, substantially outperforming prior single-image baselines, and attributes new vulnerabilities to insufficient multi-image safety alignment.

Significance. If the empirical results are robust and properly controlled, the work would usefully document a new attack surface arising from multi-image input support and demonstrate that compositional multi-image strategies can materially increase jailbreak effectiveness beyond single-image methods.

major comments (2)
  1. [Abstract / Experiments] Abstract and experimental sections: the central claim that DMN 'surpasses other baselines by a large margin' requires explicit confirmation that baseline methods were re-implemented with the same number of images and total visual tokens as DMN; without this control, the reported gains cannot be attributed to the compositional elements (distributed instruction, multimodal evidence, number-chain task) rather than simply the use of multiple images, which the abstract itself links to reduced safety alignment effort.
  2. [Abstract] Abstract: the reported attack success rates above 90% are stated without any accompanying experimental protocol, number of queries, success criteria definition, statistical tests, or failure-case analysis, rendering the quantitative claims impossible to evaluate from the provided text.
minor comments (1)
  1. [Experiments] Ensure that all baseline methods are described with the exact image count and prompt format used in the DMN evaluation to allow direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and experimental sections: the central claim that DMN 'surpasses other baselines by a large margin' requires explicit confirmation that baseline methods were re-implemented with the same number of images and total visual tokens as DMN; without this control, the reported gains cannot be attributed to the compositional elements (distributed instruction, multimodal evidence, number-chain task) rather than simply the use of multiple images, which the abstract itself links to reduced safety alignment effort.

    Authors: We agree that a controlled comparison is necessary to isolate the contribution of the compositional elements. In the revised manuscript, we will re-implement the baseline methods using the same number of images and matched total visual tokens as DMN. Updated results and analysis will be added to the Experiments section and referenced in the abstract to ensure the performance gains are clearly attributable to distributed instructions, multimodal evidence, and the number-chain task rather than multi-image input alone. revision: yes

  2. Referee: [Abstract] Abstract: the reported attack success rates above 90% are stated without any accompanying experimental protocol, number of queries, success criteria definition, statistical tests, or failure-case analysis, rendering the quantitative claims impossible to evaluate from the provided text.

    Authors: We acknowledge that the abstract should be more self-contained. In the revision, we will add a brief description of the experimental protocol, including the number of queries evaluated, the definition of attack success criteria, reference to statistical tests, and mention of failure-case analysis. These details are already provided in the Experiments section; the abstract will now summarize them concisely to improve transparency while respecting length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ASR results on external models

full rationale

The paper defines the DMN compositional framework (distributed instruction, multimodal evidence, number-chain task) and reports attack success rates measured directly on commercial MLLMs (GPT-4o, Gemini-2.5-pro, Claude Sonnet 4). These are independent external benchmarks with no internal equations, fitted parameters, or self-referential quantities. No derivation chain exists that reduces a claimed result to its own inputs by construction. Baseline comparisons and multi-image usage are methodological choices, not circular reductions. The central claim remains falsifiable via replication on the same external models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical attack paper; no mathematical axioms, free parameters, or new postulated entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5752 in / 993 out tokens · 36873 ms · 2026-05-20T10:15:04.807055+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 3 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    European Conference on Computer Vision , pages=

    Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  9. [9]

    European Conference on Computer Vision , pages=

    Mm-safetybench: A benchmark for safety evaluation of multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  10. [10]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Figstep: Jailbreaking large vision-language models via typographic visual prompts , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  11. [11]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Jailbreak large vision-language models through multi-modal linkage , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  12. [12]

    2023 , month =

    OpenAI , title =. 2023 , month =

  13. [13]

    2024 , eprint=

    GPT-4o System Card , author=. 2024 , eprint=

  14. [14]

    2025 , month =

    OpenAI , title =. 2025 , month =

  15. [15]

    2025 , month =

    Anthropic , title =. 2025 , month =

  16. [16]

    2025 , eprint=

    Qwen2.5-VL Technical Report , author=. 2025 , eprint=

  17. [17]

    Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action , howpublished =

  18. [18]

    2025 , eprint=

    Seed1.5-VL Technical Report , author=. 2025 , eprint=

  19. [19]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  20. [20]

    2025 , howpublished =

  21. [21]

    2025 , month = aug, day =

    Alisa Fortin and Guillaume Vernade and Kat Kampf and Ammaar Reshi , title =. 2025 , month = aug, day =

  22. [22]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  23. [23]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Distraction is all you need for multimodal large language model jailbreaking , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  24. [24]

    arXiv preprint arXiv:2510.21189 , year=

    Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency , author=. arXiv preprint arXiv:2510.21189 , year=

  25. [25]

    Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

    Cognitive overload: Jailbreaking large language models with overloaded logical thinking , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

  26. [26]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

  27. [27]

    The Thirteenth International Conference on Learning Representations , year=

    MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs , author=. The Thirteenth International Conference on Learning Representations , year=

  28. [28]

    arXiv preprint arXiv:2504.07957 , year=

    Mm-ifengine: Towards multimodal instruction following , author=. arXiv preprint arXiv:2504.07957 , year=

  29. [29]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    MM-Vet: evaluating large multimodal models for integrated capabilities , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

  30. [30]

    European conference on computer vision , pages=

    Mmbench: Is your multi-modal model an all-around player? , author=. European conference on computer vision , pages=. 2024 , organization=

  31. [31]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Visual adversarial examples jailbreak aligned large language models , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  32. [32]

    arXiv preprint arXiv:2503.06989 , year=

    Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs , author=. arXiv preprint arXiv:2503.06989 , year=

  33. [33]

    How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

    How robust is google's bard to adversarial image attacks? , author=. arXiv preprint arXiv:2309.11751 , year=

  34. [34]

    arXiv preprint arXiv:2407.15211 , year=

    Failures to find transferable image jailbreaks between vision-language models , author=. arXiv preprint arXiv:2407.15211 , year=

  35. [35]

    2024 , eprint=

    Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt , author=. 2024 , eprint=

  36. [36]

    Proceedings of the 32nd ACM International Conference on Multimedia , pages=

    White-box multimodal jailbreaks against large vision-language models , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=

  37. [37]

    2025 , eprint=

    Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency , author=. 2025 , eprint=

  38. [38]

    Nature Machine Intelligence , volume=

    Defending chatgpt against jailbreak attack via self-reminders , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

  39. [39]

    European Conference on Computer Vision , pages=

    Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  40. [40]

    arXiv preprint arXiv:2407.21659 , year=

    Cross-modality information check for detecting jailbreaking in multimodal large language models , author=. arXiv preprint arXiv:2407.21659 , year=

  41. [41]

    European Conference on Computer Vision , pages=

    Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  42. [42]

    Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

    Llama guard 3 vision: Safeguarding human-ai image understanding conversations , author=. arXiv preprint arXiv:2411.10414 , year=

  43. [43]

    2025 , title =

    Proceedings of the 41st International Conference on Machine Learning (ICML) , author =. 2025 , title =

  44. [44]

    ICLR 2025 Workshop on Building Trust in Language Models and Applications , year=

    Videojail: Exploiting video-modality vulnerabilities for jailbreak attacks on multimodal large language models , author=. ICLR 2025 Workshop on Building Trust in Language Models and Applications , year=

  45. [45]

    NeurIPS 2024 Competition Track , year=

    Clas 2024: The competition for llm and agent safety , author=. NeurIPS 2024 Competition Track , year=

  46. [46]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. arXiv preprint arXiv:2310.03693 , year=

  47. [47]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  48. [48]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=