DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs
Pith reviewed 2026-05-20 10:15 UTC · model grok-4.3
The pith
DMN uses multi-image inputs to distribute jailbreak instructions and achieve over 90% success on GPT-4o, Gemini, and Claude.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The DMN framework combines distributed instruction across multiple images, multimodal evidence, and a number chain task to expand the attack space and distract safety mechanisms, producing attack success rates over 90 percent on GPT-4o, Gemini-2.5-pro, and Claude Sonnet 4 while outperforming single-image methods.
What carries the argument
The DMN compositional strategy that splits the jailbreak into Distributed instruction (D), Multimodal evidence (M), and Number chain task (N) applied simultaneously to several images.
If this is right
- Multi-image input capability in MLLMs creates exploitable gaps because alignment work has not kept pace with that feature.
- Spreading instructions and evidence across images bypasses filters that single-image attacks cannot overcome.
- Adding a distracting visual reasoning task reduces the model's focus on detecting harmful intent.
- Compositional multi-image attacks reach success rates above 90 percent on major models where earlier methods do not.
- Current safety mechanisms contain fundamental weaknesses when inputs arrive as coordinated sets of images.
Where Pith is reading between the lines
- Safety training pipelines for MLLMs should add routine multi-image adversarial testing as a standard step.
- Defenses could check consistency of intent across the full set of images rather than treating each image in isolation.
- Similar distribution tactics might transfer to other input formats such as video or interleaved text-image sequences.
- Deployment policies for MLLMs may need updated risk assessments that treat multi-image support as an elevated attack vector.
Load-bearing premise
That MLLMs have received far less safety alignment for multi-image inputs than for single-image or text inputs.
What would settle it
Fine-tune or retrain one of the tested MLLMs with explicit multi-image safety examples and then measure whether DMN still reaches above 50 percent attack success rate.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) are vulnerable to jailbreak attacks, which can elicit harmful responses from MLLMs. Many MLLMs support multi-image inputs, inadvertently introducing new vulnerabilities due to less efforts on multi-image safety alignment. Previous MLLM jailbreak methods only uses a single image, which restricts the attack space: they cannot distribute harmful requests across multiple images, carry abundant information, or exploit additional visual reasoning tasks to distract MLLMs. To address these limitations, in this paper, we propose a compositional jailbreak framework, \textbf{DMN}, which leverages \textbf{D}istributed instruction, \textbf{M}ultimodal evidence and a \textbf{N}umber chain task to fully enhance the jailbreak performance. Extensive experiments show that DMN is highly effective for MLLM jailbreaking, e.g. achieving attack success rates of over 90\% on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4, surpassing other baselines by a large margin. This compositional, multi-image jailbreak strategy reveals fundamental weaknesses in their safety mechanisms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DMN, a compositional jailbreak framework for multimodal LLMs that distributes harmful instructions across multiple images, incorporates multimodal evidence, and adds a number-chain distraction task. It reports attack success rates exceeding 90% on GPT-4o, Gemini-2.5-pro, and Claude Sonnet 4, substantially outperforming prior single-image baselines, and attributes new vulnerabilities to insufficient multi-image safety alignment.
Significance. If the empirical results are robust and properly controlled, the work would usefully document a new attack surface arising from multi-image input support and demonstrate that compositional multi-image strategies can materially increase jailbreak effectiveness beyond single-image methods.
major comments (2)
- [Abstract / Experiments] Abstract and experimental sections: the central claim that DMN 'surpasses other baselines by a large margin' requires explicit confirmation that baseline methods were re-implemented with the same number of images and total visual tokens as DMN; without this control, the reported gains cannot be attributed to the compositional elements (distributed instruction, multimodal evidence, number-chain task) rather than simply the use of multiple images, which the abstract itself links to reduced safety alignment effort.
- [Abstract] Abstract: the reported attack success rates above 90% are stated without any accompanying experimental protocol, number of queries, success criteria definition, statistical tests, or failure-case analysis, rendering the quantitative claims impossible to evaluate from the provided text.
minor comments (1)
- [Experiments] Ensure that all baseline methods are described with the exact image count and prompt format used in the DMN evaluation to allow direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and experimental sections: the central claim that DMN 'surpasses other baselines by a large margin' requires explicit confirmation that baseline methods were re-implemented with the same number of images and total visual tokens as DMN; without this control, the reported gains cannot be attributed to the compositional elements (distributed instruction, multimodal evidence, number-chain task) rather than simply the use of multiple images, which the abstract itself links to reduced safety alignment effort.
Authors: We agree that a controlled comparison is necessary to isolate the contribution of the compositional elements. In the revised manuscript, we will re-implement the baseline methods using the same number of images and matched total visual tokens as DMN. Updated results and analysis will be added to the Experiments section and referenced in the abstract to ensure the performance gains are clearly attributable to distributed instructions, multimodal evidence, and the number-chain task rather than multi-image input alone. revision: yes
-
Referee: [Abstract] Abstract: the reported attack success rates above 90% are stated without any accompanying experimental protocol, number of queries, success criteria definition, statistical tests, or failure-case analysis, rendering the quantitative claims impossible to evaluate from the provided text.
Authors: We acknowledge that the abstract should be more self-contained. In the revision, we will add a brief description of the experimental protocol, including the number of queries evaluated, the definition of attack success criteria, reference to statistical tests, and mention of failure-case analysis. These details are already provided in the Experiments section; the abstract will now summarize them concisely to improve transparency while respecting length limits. revision: yes
Circularity Check
No circularity: empirical ASR results on external models
full rationale
The paper defines the DMN compositional framework (distributed instruction, multimodal evidence, number-chain task) and reports attack success rates measured directly on commercial MLLMs (GPT-4o, Gemini-2.5-pro, Claude Sonnet 4). These are independent external benchmarks with no internal equations, fitted parameters, or self-referential quantities. No derivation chain exists that reduces a claimed result to its own inputs by construction. Baseline comparisons and multi-image usage are methodological choices, not circular reductions. The central claim remains falsifiable via replication on the same external models.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Publications Manual , year = "1983", publisher =
work page 1983
-
[3]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
- [4]
-
[5]
Dan Gusfield , title =. 1997
work page 1997
-
[6]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[7]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[8]
European Conference on Computer Vision , pages=
Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[9]
European Conference on Computer Vision , pages=
Mm-safetybench: A benchmark for safety evaluation of multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[10]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Figstep: Jailbreaking large vision-language models via typographic visual prompts , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[11]
Jailbreak large vision-language models through multi-modal linkage , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
- [12]
- [13]
- [14]
- [15]
- [16]
-
[17]
Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action , howpublished =
- [18]
-
[19]
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=
work page 2025
-
[20]
2025 , howpublished =
work page 2025
-
[21]
Alisa Fortin and Guillaume Vernade and Kat Kampf and Ammaar Reshi , title =. 2025 , month = aug, day =
work page 2025
-
[22]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Distraction is all you need for multimodal large language model jailbreaking , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[24]
arXiv preprint arXiv:2510.21189 , year=
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency , author=. arXiv preprint arXiv:2510.21189 , year=
-
[25]
Findings of the Association for Computational Linguistics: NAACL 2024 , pages=
Cognitive overload: Jailbreaking large language models with overloaded logical thinking , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=
work page 2024
-
[26]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
The Thirteenth International Conference on Learning Representations , year=
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs , author=. The Thirteenth International Conference on Learning Representations , year=
-
[28]
arXiv preprint arXiv:2504.07957 , year=
Mm-ifengine: Towards multimodal instruction following , author=. arXiv preprint arXiv:2504.07957 , year=
-
[29]
Proceedings of the 41st International Conference on Machine Learning , pages=
MM-Vet: evaluating large multimodal models for integrated capabilities , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
-
[30]
European conference on computer vision , pages=
Mmbench: Is your multi-modal model an all-around player? , author=. European conference on computer vision , pages=. 2024 , organization=
work page 2024
-
[31]
Proceedings of the AAAI conference on artificial intelligence , volume=
Visual adversarial examples jailbreak aligned large language models , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[32]
arXiv preprint arXiv:2503.06989 , year=
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs , author=. arXiv preprint arXiv:2503.06989 , year=
-
[33]
How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023
How robust is google's bard to adversarial image attacks? , author=. arXiv preprint arXiv:2309.11751 , year=
-
[34]
arXiv preprint arXiv:2407.15211 , year=
Failures to find transferable image jailbreaks between vision-language models , author=. arXiv preprint arXiv:2407.15211 , year=
-
[35]
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt , author=. 2024 , eprint=
work page 2024
-
[36]
Proceedings of the 32nd ACM International Conference on Multimedia , pages=
White-box multimodal jailbreaks against large vision-language models , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=
-
[37]
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency , author=. 2025 , eprint=
work page 2025
-
[38]
Nature Machine Intelligence , volume=
Defending chatgpt against jailbreak attack via self-reminders , author=. Nature Machine Intelligence , volume=. 2023 , publisher=
work page 2023
-
[39]
European Conference on Computer Vision , pages=
Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[40]
arXiv preprint arXiv:2407.21659 , year=
Cross-modality information check for detecting jailbreaking in multimodal large language models , author=. arXiv preprint arXiv:2407.21659 , year=
-
[41]
European Conference on Computer Vision , pages=
Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[42]
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
Llama guard 3 vision: Safeguarding human-ai image understanding conversations , author=. arXiv preprint arXiv:2411.10414 , year=
-
[43]
Proceedings of the 41st International Conference on Machine Learning (ICML) , author =. 2025 , title =
work page 2025
-
[44]
ICLR 2025 Workshop on Building Trust in Language Models and Applications , year=
Videojail: Exploiting video-modality vulnerabilities for jailbreak attacks on multimodal large language models , author=. ICLR 2025 Workshop on Building Trust in Language Models and Applications , year=
work page 2025
-
[45]
NeurIPS 2024 Competition Track , year=
Clas 2024: The competition for llm and agent safety , author=. NeurIPS 2024 Competition Track , year=
work page 2024
-
[46]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. arXiv preprint arXiv:2310.03693 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [47]
-
[48]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.