PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking
Pith reviewed 2026-05-19 03:32 UTC · model grok-4.3
The pith
LVLMs can be jailbroken by splitting harmful requests into sequences of individually benign images that the model assembles during reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decomposing harmful instructions into sequences of individually benign visual gadgets and directing their integration via textual prompts causes the malicious intent to emerge from the model's compositional reasoning, evading detection from any single component and producing high attack success rates on state-of-the-art LVLMs.
What carries the argument
The PRISM framework that decomposes harmful instructions into sequences of benign visual gadgets and uses a directing textual prompt to force their integration during reasoning, modeled on return-oriented programming chains.
If this is right
- Safety alignments that scan single prompts or images fail against attacks that rely on multi-step composition.
- Attack success rates exceed 0.90 on SafeBench and improve by as much as 0.39 over prior baselines.
- Defenses must monitor the full reasoning chain rather than individual inputs.
- The same compositional vulnerability appears across popular LVLMs tested on MM-SafetyBench.
Where Pith is reading between the lines
- The technique could extend to other sequential reasoning systems if they accept mixed image and text inputs over multiple turns.
- Alignment training may need to include examples of benign components that become harmful only when assembled.
- Developers could add runtime checks that detect when a sequence of safe-looking inputs is being directed toward a single coherent output.
Load-bearing premise
LVLMs will integrate the sequence of benign visual gadgets through their reasoning process to produce a coherent and harmful output that evades detection from any single component.
What would settle it
Run the sequence of benign visual gadgets and directing prompt on a target LVLM and observe whether the model refuses the request, produces unrelated output, or fails to compose the elements into the intended harmful response.
Figures
read the original abstract
The increasing sophistication of large vision-language models (LVLMs) has been accompanied by advances in safety alignment mechanisms designed to prevent harmful content generation. However, these defenses remain vulnerable to sophisticated adversarial attacks. Existing jailbreak methods typically rely on direct and semantically explicit prompts, overlooking subtle vulnerabilities in how LVLMs compose information over multiple reasoning steps. In this paper, we propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security. Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets. A carefully engineered textual prompt directs the sequence of inputs, prompting the model to integrate the benign visual gadgets through its reasoning process to produce a coherent and harmful output. This makes the malicious intent emergent and difficult to detect from any single component. We validate our method through extensive experiments on established benchmarks including SafeBench and MM-SafetyBench, targeting popular LVLMs. Results show that our approach consistently and substantially outperforms existing baselines on state-of-the-art models, achieving near-perfect attack success rates (over 0.90 on SafeBench) and improving ASR by up to 0.39. Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs, highlighting the urgent need for defenses that secure the entire reasoning process.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PRISM, an ROP-inspired jailbreak for LVLMs that decomposes harmful instructions into sequences of individually benign visual gadgets steered by a textual prompt so that the model integrates them into coherent harmful outputs. Experiments on SafeBench and MM-SafetyBench report ASR > 0.90 on SafeBench and gains of up to 0.39 over baselines on state-of-the-art LVLMs.
Significance. If the empirical results hold, the work identifies a concrete vulnerability in how LVLMs perform multi-step compositional reasoning over image sequences, showing that safety filters can be evaded when harm emerges only from integration rather than any single input. This could motivate new defenses that audit cumulative context rather than isolated components.
major comments (2)
- [Abstract] Abstract: the central claim that the method produces 'coherent and harmful output' while evading detection rests on the untested assumption that LVLMs integrate the benign gadget sequence exactly as the prompt intends without safety mechanisms operating on cumulative context; no gadget-construction algorithm, per-gadget safety audit, or ablation that removes the integration prompt is supplied.
- [Experiments] The reported ASR > 0.90 and improvements of up to 0.39 are presented without full experimental details, baseline comparisons, or controls for confounds such as prompt length or image selection bias, leaving the strength of evidence for the compositional-reasoning vulnerability only moderately supported.
minor comments (2)
- [Method] Clarify the precise definition of 'benign' for each gadget and how it was verified against the target LVLMs' safety classifiers.
- [Evaluation] Add a table or figure showing per-model ASR with and without the textual integration prompt to isolate its contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments raise important points about the clarity of claims in the abstract and the completeness of experimental reporting. We address each major comment below and indicate the specific revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the method produces 'coherent and harmful output' while evading detection rests on the untested assumption that LVLMs integrate the benign gadget sequence exactly as the prompt intends without safety mechanisms operating on cumulative context; no gadget-construction algorithm, per-gadget safety audit, or ablation that removes the integration prompt is supplied.
Authors: We agree that the abstract is high-level and that additional supporting evidence would strengthen the central claim. The full manuscript describes the gadget-construction algorithm in Section 3.2 as a programmatic decomposition inspired by ROP, with explicit steps for breaking down harmful instructions into benign visual components. To address the integration assumption and lack of ablation, we will add a dedicated ablation study in the revised Experiments section that removes the integration prompt and reports the resulting drop in ASR. We will also include a per-gadget safety audit with quantitative results showing activation rates of safety filters on individual gadgets versus the full sequence. These additions will provide direct empirical support for the emergence of harm through compositional reasoning. revision: yes
-
Referee: [Experiments] The reported ASR > 0.90 and improvements of up to 0.39 are presented without full experimental details, baseline comparisons, or controls for confounds such as prompt length or image selection bias, leaving the strength of evidence for the compositional-reasoning vulnerability only moderately supported.
Authors: We acknowledge that the current experimental presentation would benefit from greater transparency. In the revised manuscript, we will expand the Experiments section to include full details on the setup, including exact prompt lengths, image selection criteria and randomization procedures, and explicit controls for confounds such as length bias and selection effects. We will also augment the baseline comparisons with additional methods and report results with standard deviations or confidence intervals across repeated trials. These changes will provide stronger, more robust evidence for the identified vulnerability. revision: yes
Circularity Check
No circularity: empirical attack construction evaluated on external benchmarks
full rationale
The paper proposes an empirical jailbreak method (PRISM) that decomposes harmful instructions into sequences of individually benign visual gadgets, then uses a textual prompt to steer LVLMs into integrating them into coherent harmful outputs. Claims rest on experimental results showing ASR >0.90 on SafeBench and gains up to 0.39 over baselines. No mathematical derivations, equations, fitted parameters, or self-citations appear in the provided text that would reduce any reported success rate to a quantity defined by the method itself. Evaluation uses external benchmarks (SafeBench, MM-SafetyBench) and comparisons to prior baselines, keeping the contribution independent of its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LVLMs integrate information across multiple reasoning steps involving sequences of images and text in a manner that can produce emergent harmful outputs from benign components.
Forward citations
Cited by 1 Pith paper
-
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior,...
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Aliyun (Alibaba Cloud) . 2025. How to Use Vision Models in Model Studio. https://help.aliyun.com/zh/model-studio/vision. Accessed: 2025‑07‑15; last updated: 2025‑06‑13
work page 2025
-
[5]
Anthropic . 2024. Claude 3.5 Sonnet: First in the next generation of Claude models. Accessed: 2025-06-30
work page 2024
-
[6]
Bierbaumer, B.; Kirsch, J.; Kittel, T.; Francillon, A.; and Zarras, A. 2018. Smashing the stack protector for fun and profit. In ICT Systems Security and Privacy Protection: 33rd IFIP TC 11 International Conference, SEC 2018, Held at the 24th IFIP World Computer Congress, WCC 2018, Poznan, Poland, September 18-20, 2018, Proceedings 33, 293--306. Springer
work page 2018
-
[7]
ByteDance Seed Team . 2025. Seedream 3.0: Next‑Gen Text‑to‑Image Model. https://seed.bytedance.com/en/tech/seedream3_0. Accessed: 2025‑06‑30
work page 2025
-
[8]
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Chao, P.; Debenedetti, E.; Robey, A.; Andriushchenko, M.; Croce, F.; Sehwag, V.; Dobriban, E.; Flammarion, N.; Pappas, G. J.; Tramer, F.; et al. 2024. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Dong, Y.; Liu, Z.; Sun, H.-L.; Yang, J.; Hu, W.; Rao, Y.; and Liu, Z. 2025. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 9062--9072
work page 2025
-
[10]
Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; M \"u ller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning
work page 2024
-
[11]
Fang, W.; Wu, Q.; Chen, J.; and Xue, Y. 2025. guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering. In Proceedings of the Computer Vision and Pattern Recognition Conference, 19597--19607
work page 2025
-
[12]
Gong, Y.; Ran, D.; Liu, J.; Wang, C.; Cong, T.; Wang, A.; Duan, S.; and Wang, X. 2025. Figstep: Jailbreaking large vision-language models via typographic visual prompts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 23951--23959
work page 2025
-
[13]
Gou, Y.; Chen, K.; Liu, Z.; Hong, L.; Xu, H.; Li, Z.; Yeung, D.-Y.; Kwok, J. T.; and Zhang, Y. 2024. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation. In European Conference on Computer Vision, 388--404. Springer
work page 2024
-
[14]
Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Intel Corporation . 2023. Intel 64 and IA-32 Architectures Software Developer’s Manual . Volume 1: Basic Architecture. Chapter 3: System Architecture Overview
work page 2023
-
[16]
Kuang, J.; Shen, Y.; Xie, J.; Luo, H.; Xu, Z.; Li, R.; Li, Y.; Cheng, X.; Lin, X.; and Han, Y. 2025. Natural language understanding and inference with mllm in visual question answering: A survey. ACM Computing Surveys, 57(8): 1--36
work page 2025
-
[17]
Li, Y.; Guo, H.; Zhou, K.; Zhao, W. X.; and Wen, J.-R. 2024. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. In European Conference on Computer Vision, 174--189. Springer
work page 2024
-
[18]
Liu, H.; Li, C.; Li, Y.; and Lee, Y. J. 2023. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Liu, S.; Cheng, H.; Liu, H.; Zhang, H.; Li, F.; Ren, T.; Zou, X.; Yang, J.; Su, H.; Zhu, J.; et al. 2024 a . Llava-plus: Learning to use tools for creating multimodal agents. In European Conference on Computer Vision, 126--142. Springer
work page 2024
-
[20]
Liu, X.; Zhu, Y.; Gu, J.; Lan, Y.; Yang, C.; and Qiao, Y. 2024 b . Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. In European Conference on Computer Vision, 386--403. Springer
work page 2024
-
[21]
Lu, J.; Srivastava, S.; Chen, J.; Shrestha, R.; Acharya, M.; Kafle, K.; and Kanan, C. 2025. Revisiting multi-modal llm evaluation. In Proceedings of the Computer Vision and Pattern Recognition Conference, 555--564
work page 2025
- [22]
-
[23]
Meta AI . 2024. Llama 3.2: Connect 2024 — Vision & Edge Mobile Devices. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. Accessed: 2025-06-30
work page 2024
-
[24]
Meta AI . 2025. LLaMA Use Policy. https://ai.meta.com/llama/use-policy/. Accessed: 2025-06-30
work page 2025
-
[25]
Microsoft Corporation . 2006. Data Execution Prevention. https://learn.microsoft.com/en-us/windows/win32/memory/data-execution-prevention. Accessed: 2025-06-30
work page 2006
-
[26]
OpenAI . 2023. DALL·E 3. https://openai.com/index/dall-e-3/. Accessed: 2025-06-30
work page 2023
-
[27]
OpenAI . 2025. Usage Policies. https://openai.com/zh-Hans-CN/policies/usage-policies/. Accessed: 2025-06-30
work page 2025
-
[28]
Pahune, S.; and Rewatkar, N. 2023. Healthcare: A Growing Role for Large Language Models and Generative AI. International Journal for Research in Applied Science and Engineering Technology, 11 (8), 2288--2301
work page 2023
-
[29]
Qi, X.; Huang, K.; Panda, A.; Henderson, P.; Wang, M.; and Mittal, P. 2024. Visual adversarial examples jailbreak aligned large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, 21527--21536
work page 2024
- [30]
-
[31]
Shacham, H. 2007. The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86). In Proceedings of the 14th ACM conference on Computer and communications security, 552--561
work page 2007
-
[32]
Shacham, H.; Page, M.; Pfaff, B.; Goh, E.-J.; Modadugu, N.; and Boneh, D. 2004. On the effectiveness of address-space randomization. In Proceedings of the 11th ACM conference on Computer and communications security, 298--307
work page 2004
-
[33]
Sun, Y.; Zhu, C.; Zheng, S.; Zhang, K.; Sun, L.; Shui, Z.; Zhang, Y.; Li, H.; and Yang, L. 2024. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 5034--5042
work page 2024
- [34]
-
[35]
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Fan, Y.; Dang, K.; Du, M.; Ren, X.; Men, R.; Liu, D.; Zhou, C.; Zhou, J.; and Lin, J. 2024 a . Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Wang, Y.; Chen, W.; Han, X.; Lin, X.; Zhao, H.; Liu, Y.; Zhai, B.; Yuan, J.; You, Q.; and Yang, H. 2024 b . Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805
-
[37]
Wang, Y.; Liu, X.; Li, Y.; Chen, M.; and Xiao, C. 2024 c . Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting. In European Conference on Computer Vision, 77--94. Springer
work page 2024
-
[38]
xAI . 2025. Grok. https://grok.com/. Latest release: Grok‑3 (Feb 17, 2025); Accessed: 2025‑06‑30
work page 2025
- [39]
- [40]
- [41]
- [42]
- [43]
- [44]
-
[45]
Yuan, M.; Bao, P.; Yuan, J.; Shen, Y.; Chen, Z.; Xie, Y.; Zhao, J.; Li, Q.; Chen, Y.; Zhang, L.; et al. 2024. Large language models illuminate a progressive pathway to artificial intelligent healthcare assistant. Medicine Plus, 100030
work page 2024
-
[46]
Zhang, X.; Zeng, F.; and Gu, C. 2025. Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation. Neural Networks, 184: 107059
work page 2025
- [47]
- [48]
-
[49]
Zhipu AI . 2025. GLM-4V: A Multimodal Vision-Language Model by Zhipu AI. https://open.bigmodel.cn/dev/howuse/glm-4v. Accessed: July 15, 2025
work page 2025
-
[50]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J. Z.; and Fredrikson, M. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.