PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

Deyue Zhang; Dongdong Yang; Moyang Chen; Quanchen Zou; Wenzhuo Xu; Xiangzheng Zhang; Yakai Li; Yisong Xiao; Zhao Liu; Zonghao Ying

arxiv: 2507.21540 · v3 · submitted 2025-07-29 · 💻 cs.CR · cs.CV

PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

Quanchen Zou , Zonghao Ying , Moyang Chen , Wenzhuo Xu , Yisong Xiao , Yakai Li , Deyue Zhang , Dongdong Yang

show 2 more authors

Zhao Liu Xiangzheng Zhang

This is my paper

Pith reviewed 2026-05-19 03:32 UTC · model grok-4.3

classification 💻 cs.CR cs.CV

keywords jailbreakinglarge vision-language modelsadversarial attackscompositional reasoningsafety alignmentreturn-oriented programming

0 comments

The pith

LVLMs can be jailbroken by splitting harmful requests into sequences of individually benign images that the model assembles during reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a jailbreak method that decomposes a harmful instruction into a chain of harmless visual elements. A guiding text prompt then directs the model to combine these elements step by step until a coherent harmful response emerges. This approach avoids explicit malicious content in any single image or prompt, exploiting how LVLMs build answers across multiple reasoning steps. If the method works as described, current safety checks that examine isolated inputs will miss the attack. Readers should care because it reveals that alignment focused on direct prompts leaves models open to attacks that hide intent until the final composition.

Core claim

Decomposing harmful instructions into sequences of individually benign visual gadgets and directing their integration via textual prompts causes the malicious intent to emerge from the model's compositional reasoning, evading detection from any single component and producing high attack success rates on state-of-the-art LVLMs.

What carries the argument

The PRISM framework that decomposes harmful instructions into sequences of benign visual gadgets and uses a directing textual prompt to force their integration during reasoning, modeled on return-oriented programming chains.

If this is right

Safety alignments that scan single prompts or images fail against attacks that rely on multi-step composition.
Attack success rates exceed 0.90 on SafeBench and improve by as much as 0.39 over prior baselines.
Defenses must monitor the full reasoning chain rather than individual inputs.
The same compositional vulnerability appears across popular LVLMs tested on MM-SafetyBench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could extend to other sequential reasoning systems if they accept mixed image and text inputs over multiple turns.
Alignment training may need to include examples of benign components that become harmful only when assembled.
Developers could add runtime checks that detect when a sequence of safe-looking inputs is being directed toward a single coherent output.

Load-bearing premise

LVLMs will integrate the sequence of benign visual gadgets through their reasoning process to produce a coherent and harmful output that evades detection from any single component.

What would settle it

Run the sequence of benign visual gadgets and directing prompt on a target LVLM and observe whether the model refuses the request, produces unrelated output, or fails to compose the elements into the intended harmful response.

Figures

Figures reproduced from arXiv: 2507.21540 by Deyue Zhang, Dongdong Yang, Moyang Chen, Quanchen Zou, Wenzhuo Xu, Xiangzheng Zhang, Yakai Li, Yisong Xiao, Zhao Liu, Zonghao Ying.

**Figure 1.** Figure 1: Analogy between ROP in software and PRISM in LVLM. Code gadgets with control flow in ROP correspond to visual gadgets and prompt-driven reasoning in PRISM. arises from jailbreak attacks, which attempt to subvert safety mechanisms and elicit restricted or harmful content (Zou et al. 2023; Ying et al. 2024b). In the context of LVLM jailbreaking, recent work has demonstrated that attackers craft adversarial p… view at source ↗

**Figure 2.** Figure 2: Overview of the PRISM pipeline. An auxiliary LLM decomposes the target into key steps, each described as a textual scene. These are used by a T2I model to generate sub-images, which are composed into a single image. The textual prompt, obtained via generalizable template search, guides the LVLM to extract relevant information and compose an unsafe response. Attacker Capabilities We assume a black-box attac… view at source ↗

**Figure 3.** Figure 3: Ablation study on auxiliary models. 1 2 3 4 5 6 Number of Visual Gadgets 0.0 0.2 0.4 0.6 0.8 1.0 ASR Qwen2-VL-7B-Instruct LlaVA-v1.6-Mistral-7B Llama-3.2-11B-Vision-Instruct GPT-4o Claude 3.7 Sonnet GLM-4V-Plus Qwen-Vl-Plus [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of the number of visual gadgets on the ASR [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: A sample jailbreak attack on Qwen-VL-Plus using the proposed [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: A sample jailbreak attack on GPT-4o using the proposed [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of images generated by different T2I [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

The increasing sophistication of large vision-language models (LVLMs) has been accompanied by advances in safety alignment mechanisms designed to prevent harmful content generation. However, these defenses remain vulnerable to sophisticated adversarial attacks. Existing jailbreak methods typically rely on direct and semantically explicit prompts, overlooking subtle vulnerabilities in how LVLMs compose information over multiple reasoning steps. In this paper, we propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security. Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets. A carefully engineered textual prompt directs the sequence of inputs, prompting the model to integrate the benign visual gadgets through its reasoning process to produce a coherent and harmful output. This makes the malicious intent emergent and difficult to detect from any single component. We validate our method through extensive experiments on established benchmarks including SafeBench and MM-SafetyBench, targeting popular LVLMs. Results show that our approach consistently and substantially outperforms existing baselines on state-of-the-art models, achieving near-perfect attack success rates (over 0.90 on SafeBench) and improving ASR by up to 0.39. Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs, highlighting the urgent need for defenses that secure the entire reasoning process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRISM adapts ROP ideas to split harmful requests across sequences of benign images steered by text, claiming big ASR gains on LVLM benchmarks, but the mechanism details look thin.

read the letter

The one or two things to know: This paper presents PRISM, a jailbreak method for LVLMs that uses sequences of benign images combined with programmatic text prompts to elicit harmful outputs through the model's own reasoning. It claims near-perfect success rates on standard safety benchmarks. What is new here is the adaptation of ROP concepts to visual inputs, where harmful intent is split across multiple images that are harmless alone. This goes beyond the direct prompt attacks that the abstract positions as the current state of the art. The paper does well in running experiments on SafeBench and MM-SafetyBench against several leading models, showing consistent improvements and high attack success rates that suggest the approach taps into real weaknesses in compositional reasoning. The soft spots are in the lack of transparency around the core components. Without details on the gadget construction process or evidence that each image was verified as benign independently, it's difficult to assess if the success is due to the sequence manipulation or other factors. The assumption that the model will integrate the gadgets into a coherent harmful response without safety mechanisms catching the overall pattern is load-bearing, and the abstract provides no ablations or per-component analysis to support it. If the full paper has those, it would strengthen the case considerably. This work is for researchers focused on the security of multimodal AI systems and those developing defenses against advanced jailbreaks. A reader looking for ideas on how to test reasoning-based vulnerabilities would get something out of it. I recommend sending this to peer review. The topic is relevant, the method is distinct, and the reported results merit expert evaluation even if revisions are needed for the experimental rigor.

Referee Report

2 major / 2 minor

Summary. The paper proposes PRISM, an ROP-inspired jailbreak for LVLMs that decomposes harmful instructions into sequences of individually benign visual gadgets steered by a textual prompt so that the model integrates them into coherent harmful outputs. Experiments on SafeBench and MM-SafetyBench report ASR > 0.90 on SafeBench and gains of up to 0.39 over baselines on state-of-the-art LVLMs.

Significance. If the empirical results hold, the work identifies a concrete vulnerability in how LVLMs perform multi-step compositional reasoning over image sequences, showing that safety filters can be evaded when harm emerges only from integration rather than any single input. This could motivate new defenses that audit cumulative context rather than isolated components.

major comments (2)

[Abstract] Abstract: the central claim that the method produces 'coherent and harmful output' while evading detection rests on the untested assumption that LVLMs integrate the benign gadget sequence exactly as the prompt intends without safety mechanisms operating on cumulative context; no gadget-construction algorithm, per-gadget safety audit, or ablation that removes the integration prompt is supplied.
[Experiments] The reported ASR > 0.90 and improvements of up to 0.39 are presented without full experimental details, baseline comparisons, or controls for confounds such as prompt length or image selection bias, leaving the strength of evidence for the compositional-reasoning vulnerability only moderately supported.

minor comments (2)

[Method] Clarify the precise definition of 'benign' for each gadget and how it was verified against the target LVLMs' safety classifiers.
[Evaluation] Add a table or figure showing per-model ASR with and without the textual integration prompt to isolate its contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments raise important points about the clarity of claims in the abstract and the completeness of experimental reporting. We address each major comment below and indicate the specific revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the method produces 'coherent and harmful output' while evading detection rests on the untested assumption that LVLMs integrate the benign gadget sequence exactly as the prompt intends without safety mechanisms operating on cumulative context; no gadget-construction algorithm, per-gadget safety audit, or ablation that removes the integration prompt is supplied.

Authors: We agree that the abstract is high-level and that additional supporting evidence would strengthen the central claim. The full manuscript describes the gadget-construction algorithm in Section 3.2 as a programmatic decomposition inspired by ROP, with explicit steps for breaking down harmful instructions into benign visual components. To address the integration assumption and lack of ablation, we will add a dedicated ablation study in the revised Experiments section that removes the integration prompt and reports the resulting drop in ASR. We will also include a per-gadget safety audit with quantitative results showing activation rates of safety filters on individual gadgets versus the full sequence. These additions will provide direct empirical support for the emergence of harm through compositional reasoning. revision: yes
Referee: [Experiments] The reported ASR > 0.90 and improvements of up to 0.39 are presented without full experimental details, baseline comparisons, or controls for confounds such as prompt length or image selection bias, leaving the strength of evidence for the compositional-reasoning vulnerability only moderately supported.

Authors: We acknowledge that the current experimental presentation would benefit from greater transparency. In the revised manuscript, we will expand the Experiments section to include full details on the setup, including exact prompt lengths, image selection criteria and randomization procedures, and explicit controls for confounds such as length bias and selection effects. We will also augment the baseline comparisons with additional methods and report results with standard deviations or confidence intervals across repeated trials. These changes will provide stronger, more robust evidence for the identified vulnerability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack construction evaluated on external benchmarks

full rationale

The paper proposes an empirical jailbreak method (PRISM) that decomposes harmful instructions into sequences of individually benign visual gadgets, then uses a textual prompt to steer LVLMs into integrating them into coherent harmful outputs. Claims rest on experimental results showing ASR >0.90 on SafeBench and gains up to 0.39 over baselines. No mathematical derivations, equations, fitted parameters, or self-citations appear in the provided text that would reduce any reported success rate to a quantity defined by the method itself. Evaluation uses external benchmarks (SafeBench, MM-SafetyBench) and comparisons to prior baselines, keeping the contribution independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LVLMs perform compositional reasoning over sequences of individually benign visual inputs when guided by a textual prompt, allowing malicious intent to emerge without detection in any single step.

axioms (1)

domain assumption LVLMs integrate information across multiple reasoning steps involving sequences of images and text in a manner that can produce emergent harmful outputs from benign components.
This assumption underpins the claim that the malicious intent becomes difficult to detect from any single gadget.

pith-pipeline@v0.9.0 · 5798 in / 1234 out tokens · 45176 ms · 2026-05-19T03:32:09.022432+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
cs.CR 2025-10 unverdicted novelty 7.0

SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior,...

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

GPT-4 Technical Report

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Aliyun (Alibaba Cloud) . 2025. How to Use Vision Models in Model Studio. https://help.aliyun.com/zh/model-studio/vision. Accessed: 2025‑07‑15; last updated: 2025‑06‑13

work page 2025
[5]

Anthropic . 2024. Claude 3.5 Sonnet: First in the next generation of Claude models. Accessed: 2025-06-30

work page 2024
[6]

Bierbaumer, B.; Kirsch, J.; Kittel, T.; Francillon, A.; and Zarras, A. 2018. Smashing the stack protector for fun and profit. In ICT Systems Security and Privacy Protection: 33rd IFIP TC 11 International Conference, SEC 2018, Held at the 24th IFIP World Computer Congress, WCC 2018, Poznan, Poland, September 18-20, 2018, Proceedings 33, 293--306. Springer

work page 2018
[7]

ByteDance Seed Team . 2025. Seedream 3.0: Next‑Gen Text‑to‑Image Model. https://seed.bytedance.com/en/tech/seedream3_0. Accessed: 2025‑06‑30

work page 2025
[8]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, P.; Debenedetti, E.; Robey, A.; Andriushchenko, M.; Croce, F.; Sehwag, V.; Dobriban, E.; Flammarion, N.; Pappas, G. J.; Tramer, F.; et al. 2024. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Dong, Y.; Liu, Z.; Sun, H.-L.; Yang, J.; Hu, W.; Rao, Y.; and Liu, Z. 2025. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 9062--9072

work page 2025
[10]

Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; M \"u ller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning

work page 2024
[11]

Fang, W.; Wu, Q.; Chen, J.; and Xue, Y. 2025. guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering. In Proceedings of the Computer Vision and Pattern Recognition Conference, 19597--19607

work page 2025
[12]

Gong, Y.; Ran, D.; Liu, J.; Wang, C.; Cong, T.; Wang, A.; Duan, S.; and Wang, X. 2025. Figstep: Jailbreaking large vision-language models via typographic visual prompts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 23951--23959

work page 2025
[13]

T.; and Zhang, Y

Gou, Y.; Chen, K.; Liu, Z.; Hong, L.; Xu, H.; Li, Z.; Yeung, D.-Y.; Kwok, J. T.; and Zhang, Y. 2024. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation. In European Conference on Computer Vision, 388--404. Springer

work page 2024
[14]

GPT-4o System Card

Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Intel Corporation . 2023. Intel 64 and IA-32 Architectures Software Developer’s Manual . Volume 1: Basic Architecture. Chapter 3: System Architecture Overview

work page 2023
[16]

Kuang, J.; Shen, Y.; Xie, J.; Luo, H.; Xu, Z.; Li, R.; Li, Y.; Cheng, X.; Lin, X.; and Han, Y. 2025. Natural language understanding and inference with mllm in visual question answering: A survey. ACM Computing Surveys, 57(8): 1--36

work page 2025
[17]

X.; and Wen, J.-R

Li, Y.; Guo, H.; Zhou, K.; Zhao, W. X.; and Wen, J.-R. 2024. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. In European Conference on Computer Vision, 174--189. Springer

work page 2024
[18]

Liu, H.; Li, C.; Li, Y.; and Lee, Y. J. 2023. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Liu, S.; Cheng, H.; Liu, H.; Zhang, H.; Li, F.; Ren, T.; Zou, X.; Yang, J.; Su, H.; Zhu, J.; et al. 2024 a . Llava-plus: Learning to use tools for creating multimodal agents. In European Conference on Computer Vision, 126--142. Springer

work page 2024
[20]

Liu, X.; Zhu, Y.; Gu, J.; Lan, Y.; Yang, C.; and Qiao, Y. 2024 b . Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. In European Conference on Computer Vision, 386--403. Springer

work page 2024
[21]

Lu, J.; Srivastava, S.; Chen, J.; Shrestha, R.; Acharya, M.; Kafle, K.; and Kanan, C. 2025. Revisiting multi-modal llm evaluation. In Proceedings of the Computer Vision and Pattern Recognition Conference, 555--564

work page 2025
[22]

Luo, W.; Ma, S.; Liu, X.; Guo, X.; and Xiao, C. 2024. JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks. arXiv:2404.03027

work page arXiv 2024
[23]

Meta AI . 2024. Llama 3.2: Connect 2024 — Vision & Edge Mobile Devices. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. Accessed: 2025-06-30

work page 2024
[24]

Meta AI . 2025. LLaMA Use Policy. https://ai.meta.com/llama/use-policy/. Accessed: 2025-06-30

work page 2025
[25]

Microsoft Corporation . 2006. Data Execution Prevention. https://learn.microsoft.com/en-us/windows/win32/memory/data-execution-prevention. Accessed: 2025-06-30

work page 2006
[26]

OpenAI . 2023. DALL·E 3. https://openai.com/index/dall-e-3/. Accessed: 2025-06-30

work page 2023
[27]

OpenAI . 2025. Usage Policies. https://openai.com/zh-Hans-CN/policies/usage-policies/. Accessed: 2025-06-30

work page 2025
[28]

Pahune, S.; and Rewatkar, N. 2023. Healthcare: A Growing Role for Large Language Models and Generative AI. International Journal for Research in Applied Science and Engineering Technology, 11 (8), 2288--2301

work page 2023
[29]

Qi, X.; Huang, K.; Panda, A.; Henderson, P.; Wang, M.; and Mittal, P. 2024. Visual adversarial examples jailbreak aligned large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, 21527--21536

work page 2024
[30]

Ran, D.; Liu, J.; Gong, Y.; Zheng, J.; He, X.; Cong, T.; and Wang, A. 2024. Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models. arXiv preprint arXiv:2406.09321

work page arXiv 2024
[31]

Shacham, H. 2007. The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86). In Proceedings of the 14th ACM conference on Computer and communications security, 552--561

work page 2007
[32]

Shacham, H.; Page, M.; Pfaff, B.; Goh, E.-J.; Modadugu, N.; and Boneh, D. 2004. On the effectiveness of address-space randomization. In Proceedings of the 11th ACM conference on Computer and communications security, 298--307

work page 2004
[33]

Sun, Y.; Zhu, C.; Zheng, S.; Zhang, K.; Sun, L.; Shui, Z.; Zhang, Y.; Li, H.; and Yang, L. 2024. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 5034--5042

work page 2024
[34]

Teng, M.; Xiaojun, J.; Ranjie, D.; Xinfeng, L.; Yihao, H.; Zhixuan, C.; Yang, L.; and Wenqi, R. 2024. Heuristic-induced multimodal risk distribution jailbreak attack for multimodal large language models. arXiv preprint arXiv:2412.05934

work page arXiv 2024
[35]

Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Fan, Y.; Dang, K.; Du, M.; Ren, X.; Men, R.; Liu, D.; Zhou, C.; Zhou, J.; and Lin, J. 2024 a . Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Wang, Y.; Chen, W.; Han, X.; Lin, X.; Zhao, H.; Liu, Y.; Zhai, B.; Yuan, J.; You, Q.; and Yang, H. 2024 b . Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805

work page arXiv 2024
[37]

Wang, Y.; Liu, X.; Li, Y.; Chen, M.; and Xiao, C. 2024 c . Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting. In European Conference on Computer Vision, 77--94. Springer

work page 2024
[38]

xAI . 2025. Grok. https://grok.com/. Latest release: Grok‑3 (Feb 17, 2025); Accessed: 2025‑06‑30

work page 2025
[39]

Xu, Y.; Qi, X.; Qin, Z.; and Wang, W. 2024. Cross-modality information check for detecting jailbreaking in multimodal large language models. arXiv preprint arXiv:2407.21659

work page arXiv 2024
[40]

Ying, Z.; Liu, A.; Liang, S.; Huang, L.; Guo, J.; Zhou, W.; Liu, X.; and Tao, D. 2024 a . Safebench: A safety evaluation framework for multimodal large language models. arXiv preprint arXiv:2410.18927

work page arXiv 2024
[41]

Ying, Z.; Liu, A.; Liu, X.; and Tao, D. 2024 b . Unveiling the safety of gpt-4o: An empirical study using jailbreak attacks. arXiv preprint arXiv:2406.06302

work page arXiv 2024
[42]

Ying, Z.; Liu, A.; Zhang, T.; Yu, Z.; Liang, S.; Liu, X.; and Tao, D. 2024 c . Jailbreak vision language models via bi-modal adversarial prompt. arXiv preprint arXiv:2406.04031

work page arXiv 2024
[43]

Ying, Z.; Zhang, D.; Jing, Z.; Xiao, Y.; Zou, Q.; Liu, A.; Liang, S.; Zhang, X.; Liu, X.; and Tao, D. 2025 a . Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models. arXiv preprint arXiv:2502.11054

work page arXiv 2025
[44]

Ying, Z.; Zheng, G.; Huang, Y.; Zhang, D.; Zhang, W.; Zou, Q.; Liu, A.; Liu, X.; and Tao, D. 2025 b . Towards understanding the safety boundaries of deepseek models: Evaluation and findings. arXiv preprint arXiv:2503.15092

work page arXiv 2025
[45]

Yuan, M.; Bao, P.; Yuan, J.; Shen, Y.; Chen, Z.; Xie, Y.; Zhao, J.; Li, Q.; Chen, Y.; Zhang, L.; et al. 2024. Large language models illuminate a progressive pathway to artificial intelligent healthcare assistant. Medicine Plus, 100030

work page 2024
[46]

Zhang, X.; Zeng, F.; and Gu, C. 2025. Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation. Neural Networks, 184: 107059

work page 2025
[47]

Zhang, X.; Zhang, C.; Li, T.; Huang, Y.; Jia, X.; Hu, M.; Zhang, J.; Liu, Y.; Ma, S.; and Shen, C. 2023. Jailguard: A universal detection framework for llm prompt-based attacks. arXiv preprint arXiv:2312.10766

work page arXiv 2023
[48]

Zhao, S.; Duan, R.; Wang, F.; Chen, C.; Kang, C.; Tao, J.; Chen, Y.; Xue, H.; and Wei, X. 2025. Jailbreaking multimodal large language models via shuffle inconsistency. arXiv preprint arXiv:2501.04931

work page arXiv 2025
[49]

Zhipu AI . 2025. GLM-4V: A Multimodal Vision-Language Model by Zhipu AI. https://open.bigmodel.cn/dev/howuse/glm-4v. Accessed: July 15, 2025

work page 2025
[50]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J. Z.; and Fredrikson, M. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

GPT-4 Technical Report

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Aliyun (Alibaba Cloud) . 2025. How to Use Vision Models in Model Studio. https://help.aliyun.com/zh/model-studio/vision. Accessed: 2025‑07‑15; last updated: 2025‑06‑13

work page 2025

[5] [5]

Anthropic . 2024. Claude 3.5 Sonnet: First in the next generation of Claude models. Accessed: 2025-06-30

work page 2024

[6] [6]

Bierbaumer, B.; Kirsch, J.; Kittel, T.; Francillon, A.; and Zarras, A. 2018. Smashing the stack protector for fun and profit. In ICT Systems Security and Privacy Protection: 33rd IFIP TC 11 International Conference, SEC 2018, Held at the 24th IFIP World Computer Congress, WCC 2018, Poznan, Poland, September 18-20, 2018, Proceedings 33, 293--306. Springer

work page 2018

[7] [7]

ByteDance Seed Team . 2025. Seedream 3.0: Next‑Gen Text‑to‑Image Model. https://seed.bytedance.com/en/tech/seedream3_0. Accessed: 2025‑06‑30

work page 2025

[8] [8]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, P.; Debenedetti, E.; Robey, A.; Andriushchenko, M.; Croce, F.; Sehwag, V.; Dobriban, E.; Flammarion, N.; Pappas, G. J.; Tramer, F.; et al. 2024. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Dong, Y.; Liu, Z.; Sun, H.-L.; Yang, J.; Hu, W.; Rao, Y.; and Liu, Z. 2025. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 9062--9072

work page 2025

[10] [10]

Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; M \"u ller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning

work page 2024

[11] [11]

Fang, W.; Wu, Q.; Chen, J.; and Xue, Y. 2025. guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering. In Proceedings of the Computer Vision and Pattern Recognition Conference, 19597--19607

work page 2025

[12] [12]

Gong, Y.; Ran, D.; Liu, J.; Wang, C.; Cong, T.; Wang, A.; Duan, S.; and Wang, X. 2025. Figstep: Jailbreaking large vision-language models via typographic visual prompts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 23951--23959

work page 2025

[13] [13]

T.; and Zhang, Y

Gou, Y.; Chen, K.; Liu, Z.; Hong, L.; Xu, H.; Li, Z.; Yeung, D.-Y.; Kwok, J. T.; and Zhang, Y. 2024. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation. In European Conference on Computer Vision, 388--404. Springer

work page 2024

[14] [14]

GPT-4o System Card

Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Intel Corporation . 2023. Intel 64 and IA-32 Architectures Software Developer’s Manual . Volume 1: Basic Architecture. Chapter 3: System Architecture Overview

work page 2023

[16] [16]

Kuang, J.; Shen, Y.; Xie, J.; Luo, H.; Xu, Z.; Li, R.; Li, Y.; Cheng, X.; Lin, X.; and Han, Y. 2025. Natural language understanding and inference with mllm in visual question answering: A survey. ACM Computing Surveys, 57(8): 1--36

work page 2025

[17] [17]

X.; and Wen, J.-R

Li, Y.; Guo, H.; Zhou, K.; Zhao, W. X.; and Wen, J.-R. 2024. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. In European Conference on Computer Vision, 174--189. Springer

work page 2024

[18] [18]

Liu, H.; Li, C.; Li, Y.; and Lee, Y. J. 2023. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Liu, S.; Cheng, H.; Liu, H.; Zhang, H.; Li, F.; Ren, T.; Zou, X.; Yang, J.; Su, H.; Zhu, J.; et al. 2024 a . Llava-plus: Learning to use tools for creating multimodal agents. In European Conference on Computer Vision, 126--142. Springer

work page 2024

[20] [20]

Liu, X.; Zhu, Y.; Gu, J.; Lan, Y.; Yang, C.; and Qiao, Y. 2024 b . Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. In European Conference on Computer Vision, 386--403. Springer

work page 2024

[21] [21]

Lu, J.; Srivastava, S.; Chen, J.; Shrestha, R.; Acharya, M.; Kafle, K.; and Kanan, C. 2025. Revisiting multi-modal llm evaluation. In Proceedings of the Computer Vision and Pattern Recognition Conference, 555--564

work page 2025

[22] [22]

Luo, W.; Ma, S.; Liu, X.; Guo, X.; and Xiao, C. 2024. JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks. arXiv:2404.03027

work page arXiv 2024

[23] [23]

Meta AI . 2024. Llama 3.2: Connect 2024 — Vision & Edge Mobile Devices. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. Accessed: 2025-06-30

work page 2024

[24] [24]

Meta AI . 2025. LLaMA Use Policy. https://ai.meta.com/llama/use-policy/. Accessed: 2025-06-30

work page 2025

[25] [25]

Microsoft Corporation . 2006. Data Execution Prevention. https://learn.microsoft.com/en-us/windows/win32/memory/data-execution-prevention. Accessed: 2025-06-30

work page 2006

[26] [26]

OpenAI . 2023. DALL·E 3. https://openai.com/index/dall-e-3/. Accessed: 2025-06-30

work page 2023

[27] [27]

OpenAI . 2025. Usage Policies. https://openai.com/zh-Hans-CN/policies/usage-policies/. Accessed: 2025-06-30

work page 2025

[28] [28]

Pahune, S.; and Rewatkar, N. 2023. Healthcare: A Growing Role for Large Language Models and Generative AI. International Journal for Research in Applied Science and Engineering Technology, 11 (8), 2288--2301

work page 2023

[29] [29]

Qi, X.; Huang, K.; Panda, A.; Henderson, P.; Wang, M.; and Mittal, P. 2024. Visual adversarial examples jailbreak aligned large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, 21527--21536

work page 2024

[30] [30]

Ran, D.; Liu, J.; Gong, Y.; Zheng, J.; He, X.; Cong, T.; and Wang, A. 2024. Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models. arXiv preprint arXiv:2406.09321

work page arXiv 2024

[31] [31]

Shacham, H. 2007. The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86). In Proceedings of the 14th ACM conference on Computer and communications security, 552--561

work page 2007

[32] [32]

Shacham, H.; Page, M.; Pfaff, B.; Goh, E.-J.; Modadugu, N.; and Boneh, D. 2004. On the effectiveness of address-space randomization. In Proceedings of the 11th ACM conference on Computer and communications security, 298--307

work page 2004

[33] [33]

Sun, Y.; Zhu, C.; Zheng, S.; Zhang, K.; Sun, L.; Shui, Z.; Zhang, Y.; Li, H.; and Yang, L. 2024. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 5034--5042

work page 2024

[34] [34]

Teng, M.; Xiaojun, J.; Ranjie, D.; Xinfeng, L.; Yihao, H.; Zhixuan, C.; Yang, L.; and Wenqi, R. 2024. Heuristic-induced multimodal risk distribution jailbreak attack for multimodal large language models. arXiv preprint arXiv:2412.05934

work page arXiv 2024

[35] [35]

Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Fan, Y.; Dang, K.; Du, M.; Ren, X.; Men, R.; Liu, D.; Zhou, C.; Zhou, J.; and Lin, J. 2024 a . Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Wang, Y.; Chen, W.; Han, X.; Lin, X.; Zhao, H.; Liu, Y.; Zhai, B.; Yuan, J.; You, Q.; and Yang, H. 2024 b . Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805

work page arXiv 2024

[37] [37]

Wang, Y.; Liu, X.; Li, Y.; Chen, M.; and Xiao, C. 2024 c . Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting. In European Conference on Computer Vision, 77--94. Springer

work page 2024

[38] [38]

xAI . 2025. Grok. https://grok.com/. Latest release: Grok‑3 (Feb 17, 2025); Accessed: 2025‑06‑30

work page 2025

[39] [39]

Xu, Y.; Qi, X.; Qin, Z.; and Wang, W. 2024. Cross-modality information check for detecting jailbreaking in multimodal large language models. arXiv preprint arXiv:2407.21659

work page arXiv 2024

[40] [40]

Ying, Z.; Liu, A.; Liang, S.; Huang, L.; Guo, J.; Zhou, W.; Liu, X.; and Tao, D. 2024 a . Safebench: A safety evaluation framework for multimodal large language models. arXiv preprint arXiv:2410.18927

work page arXiv 2024

[41] [41]

Ying, Z.; Liu, A.; Liu, X.; and Tao, D. 2024 b . Unveiling the safety of gpt-4o: An empirical study using jailbreak attacks. arXiv preprint arXiv:2406.06302

work page arXiv 2024

[42] [42]

Ying, Z.; Liu, A.; Zhang, T.; Yu, Z.; Liang, S.; Liu, X.; and Tao, D. 2024 c . Jailbreak vision language models via bi-modal adversarial prompt. arXiv preprint arXiv:2406.04031

work page arXiv 2024

[43] [43]

Ying, Z.; Zhang, D.; Jing, Z.; Xiao, Y.; Zou, Q.; Liu, A.; Liang, S.; Zhang, X.; Liu, X.; and Tao, D. 2025 a . Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models. arXiv preprint arXiv:2502.11054

work page arXiv 2025

[44] [44]

Ying, Z.; Zheng, G.; Huang, Y.; Zhang, D.; Zhang, W.; Zou, Q.; Liu, A.; Liu, X.; and Tao, D. 2025 b . Towards understanding the safety boundaries of deepseek models: Evaluation and findings. arXiv preprint arXiv:2503.15092

work page arXiv 2025

[45] [45]

Yuan, M.; Bao, P.; Yuan, J.; Shen, Y.; Chen, Z.; Xie, Y.; Zhao, J.; Li, Q.; Chen, Y.; Zhang, L.; et al. 2024. Large language models illuminate a progressive pathway to artificial intelligent healthcare assistant. Medicine Plus, 100030

work page 2024

[46] [46]

Zhang, X.; Zeng, F.; and Gu, C. 2025. Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation. Neural Networks, 184: 107059

work page 2025

[47] [47]

Zhang, X.; Zhang, C.; Li, T.; Huang, Y.; Jia, X.; Hu, M.; Zhang, J.; Liu, Y.; Ma, S.; and Shen, C. 2023. Jailguard: A universal detection framework for llm prompt-based attacks. arXiv preprint arXiv:2312.10766

work page arXiv 2023

[48] [48]

Zhao, S.; Duan, R.; Wang, F.; Chen, C.; Kang, C.; Tao, J.; Chen, Y.; Xue, H.; and Wei, X. 2025. Jailbreaking multimodal large language models via shuffle inconsistency. arXiv preprint arXiv:2501.04931

work page arXiv 2025

[49] [49]

Zhipu AI . 2025. GLM-4V: A Multimodal Vision-Language Model by Zhipu AI. https://open.bigmodel.cn/dev/howuse/glm-4v. Accessed: July 15, 2025

work page 2025

[50] [50]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J. Z.; and Fredrikson, M. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023