pith. sign in

arxiv: 2507.21540 · v3 · submitted 2025-07-29 · 💻 cs.CR · cs.CV

PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

Pith reviewed 2026-05-19 03:32 UTC · model grok-4.3

classification 💻 cs.CR cs.CV
keywords jailbreakinglarge vision-language modelsadversarial attackscompositional reasoningsafety alignmentreturn-oriented programming
0
0 comments X

The pith

LVLMs can be jailbroken by splitting harmful requests into sequences of individually benign images that the model assembles during reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a jailbreak method that decomposes a harmful instruction into a chain of harmless visual elements. A guiding text prompt then directs the model to combine these elements step by step until a coherent harmful response emerges. This approach avoids explicit malicious content in any single image or prompt, exploiting how LVLMs build answers across multiple reasoning steps. If the method works as described, current safety checks that examine isolated inputs will miss the attack. Readers should care because it reveals that alignment focused on direct prompts leaves models open to attacks that hide intent until the final composition.

Core claim

Decomposing harmful instructions into sequences of individually benign visual gadgets and directing their integration via textual prompts causes the malicious intent to emerge from the model's compositional reasoning, evading detection from any single component and producing high attack success rates on state-of-the-art LVLMs.

What carries the argument

The PRISM framework that decomposes harmful instructions into sequences of benign visual gadgets and uses a directing textual prompt to force their integration during reasoning, modeled on return-oriented programming chains.

If this is right

  • Safety alignments that scan single prompts or images fail against attacks that rely on multi-step composition.
  • Attack success rates exceed 0.90 on SafeBench and improve by as much as 0.39 over prior baselines.
  • Defenses must monitor the full reasoning chain rather than individual inputs.
  • The same compositional vulnerability appears across popular LVLMs tested on MM-SafetyBench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could extend to other sequential reasoning systems if they accept mixed image and text inputs over multiple turns.
  • Alignment training may need to include examples of benign components that become harmful only when assembled.
  • Developers could add runtime checks that detect when a sequence of safe-looking inputs is being directed toward a single coherent output.

Load-bearing premise

LVLMs will integrate the sequence of benign visual gadgets through their reasoning process to produce a coherent and harmful output that evades detection from any single component.

What would settle it

Run the sequence of benign visual gadgets and directing prompt on a target LVLM and observe whether the model refuses the request, produces unrelated output, or fails to compose the elements into the intended harmful response.

Figures

Figures reproduced from arXiv: 2507.21540 by Deyue Zhang, Dongdong Yang, Moyang Chen, Quanchen Zou, Wenzhuo Xu, Xiangzheng Zhang, Yakai Li, Yisong Xiao, Zhao Liu, Zonghao Ying.

Figure 1
Figure 1. Figure 1: Analogy between ROP in software and PRISM in LVLM. Code gadgets with control flow in ROP correspond to visual gadgets and prompt-driven reasoning in PRISM. arises from jailbreak attacks, which attempt to subvert safety mechanisms and elicit restricted or harmful content (Zou et al. 2023; Ying et al. 2024b). In the context of LVLM jailbreaking, recent work has demonstrated that attackers craft adversarial p… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the PRISM pipeline. An auxiliary LLM decomposes the target into key steps, each described as a textual scene. These are used by a T2I model to generate sub-images, which are composed into a single image. The textual prompt, obtained via generalizable template search, guides the LVLM to extract relevant information and compose an unsafe response. Attacker Capabilities We assume a black-box attac… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study on auxiliary models. 1 2 3 4 5 6 Number of Visual Gadgets 0.0 0.2 0.4 0.6 0.8 1.0 ASR Qwen2-VL-7B-Instruct LlaVA-v1.6-Mistral-7B Llama-3.2-11B-Vision-Instruct GPT-4o Claude 3.7 Sonnet GLM-4V-Plus Qwen-Vl-Plus [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of the number of visual gadgets on the ASR [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A sample jailbreak attack on Qwen-VL-Plus using the proposed [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A sample jailbreak attack on GPT-4o using the proposed [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of images generated by different T2I [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

The increasing sophistication of large vision-language models (LVLMs) has been accompanied by advances in safety alignment mechanisms designed to prevent harmful content generation. However, these defenses remain vulnerable to sophisticated adversarial attacks. Existing jailbreak methods typically rely on direct and semantically explicit prompts, overlooking subtle vulnerabilities in how LVLMs compose information over multiple reasoning steps. In this paper, we propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security. Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets. A carefully engineered textual prompt directs the sequence of inputs, prompting the model to integrate the benign visual gadgets through its reasoning process to produce a coherent and harmful output. This makes the malicious intent emergent and difficult to detect from any single component. We validate our method through extensive experiments on established benchmarks including SafeBench and MM-SafetyBench, targeting popular LVLMs. Results show that our approach consistently and substantially outperforms existing baselines on state-of-the-art models, achieving near-perfect attack success rates (over 0.90 on SafeBench) and improving ASR by up to 0.39. Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs, highlighting the urgent need for defenses that secure the entire reasoning process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PRISM, an ROP-inspired jailbreak for LVLMs that decomposes harmful instructions into sequences of individually benign visual gadgets steered by a textual prompt so that the model integrates them into coherent harmful outputs. Experiments on SafeBench and MM-SafetyBench report ASR > 0.90 on SafeBench and gains of up to 0.39 over baselines on state-of-the-art LVLMs.

Significance. If the empirical results hold, the work identifies a concrete vulnerability in how LVLMs perform multi-step compositional reasoning over image sequences, showing that safety filters can be evaded when harm emerges only from integration rather than any single input. This could motivate new defenses that audit cumulative context rather than isolated components.

major comments (2)
  1. [Abstract] Abstract: the central claim that the method produces 'coherent and harmful output' while evading detection rests on the untested assumption that LVLMs integrate the benign gadget sequence exactly as the prompt intends without safety mechanisms operating on cumulative context; no gadget-construction algorithm, per-gadget safety audit, or ablation that removes the integration prompt is supplied.
  2. [Experiments] The reported ASR > 0.90 and improvements of up to 0.39 are presented without full experimental details, baseline comparisons, or controls for confounds such as prompt length or image selection bias, leaving the strength of evidence for the compositional-reasoning vulnerability only moderately supported.
minor comments (2)
  1. [Method] Clarify the precise definition of 'benign' for each gadget and how it was verified against the target LVLMs' safety classifiers.
  2. [Evaluation] Add a table or figure showing per-model ASR with and without the textual integration prompt to isolate its contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments raise important points about the clarity of claims in the abstract and the completeness of experimental reporting. We address each major comment below and indicate the specific revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the method produces 'coherent and harmful output' while evading detection rests on the untested assumption that LVLMs integrate the benign gadget sequence exactly as the prompt intends without safety mechanisms operating on cumulative context; no gadget-construction algorithm, per-gadget safety audit, or ablation that removes the integration prompt is supplied.

    Authors: We agree that the abstract is high-level and that additional supporting evidence would strengthen the central claim. The full manuscript describes the gadget-construction algorithm in Section 3.2 as a programmatic decomposition inspired by ROP, with explicit steps for breaking down harmful instructions into benign visual components. To address the integration assumption and lack of ablation, we will add a dedicated ablation study in the revised Experiments section that removes the integration prompt and reports the resulting drop in ASR. We will also include a per-gadget safety audit with quantitative results showing activation rates of safety filters on individual gadgets versus the full sequence. These additions will provide direct empirical support for the emergence of harm through compositional reasoning. revision: yes

  2. Referee: [Experiments] The reported ASR > 0.90 and improvements of up to 0.39 are presented without full experimental details, baseline comparisons, or controls for confounds such as prompt length or image selection bias, leaving the strength of evidence for the compositional-reasoning vulnerability only moderately supported.

    Authors: We acknowledge that the current experimental presentation would benefit from greater transparency. In the revised manuscript, we will expand the Experiments section to include full details on the setup, including exact prompt lengths, image selection criteria and randomization procedures, and explicit controls for confounds such as length bias and selection effects. We will also augment the baseline comparisons with additional methods and report results with standard deviations or confidence intervals across repeated trials. These changes will provide stronger, more robust evidence for the identified vulnerability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack construction evaluated on external benchmarks

full rationale

The paper proposes an empirical jailbreak method (PRISM) that decomposes harmful instructions into sequences of individually benign visual gadgets, then uses a textual prompt to steer LVLMs into integrating them into coherent harmful outputs. Claims rest on experimental results showing ASR >0.90 on SafeBench and gains up to 0.39 over baselines. No mathematical derivations, equations, fitted parameters, or self-citations appear in the provided text that would reduce any reported success rate to a quantity defined by the method itself. Evaluation uses external benchmarks (SafeBench, MM-SafetyBench) and comparisons to prior baselines, keeping the contribution independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LVLMs perform compositional reasoning over sequences of individually benign visual inputs when guided by a textual prompt, allowing malicious intent to emerge without detection in any single step.

axioms (1)
  • domain assumption LVLMs integrate information across multiple reasoning steps involving sequences of images and text in a manner that can produce emergent harmful outputs from benign components.
    This assumption underpins the claim that the malicious intent becomes difficult to detect from any single gadget.

pith-pipeline@v0.9.0 · 5798 in / 1234 out tokens · 45176 ms · 2026-05-19T03:32:09.022432+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

    cs.CR 2025-10 unverdicted novelty 7.0

    SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior,...

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    GPT-4 Technical Report

    Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  4. [4]

    Aliyun (Alibaba Cloud) . 2025. How to Use Vision Models in Model Studio. https://help.aliyun.com/zh/model-studio/vision. Accessed: 2025‑07‑15; last updated: 2025‑06‑13

  5. [5]

    Anthropic . 2024. Claude 3.5 Sonnet: First in the next generation of Claude models. Accessed: 2025-06-30

  6. [6]

    Bierbaumer, B.; Kirsch, J.; Kittel, T.; Francillon, A.; and Zarras, A. 2018. Smashing the stack protector for fun and profit. In ICT Systems Security and Privacy Protection: 33rd IFIP TC 11 International Conference, SEC 2018, Held at the 24th IFIP World Computer Congress, WCC 2018, Poznan, Poland, September 18-20, 2018, Proceedings 33, 293--306. Springer

  7. [7]

    ByteDance Seed Team . 2025. Seedream 3.0: Next‑Gen Text‑to‑Image Model. https://seed.bytedance.com/en/tech/seedream3_0. Accessed: 2025‑06‑30

  8. [8]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Chao, P.; Debenedetti, E.; Robey, A.; Andriushchenko, M.; Croce, F.; Sehwag, V.; Dobriban, E.; Flammarion, N.; Pappas, G. J.; Tramer, F.; et al. 2024. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318

  9. [9]

    Dong, Y.; Liu, Z.; Sun, H.-L.; Yang, J.; Hu, W.; Rao, Y.; and Liu, Z. 2025. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 9062--9072

  10. [10]

    Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; M \"u ller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning

  11. [11]

    Fang, W.; Wu, Q.; Chen, J.; and Xue, Y. 2025. guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering. In Proceedings of the Computer Vision and Pattern Recognition Conference, 19597--19607

  12. [12]

    Gong, Y.; Ran, D.; Liu, J.; Wang, C.; Cong, T.; Wang, A.; Duan, S.; and Wang, X. 2025. Figstep: Jailbreaking large vision-language models via typographic visual prompts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 23951--23959

  13. [13]

    T.; and Zhang, Y

    Gou, Y.; Chen, K.; Liu, Z.; Hong, L.; Xu, H.; Li, Z.; Yeung, D.-Y.; Kwok, J. T.; and Zhang, Y. 2024. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation. In European Conference on Computer Vision, 388--404. Springer

  14. [14]

    GPT-4o System Card

    Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

  15. [15]

    Intel Corporation . 2023. Intel 64 and IA-32 Architectures Software Developer’s Manual . Volume 1: Basic Architecture. Chapter 3: System Architecture Overview

  16. [16]

    Kuang, J.; Shen, Y.; Xie, J.; Luo, H.; Xu, Z.; Li, R.; Li, Y.; Cheng, X.; Lin, X.; and Han, Y. 2025. Natural language understanding and inference with mllm in visual question answering: A survey. ACM Computing Surveys, 57(8): 1--36

  17. [17]

    X.; and Wen, J.-R

    Li, Y.; Guo, H.; Zhou, K.; Zhao, W. X.; and Wen, J.-R. 2024. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. In European Conference on Computer Vision, 174--189. Springer

  18. [18]

    Liu, H.; Li, C.; Li, Y.; and Lee, Y. J. 2023. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744

  19. [19]

    Liu, S.; Cheng, H.; Liu, H.; Zhang, H.; Li, F.; Ren, T.; Zou, X.; Yang, J.; Su, H.; Zhu, J.; et al. 2024 a . Llava-plus: Learning to use tools for creating multimodal agents. In European Conference on Computer Vision, 126--142. Springer

  20. [20]

    Liu, X.; Zhu, Y.; Gu, J.; Lan, Y.; Yang, C.; and Qiao, Y. 2024 b . Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. In European Conference on Computer Vision, 386--403. Springer

  21. [21]

    Lu, J.; Srivastava, S.; Chen, J.; Shrestha, R.; Acharya, M.; Kafle, K.; and Kanan, C. 2025. Revisiting multi-modal llm evaluation. In Proceedings of the Computer Vision and Pattern Recognition Conference, 555--564

  22. [22]

    Luo, W.; Ma, S.; Liu, X.; Guo, X.; and Xiao, C. 2024. JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks. arXiv:2404.03027

  23. [23]

    Meta AI . 2024. Llama 3.2: Connect 2024 — Vision & Edge Mobile Devices. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. Accessed: 2025-06-30

  24. [24]

    Meta AI . 2025. LLaMA Use Policy. https://ai.meta.com/llama/use-policy/. Accessed: 2025-06-30

  25. [25]

    Microsoft Corporation . 2006. Data Execution Prevention. https://learn.microsoft.com/en-us/windows/win32/memory/data-execution-prevention. Accessed: 2025-06-30

  26. [26]

    OpenAI . 2023. DALL·E 3. https://openai.com/index/dall-e-3/. Accessed: 2025-06-30

  27. [27]

    OpenAI . 2025. Usage Policies. https://openai.com/zh-Hans-CN/policies/usage-policies/. Accessed: 2025-06-30

  28. [28]

    Pahune, S.; and Rewatkar, N. 2023. Healthcare: A Growing Role for Large Language Models and Generative AI. International Journal for Research in Applied Science and Engineering Technology, 11 (8), 2288--2301

  29. [29]

    Qi, X.; Huang, K.; Panda, A.; Henderson, P.; Wang, M.; and Mittal, P. 2024. Visual adversarial examples jailbreak aligned large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, 21527--21536

  30. [30]

    Ran, D.; Liu, J.; Gong, Y.; Zheng, J.; He, X.; Cong, T.; and Wang, A. 2024. Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models. arXiv preprint arXiv:2406.09321

  31. [31]

    Shacham, H. 2007. The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86). In Proceedings of the 14th ACM conference on Computer and communications security, 552--561

  32. [32]

    Shacham, H.; Page, M.; Pfaff, B.; Goh, E.-J.; Modadugu, N.; and Boneh, D. 2004. On the effectiveness of address-space randomization. In Proceedings of the 11th ACM conference on Computer and communications security, 298--307

  33. [33]

    Sun, Y.; Zhu, C.; Zheng, S.; Zhang, K.; Sun, L.; Shui, Z.; Zhang, Y.; Li, H.; and Yang, L. 2024. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 5034--5042

  34. [34]

    Teng, M.; Xiaojun, J.; Ranjie, D.; Xinfeng, L.; Yihao, H.; Zhixuan, C.; Yang, L.; and Wenqi, R. 2024. Heuristic-induced multimodal risk distribution jailbreak attack for multimodal large language models. arXiv preprint arXiv:2412.05934

  35. [35]

    Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Fan, Y.; Dang, K.; Du, M.; Ren, X.; Men, R.; Liu, D.; Zhou, C.; Zhou, J.; and Lin, J. 2024 a . Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191

  36. [36]

    Wang, Y.; Chen, W.; Han, X.; Lin, X.; Zhao, H.; Liu, Y.; Zhai, B.; Yuan, J.; You, Q.; and Yang, H. 2024 b . Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805

  37. [37]

    Wang, Y.; Liu, X.; Li, Y.; Chen, M.; and Xiao, C. 2024 c . Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting. In European Conference on Computer Vision, 77--94. Springer

  38. [38]

    xAI . 2025. Grok. https://grok.com/. Latest release: Grok‑3 (Feb 17, 2025); Accessed: 2025‑06‑30

  39. [39]

    Xu, Y.; Qi, X.; Qin, Z.; and Wang, W. 2024. Cross-modality information check for detecting jailbreaking in multimodal large language models. arXiv preprint arXiv:2407.21659

  40. [40]

    Ying, Z.; Liu, A.; Liang, S.; Huang, L.; Guo, J.; Zhou, W.; Liu, X.; and Tao, D. 2024 a . Safebench: A safety evaluation framework for multimodal large language models. arXiv preprint arXiv:2410.18927

  41. [41]

    Ying, Z.; Liu, A.; Liu, X.; and Tao, D. 2024 b . Unveiling the safety of gpt-4o: An empirical study using jailbreak attacks. arXiv preprint arXiv:2406.06302

  42. [42]

    Ying, Z.; Liu, A.; Zhang, T.; Yu, Z.; Liang, S.; Liu, X.; and Tao, D. 2024 c . Jailbreak vision language models via bi-modal adversarial prompt. arXiv preprint arXiv:2406.04031

  43. [43]

    Ying, Z.; Zhang, D.; Jing, Z.; Xiao, Y.; Zou, Q.; Liu, A.; Liang, S.; Zhang, X.; Liu, X.; and Tao, D. 2025 a . Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models. arXiv preprint arXiv:2502.11054

  44. [44]

    Ying, Z.; Zheng, G.; Huang, Y.; Zhang, D.; Zhang, W.; Zou, Q.; Liu, A.; Liu, X.; and Tao, D. 2025 b . Towards understanding the safety boundaries of deepseek models: Evaluation and findings. arXiv preprint arXiv:2503.15092

  45. [45]

    Yuan, M.; Bao, P.; Yuan, J.; Shen, Y.; Chen, Z.; Xie, Y.; Zhao, J.; Li, Q.; Chen, Y.; Zhang, L.; et al. 2024. Large language models illuminate a progressive pathway to artificial intelligent healthcare assistant. Medicine Plus, 100030

  46. [46]

    Zhang, X.; Zeng, F.; and Gu, C. 2025. Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation. Neural Networks, 184: 107059

  47. [47]

    Zhang, X.; Zhang, C.; Li, T.; Huang, Y.; Jia, X.; Hu, M.; Zhang, J.; Liu, Y.; Ma, S.; and Shen, C. 2023. Jailguard: A universal detection framework for llm prompt-based attacks. arXiv preprint arXiv:2312.10766

  48. [48]

    Zhao, S.; Duan, R.; Wang, F.; Chen, C.; Kang, C.; Tao, J.; Chen, Y.; Xue, H.; and Wei, X. 2025. Jailbreaking multimodal large language models via shuffle inconsistency. arXiv preprint arXiv:2501.04931

  49. [49]

    Zhipu AI . 2025. GLM-4V: A Multimodal Vision-Language Model by Zhipu AI. https://open.bigmodel.cn/dev/howuse/glm-4v. Accessed: July 15, 2025

  50. [50]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J. Z.; and Fredrikson, M. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043