pith. machine review for the scientific record. sign in

arxiv: 2604.12616 · v1 · submitted 2026-04-14 · 💻 cs.AI · cs.MM

Recognition: unknown

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:01 UTC · model grok-4.3

classification 💻 cs.AI cs.MM
keywords jailbreak attacksvision-language modelsmulti-agent systemsmemory augmentationadversarial attackssemantic vulnerabilitiesmultimodal modelsnatural image attacks
0
0 comments X

The pith

MemJack shows that multi-agent systems with shared memory can jailbreak vision-language models on unmodified natural images at 71 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that the semantic content of ordinary photographs offers an effective and under-explored route for attacking vision-language models. It does this by building a system in which several agents cooperate to connect objects and scenes in an image to harmful requests, generate disguised prompts, and avoid early model refusals. Successful tactics are saved in a persistent memory store so they can be applied to new images without starting over each time. The result is sustained, coherent attacks that achieve substantially higher success rates than methods relying on pixel changes or obvious text tricks. The authors also release a large collection of attack trajectories to help others study defenses.

Core claim

MemJack orchestrates coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generates adversarial prompts via multi-angle visual-semantic camouflage, and utilizes an Iterative Nullspace Projection geometric filter to bypass premature latent space refusals; by accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, it maintains highly coherent extended multi-turn jailbreak attack interactions across different images and thereby improves the attack success rate on new images.

What carries the argument

The Multimodal Experience Memory, which accumulates and transfers successful multi-turn jailbreak strategies across images to enable coherent attacks without model-specific retraining.

If this is right

  • Attack success reaches 71.48 percent on the full unmodified COCO val2017 set against Qwen3-VL-Plus.
  • Performance scales to 90 percent when additional interaction budget is allowed.
  • Attacks remain coherent across extended multi-turn conversations on varying images.
  • A public dataset of over 113,000 interactive multimodal attack trajectories becomes available for defense research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety training for VLMs may need to address semantic interpretations of visual content rather than surface-level features alone.
  • Memory mechanisms that reuse successful strategies could be studied as a way to improve model robustness instead of only enabling attacks.
  • The vulnerability pattern may extend to other multimodal systems that process both images and text in open-ended ways.
  • Systematic testing on additional model architectures would clarify how broadly the semantic attack surface applies.

Load-bearing premise

The coordinated multi-agent mapping of visual entities to malicious intents and the transfer of strategies via memory will generalize to new models and image sets without requiring specific tuning or selection of successful trajectories.

What would settle it

A large drop in attack success rate when MemJack is run on a different vision-language model family or on a fresh collection of images drawn from a distribution other than COCO val2017.

Figures

Figures reproduced from arXiv: 2604.12616 by Hanjie Zhao, Haoyang Chen, Haozhe Liang, Jianhao Chen, Tieyun Qian.

Figure 1
Figure 1. Figure 1: Comparison of VLM jailbreak paradigms. Exist [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the MemJack framework. Stage 1: the Vulnerability Planning Agent maps visual anchor to attack goal; [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 2D PCA projection of (image, text prompt) embed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Progressive attack performance. (a) cumulative ASR [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Memory dynamics over the COCO val2017 cam [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scene × harmful-category heatmap: cell (𝑠, ℎ) is the percentage of images in scene 𝑠 with a successful attack whose primary harmful category is ℎ [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example A: shared workspace with laptops on ta [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example C: cluttered desk / kitchen scene (knife, [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

The rapid evolution of Vision-Language Models (VLMs) has catalyzed unprecedented capabilities in artificial intelligence; however, this continuous modal expansion has inadvertently exposed a vastly broadened and unconstrained adversarial attack surface. Current multimodal jailbreak strategies primarily focus on surface-level pixel perturbations and typographic attacks or harmful images; however, they fail to engage with the complex semantic structures intrinsic to visual data. This leaves the vast semantic attack surface of original, natural images largely unscrutinized. Driven by the need to expose these deep-seated semantic vulnerabilities, we introduce \textbf{MemJack}, a \textbf{MEM}ory-augmented multi-agent \textbf{JA}ilbreak atta\textbf{CK} framework that explicitly leverages visual semantics to orchestrate automated jailbreak attacks. MemJack employs coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi-angle visual-semantic camouflage, and utilize an Iterative Nullspace Projection (INLP) geometric filter to bypass premature latent space refusals. By accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, MemJack maintains highly coherent extended multi-turn jailbreak attack interactions across different images, thereby improving the attack success rate (ASR) on new images. Extensive empirical evaluations across full, unmodified COCO val2017 images demonstrate that MemJack achieves a 71.48\% ASR against Qwen3-VL-Plus, scaling to 90\% under extended budgets. Furthermore, to catalyze future defensive alignment research, we will release \textbf{MemJack-Bench}, a comprehensive dataset comprising over 113,000 interactive multimodal jailbreak attack trajectories, establishing a vital foundation for developing inherently robust VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MemJack, a memory-augmented multi-agent jailbreak framework for vision-language models that coordinates agents to map visual entities in natural images to malicious intents, generates multi-angle semantic camouflage prompts, applies an Iterative Nullspace Projection (INLP) geometric filter to evade latent-space refusals, and transfers strategies via a persistent Multimodal Experience Memory for coherent multi-turn attacks. The central empirical claim is a 71.48% attack success rate (ASR) on the full unmodified COCO val2017 set against Qwen3-VL-Plus, scaling to 90% under extended budgets, together with the planned release of the MemJack-Bench dataset of >113k trajectories.

Significance. If the reported results are reproducible with the full pipeline, the work would establish that unmodified natural images contain exploitable semantic structures enabling high-success, memory-driven multi-turn jailbreaks on frontier VLMs, substantially broadening the known attack surface beyond pixel perturbations or typographic methods. The release of a large-scale interactive trajectory dataset would constitute a concrete, reusable resource for defensive alignment research.

major comments (2)
  1. [§3.3 and §4.1] §3.3 (INLP component) and §4.1 (evaluation protocol): The MemJack pipeline is described as including Iterative Nullspace Projection to bypass premature latent-space refusals, yet the headline 71.48% ASR is reported for the closed-source, API-only Qwen3-VL-Plus model, which provides no public access to activations, embeddings, or nullspace directions. No black-box surrogate or ablation confirming that the geometric filter was omitted is supplied; this directly undermines attribution of the central ASR number to the complete framework as described.
  2. [§4] §4 (Experimental results): The abstract and results section report a concrete 71.48% ASR without error bars, confidence intervals, or details on trajectory selection/filtering criteria, and without baseline comparisons or component ablations (e.g., memory module removed). These omissions make it impossible to assess whether the headline figure is robust or sensitive to implementation choices.
minor comments (2)
  1. [Abstract] Abstract: The acronym expansion and bolding for “MemJack” is typographically inconsistent; the planned dataset release is mentioned but no license, access method, or exact composition statistics are given.
  2. [§3] §3: The multi-agent coordination protocol and the precise mechanism by which Multimodal Experience Memory stores and retrieves visual-semantic strategies would benefit from a pseudocode listing or explicit data-flow diagram.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which help clarify the scope and attribution of our results. We address each major concern below and commit to revisions that improve transparency and rigor without altering the core claims.

read point-by-point responses
  1. Referee: [§3.3 and §4.1] §3.3 (INLP component) and §4.1 (evaluation protocol): The MemJack pipeline is described as including Iterative Nullspace Projection to bypass premature latent-space refusals, yet the headline 71.48% ASR is reported for the closed-source, API-only Qwen3-VL-Plus model, which provides no public access to activations, embeddings, or nullspace directions. No black-box surrogate or ablation confirming that the geometric filter was omitted is supplied; this directly undermines attribution of the central ASR number to the complete framework as described.

    Authors: We acknowledge the referee's point on attribution. The INLP filter requires internal activations and was therefore not applied to the closed-source Qwen3-VL-Plus API. The reported 71.48% ASR is achieved using only the multi-agent entity mapping, multi-angle semantic camouflage, and persistent Multimodal Experience Memory components. We will revise §3.3, §4.1, and the abstract to explicitly state which modules are active for each model type. In the revision we will also add an ablation on open-source VLMs (e.g., LLaVA-1.6, Qwen2-VL) that compares ASR with and without INLP, thereby confirming the contribution of the geometric filter where it is applicable. revision: yes

  2. Referee: [§4] §4 (Experimental results): The abstract and results section report a concrete 71.48% ASR without error bars, confidence intervals, or details on trajectory selection/filtering criteria, and without baseline comparisons or component ablations (e.g., memory module removed). These omissions make it impossible to assess whether the headline figure is robust or sensitive to implementation choices.

    Authors: We agree that statistical robustness and component analysis are necessary. In the revised manuscript we will: (1) report ASR with standard error or 95% confidence intervals obtained from at least three independent runs on a stratified subset of COCO val2017 (subject to API quotas); (2) document the exact trajectory selection and filtering criteria used to compute the 71.48% figure; (3) add baseline comparisons against prior VLM jailbreak methods; and (4) include full component ablations, including removal of the memory module, removal of multi-agent coordination, and removal of camouflage. These results will appear in §4 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack evaluation on held-out data

full rationale

The paper presents MemJack as a multi-agent framework with memory and an INLP filter, then reports a directly measured ASR of 71.48% on the full unmodified COCO val2017 set against Qwen3-VL-Plus. No equations, fitted parameters, self-referential predictions, or derivation chains appear; the headline result is an experimental outcome on external images, not a quantity constructed from the method's own outputs or prior self-citations. The evaluation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on domain assumptions about how VLMs encode visual semantics in latent space and how multi-agent prompting can reliably bypass refusals; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption VLMs possess identifiable latent-space refusal directions that can be geometrically filtered via iterative nullspace projection without destroying task-relevant information.
    Invoked to justify the INLP component that bypasses premature refusals.
  • domain assumption Successful attack strategies discovered on one image can be transferred to new images via a shared memory store to improve coherence and ASR.
    Central to the memory-augmented claim.
invented entities (2)
  • MemJack framework no independent evidence
    purpose: Coordinated multi-agent jailbreak system
    New named system introduced in the paper.
  • Multimodal Experience Memory no independent evidence
    purpose: Persistent store for transferring attack strategies across images
    Core novel component for multi-turn coherence.

pith-pipeline@v0.9.0 · 5617 in / 1510 out tokens · 38236 ms · 2026-05-10T15:01:27.198595+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 34 canonical work pages · 10 internal anchors

  1. [1]

    Anthropic. 2025. Claude Haiku 4.5. https://www.anthropic.com/claude

  2. [2]

    Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samoth- rakis, and Simon Colton. 2012. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games4, 1 (2012), 1–43

  3. [3]

    Pappas, and Eric Wong

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2025. Jailbreaking Black Box Large Language Models in Twenty Queries.2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)(2025), 23–42. doi:10.1109/SaTML64287.2025.00010

  4. [4]

    Yangyi Chen, Hongcheng Gao, Ganqu Cui, Fanchao Qi, Longtao Huang, Zhiyuan Liu, and Maosong Sun. 2022. Why Should Adversarial Perturbations be Imper- ceptible? Rethink the Research Paradigm in Adversarial NLP.Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing(2022), 11222–11237. doi:10.18653/v1/2022.emnlp-main.771

  5. [5]

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. AgentPoison: Red-Teaming LLM Agents via Poisoning Memory or Knowl- edge Bases.Advances in Neural Information Processing Systems 37(2024). arXiv:2407.12784 https://proceedings.neurips.cc/paper_files/paper/2024/hash/ eb113910e9c3f6242541c1652e30dfd6-Abstract-Conference.html

  6. [6]

    Chenhang Cui, Gelei Deng, An Zhang, Jingnan Zheng, Yicong Li, Lianli Gao, Tianwei Zhang, and Tat-Seng Chua. 2025. Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models. Advances in Neural Information Processing Systems 38(2025). arXiv:2411.11496 https://openreview.net/forum?id=jvq8nzOUp8

  7. [7]

    DeepSeek-AI. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556 [cs.CL] https://arxiv.org/abs/2512.02556

  8. [8]

    Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models via typographic visual prompts.Proceedings of the AAAI Con- ference on Artificial Intelligence39, 22 (2025), 23951–23959

  9. [9]

    Google DeepMind. 2025. Gemini 3 Flash Preview. https://deepmind.google/ models/gemini/

  10. [10]

    Qwen3-VL Group. 2025. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631(2025)

  11. [11]

    Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. 2024. Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast.Proceedings of the 41st International Conference on Machine Learning235 (2024), 16647–16672. arXiv:2402.08567 https://proceedings.mlr.press/v235/gu24e.html

  12. [12]

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. 2023. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532(2023)

  13. [13]

    Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter, and Dan Roth. 2026. Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks. arXiv:2510.02286 [cs.LG] https://arxiv.org/abs/2510.02286

  14. [14]

    Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. 2024. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models.Advances in Neural Information Processing Systems 37(2024). arXiv:2405.14831 https://papers.nips.cc/paper_files/paper/2024/hash/ 4f7f6528c8594970a98ba7c739b8be3f-Abstract-Conference.html

  15. [15]

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs.IEEE transactions on big data7, 3 (2019), 535–547

  16. [16]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.Advances in Neural Information Processing Systems33 (2020), 9459–9474. https://proceedings.neuri...

  17. [17]

    Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, and Deva Ramanan. 2024. Naturalbench: Evaluating vision-language models on natural adversarial samples. Advances in Neural Information Processing Systems37 (2024), 17044–17068

  18. [18]

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Chen Keqin, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2026. Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Frame- work for State-of-the-Art Multimodal Retrieval and Ranking.arXiv preprint arXiv:2601.04720(2026)

  19. [19]

    Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. 2024. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jail- breaking multimodal large language models.European Conference on Computer Vision(2024), 174–189

  20. [20]

    Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. 2025. A survey of state of the art large vision language models: Benchmark evaluations and challenges.Proceedings of the Computer Vision and Pattern Recognition Conference(2025), 1587–1606. Chen et al

  21. [21]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context.Computer Vision – ECCV 2014(2014), 740–755

  22. [22]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744 [cs.CV]

  23. [23]

    doi:10.48550/arXiv.2410.05295 , abstract =

    Xiaogeng Liu, Peiran Li, G. Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. 2025. AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs.The Thirteenth International Conference on Learning Representations (ICLR) (2025). arXiv:2410.05295 https://openreview.net/forum?i...

  24. [24]

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. AutoDAN: Gen- erating Stealthy Jailbreak Prompts on Aligned Large Language Models.The Twelfth International Conference on Learning Representations(2024). https: //openreview.net/forum?id=7Jwpw4qKkb

  25. [25]

    Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. 2024. Mm- safetybench: A benchmark for safety evaluation of multimodal large language models.European Conference on Computer Vision(2024), 386–403

  26. [26]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. 2024. Mmbench: Is your multi-modal model an all-around player?European conference on computer vision(2024), 216–233

  27. [27]

    Yilian Liu, Xiaojun Jia, Guoshun Nan, Jiuyang Lyu, Zhican Chen, Tao Guan, Shuyuan Luo, Zhongyi Zhai, and Yang Liu. 2026. MIDAS: Multi-Image Dis- persion and Semantic Reconstruction for Jailbreaking MLLMs.The Fourteenth International Conference on Learning Representations(2026). https://openreview. net/forum?id=tXsE2wKPvx

  28. [28]

    Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. 2024. JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Lan- guage Models against Jailbreak Attacks.First Conference on Language Modeling (2024). https://openreview.net/forum?id=GC4mXVfquq

  29. [29]

    Teng Ma, Xiaojun Jia, Ranjie Duan, Xinfeng Li, Yihao Huang, Xiaoshuang Jia, Zhixuan Chu, and Wenqi Ren. 2025. Heuristic-Induced Multimodal Risk Distri- bution Jailbreak Attack for Multimodal Large Language Models.Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)(2025), 2686–

  30. [30]

    https://openaccess.thecvf.com/content/ICCV2025/html/Ma_Heuristic- Induced_Multimodal_Risk_Distribution_Jailbreak_Attack_for_Multimodal_ Large_Language_ICCV_2025_paper.html

  31. [31]

    Advances in Neural Information Processing Systems , volume =

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum S. Anderson, Yaron Singer, and Amin Karbasi. 2024. Tree of Attacks: Jailbreak- ing Black-Box LLMs Automatically.Advances in Neural Information Processing Systems 37(2024). arXiv:2312.02119 doi:10.52202/079017-1952

  32. [32]

    Meta AI. 2024. Llama-3.2-11B-Vision-Instruct. https://huggingface.co/meta- llama/Llama-3.2-11B-Vision-Instruct

  33. [33]

    MistralAI. 2025. Mistral medium 3. https://mistral.ai/fr/news/mistral-medium-3

  34. [34]

    Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. 2024. Jail- breaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309(2024)

  35. [35]

    OpenAI. 2025. GPT-5 Mini. https://openai.com/

  36. [36]

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red Team- ing Language Models with Language Models.Proceedings of the 2022 Confer- ence on Empirical Methods in Natural Language Processing(2022), 3419–3448. doi:10.18653/v1/2022.emnlp-main.225

  37. [37]

    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. Visual Adversarial Examples Jailbreak Aligned Large Language Models.Proceedings of the AAAI Conference on Artificial Intelligence 38, 19 (2024), 21527–21536. doi:10.1609/AAAI.V38I19.30150

  38. [38]

    Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg

  39. [39]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

    Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics(2020), 7237–7256. doi:10.18653/v1/2020.acl-main.647

  40. [40]

    Daniel Schwartz, Dmitriy Bespalov, Zhe Wang, Ninad Kulkarni, and Yanjun Qi

  41. [41]

    Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation.arXiv preprint arXiv:2501.18638,

    Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation. arXiv:2501.18638 [cs.CR] https://arxiv.org/abs/2501.18638

  42. [42]

    Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, and Tat-Seng Chua. 2026. Al- phaSteer: Learning Refusal Steering with Principled Null-Space Constraint. arXiv:2506.07022 [cs.LG] https://arxiv.org/abs/2506.07022

  43. [43]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforce- ment learning.Advances in Neural Information Processing Systems36 (2023), 8634–8652. https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf

  44. [44]

    Bingrui Sima, Linhua Cong, Wenxuan Wang, and Kun He. 2025. VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models.Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing(2025), 6131–6144. doi:10.18653/v1/2025.emnlp-main.312

  45. [45]

    Jiaxin Song, Yixu Wang, Jie Li, Xuan Tong, rui yu, Yan Teng, Xingjun Ma, and Yingchun Wang. 2025. JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models.The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems(2025). https://openreview.net/forum?id=yg1yfaKolw

  46. [46]

    Kimi Team. 2026. Kimi K2.5: Visual Agentic Intelligence. arXiv:2602.02276 [cs.CL] https://arxiv.org/abs/2602.02276

  47. [47]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

  48. [48]

    V Team. 2025. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. arXiv:2507.01006 [cs.CV] https://arxiv.org/abs/2507.01006

  49. [49]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. Voyager: An Open-Ended Embodied Agent with Large Language Models.Transactions on Machine Learning Research (2024). arXiv:2305.16291 https://openreview.net/forum?id=ehfRiF0R3a

  50. [50]

    Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, and Yu-Gang Jiang. 2025. Ideator: Jailbreaking and benchmarking large vision-language models using themselves.Proceedings of the IEEE/CVF International Conference on Computer Vision(2025), 8875–8884

  51. [51]

    Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, and Xuan-Jing Huang. 2025. Safe inputs but unsafe output: Benchmarking cross-modality safety alignment of large vision-language models. Findings of the Association for Computational Linguistics: NAACL 2025(2025), 3563–3605

  52. [52]

    Yu Wang, Xiaofei Zhou, Yichen Wang, Geyuan Zhang, and Tianxing He. 2025. Jailbreak Large Vision-Language Models Through Multi-Modal Linkage.Proceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2025), 1466–1494. doi:10.18653/v1/2025.acl-long.74

  53. [53]

    Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Johannes Gasteiger. 2025. The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence.Forty- second International Conference on Machine Learning(2025). https://openreview. net/forum?id=80IwJqlXs8

  54. [54]

    Jihui Yan, Xiaocui Yang, Daling Wang, Shi Feng, Yifei Zhang, and Yinzhi Zhao

  55. [55]

    doi:10.18653/v1/2025.findings-acl.745

    SemanticCamo: Jailbreaking Large Language Models through Semantic Camouflage.Findings of the Association for Computational Linguistics: ACL 2025 (2025), 14427–14452. doi:10.18653/v1/2025.findings-acl.745

  56. [56]

    Yu Yan, Sheng Sun, Shengjia Cheng, Teli Liu, Mingfeng Li, and Min Liu. 2026. Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks.arXiv preprint arXiv:2602.10148(2026). arXiv:2602.10148 https://arxiv.org/abs/2602.10148

  57. [57]

    Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. 2025. Distraction is All You Need for Multimodal Large Language Model Jailbreaking.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2025), 9467–9476. doi:10.1109/CVPR52734.2025. 00884

  58. [58]

    Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xi- anglong Liu, and Dacheng Tao. 2025. Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt.IEEE Transactions on Information Forensics and Security20 (2025), 7153–7165. doi:10.1109/TIFS.2025.3583249

  59. [59]

    Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Yunhao Chen, Jitao Sang, and Dit-Yan Yeung. 2025. AnyAttack: Towards Large-scale Self- supervised Adversarial Attacks on Vision-Language Models.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2025), 19900–19909. doi:10.1109/CVPR52734.2025.01853

  60. [60]

    Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al . 2025. Qwen3Guard Technical Report.arXiv preprint arXiv:2510.14276(2025)

  61. [61]

    Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Shouwei Ruan, Jialing Tao, YueFeng Chen, Hui Xue, and Xingxing Wei. 2025. Jailbreaking multimodal large language models via shuffle inconsistency.Proceedings of the IEEE/CVF International Conference on Computer Vision(2025), 2045–2054

  62. [62]

    Miao Ziqi, Yi Ding, Lijun Li, and Jing Shao. 2025. Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection.Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing(2025), 9627–9644. doi:10.18653/v1/2025.emnlp-main.487

  63. [63]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043 [cs.CL] https://arxiv.org/abs/2307.15043 Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs A Appendix A.1 Representative Datas...