pith. sign in

arxiv: 2606.24388 · v1 · pith:XBACKY5Anew · submitted 2026-06-23 · 💻 cs.AI · cs.LG

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

Pith reviewed 2026-06-25 23:50 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords adversarial attacksvision-language modelsmultimodal datasetmodel robustnesssafety evaluationharmful intentspre-generated attacksVLM alignment
0
0 comments X

The pith

PHANTOM releases 47,524 pre-generated adversarial samples for vision-language models to enable accessible robustness testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a consolidated dataset of multimodal adversarial attacks drawn from recent literature and extended with additional categories. It aggregates 7,826 intents into 10 high-level categories and 55 subcategories, all supplied as ready-to-use samples rather than requiring fresh generation. The core purpose is to remove the computational barrier that prevents many researchers from running large-scale safety evaluations on vision-language models. By making these attacks publicly available, the work supports systematic testing of model robustness, alignment, and guardrail effectiveness under diverse harmful-intent conditions. The dataset is positioned as a practical extension of prior benchmarks, not as a new attack method itself.

Core claim

PHANTOM is a large-scale, open-source collection of 47,524 adversarial samples for vision-language models. The samples were produced by applying published attack strategies to cover 7,826 intents across 10 high-level categories and 55 subcategories of harmful content, with one new category added to broaden coverage. The dataset is released so that researchers can evaluate VLM safety, fine-tune attack generators, and stress-test defenses without incurring the original generation costs.

What carries the argument

The PHANTOM dataset, a consolidated repository of pre-generated multimodal adversarial samples that aggregates and extends existing attack benchmarks into a single accessible resource.

If this is right

  • Researchers can run large-scale robustness evaluations on VLMs without generating attacks themselves.
  • The dataset supports fine-tuning of models that generate new adversarial examples.
  • Defensive guardrails can be stress-tested against a wider range of harmful intents than previous single-source benchmarks allowed.
  • Evaluations of VLM safety become more reproducible because the same fixed set of attacks can be reused across studies.
  • Coverage of harmful intents expands by the addition of one new high-level category.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Comparative studies across multiple VLMs using the same attack set could identify shared failure modes that single-model tests miss.
  • The fixed nature of the dataset makes it suitable for tracking how model updates over time affect vulnerability to the same attacks.
  • Future extensions could add attack variants generated against the latest models to keep the resource current.

Load-bearing premise

The pre-generated attacks will remain effective and representative when applied by other researchers to new or updated vision-language models.

What would settle it

A direct test that measures the success rate of the supplied attack samples against a current frontier VLM and compares it to the rates reported in the original source papers; a large drop would indicate the dataset has become outdated.

Figures

Figures reproduced from arXiv: 2606.24388 by Hossein Khodadadi, Mauro Dore, Mauro Medda, Nicola Franco, Simone Gallivanone.

Figure 1
Figure 1. Figure 1: Overview of PHANTOM. The risk taxonomy (10 categories, 55 subcategories, 7 826 intents) defines the dataset; the multimodal attacks (BAP, IDEATOR, MML, FC ATTACK, CSDJ) turn each intent into a multimodal adversarial sample (harmful text prompt + image), giving 47 524 (prompt, image) pairs. Each pair is given white-box to nine open-source VLMs and transferred to six closed-source, black-box models; the judg… view at source ↗
Figure 2
Figure 2. Figure 2: PHANTOM intents dataset. The horizontal axis shows the distribution of categories across the dataset, while the vertical axis shows the distribution of intents within each category, broken down by subcategory. As a result, our dataset is organized into 10 high-level categories, further divided into a total of 55 subcategories. The main categories are: Ethical and Social Risks, Privacy and Data Risks, Safet… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation of ASR based on Attack strategy and delay per attack The main motivation for selecting these strategies was their strong empirical performance. Before converging on this subset, however, we experimented with additional attack strategies, namely QR–attack [2], JOOD [45], CS-DJ [21], FigStep [7], HADES [18], HIMRD [46], VISCARA [47], MIDAS [48], ACZ attack [49]. The results of this preliminary ana… view at source ↗
Figure 4
Figure 4. Figure 4: Category coverage: overall and in-category [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examining model vulnerability against harmful categories 4 Limitations PHANTOM has several limitations. First, our evaluation relies primarily on an automated judge, Abel-24- HarmClassifier, which may introduce false positives and false negatives, particularly for responses that are partially harmful, evasive, or context-dependent. Although automated judging enables large-scale evaluation, it cannot fully … view at source ↗
Figure 6
Figure 6. Figure 6: The plot graphically presents the vulnerability of each tested model to harmful categories, enabling a comparison across attacks. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The plot presents the vulnerability of models to harmful categories. It is obtained by averaging results across different attack strategies to mitigate the influence of the specific attack used. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The plot shows the vulnerabilities of the models with respect to the attack strategies, averaged over the categories. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: This plot identifies, for each attack strategy and model, the most vulnerable category by selecting the category with the highest ASR score. D Benchmark overview D.1 Evolution of Textual and Early Multimodal Benchmarks The field was pioneered by AdvBench [17], a text-only benchmark containing 520 behaviors. Despite its reliance on primitive string-matching judges and the simple GCG attack, it established t… view at source ↗
Figure 10
Figure 10. Figure 10: Workflow of MML combining both image manipulation and role-playing through text. F.4 FC ATTACK Once again, our methodology closely follows the original approach proposed in [20]. The core idea is to start from a harmful intent and generate a sequence of logical steps to address it using an auxiliary abliterated model. In our case, similarly to BAP [8], we employ the Huihui-Qwen3.5-9B-abliterated model [55… view at source ↗
read the original abstract

We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and practical, extending existing benchmarks by covering 10 high-level categories and 55 subcategories of harmful intents. Our primary goal is to make adversarial data accessible to the research community, given the computational cost and complexity of generating large numbers of attacks. The dataset comprises 47 524 adversarial samples, generated using state-of-the-art attack strategies from recent literature. Our work complements existing efforts by consolidating and extending prior benchmarks from multiple established sources, resulting in 7 826 intents, and introduce an additional category to broaden coverage. This provides realistic evaluation resources for studying model robustness and alignment. Our dataset intends to enable researchers and practitioners to systematically evaluate the robustness and safety of VLMs, fine-tune attack-generation models, and develop or stress-test defensive guardrails under diverse adversarial conditions. By releasing this resource, we aim to lower the barrier to adversarial research and foster more reproducible, comprehensive, and comparable evaluations of VLM safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces PHANTOM, a large-scale open-source dataset of 47,524 pre-generated multimodal adversarial attacks for vision-language models. The samples are generated using state-of-the-art attack strategies from recent literature, covering 7,826 intents across 10 high-level categories and 55 subcategories (with one additional category introduced to broaden coverage). The primary contribution is to consolidate and extend prior benchmarks, lowering the barrier for researchers to evaluate VLM robustness, alignment, and safety without incurring the cost of generating attacks themselves.

Significance. If the samples are empirically confirmed as adversarial, the dataset would be a useful community resource for standardized and reproducible VLM safety evaluations. The consolidation of multiple established sources plus expanded subcategory coverage is a practical strength that could support systematic stress-testing of guardrails and fine-tuning of attack models.

major comments (1)
  1. [§3 (Dataset Construction / Generation Pipeline)] §3 (Dataset Construction / Generation Pipeline): The central claim that the 47,524 samples constitute effective adversarial examples is unsupported by any reported attack success rates (ASR), per-sample verification, or held-out validation on the source VLMs. Effectiveness is assumed to transfer from the cited source papers without direct empirical checks in this work, which is load-bearing for the dataset's advertised utility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough review and the recommendation for major revision. We address the single major comment below.

read point-by-point responses
  1. Referee: The central claim that the 47,524 samples constitute effective adversarial examples is unsupported by any reported attack success rates (ASR), per-sample verification, or held-out validation on the source VLMs. Effectiveness is assumed to transfer from the cited source papers without direct empirical checks in this work, which is load-bearing for the dataset's advertised utility.

    Authors: We agree that the manuscript does not report new attack success rates or perform fresh validation on the consolidated set of 47,524 samples. The PHANTOM dataset is explicitly constructed by aggregating pre-generated attacks whose effectiveness was demonstrated in the source papers; our contribution is consolidation and extension of coverage rather than re-generation or re-evaluation. In the revised version we will add a new subsection in §3 that tabulates the attack success rates, target models, and evaluation protocols reported in each cited source paper, cross-referenced to the 10 high-level categories. This will make the empirical grounding transparent. We note that re-validating the full set on the original VLMs would require substantial additional compute and model access, which runs counter to the dataset's purpose of lowering such barriers for the community. The revision will therefore clarify the reliance on prior validations without introducing new experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with no derivations or fitted predictions

full rationale

This is a data-release paper whose central contribution is the compilation and taxonomy of 47,524 pre-generated adversarial samples drawn from cited external attack methods. No equations, parameter fitting, uniqueness theorems, or predictions appear in the provided text. The generation pipeline is described as using 'state-of-the-art attack strategies from recent literature,' which is standard citation practice rather than a self-referential reduction. No load-bearing self-citation chain or ansatz smuggling is present. The paper is self-contained as a resource contribution; its claims rest on the external validity of the cited generators, not on any internal derivation that collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset construction and release paper. No free parameters, mathematical axioms, or new invented entities are introduced; the work relies on existing attack generation methods from cited literature.

pith-pipeline@v0.9.1-grok · 5734 in / 1095 out tokens · 20269 ms · 2026-06-25T23:50:13.908986+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 2 canonical work pages

  1. [1]

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

  2. [2]

    Mm-safetybench: A benchmark for safety evaluation of multimodal large language models

    Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. InEuropean Conference on Computer Vision, pages 386–403. Springer, 2024

  3. [3]

    Mmj-bench: A comprehensive study on jailbreak attacks and defenses for vision language models

    Fenghua Weng, Yue Xu, Chengyan Fu, and Wenjie Wang. Mmj-bench: A comprehensive study on jailbreak attacks and defenses for vision language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27689–27697, 2025

  4. [4]

    Omnisafebench-mm: A unified benchmark and toolbox for multimodal jailbreak attack-defense evaluation, 2025

    Xiaojun Jia, Jie Liao, Qi Guo, Teng Ma, Simeng Qin, Ranjie Duan, Tianlin Li, Yihao Huang, Zhitao Zeng, Dongxian Wu, Yiming Li, Wenqi Ren, Xiaochun Cao, and Yang Liu. Omnisafebench-mm: A unified benchmark and toolbox for multimodal jailbreak attack-defense evaluation, 2025. URL https://arxiv.org/abs/2512. 06589

  5. [5]

    Multibreak: A scalable and diverse multi-turn jailbreak benchmark for evaluating llm safety, 2026

    Jialin Song, Xiaodong Liu, Weiwei Yang, Wuyang Chen, Mingqian Feng, Xuekai Zhu, and Jianfeng Gao. Multibreak: A scalable and diverse multi-turn jailbreak benchmark for evaluating llm safety, 2026. URL https://arxiv.org/abs/2605.01687

  6. [6]

    Ideator: Jailbreaking and benchmarking large vision-language models using themselves, 2025

    Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, and Yu-Gang Jiang. Ideator: Jailbreaking and benchmarking large vision-language models using themselves, 2025. URLhttps://arxiv.org/abs/2411.00827

  7. [7]

    Figstep: Jailbreaking large vision-language models via typographic visual prompts

    Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. Figstep: Jailbreaking large vision-language models via typographic visual prompts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23951–23959, 2025

  8. [8]

    Jailbreak vision language models via bi-modal adversarial prompt.IEEE Transactions on Information Forensics and Security, 20:7153–7165, 2025

    Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, and Dacheng Tao. Jailbreak vision language models via bi-modal adversarial prompt.IEEE Transactions on Information Forensics and Security, 20:7153–7165, 2025. doi:10.1109/TIFS.2025.3583249

  9. [9]

    Jailbreak large vision-language models through multi-modal linkage

    Yu Wang, Xiaofei Zhou, Yichen Wang, Geyuan Zhang, and Tianxing He. Jailbreak large vision-language models through multi-modal linkage. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1466–1494, Vien...

  10. [10]

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems, 37:55005–55029, 2024

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems, 37:55005–55029, 2024

  11. [11]

    Sorry-bench: Systematically evaluating large language model safety refusal

    Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, et al. Sorry-bench: Systematically evaluating large language model safety refusal. arXiv preprint arXiv:2406.14598, 2024

  12. [12]

    Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Nanning Zheng, and Kaipeng Zhang. B-avibench: Toward evaluating the robustness of large vision-language model on black-box adversarial visual- instructions.IEEE Transactions on Information Forensics and Security, 20:1434–1446, 2024

  13. [13]

    Safebench: A safety evaluation framework for multimodal large language models.International Journal of Computer Vision, 134(1):18, 2026

    Zonghao Ying, Aishan Liu, Siyuan Liang, Lei Huang, Jinyang Guo, Wenbo Zhou, Xianglong Liu, and Dacheng Tao. Safebench: A safety evaluation framework for multimodal large language models.International Journal of Computer Vision, 134(1):18, 2026

  14. [14]

    Llm defenses are not robust to multi-turn human jailbreaks yet.arXiv preprint arXiv:2408.15221, 2024

    Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. Llm defenses are not robust to multi-turn human jailbreaks yet.arXiv preprint arXiv:2408.15221, 2024

  15. [15]

    A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024

  16. [16]

    Visual adversarial examples jailbreak aligned large language models

    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 21527–21536, 2024. 12 PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

  17. [17]

    Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  18. [18]

    Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models

    Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. InEuropean Conference on Computer Vision, pages 174–189. Springer, 2024

  19. [19]

    Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027, 2024

    Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027, 2024

  20. [20]

    Fc-attack: Jailbreaking large vision-language models via auto-generated flowcharts.arXiv e-prints, pages arXiv–2502, 2025

    Ziyi Zhang, Zhen Sun, Zongmin Zhang, Jihui Guo, and Xinlei He. Fc-attack: Jailbreaking large vision-language models via auto-generated flowcharts.arXiv e-prints, pages arXiv–2502, 2025

  21. [21]

    Distraction is all you need for multimodal large language model jailbreaking

    Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. Distraction is all you need for multimodal large language model jailbreaking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9467–9476, 2025

  22. [22]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  23. [23]

    Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

  24. [24]

    Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025

    V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...

  25. [25]

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabi...

  26. [26]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id= qwen3.5

  27. [27]

    Qwen3.6-27b: Flagship-level coding in a 27b dense model, April 2026

    Qwen Team. Qwen3.6-27b: Flagship-level coding in a 27b dense model, April 2026. URL https://qwen.ai/ blog?id=qwen3.6-27b

  28. [28]

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14322–14350, 2024

  29. [29]

    Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick Mc- Daniel, Huan Sun, Bo Li, and Chaowei Xiao

    Xiaogeng Liu, Peiran Li, G. Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick Mc- Daniel, Huan Sun, Bo Li, and Chaowei Xiao. AutoDAN-turbo: A lifelong agent for strategy self-exploration to jailbreak LLMs. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=bhK7U37VW8. 13 PHAN...

  30. [30]

    Autoprompt: Eliciting knowledge from language models with automatically generated prompts

    Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

  31. [31]

    Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023

    Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023

  32. [32]

    Jailbreaking black box large language models in twenty queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025

  33. [33]

    Safebench: A safety evaluation framework for multimodal large language models.https://safebench-mm

    Zonghao Ying, Aishan Liu, Siyuan Liang, Lei Huang, Jinyang Guo, Wenbo Zhou, Xianglong Liu, and Dacheng Tao. Safebench: A safety evaluation framework for multimodal large language models.https://safebench-mm. github.io/, 2025. Online resource

  34. [34]

    Does refusal training in llms generalize to the past tense? arXiv preprint arXiv:2407.11969, 2024

    Maksym Andriushchenko and Nicolas Flammarion. Does refusal training in llms generalize to the past tense? arXiv preprint arXiv:2407.11969, 2024

  35. [35]

    Visual adversarial examples jailbreak aligned large language models

    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 21527–21536, 2024

  36. [36]

    Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

    Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

  37. [37]

    On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

    Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

  38. [38]

    Decision-based black-box attack against vision transformers via patch-wise adversarial removal.Advances in Neural Information Processing Systems, 35: 12921–12933, 2022

    Yucheng Shi, Yahong Han, Yu-an Tan, and Xiaohui Kuang. Decision-based black-box attack against vision transformers via patch-wise adversarial removal.Advances in Neural Information Processing Systems, 35: 12921–12933, 2022

  39. [39]

    Decision-based adversarial attacks: Reliable attacks against black-box machine learning models.arXiv preprint arXiv:1712.04248, 2017

    Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models.arXiv preprint arXiv:1712.04248, 2017

  40. [40]

    Surfree: a fast surrogate-free black-box attack

    Thibault Maho, Teddy Furon, and Erwan Le Merrer. Surfree: a fast surrogate-free black-box attack. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10430–10439, 2021

  41. [41]

    White-box multimodal jailbreaks against large vision-language models

    Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, and Yu-Gang Jiang. White-box multimodal jailbreaks against large vision-language models. InACM Multimedia 2024, 2024. URLhttps://openreview. net/forum?id=SMOUQtEaAf

  42. [42]

    Adversarial humanities benchmark: Results on stylistic robustness in frontier model safety, 2026

    Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, and Daniele Nardi. Adversarial humanities benchmark: Results on stylistic robustness in frontier model safety, 2026. URLhttps://arxiv.org/abs/2604.18487

  43. [43]

    Harmmetric eval: Benchmarking metrics and judges for llm harmfulness assessment, 2026

    Langqi Yang, Tianhang Zheng, Yixuan Chen, Kedong Xiu, Hao Zhou, Wangze Ni, Lei Chen, Zhan Qin, and Kui Ren. Harmmetric eval: Benchmarking metrics and judges for llm harmfulness assessment, 2026. URL https://arxiv.org/abs/2509.24384

  44. [44]

    all-minilm-l6-v2

    sentence-transformers. all-minilm-l6-v2. URL https://huggingface.co/sentence-transformers/ all-MiniLM-L6-v2. Hugging Face model

  45. [45]

    Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy

    Joonhyun Jeong, Seyun Bae, Yeonsung Jung, Jaeryong Hwang, and Eunho Yang. Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29937–29946, 2025

  46. [46]

    Heuristic-induced multimodal risk distribution jailbreak attack for multimodal large language models

    Teng Ma, Xiaojun Jia, Ranjie Duan, Xinfeng Li, Yihao Huang, Xiaoshuang Jia, Zhixuan Chu, and Wenqi Ren. Heuristic-induced multimodal risk distribution jailbreak attack for multimodal large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2686–2696, 2025

  47. [47]

    Viscra: A visual chain reasoning attack for jailbreaking multimodal large language models

    Bingrui Sima, Linhua Cong, Wenxuan Wang, and Kun He. Viscra: A visual chain reasoning attack for jailbreaking multimodal large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6142–6155, 2025

  48. [48]

    Midas: Multi-image dispersion and semantic reconstruction for jailbreaking mllms.arXiv preprint arXiv:2603.00565, 2026

    Yilian Liu, Xiaojun Jia, Guoshun Nan, Jiuyang Lyu, Zhican Chen, Tao Guan, Shuyuan Luo, Zhongyi Zhai, and Yang Liu. Midas: Multi-image dispersion and semantic reconstruction for jailbreaking mllms.arXiv preprint arXiv:2603.00565, 2026

  49. [49]

    Hard to read, easy to jailbreak: How visual degradation bypasses mllm safety alignment.arXiv preprint arXiv:2605.07250, 2026

    Zhixue Song, Boyan Han, Yiwei Wang, and Chi Zhang. Hard to read, easy to jailbreak: How visual degradation bypasses mllm safety alignment.arXiv preprint arXiv:2605.07250, 2026. 14 PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

  50. [50]

    Feder Cooper, Solon Barocas, Abhinav Palia, Dan Vann, and Hanna Wal- lach

    Alex Chouldechova, A. Feder Cooper, Solon Barocas, Abhinav Palia, Dan Vann, and Hanna Wal- lach. Comparison requires valid measurement: Rethinking attack success rate comparisons in ai red teaming. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38. Curra...

  51. [51]

    A coin flip for safety: Llm judges fail to reliably measure adversarial robustness.arXiv preprint arXiv:2603.06594, 2026

    Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel, and Stephan Günnemann. A coin flip for safety: Llm judges fail to reliably measure adversarial robustness.arXiv preprint arXiv:2603.06594, 2026

  52. [52]

    Low-resource languages jailbreak gpt-4.arXiv preprint arXiv:2310.02446, 2023

    Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. Low-resource languages jailbreak gpt-4.arXiv preprint arXiv:2310.02446, 2023

  53. [53]

    Multilingual blending: Large language model safety alignment evaluation with language mixture

    Jiayang Song, Yuheng Huang, Zhehua Zhou, and Lei Ma. Multilingual blending: Large language model safety alignment evaluation with language mixture. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3433–3449, 2025

  54. [54]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1

  55. [55]

    Huihui-qwen3.5-9b-abliterated,

    huihui-ai. Huihui-qwen3.5-9b-abliterated, . URL https://huggingface.co/huihui-ai/Huihui-Qwen3. 5-9B-abliterated. Hugging Face model

  56. [56]

    Huihui-gemma-4-31b-it-abliterated,

    huihui-ai. Huihui-gemma-4-31b-it-abliterated, . URL https://huggingface.co/huihui-ai/ Huihui-gemma-4-31B-it-abliterated. Hugging Face model

  57. [57]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  58. [58]

    Refusal Rate

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 15 PHANTOM: A Large-Scale Dataset of Multimodal Advers...