pith. sign in

arxiv: 2605.18868 · v1 · pith:JN7UP5WXnew · submitted 2026-05-15 · 💻 cs.CR · cs.AI· cs.CV· cs.LG

DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models

Pith reviewed 2026-05-20 18:24 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CVcs.LG
keywords adversarial attackslarge language modelsmultimodal modelsinstruction tuningvisual perturbationsCLIPSAMsegmentation attacks
0
0 comments X

The pith

A small LLM trained on attack instructions generates flexible adversarial perturbations for vision and multimodal models from natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DarkLLM, a framework that trains an LLM to convert natural-language attack instructions into latent vectors decoded as image perturbations. This unifies targeted, untargeted, segmentation, and multi-model attacks inside one controllable system instead of requiring separate designs for each objective. A sympathetic reader would care because the method shows foundation models can be attacked scalably through language rather than fixed, model-specific objectives. Experiments across many datasets and models indicate that a 1B-parameter LLM can produce effective attacks on systems like CLIP and SAM.

Core claim

DarkLLM trains an LLM through natural-language instruction tuning to map attacker instructions to latent attack vectors that are decoded into visual adversarial perturbations, creating a single framework that supports targeted, untargeted, segmentation, and multi-model attacks while achieving high effectiveness against CLIP, SAM, and frontier large language models.

What carries the argument

The instruction-tuned LLM that translates natural-language attack instructions into latent attack vectors for subsequent decoding into perturbations.

Load-bearing premise

Natural language instructions can be mapped reliably by an LLM to perturbation vectors that induce the desired effects across different models and tasks.

What would settle it

Train the 1B LLM as described and test whether the generated perturbations consistently fail to produce the instructed behaviors on a new held-out model or dataset.

Figures

Figures reproduced from arXiv: 2605.18868 by Henghui Ding, Jiaming Zhang, Qixian Zhang, Xingjun Ma, Xin Wang, Ye Sun, Yifan Ding, Yifeng Gao, Yixu Wang, Yu-Gang Jiang.

Figure 1
Figure 1. Figure 1: (a) An illustration of traditional adversarial attacks. (b) DarkLLM translates natural [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The framework of DarkLLM consists of two main stages. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A brief illus￾tration of our conditional noise generator. We employ two representative classes of large vision foundation models as surrogates for optimizing DarkLLM: the coarse-grained vision-language model, CLIP [46], and the fine-grained promptable visual segmentation model, SAM [26]. Optimization for Fine-Grained Segmentation. To extend our frame￾work to fine-grained visual segmentation, we incorporate… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of DarkLLM for language-driven attacks. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on the effects of different factors on the attack performance of DarkLLM. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attack effectiveness of DarkLLM under defense mechanisms. In this section, we investigate the different fac￾tors on the attack performance of DarkLLM. Attack Optimization on SAM. We analyze several factors that improve the attack effective￾ness and transferability on SAM from different perspectives. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of DarkLLM for instruction-guided attacks on Commercial MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of DarkLLM for instruction-guided attacks on SAM-Base. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of DarkLLM for instruction-guided attacks on SAM-Large. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of DarkLLM for instruction-guided attacks on SAM-Huge. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Failed cases of DarkLLM for instruction-guided attacks on Commercial MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Failed cases of DarkLLM for instruction-guided attacks on SAM-Huge. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Details of perturbations generated by DarkLLM from user instructions. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
read the original abstract

While vision and multimodal foundation models underpin critical tasks from perception to complex reasoning, they remain highly vulnerable to adversarial attacks. However, traditional adversarial attacks are typically limited to single, predefined objectives, tightly coupling each attack to a specific model or task, which restricts their scalability and flexibility in real-world scenarios. In this work, we present DarkLLM, a novel attack framework that trains an LLM to translate natural-language attack instructions into latent attack vectors, which are then decoded into visual adversarial perturbations. By leveraging natural-language instruction tuning, DarkLLM not only unifies targeted, untargeted, segmentation, and multi-model attacks within a single framework, but also achieves flexible and controllable adversarial generation, enabling each instruction to produce a perturbation that induces desired behaviors across heterogeneous models. Through extensive experiments across 4 tasks, 13 datasets, and 15 models, we demonstrate that DarkLLM with only 1B parameters can follow attacker instructions and generate highly effective attacks against CLIP, SAM, and frontier LLMs, revealing a systemic vulnerability in modern foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DarkLLM, a framework that trains a 1B-parameter LLM via natural-language instruction tuning to map attacker instructions to latent attack vectors; these vectors are decoded into visual perturbations that unify targeted, untargeted, segmentation, and multi-model adversarial attacks. Experiments are reported across 4 tasks, 13 datasets, and 15 models (including CLIP, SAM, and frontier LLMs), with the central claim that the approach yields highly effective, controllable attacks and exposes systemic vulnerabilities in foundation models.

Significance. If the generalization and effectiveness claims are substantiated with proper controls, this would represent a meaningful advance in adversarial ML by replacing task-specific attack engineering with a single language-driven interface. The unification of attack types and the use of a small LLM for cross-model transfer could influence both attack generation and robustness evaluation practices.

major comments (3)
  1. [Experiments] Experiments section: the central claim of effectiveness across 13 datasets and 15 models is load-bearing, yet the manuscript provides no quantitative metrics (success rates, perturbation norms, or comparisons to baselines), error bars, or statistical tests, preventing evaluation of whether the data support the asserted performance.
  2. [Method and Experiments] Method and Experiments sections: the claim that instruction tuning produces a reliable mapping from arbitrary natural-language instructions to effective perturbations on heterogeneous models requires evidence of generalization; the manuscript does not describe instruction diversity, train/test splits for prompts, or ablations on out-of-distribution instructions, which directly undermines the controllability and unification assertions.
  3. [Experiments] Experiments section (cross-model results): the assertion of transfer to CLIP, SAM, and frontier LLMs without model-specific retraining is central, but lacks controls for distribution shift or overfitting to training-model cues; without such ablations the systemic-vulnerability conclusion does not follow from the reported results.
minor comments (2)
  1. [Method] Notation for the latent-vector decoder and its training objective should be clarified with explicit equations to aid reproducibility.
  2. [Figures] Figure captions for attack visualizations could include quantitative perturbation magnitudes and success rates for each example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We address each of the major comments below, indicating where we will make revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim of effectiveness across 13 datasets and 15 models is load-bearing, yet the manuscript provides no quantitative metrics (success rates, perturbation norms, or comparisons to baselines), error bars, or statistical tests, preventing evaluation of whether the data support the asserted performance.

    Authors: We agree that providing explicit quantitative metrics is essential for rigorously supporting our claims. Although the manuscript includes experimental results demonstrating effectiveness, we will revise the Experiments section to include comprehensive tables with success rates for each task and model, perturbation norms (L2 and Linf), direct comparisons to relevant baselines, error bars from repeated trials, and appropriate statistical tests such as t-tests for significance. These additions will be included in the revised manuscript. revision: yes

  2. Referee: [Method and Experiments] Method and Experiments sections: the claim that instruction tuning produces a reliable mapping from arbitrary natural-language instructions to effective perturbations on heterogeneous models requires evidence of generalization; the manuscript does not describe instruction diversity, train/test splits for prompts, or ablations on out-of-distribution instructions, which directly undermines the controllability and unification assertions.

    Authors: The manuscript does describe the use of natural-language instruction tuning to achieve unification, but we acknowledge that more details on generalization are needed. In the revision, we will expand the Method section to detail the diversity of the instruction set (e.g., variations in phrasing for targeted vs. untargeted attacks), the prompt train/test split used during tuning, and add ablation studies on out-of-distribution instructions to demonstrate the mapping's reliability. This will better substantiate the controllability claims. revision: yes

  3. Referee: [Experiments] Experiments section (cross-model results): the assertion of transfer to CLIP, SAM, and frontier LLMs without model-specific retraining is central, but lacks controls for distribution shift or overfitting to training-model cues; without such ablations the systemic-vulnerability conclusion does not follow from the reported results.

    Authors: We maintain that the reported cross-model results, where the trained model is applied directly to unseen architectures like CLIP, SAM, and frontier LLMs, provide evidence of transfer without retraining. However, to address concerns about distribution shift and overfitting, we will add ablations in the revised Experiments section that include training on different model subsets and testing on held-out models, as well as analysis of perturbation patterns to check for model-specific cues. These will strengthen the support for the systemic vulnerability conclusion. revision: partial

Circularity Check

0 steps flagged

No circularity: training framework is self-contained and empirically driven

full rationale

The paper presents DarkLLM as a training procedure that maps natural-language instructions to latent vectors decoded into perturbations. No equations, derivations, or load-bearing claims reduce by construction to fitted parameters, self-definitions, or self-citation chains. The abstract and description emphasize a new framework with experiments across tasks, datasets, and models, without invoking prior author work as a uniqueness theorem or smuggling ansatzes. The result is supported by reported empirical success rather than tautological mappings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5752 in / 1077 out tokens · 107150 ms · 2026-05-20T18:24:20.733693+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Square attack: a query-efficient black-box adversarial attack via random search

    Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. InEuropean conference on computer vision, pages 484–501. Springer, 2020

  3. [3]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

  4. [4]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  5. [5]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  6. [6]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

  7. [7]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  8. [8]

    An analysis of single-layer networks in unsupervised feature learning

    Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011

  9. [9]

    Xtuner: A toolkit for efficiently fine-tuning llm.https://github.com/InternLM/ xtuner, 2023

    XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm.https://github.com/InternLM/ xtuner, 2023

  10. [10]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

  11. [11]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  12. [12]

    How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

    Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751, 2023

  13. [13]

    Boosting adversarial attacks with momentum

    Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9185–9193, 2018

  14. [14]

    Clip-guided generative networks for transferable targeted adversarial attacks

    Hao Fang, Jiawei Kong, Bin Chen, Tao Dai, Hao Wu, and Shu-Tao Xia. Clip-guided generative networks for transferable targeted adversarial attacks. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2024

  15. [15]

    One perturbation is enough: On generating universal adversarial perturbations against vision-language pre- training models

    Hao Fang, Jiawei Kong, Wenbo Yu, Bin Chen, Jiawei Li, Hao Wu, Shu-Tao Xia, and Ke Xu. One perturbation is enough: On generating universal adversarial perturbations against vision-language pre- training models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4090–4100, 2025

  16. [16]

    Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

    Yunhao Feng, Yige Li, Yutao Wu, Yingshui Tan, Yanming Guo, Yifan Ding, Kun Zhai, Xingjun Ma, and Yu-Gang Jiang. Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

  17. [17]

    Generative adversarial nets.Advances in neural information processing systems, 27, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014. 10

  18. [18]

    Explaining and Harnessing Adversarial Examples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014

  19. [19]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  20. [20]

    X-transfer attacks: Towards super transferable adversarial attacks on clip.arXiv preprint arXiv:2505.05528, 2025

    Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, and James Bailey. X-transfer attacks: Towards super transferable adversarial attacks on clip.arXiv preprint arXiv:2505.05528, 2025

  21. [21]

    Black-box adversarial attacks with limited queries and information

    Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. InInternational conference on machine learning, pages 2137–2146. PMLR, 2018

  22. [22]

    Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494, 2025

    Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, and Yang Liu. Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494, 2025

  23. [23]

    Nips 2017: Defense against adversarial attack

    Alex K, Ben Hamner, and Ian Goodfellow. Nips 2017: Defense against adversarial attack. https:// kaggle.com/competitions/nips-2017-defense-against-adversarial-attack , 2017. Kaggle

  24. [24]

    Deep visual-semantic alignments for generating image descriptions

    Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015

  25. [25]

    Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

    Lei Ke, Mingqiao Ye, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu, et al. Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

  26. [26]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  27. [27]

    Collecting a large-scale dataset of fine-grained cars

    Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a large-scale dataset of fine-grained cars. 2013

  28. [28]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  29. [29]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  30. [30]

    Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprint arXiv:2511.16709, 2025

    Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, and Jun Sun. Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprint arXiv:2511.16709, 2025

  31. [31]

    A frustratingly simple yet highly effective attack baseline: Over 90% success rate against the strong black-box models of gpt-4.5/4o/o1.arXiv preprint arXiv:2503.10635, 2025

    Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, and Zhiqiang Shen. A frustratingly simple yet highly effective attack baseline: Over 90% success rate against the strong black-box models of gpt-4.5/4o/o1.arXiv preprint arXiv:2503.10635, 2025

  32. [32]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  33. [33]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  34. [34]

    arXiv preprint arXiv:2410.05295 (2024)

    Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self- exploration to jailbreak llms.arXiv preprint arXiv:2410.05295, 2024

  35. [35]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023

  36. [36]

    Trm-uap: Enhancing the transferability of data-free universal adversarial perturbation via truncated ratio maximization

    Yiran Liu, Xin Feng, Yunlong Wang, Wu Yang, and Di Ming. Trm-uap: Enhancing the transferability of data-free universal adversarial perturbation via truncated ratio maximization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4762–4771, 2023

  37. [37]

    Unsegment anything by simulating deformation

    Jiahao Lu, Xingyi Yang, and Xinchao Wang. Unsegment anything by simulating deformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24294– 24304, 2024. 11

  38. [38]

    Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks,

    Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027, 2024

  39. [39]

    Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

    Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

  40. [40]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017

  41. [41]

    On the robustness of vision transformers to adversarial examples

    Kaleel Mahmood, Rigel Mahmood, and Marten Van Dijk. On the robustness of vision transformers to adversarial examples. InProceedings of the IEEE/CVF international conference on computer vision, pages 7838–7847, 2021

  42. [42]

    Universal adver- sarial perturbations

    Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adver- sarial perturbations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1765–1773, 2017

  43. [43]

    Generalizable data-free objective for crafting universal adversarial perturbations.IEEE transactions on pattern analysis and machine intelligence, 41(10):2452–2465, 2018

    Konda Reddy Mopuri, Aditya Ganeshan, and R Venkatesh Babu. Generalizable data-free objective for crafting universal adversarial perturbations.IEEE transactions on pattern analysis and machine intelligence, 41(10):2452–2465, 2018

  44. [44]

    On generating transferable targeted perturbations

    Muzammal Naseer, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Fatih Porikli. On generating transferable targeted perturbations. InProceedings of the IEEE/CVF international conference on computer vision, pages 7708–7717, 2021

  45. [45]

    A study of generative large language model for medical research and healthcare.NPJ digital medicine, 6(1):210, 2023

    Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of generative large language model for medical research and healthcare.NPJ digital medicine, 6(1):210, 2023

  46. [46]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  47. [47]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  48. [48]

    On the adversarial robustness of vision transformers.arXiv preprint arXiv:2103.15670, 2021

    Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. On the adversarial robustness of vision transformers.arXiv preprint arXiv:2103.15670, 2021

  49. [49]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018

  50. [50]

    Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

  51. [51]

    Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition.Neural networks, 32:323–332, 2012

  52. [52]

    Sama: Towards multi-turn referential grounded video chat with large language models.Advances in Neural Information Processing Systems, 38:47065–47091, 2026

    Ye Sun, Hao Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, and Yu-Gang Jiang. Sama: Towards multi-turn referential grounded video chat with large language models.Advances in Neural Information Processing Systems, 38:47065–47091, 2026

  53. [53]

    Intriguing properties of neural networks

    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013

  54. [54]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  55. [55]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024. 12

  56. [56]

    Cider: Consensus-based image description evaluation

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

  57. [57]

    Towards transferable targeted adversarial examples

    Zhibo Wang, Hongshan Yang, Yunhe Feng, Peng Sun, Hengchang Guo, Zhifei Zhang, and Kui Ren. Towards transferable targeted adversarial examples. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20534–20543, 2023

  58. [58]

    Towards transferable adversarial attacks on vision transformers

    Zhipeng Wei, Jingjing Chen, Micah Goldblum, Zuxuan Wu, Tom Goldstein, and Yu-Gang Jiang. Towards transferable adversarial attacks on vision transformers. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 2668–2676, 2022

  59. [59]

    Enhancing the self-universality for transferable targeted attacks

    Zhipeng Wei, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. Enhancing the self-universality for transferable targeted attacks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12281–12290, 2023

  60. [60]

    Learning transferable targeted universal adversarial perturbations by sequential meta-learning.Computers & Security, 137:103584, 2024

    Juanjuan Weng, Zhiming Luo, Dazhen Lin, and Shaozi Li. Learning transferable targeted universal adversarial perturbations by sequential meta-learning.Computers & Security, 137:103584, 2024

  61. [61]

    Improving transferable targeted adversarial attacks with model self-enhancement

    Han Wu, Guanyan Ou, Weibin Wu, and Zibin Zheng. Improving transferable targeted adversarial attacks with model self-enhancement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24615–24624, 2024

  62. [62]

    Generating Adversarial Examples with Adversarial Networks

    Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks.arXiv preprint arXiv:1801.02610, 2018

  63. [63]

    Improving transferability of adversarial examples with input diversity

    Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2730–2739, 2019

  64. [64]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  65. [65]

    Safebench: A safety evaluation framework for multimodal large language models

    Zonghao Ying, Aishan Liu, Siyuan Liang, Lei Huang, Jinyang Guo, Wenbo Zhou, Xianglong Liu, and Dacheng Tao. Safebench: A safety evaluation framework for multimodal large language models. International Journal of Computer Vision, 134(1):18, 2026

  66. [66]

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the association for computational linguistics, 2:67–78, 2014

  67. [67]

    Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

    Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023

  68. [68]

    Attack-sam: Towards evaluating adversarial robustness of segment anything model,

    Chenshuang Zhang, Chaoning Zhang, Taegoo Kang, Donghun Kim, Sung-Ho Bae, and In So Kweon. Attack-sam: Towards evaluating adversarial robustness of segment anything model.arXiv preprint arXiv:2305.00866, 1(3):5, 2023

  69. [69]

    Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models

    Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Yunhao Chen, Jitao Sang, and Dit-Yan Yeung. Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19900–19909, 2025

  70. [70]

    Towards adversarial attack on vision-language pre-training models

    Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. InProceedings of the 30th ACM International Conference on Multimedia, pages 5005–5013, 2022

  71. [71]

    Universal adversarial perturbations for vision-language pre-trained models

    Peng-Fei Zhang, Zi Huang, and Guangdong Bai. Universal adversarial perturbations for vision-language pre-trained models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 862–871, 2024

  72. [72]

    On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

    Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

  73. [73]

    Black-box targeted adversarial attack on segment anything (sam).IEEE Transactions on Multimedia, 27:1901–1913, 2024

    Sheng Zheng, Chaoning Zhang, and Xinhong Hao. Black-box targeted adversarial attack on segment anything (sam).IEEE Transactions on Multimedia, 27:1901–1913, 2024. 13

  74. [74]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017

  75. [75]

    Advclip: Downstream- agnostic adversarial examples in multimodal contrastive learning

    Ziqi Zhou, Shengshan Hu, Minghui Li, Hangtao Zhang, Yechao Zhang, and Hai Jin. Advclip: Downstream- agnostic adversarial examples in multimodal contrastive learning. InProceedings of the 31st ACM International Conference on Multimedia, pages 6311–6320, 2023

  76. [76]

    Vanish into thin air: Cross-prompt universal adversarial attacks for sam2.arXiv preprint arXiv:2510.24195, 2025

    Ziqi Zhou, Yifan Hu, Yufei Song, Zijing Li, Shengshan Hu, Leo Yu Zhang, Dezhong Yao, Long Zheng, and Hai Jin. Vanish into thin air: Cross-prompt universal adversarial attacks for sam2.arXiv preprint arXiv:2510.24195, 2025

  77. [77]

    Darksam: Fooling segment anything model to segment nothing.Advances in Neural Information Processing Systems, 37:49859–49880, 2024

    Ziqi Zhou, Yufei Song, Minghui Li, Shengshan Hu, Xianlong Wang, Leo Yu Zhang, Dezhong Yao, and Hai Jin. Darksam: Fooling segment anything model to segment nothing.Advances in Neural Information Processing Systems, 37:49859–49880, 2024. 14 A Impact Statements Our DarkLLM imposes several positive broader impacts.1)DarkLLM presents an elegant and versatile f...

  78. [78]

    **Main Subject Consistency:** If both descriptions refer to the same key subject or object (e.g., a person, food, an event), they should receive a higher similarity score

  79. [79]

    **Relevant Description**: If the descriptions are related to the same context or topic, they should also contribute to a higher similarity score

  80. [80]

    Focus on **whether both descriptions fundamentally describe the same thing.**

    **Ignore Fine-Grained Details:** Do not penalize differences in **phrasing, sentence structure, or minor variations in detail**. Focus on **whether both descriptions fundamentally describe the same thing.**

Showing first 80 references.