DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models

Henghui Ding; Jiaming Zhang; Qixian Zhang; Xingjun Ma; Xin Wang; Ye Sun; Yifan Ding; Yifeng Gao; Yixu Wang; Yu-Gang Jiang

arxiv: 2605.18868 · v1 · pith:JN7UP5WXnew · submitted 2026-05-15 · 💻 cs.CR · cs.AI· cs.CV· cs.LG

DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models

Ye Sun , Xin Wang , Jiaming Zhang , Yifeng Gao , Yixu Wang , Yifan Ding , Qixian Zhang , Henghui Ding

show 2 more authors

Xingjun Ma Yu-Gang Jiang

This is my paper

Pith reviewed 2026-05-20 18:24 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CVcs.LG

keywords adversarial attackslarge language modelsmultimodal modelsinstruction tuningvisual perturbationsCLIPSAMsegmentation attacks

0 comments

The pith

A small LLM trained on attack instructions generates flexible adversarial perturbations for vision and multimodal models from natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DarkLLM, a framework that trains an LLM to convert natural-language attack instructions into latent vectors decoded as image perturbations. This unifies targeted, untargeted, segmentation, and multi-model attacks inside one controllable system instead of requiring separate designs for each objective. A sympathetic reader would care because the method shows foundation models can be attacked scalably through language rather than fixed, model-specific objectives. Experiments across many datasets and models indicate that a 1B-parameter LLM can produce effective attacks on systems like CLIP and SAM.

Core claim

DarkLLM trains an LLM through natural-language instruction tuning to map attacker instructions to latent attack vectors that are decoded into visual adversarial perturbations, creating a single framework that supports targeted, untargeted, segmentation, and multi-model attacks while achieving high effectiveness against CLIP, SAM, and frontier large language models.

What carries the argument

The instruction-tuned LLM that translates natural-language attack instructions into latent attack vectors for subsequent decoding into perturbations.

Load-bearing premise

Natural language instructions can be mapped reliably by an LLM to perturbation vectors that induce the desired effects across different models and tasks.

What would settle it

Train the 1B LLM as described and test whether the generated perturbations consistently fail to produce the instructed behaviors on a new held-out model or dataset.

Figures

Figures reproduced from arXiv: 2605.18868 by Henghui Ding, Jiaming Zhang, Qixian Zhang, Xingjun Ma, Xin Wang, Ye Sun, Yifan Ding, Yifeng Gao, Yixu Wang, Yu-Gang Jiang.

**Figure 2.** Figure 2: The framework of DarkLLM consists of two main stages. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A brief illustration of our conditional noise generator. We employ two representative classes of large vision foundation models as surrogates for optimizing DarkLLM: the coarse-grained vision-language model, CLIP [46], and the fine-grained promptable visual segmentation model, SAM [26]. Optimization for Fine-Grained Segmentation. To extend our framework to fine-grained visual segmentation, we incorporate… view at source ↗

**Figure 4.** Figure 4: Visualization of DarkLLM for language-driven attacks. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on the effects of different factors on the attack performance of DarkLLM. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Attack effectiveness of DarkLLM under defense mechanisms. In this section, we investigate the different factors on the attack performance of DarkLLM. Attack Optimization on SAM. We analyze several factors that improve the attack effectiveness and transferability on SAM from different perspectives. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of DarkLLM for instruction-guided attacks on Commercial MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of DarkLLM for instruction-guided attacks on SAM-Base. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of DarkLLM for instruction-guided attacks on SAM-Large. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of DarkLLM for instruction-guided attacks on SAM-Huge. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Failed cases of DarkLLM for instruction-guided attacks on Commercial MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Failed cases of DarkLLM for instruction-guided attacks on SAM-Huge. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Details of perturbations generated by DarkLLM from user instructions. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

read the original abstract

While vision and multimodal foundation models underpin critical tasks from perception to complex reasoning, they remain highly vulnerable to adversarial attacks. However, traditional adversarial attacks are typically limited to single, predefined objectives, tightly coupling each attack to a specific model or task, which restricts their scalability and flexibility in real-world scenarios. In this work, we present DarkLLM, a novel attack framework that trains an LLM to translate natural-language attack instructions into latent attack vectors, which are then decoded into visual adversarial perturbations. By leveraging natural-language instruction tuning, DarkLLM not only unifies targeted, untargeted, segmentation, and multi-model attacks within a single framework, but also achieves flexible and controllable adversarial generation, enabling each instruction to produce a perturbation that induces desired behaviors across heterogeneous models. Through extensive experiments across 4 tasks, 13 datasets, and 15 models, we demonstrate that DarkLLM with only 1B parameters can follow attacker instructions and generate highly effective attacks against CLIP, SAM, and frontier LLMs, revealing a systemic vulnerability in modern foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DarkLLM tunes a 1B LLM to turn natural-language attack instructions into perturbations across vision and multimodal models, but the generalization claims need tighter controls.

read the letter

DarkLLM trains a small LLM to map attacker instructions in plain English to latent vectors that decode into visual perturbations. The framework unifies targeted, untargeted, segmentation, and multi-model attacks in one setup instead of building separate methods for each case. They test it on 13 datasets and 15 models including CLIP, SAM, and some frontier LLMs, which is a broad scope for this kind of work. The core move of using instruction tuning to get controllable attacks from language is the clearest new element, and showing that a 1B model can produce effective results is worth noting for anyone thinking about practical attack generation. The experiments appear to back the claim that these attacks work across heterogeneous models and tasks. The soft spot is generalization to instructions the model has not seen during tuning. The stress-test note correctly flags the lack of reported splits on the natural-language prompts or ablations that would show performance on out-of-distribution instructions. Without those checks it is difficult to tell whether the model is truly following arbitrary attacker directions or has picked up on patterns tied to the training phrasings or specific model cues. The abstract also gives no numbers or baseline comparisons, so the actual strength of the results depends on the tables and figures in the full paper. This paper is aimed at people working on adversarial robustness for foundation models. A reader who wants to explore language-driven attack methods or test vulnerabilities in multimodal systems could get ideas from it, even if they end up tightening the evaluation themselves. It deserves peer review because the idea is distinct from prior single-objective attacks and the experimental breadth is real, though referees will likely ask for the missing generalization controls and concrete metrics.

Referee Report

3 major / 2 minor

Summary. The paper introduces DarkLLM, a framework that trains a 1B-parameter LLM via natural-language instruction tuning to map attacker instructions to latent attack vectors; these vectors are decoded into visual perturbations that unify targeted, untargeted, segmentation, and multi-model adversarial attacks. Experiments are reported across 4 tasks, 13 datasets, and 15 models (including CLIP, SAM, and frontier LLMs), with the central claim that the approach yields highly effective, controllable attacks and exposes systemic vulnerabilities in foundation models.

Significance. If the generalization and effectiveness claims are substantiated with proper controls, this would represent a meaningful advance in adversarial ML by replacing task-specific attack engineering with a single language-driven interface. The unification of attack types and the use of a small LLM for cross-model transfer could influence both attack generation and robustness evaluation practices.

major comments (3)

[Experiments] Experiments section: the central claim of effectiveness across 13 datasets and 15 models is load-bearing, yet the manuscript provides no quantitative metrics (success rates, perturbation norms, or comparisons to baselines), error bars, or statistical tests, preventing evaluation of whether the data support the asserted performance.
[Method and Experiments] Method and Experiments sections: the claim that instruction tuning produces a reliable mapping from arbitrary natural-language instructions to effective perturbations on heterogeneous models requires evidence of generalization; the manuscript does not describe instruction diversity, train/test splits for prompts, or ablations on out-of-distribution instructions, which directly undermines the controllability and unification assertions.
[Experiments] Experiments section (cross-model results): the assertion of transfer to CLIP, SAM, and frontier LLMs without model-specific retraining is central, but lacks controls for distribution shift or overfitting to training-model cues; without such ablations the systemic-vulnerability conclusion does not follow from the reported results.

minor comments (2)

[Method] Notation for the latent-vector decoder and its training objective should be clarified with explicit equations to aid reproducibility.
[Figures] Figure captions for attack visualizations could include quantitative perturbation magnitudes and success rates for each example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We address each of the major comments below, indicating where we will make revisions to strengthen the paper.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim of effectiveness across 13 datasets and 15 models is load-bearing, yet the manuscript provides no quantitative metrics (success rates, perturbation norms, or comparisons to baselines), error bars, or statistical tests, preventing evaluation of whether the data support the asserted performance.

Authors: We agree that providing explicit quantitative metrics is essential for rigorously supporting our claims. Although the manuscript includes experimental results demonstrating effectiveness, we will revise the Experiments section to include comprehensive tables with success rates for each task and model, perturbation norms (L2 and Linf), direct comparisons to relevant baselines, error bars from repeated trials, and appropriate statistical tests such as t-tests for significance. These additions will be included in the revised manuscript. revision: yes
Referee: [Method and Experiments] Method and Experiments sections: the claim that instruction tuning produces a reliable mapping from arbitrary natural-language instructions to effective perturbations on heterogeneous models requires evidence of generalization; the manuscript does not describe instruction diversity, train/test splits for prompts, or ablations on out-of-distribution instructions, which directly undermines the controllability and unification assertions.

Authors: The manuscript does describe the use of natural-language instruction tuning to achieve unification, but we acknowledge that more details on generalization are needed. In the revision, we will expand the Method section to detail the diversity of the instruction set (e.g., variations in phrasing for targeted vs. untargeted attacks), the prompt train/test split used during tuning, and add ablation studies on out-of-distribution instructions to demonstrate the mapping's reliability. This will better substantiate the controllability claims. revision: yes
Referee: [Experiments] Experiments section (cross-model results): the assertion of transfer to CLIP, SAM, and frontier LLMs without model-specific retraining is central, but lacks controls for distribution shift or overfitting to training-model cues; without such ablations the systemic-vulnerability conclusion does not follow from the reported results.

Authors: We maintain that the reported cross-model results, where the trained model is applied directly to unseen architectures like CLIP, SAM, and frontier LLMs, provide evidence of transfer without retraining. However, to address concerns about distribution shift and overfitting, we will add ablations in the revised Experiments section that include training on different model subsets and testing on held-out models, as well as analysis of perturbation patterns to check for model-specific cues. These will strengthen the support for the systemic vulnerability conclusion. revision: partial

Circularity Check

0 steps flagged

No circularity: training framework is self-contained and empirically driven

full rationale

The paper presents DarkLLM as a training procedure that maps natural-language instructions to latent vectors decoded into perturbations. No equations, derivations, or load-bearing claims reduce by construction to fitted parameters, self-definitions, or self-citation chains. The abstract and description emphasize a new framework with experiments across tasks, datasets, and models, without invoking prior author work as a uniqueness theorem or smuggling ansatzes. The result is supported by reported empirical success rather than tautological mappings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5752 in / 1077 out tokens · 107150 ms · 2026-05-20T18:24:20.733693+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

trains an LLM to translate natural-language attack instructions into latent attack vectors, which are then decoded into visual adversarial perturbations
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unifies targeted, untargeted, segmentation, and multi-model attacks within a single framework

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 13 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Square attack: a query-efficient black-box adversarial attack via random search

Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. InEuropean conference on computer vision, pages 484–501. Springer, 2020

work page 2020
[3]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

work page 2014
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

An analysis of single-layer networks in unsupervised feature learning

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011

work page 2011
[9]

Xtuner: A toolkit for efficiently fine-tuning llm.https://github.com/InternLM/ xtuner, 2023

XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm.https://github.com/InternLM/ xtuner, 2023

work page 2023
[10]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

work page 2016
[11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[12]

How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751, 2023

work page arXiv 2023
[13]

Boosting adversarial attacks with momentum

Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9185–9193, 2018

work page 2018
[14]

Clip-guided generative networks for transferable targeted adversarial attacks

Hao Fang, Jiawei Kong, Bin Chen, Tao Dai, Hao Wu, and Shu-Tao Xia. Clip-guided generative networks for transferable targeted adversarial attacks. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2024

work page 2024
[15]

One perturbation is enough: On generating universal adversarial perturbations against vision-language pre- training models

Hao Fang, Jiawei Kong, Wenbo Yu, Bin Chen, Jiawei Li, Hao Wu, Shu-Tao Xia, and Ke Xu. One perturbation is enough: On generating universal adversarial perturbations against vision-language pre- training models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4090–4100, 2025

work page 2025
[16]

Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

Yunhao Feng, Yige Li, Yutao Wu, Yingshui Tan, Yanming Guo, Yifan Ding, Kun Zhai, Xingjun Ma, and Yu-Gang Jiang. Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

work page arXiv 2026
[17]

Generative adversarial nets.Advances in neural information processing systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014. 10

work page 2014
[18]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[20]

X-transfer attacks: Towards super transferable adversarial attacks on clip.arXiv preprint arXiv:2505.05528, 2025

Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, and James Bailey. X-transfer attacks: Towards super transferable adversarial attacks on clip.arXiv preprint arXiv:2505.05528, 2025

work page arXiv 2025
[21]

Black-box adversarial attacks with limited queries and information

Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. InInternational conference on machine learning, pages 2137–2146. PMLR, 2018

work page 2018
[22]

Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494, 2025

Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, and Yang Liu. Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494, 2025

work page arXiv 2025
[23]

Nips 2017: Defense against adversarial attack

Alex K, Ben Hamner, and Ian Goodfellow. Nips 2017: Defense against adversarial attack. https:// kaggle.com/competitions/nips-2017-defense-against-adversarial-attack , 2017. Kaggle

work page 2017
[24]

Deep visual-semantic alignments for generating image descriptions

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015

work page 2015
[25]

Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

Lei Ke, Mingqiao Ye, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu, et al. Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

work page 2023
[26]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023
[27]

Collecting a large-scale dataset of fine-grained cars

Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a large-scale dataset of fine-grained cars. 2013

work page 2013
[28]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[29]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[30]

Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprint arXiv:2511.16709, 2025

Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, and Jun Sun. Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprint arXiv:2511.16709, 2025

work page arXiv 2025
[31]

A frustratingly simple yet highly effective attack baseline: Over 90% success rate against the strong black-box models of gpt-4.5/4o/o1.arXiv preprint arXiv:2503.10635, 2025

Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, and Zhiqiang Shen. A frustratingly simple yet highly effective attack baseline: Over 90% success rate against the strong black-box models of gpt-4.5/4o/o1.arXiv preprint arXiv:2503.10635, 2025

work page arXiv 2025
[32]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[33]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[34]

arXiv preprint arXiv:2410.05295 (2024)

Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self- exploration to jailbreak llms.arXiv preprint arXiv:2410.05295, 2024

work page arXiv 2024
[35]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Trm-uap: Enhancing the transferability of data-free universal adversarial perturbation via truncated ratio maximization

Yiran Liu, Xin Feng, Yunlong Wang, Wu Yang, and Di Ming. Trm-uap: Enhancing the transferability of data-free universal adversarial perturbation via truncated ratio maximization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4762–4771, 2023

work page 2023
[37]

Unsegment anything by simulating deformation

Jiahao Lu, Xingyi Yang, and Xinchao Wang. Unsegment anything by simulating deformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24294– 24304, 2024. 11

work page 2024
[38]

Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks,

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027, 2024

work page arXiv 2024
[39]

Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

work page 2026
[40]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

On the robustness of vision transformers to adversarial examples

Kaleel Mahmood, Rigel Mahmood, and Marten Van Dijk. On the robustness of vision transformers to adversarial examples. InProceedings of the IEEE/CVF international conference on computer vision, pages 7838–7847, 2021

work page 2021
[42]

Universal adver- sarial perturbations

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adver- sarial perturbations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1765–1773, 2017

work page 2017
[43]

Generalizable data-free objective for crafting universal adversarial perturbations.IEEE transactions on pattern analysis and machine intelligence, 41(10):2452–2465, 2018

Konda Reddy Mopuri, Aditya Ganeshan, and R Venkatesh Babu. Generalizable data-free objective for crafting universal adversarial perturbations.IEEE transactions on pattern analysis and machine intelligence, 41(10):2452–2465, 2018

work page 2018
[44]

On generating transferable targeted perturbations

Muzammal Naseer, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Fatih Porikli. On generating transferable targeted perturbations. InProceedings of the IEEE/CVF international conference on computer vision, pages 7708–7717, 2021

work page 2021
[45]

A study of generative large language model for medical research and healthcare.NPJ digital medicine, 6(1):210, 2023

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of generative large language model for medical research and healthcare.NPJ digital medicine, 6(1):210, 2023

work page 2023
[46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[47]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

On the adversarial robustness of vision transformers.arXiv preprint arXiv:2103.15670, 2021

Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. On the adversarial robustness of vision transformers.arXiv preprint arXiv:2103.15670, 2021

work page arXiv 2021
[49]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018

work page 2018
[50]

Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

work page 2025
[51]

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition.Neural networks, 32:323–332, 2012

work page 2012
[52]

Sama: Towards multi-turn referential grounded video chat with large language models.Advances in Neural Information Processing Systems, 38:47065–47091, 2026

Ye Sun, Hao Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, and Yu-Gang Jiang. Sama: Towards multi-turn referential grounded video chat with large language models.Advances in Neural Information Processing Systems, 38:47065–47091, 2026

work page 2026
[53]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[54]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024. 12

work page 2024
[56]

Cider: Consensus-based image description evaluation

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

work page 2015
[57]

Towards transferable targeted adversarial examples

Zhibo Wang, Hongshan Yang, Yunhe Feng, Peng Sun, Hengchang Guo, Zhifei Zhang, and Kui Ren. Towards transferable targeted adversarial examples. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20534–20543, 2023

work page 2023
[58]

Towards transferable adversarial attacks on vision transformers

Zhipeng Wei, Jingjing Chen, Micah Goldblum, Zuxuan Wu, Tom Goldstein, and Yu-Gang Jiang. Towards transferable adversarial attacks on vision transformers. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 2668–2676, 2022

work page 2022
[59]

Enhancing the self-universality for transferable targeted attacks

Zhipeng Wei, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. Enhancing the self-universality for transferable targeted attacks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12281–12290, 2023

work page 2023
[60]

Learning transferable targeted universal adversarial perturbations by sequential meta-learning.Computers & Security, 137:103584, 2024

Juanjuan Weng, Zhiming Luo, Dazhen Lin, and Shaozi Li. Learning transferable targeted universal adversarial perturbations by sequential meta-learning.Computers & Security, 137:103584, 2024

work page 2024
[61]

Improving transferable targeted adversarial attacks with model self-enhancement

Han Wu, Guanyan Ou, Weibin Wu, and Zibin Zheng. Improving transferable targeted adversarial attacks with model self-enhancement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24615–24624, 2024

work page 2024
[62]

Generating Adversarial Examples with Adversarial Networks

Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks.arXiv preprint arXiv:1801.02610, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[63]

Improving transferability of adversarial examples with input diversity

Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2730–2739, 2019

work page 2019
[64]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Safebench: A safety evaluation framework for multimodal large language models

Zonghao Ying, Aishan Liu, Siyuan Liang, Lei Huang, Jinyang Guo, Wenbo Zhou, Xianglong Liu, and Dacheng Tao. Safebench: A safety evaluation framework for multimodal large language models. International Journal of Computer Vision, 134(1):18, 2026

work page 2026
[66]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the association for computational linguistics, 2:67–78, 2014

work page 2014
[67]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Attack-sam: Towards evaluating adversarial robustness of segment anything model,

Chenshuang Zhang, Chaoning Zhang, Taegoo Kang, Donghun Kim, Sung-Ho Bae, and In So Kweon. Attack-sam: Towards evaluating adversarial robustness of segment anything model.arXiv preprint arXiv:2305.00866, 1(3):5, 2023

work page arXiv 2023
[69]

Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models

Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Yunhao Chen, Jitao Sang, and Dit-Yan Yeung. Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19900–19909, 2025

work page 2025
[70]

Towards adversarial attack on vision-language pre-training models

Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. InProceedings of the 30th ACM International Conference on Multimedia, pages 5005–5013, 2022

work page 2022
[71]

Universal adversarial perturbations for vision-language pre-trained models

Peng-Fei Zhang, Zi Huang, and Guangdong Bai. Universal adversarial perturbations for vision-language pre-trained models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 862–871, 2024

work page 2024
[72]

On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

work page 2023
[73]

Black-box targeted adversarial attack on segment anything (sam).IEEE Transactions on Multimedia, 27:1901–1913, 2024

Sheng Zheng, Chaoning Zhang, and Xinhong Hao. Black-box targeted adversarial attack on segment anything (sam).IEEE Transactions on Multimedia, 27:1901–1913, 2024. 13

work page 1901
[74]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017

work page 2017
[75]

Advclip: Downstream- agnostic adversarial examples in multimodal contrastive learning

Ziqi Zhou, Shengshan Hu, Minghui Li, Hangtao Zhang, Yechao Zhang, and Hai Jin. Advclip: Downstream- agnostic adversarial examples in multimodal contrastive learning. InProceedings of the 31st ACM International Conference on Multimedia, pages 6311–6320, 2023

work page 2023
[76]

Vanish into thin air: Cross-prompt universal adversarial attacks for sam2.arXiv preprint arXiv:2510.24195, 2025

Ziqi Zhou, Yifan Hu, Yufei Song, Zijing Li, Shengshan Hu, Leo Yu Zhang, Dezhong Yao, Long Zheng, and Hai Jin. Vanish into thin air: Cross-prompt universal adversarial attacks for sam2.arXiv preprint arXiv:2510.24195, 2025

work page arXiv 2025
[77]

Darksam: Fooling segment anything model to segment nothing.Advances in Neural Information Processing Systems, 37:49859–49880, 2024

Ziqi Zhou, Yufei Song, Minghui Li, Shengshan Hu, Xianlong Wang, Leo Yu Zhang, Dezhong Yao, and Hai Jin. Darksam: Fooling segment anything model to segment nothing.Advances in Neural Information Processing Systems, 37:49859–49880, 2024. 14 A Impact Statements Our DarkLLM imposes several positive broader impacts.1)DarkLLM presents an elegant and versatile f...

work page 2024
[78]

**Main Subject Consistency:** If both descriptions refer to the same key subject or object (e.g., a person, food, an event), they should receive a higher similarity score

work page
[79]

**Relevant Description**: If the descriptions are related to the same context or topic, they should also contribute to a higher similarity score

work page
[80]

Focus on **whether both descriptions fundamentally describe the same thing.**

**Ignore Fine-Grained Details:** Do not penalize differences in **phrasing, sentence structure, or minor variations in detail**. Focus on **whether both descriptions fundamentally describe the same thing.**

work page

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Square attack: a query-efficient black-box adversarial attack via random search

Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. InEuropean conference on computer vision, pages 484–501. Springer, 2020

work page 2020

[3] [3]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

work page 2014

[4] [4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[5] [5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

An analysis of single-layer networks in unsupervised feature learning

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011

work page 2011

[9] [9]

Xtuner: A toolkit for efficiently fine-tuning llm.https://github.com/InternLM/ xtuner, 2023

XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm.https://github.com/InternLM/ xtuner, 2023

work page 2023

[10] [10]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

work page 2016

[11] [11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009

[12] [12]

How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751, 2023

work page arXiv 2023

[13] [13]

Boosting adversarial attacks with momentum

Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9185–9193, 2018

work page 2018

[14] [14]

Clip-guided generative networks for transferable targeted adversarial attacks

Hao Fang, Jiawei Kong, Bin Chen, Tao Dai, Hao Wu, and Shu-Tao Xia. Clip-guided generative networks for transferable targeted adversarial attacks. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2024

work page 2024

[15] [15]

One perturbation is enough: On generating universal adversarial perturbations against vision-language pre- training models

Hao Fang, Jiawei Kong, Wenbo Yu, Bin Chen, Jiawei Li, Hao Wu, Shu-Tao Xia, and Ke Xu. One perturbation is enough: On generating universal adversarial perturbations against vision-language pre- training models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4090–4100, 2025

work page 2025

[16] [16]

Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

Yunhao Feng, Yige Li, Yutao Wu, Yingshui Tan, Yanming Guo, Yifan Ding, Kun Zhai, Xingjun Ma, and Yu-Gang Jiang. Backdooragent: A unified framework for backdoor attacks on llm-based agents.arXiv preprint arXiv:2601.04566, 2026

work page arXiv 2026

[17] [17]

Generative adversarial nets.Advances in neural information processing systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014. 10

work page 2014

[18] [18]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022

[20] [20]

X-transfer attacks: Towards super transferable adversarial attacks on clip.arXiv preprint arXiv:2505.05528, 2025

Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, and James Bailey. X-transfer attacks: Towards super transferable adversarial attacks on clip.arXiv preprint arXiv:2505.05528, 2025

work page arXiv 2025

[21] [21]

Black-box adversarial attacks with limited queries and information

Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. InInternational conference on machine learning, pages 2137–2146. PMLR, 2018

work page 2018

[22] [22]

Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494, 2025

Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, and Yang Liu. Adversarial attacks against closed-source mllms via feature optimal alignment.arXiv preprint arXiv:2505.21494, 2025

work page arXiv 2025

[23] [23]

Nips 2017: Defense against adversarial attack

Alex K, Ben Hamner, and Ian Goodfellow. Nips 2017: Defense against adversarial attack. https:// kaggle.com/competitions/nips-2017-defense-against-adversarial-attack , 2017. Kaggle

work page 2017

[24] [24]

Deep visual-semantic alignments for generating image descriptions

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015

work page 2015

[25] [25]

Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

Lei Ke, Mingqiao Ye, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu, et al. Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

work page 2023

[26] [26]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023

[27] [27]

Collecting a large-scale dataset of fine-grained cars

Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a large-scale dataset of fine-grained cars. 2013

work page 2013

[28] [28]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009

[29] [29]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023

[30] [30]

Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprint arXiv:2511.16709, 2025

Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, and Jun Sun. Autobackdoor: Automating backdoor attacks via llm agents.arXiv preprint arXiv:2511.16709, 2025

work page arXiv 2025

[31] [31]

A frustratingly simple yet highly effective attack baseline: Over 90% success rate against the strong black-box models of gpt-4.5/4o/o1.arXiv preprint arXiv:2503.10635, 2025

Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, and Zhiqiang Shen. A frustratingly simple yet highly effective attack baseline: Over 90% success rate against the strong black-box models of gpt-4.5/4o/o1.arXiv preprint arXiv:2503.10635, 2025

work page arXiv 2025

[32] [32]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014

[33] [33]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023

[34] [34]

arXiv preprint arXiv:2410.05295 (2024)

Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self- exploration to jailbreak llms.arXiv preprint arXiv:2410.05295, 2024

work page arXiv 2024

[35] [35]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Trm-uap: Enhancing the transferability of data-free universal adversarial perturbation via truncated ratio maximization

Yiran Liu, Xin Feng, Yunlong Wang, Wu Yang, and Di Ming. Trm-uap: Enhancing the transferability of data-free universal adversarial perturbation via truncated ratio maximization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4762–4771, 2023

work page 2023

[37] [37]

Unsegment anything by simulating deformation

Jiahao Lu, Xingyi Yang, and Xinchao Wang. Unsegment anything by simulating deformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24294– 24304, 2024. 11

work page 2024

[38] [38]

Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks,

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027, 2024

work page arXiv 2024

[39] [39]

Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026

work page 2026

[40] [40]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

On the robustness of vision transformers to adversarial examples

Kaleel Mahmood, Rigel Mahmood, and Marten Van Dijk. On the robustness of vision transformers to adversarial examples. InProceedings of the IEEE/CVF international conference on computer vision, pages 7838–7847, 2021

work page 2021

[42] [42]

Universal adver- sarial perturbations

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adver- sarial perturbations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1765–1773, 2017

work page 2017

[43] [43]

Generalizable data-free objective for crafting universal adversarial perturbations.IEEE transactions on pattern analysis and machine intelligence, 41(10):2452–2465, 2018

Konda Reddy Mopuri, Aditya Ganeshan, and R Venkatesh Babu. Generalizable data-free objective for crafting universal adversarial perturbations.IEEE transactions on pattern analysis and machine intelligence, 41(10):2452–2465, 2018

work page 2018

[44] [44]

On generating transferable targeted perturbations

Muzammal Naseer, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Fatih Porikli. On generating transferable targeted perturbations. InProceedings of the IEEE/CVF international conference on computer vision, pages 7708–7717, 2021

work page 2021

[45] [45]

A study of generative large language model for medical research and healthcare.NPJ digital medicine, 6(1):210, 2023

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of generative large language model for medical research and healthcare.NPJ digital medicine, 6(1):210, 2023

work page 2023

[46] [46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[47] [47]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

On the adversarial robustness of vision transformers.arXiv preprint arXiv:2103.15670, 2021

Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. On the adversarial robustness of vision transformers.arXiv preprint arXiv:2103.15670, 2021

work page arXiv 2021

[49] [49]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018

work page 2018

[50] [50]

Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

work page 2025

[51] [51]

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition.Neural networks, 32:323–332, 2012

work page 2012

[52] [52]

Sama: Towards multi-turn referential grounded video chat with large language models.Advances in Neural Information Processing Systems, 38:47065–47091, 2026

Ye Sun, Hao Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, and Yu-Gang Jiang. Sama: Towards multi-turn referential grounded video chat with large language models.Advances in Neural Information Processing Systems, 38:47065–47091, 2026

work page 2026

[53] [53]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[54] [54]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024. 12

work page 2024

[56] [56]

Cider: Consensus-based image description evaluation

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

work page 2015

[57] [57]

Towards transferable targeted adversarial examples

Zhibo Wang, Hongshan Yang, Yunhe Feng, Peng Sun, Hengchang Guo, Zhifei Zhang, and Kui Ren. Towards transferable targeted adversarial examples. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20534–20543, 2023

work page 2023

[58] [58]

Towards transferable adversarial attacks on vision transformers

Zhipeng Wei, Jingjing Chen, Micah Goldblum, Zuxuan Wu, Tom Goldstein, and Yu-Gang Jiang. Towards transferable adversarial attacks on vision transformers. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 2668–2676, 2022

work page 2022

[59] [59]

Enhancing the self-universality for transferable targeted attacks

Zhipeng Wei, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. Enhancing the self-universality for transferable targeted attacks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12281–12290, 2023

work page 2023

[60] [60]

Learning transferable targeted universal adversarial perturbations by sequential meta-learning.Computers & Security, 137:103584, 2024

Juanjuan Weng, Zhiming Luo, Dazhen Lin, and Shaozi Li. Learning transferable targeted universal adversarial perturbations by sequential meta-learning.Computers & Security, 137:103584, 2024

work page 2024

[61] [61]

Improving transferable targeted adversarial attacks with model self-enhancement

Han Wu, Guanyan Ou, Weibin Wu, and Zibin Zheng. Improving transferable targeted adversarial attacks with model self-enhancement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24615–24624, 2024

work page 2024

[62] [62]

Generating Adversarial Examples with Adversarial Networks

Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks.arXiv preprint arXiv:1801.02610, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[63] [63]

Improving transferability of adversarial examples with input diversity

Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2730–2739, 2019

work page 2019

[64] [64]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Safebench: A safety evaluation framework for multimodal large language models

Zonghao Ying, Aishan Liu, Siyuan Liang, Lei Huang, Jinyang Guo, Wenbo Zhou, Xianglong Liu, and Dacheng Tao. Safebench: A safety evaluation framework for multimodal large language models. International Journal of Computer Vision, 134(1):18, 2026

work page 2026

[66] [66]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the association for computational linguistics, 2:67–78, 2014

work page 2014

[67] [67]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[68] [68]

Attack-sam: Towards evaluating adversarial robustness of segment anything model,

Chenshuang Zhang, Chaoning Zhang, Taegoo Kang, Donghun Kim, Sung-Ho Bae, and In So Kweon. Attack-sam: Towards evaluating adversarial robustness of segment anything model.arXiv preprint arXiv:2305.00866, 1(3):5, 2023

work page arXiv 2023

[69] [69]

Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models

Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Yunhao Chen, Jitao Sang, and Dit-Yan Yeung. Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19900–19909, 2025

work page 2025

[70] [70]

Towards adversarial attack on vision-language pre-training models

Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. InProceedings of the 30th ACM International Conference on Multimedia, pages 5005–5013, 2022

work page 2022

[71] [71]

Universal adversarial perturbations for vision-language pre-trained models

Peng-Fei Zhang, Zi Huang, and Guangdong Bai. Universal adversarial perturbations for vision-language pre-trained models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 862–871, 2024

work page 2024

[72] [72]

On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

work page 2023

[73] [73]

Black-box targeted adversarial attack on segment anything (sam).IEEE Transactions on Multimedia, 27:1901–1913, 2024

Sheng Zheng, Chaoning Zhang, and Xinhong Hao. Black-box targeted adversarial attack on segment anything (sam).IEEE Transactions on Multimedia, 27:1901–1913, 2024. 13

work page 1901

[74] [74]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017

work page 2017

[75] [75]

Advclip: Downstream- agnostic adversarial examples in multimodal contrastive learning

Ziqi Zhou, Shengshan Hu, Minghui Li, Hangtao Zhang, Yechao Zhang, and Hai Jin. Advclip: Downstream- agnostic adversarial examples in multimodal contrastive learning. InProceedings of the 31st ACM International Conference on Multimedia, pages 6311–6320, 2023

work page 2023

[76] [76]

Vanish into thin air: Cross-prompt universal adversarial attacks for sam2.arXiv preprint arXiv:2510.24195, 2025

Ziqi Zhou, Yifan Hu, Yufei Song, Zijing Li, Shengshan Hu, Leo Yu Zhang, Dezhong Yao, Long Zheng, and Hai Jin. Vanish into thin air: Cross-prompt universal adversarial attacks for sam2.arXiv preprint arXiv:2510.24195, 2025

work page arXiv 2025

[77] [77]

Darksam: Fooling segment anything model to segment nothing.Advances in Neural Information Processing Systems, 37:49859–49880, 2024

Ziqi Zhou, Yufei Song, Minghui Li, Shengshan Hu, Xianlong Wang, Leo Yu Zhang, Dezhong Yao, and Hai Jin. Darksam: Fooling segment anything model to segment nothing.Advances in Neural Information Processing Systems, 37:49859–49880, 2024. 14 A Impact Statements Our DarkLLM imposes several positive broader impacts.1)DarkLLM presents an elegant and versatile f...

work page 2024

[78] [78]

**Main Subject Consistency:** If both descriptions refer to the same key subject or object (e.g., a person, food, an event), they should receive a higher similarity score

work page

[79] [79]

**Relevant Description**: If the descriptions are related to the same context or topic, they should also contribute to a higher similarity score

work page

[80] [80]

Focus on **whether both descriptions fundamentally describe the same thing.**

**Ignore Fine-Grained Details:** Do not penalize differences in **phrasing, sentence structure, or minor variations in detail**. Focus on **whether both descriptions fundamentally describe the same thing.**

work page