Investigating Adversarial Robustness of Multi-modal Large Language Models

Hashmat Shadab Malik; Muzammal Naseer; Salman Khan

arxiv: 2606.03713 · v1 · pith:UVML22ZSnew · submitted 2026-06-02 · 💻 cs.CV

Investigating Adversarial Robustness of Multi-modal Large Language Models

Hashmat Shadab Malik , Muzammal Naseer , Salman Khan This is my paper

Pith reviewed 2026-06-28 11:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial robustnessmulti-modal large language modelsvision encodersCLIP alignmentadversarial pretrainingend-to-end trainingvisual adversarial attacksjailbreak attacks

0 comments

The pith

Large-scale multimodal adversarial pretraining of vision encoders is required for strong robustness transfer into MLLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to determine why some vision encoders confer adversarial robustness to multi-modal large language models while others do not. It shows that scale alone in unimodal adversarial training is insufficient; only encoders that receive large-scale multimodal adversarial pretraining transfer effectively. A diagnostic protocol based on alignment to CLIP space predicts this transfer before any full MLLM training occurs. When such encoders are inserted and the entire model is trained end-to-end on multimodal data, captioning and VQA performance under strong visual attacks rises sharply relative to plug-and-play baselines that keep the original CLIP embedding constraint. The work further establishes that robust visual representations must already exist before adversarial training is applied to the language model itself.

Core claim

Large-scale multimodal adversarial pretraining, rather than unimodal scale alone, is the critical factor for strong robustness transfer. A diagnostic CLIP-alignment protocol predicts which robust vision encoders will transfer effectively. Integrating those encoders into MLLMs via end-to-end multimodal training produces average gains of 28 CIDEr points on captioning and 11.7 percent VQA accuracy under strong adversarial attacks compared with constrained plug-and-play baselines. Adversarial training applied directly to a standard non-robust MLLM degrades both clean and adversarial performance, confirming that robust visual representations are a strict prerequisite. End-to-end adversarial train

What carries the argument

The diagnostic CLIP-alignment protocol that scores how well a robust vision encoder preserves alignment with CLIP's original embedding space and thereby predicts transfer success to MLLMs.

If this is right

Integrating encoders pretrained with large-scale multimodal adversarial data via end-to-end training yields 28 CIDEr points on captioning and 11.7 percent VQA accuracy under strong attacks versus plug-and-play baselines.
Adversarial training applied to a non-robust MLLM degrades both clean and adversarial performance, showing robust visual representations are required first.
End-to-end adversarial training starting from a robust backbone supplies an extra 1.9 CIDEr points and 4.3 percent VQA accuracy.
Lightweight test-time visual stochastic transformations raise adversarial performance of non-robust MLLMs from near zero to levels comparable with robust models.
Robust MLLMs substantially reduce toxic generation under white-box visual jailbreak attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of future MLLMs may need to invest in large multimodal adversarial datasets rather than relying solely on scaling unimodal robust encoders.
The same pretraining principle could be tested on other input modalities such as audio to check whether multimodal adversarial pretraining remains the decisive factor.
Combining the test-time stochastic transformations with the robust backbone training may produce additive defense without extra training cost.

Load-bearing premise

The diagnostic CLIP-alignment protocol accurately predicts, prior to full MLLM training, which robust vision encoders will transfer effectively to the multimodal setting.

What would settle it

A controlled experiment in which an encoder that scores high on the CLIP-alignment protocol is integrated end-to-end yet produces no robustness gain, or an encoder that scores low yet succeeds after the same training.

Figures

Figures reproduced from arXiv: 2606.03713 by Hashmat Shadab Malik, Muzammal Naseer, Salman Khan.

**Figure 1.** Figure 1: Robust performance of our approach on multimodal tasks under stronger adversarial attacks at perturbation budget ε = 8/255: The original CLIP exhibits severe vulnerability, while robust CLIP baselines such as Sim-CLIP4 and FARE4 achieve only limited robustness. Ours (Align) integrates a large-scale adversarially pretrained vision encoder into the MLLM via standard end-to-end training, and Ours (Align+AT) … view at source ↗

**Figure 3.** Figure 3: Effect of AT on MLLM robustness. Average performance on image captioning (left) and visual question answering (right) under clean inputs, APGD attacks, and stronger APGD-Ensemble attack at ε = 8/255. Ours (Align) integrates a large-scale adversarially pretrained vision encoder into the MLLM via standard MLLM training, and Ours (Align+AT) applies adversarial instruction tuning during end-to-end MLLM train… view at source ↗

**Figure 4.** Figure 4: Stage-wise Analysis of AT in MLLMs. We analyze the effect of applying adversarial training at different stages of the standard two-stage LLaVA pipeline (Stage 2 vs. Stage 1+2), and whether the vision encoder is jointly fine-tuned during Stage 2 adversarial instruction tuning (Vision), where the projection and language model are updated by default. 4.1 Robustness Evaluation Evaluation protocol. We evaluate… view at source ↗

**Figure 5.** Figure 5: Robustness comparison with AdPO and different attack types. Left two panels: comparison against AdPO [31] under APGD-Ensemble at ε ∈ {4/255,8/255} on COCO and VQAv2 task. Right panel: Average robustness (clean & adversarial) of Ours (Align+AT) across different attack families (APGD [11], C&W [7], sC&W [65]) at ε = 8/255 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of PGD steps during adversarial MLLM training (left) evaluated under APGD-Ensemble at ε = 8/255. Generalization to OpenFlamingo-7B [2] (right): Average of clean and adversarial performance per task under APGD-Ensemble at ε = 8/255. 4.1.1 Robustness on Untargeted Attacks Standard LLaVA training [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Targeted attack success rate and image captioning performance under APGD [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of visual test-time transformations on non-robust MLLMs. Applying stochastic noise (uniform or gaussian) on the image at inference time substantially improves adversarial performance for MLLMs with non-robust CLIP vision encoder [46] [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 10.** Figure 10: Test-time defense analysis. Average cosine similarity between visual features (after MLP projection) of clean images and adversarial images with uniform noise injection (left). Effect of textual prompt modifications on captioning robustness under APGDEnsemble at ε = 8/255, averaged across captioning tasks (right). is highly vulnerable in this setting, with attack success rates approaching 100% and capti… view at source ↗

**Figure 11.** Figure 11: Zero-shot adversarial robustness after CLIP alignment, evaluated across diverse [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Targeted attack success rate right and image captioning performance right under white-box APGD attacks at ε = 4/255 and ε = 8/255. Standard CLIP-based LLaVA is fully compromised under targeted attacks, while robust CLIP baselines such as FARE fail at higher perturbation budgets. In contrast, models built on AdvXLCLIP-L effectively suppress targeted attacks while maintaining strong captioning performance. … view at source ↗

**Figure 13.** Figure 13: Hallucination evaluation using POPE (F1-Score) E Transferability of Adversarial Examples In this section, we analyze the transferability of adversarial examples across different robust vision encoders integrated within the LLaVA framework under untargeted attacks. We conduct this analysis using two complementary approaches: direct transferability evaluation and ensemble-based attacks. E.1 Direct Transfer… view at source ↗

**Figure 14.** Figure 14: Hallucination evaluation using POPE (Accuracy) [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: Clean and Robust Accuracy of Models on Ensemble-based transfer attack. [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Visual test-time transformations improve robustness significantly for nonrobust vision backbones. Injecting stochastic noise at inference time effectively suppresses adversarial representations in MLLMs built on standard CLIP encoders, leading to large adversarial performance gains with minimal impact on clean accuracy. Robust backbones (FARE4 , AdvXLCLIP-L) show smaller improvements due to inherently st… view at source ↗

**Figure 17.** Figure 17: Effect of prompt formatting on image captioning under APGD-Ensemble attack [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗

**Figure 18.** Figure 18: Illustration of untargeted ℓ∞-attacks with APGD-Ensemble attack on ε = 8/255 [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗

**Figure 19.** Figure 19: Illustration of untargeted ℓ∞-attacks with APGD-Ensemble attack on ε = 8/255. CLIP: A tennis player in an orange shirt and black shorts is jumping to hit a tennis ball FARE4 : A man in an orange shirt is playing tennis. Ours(Align): A man in an orange shirt is playing tennis. Original Image Adversarial Image Ours(Stage 1+2 AT): A man in an orange shirt and black shorts is playing tennis. Ours(Stage 2 AT):… view at source ↗

**Figure 20.** Figure 20: Illustration of untargeted ℓ∞-attacks with APGD-Ensemble attack on ε = 8/255 [PITH_FULL_IMAGE:figures/full_fig_p036_20.png] view at source ↗

**Figure 21.** Figure 21: Illustration of untargeted ℓ∞-attacks with APGD-Ensemble attack on ε = 8/255. CLIP: A group of men skiing in the snow FARE4 : A man in white skiing on a snowy slope. Ours(Align): A man in white ski gear is skiing down a snowy hill. Original Image Adversarial Image Ours(Stage 1+2 AT): A man in a white suit is skiing down a snowy hill. Ours(Stage 2 AT): A man in white skiing down a snowy hill. CLIP: A man i… view at source ↗

**Figure 22.** Figure 22: Illustration of untargeted ℓ∞-attacks with APGD-Ensemble attack on ε = 8/255 [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗

**Figure 23.** Figure 23: Illustration of untargeted ℓ∞-attacks with ε = 4/255 on LLaVA when using different robust vision encoders on image captioning task [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗

**Figure 24.** Figure 24: Illustration of untargeted ℓ∞-attacks with ε = 4/255 on LLaVA when using different robust vision encoders on image captioning task [PITH_FULL_IMAGE:figures/full_fig_p039_24.png] view at source ↗

**Figure 25.** Figure 25: Illustration of untargeted ℓ∞-attacks with ε = 8/255 on LLaVA when using different robust vision encoders on image captioning task [PITH_FULL_IMAGE:figures/full_fig_p040_25.png] view at source ↗

**Figure 26.** Figure 26: Illustration of targeted ℓ∞-attacks with ε = 8/255 on LLaVA when using different robust vision encoders in LLaVA [PITH_FULL_IMAGE:figures/full_fig_p041_26.png] view at source ↗

**Figure 27.** Figure 27: Illustration of untargeted ℓ∞-attacks with ε = 8/255 on LLaVA when using different robust vision encoders on Visual Question Answering Task [PITH_FULL_IMAGE:figures/full_fig_p042_27.png] view at source ↗

**Figure 28.** Figure 28: Illustration of untargeted ℓ∞-attacks with ε = 8/255 on LLaVA when using different robust vision encoders on Visual Question Answering Task [PITH_FULL_IMAGE:figures/full_fig_p043_28.png] view at source ↗

**Figure 29.** Figure 29: Performance on Common Corruptions. Illustration of behavior on applying several common corruptions, when using different robust vision encoders in LLaVA [PITH_FULL_IMAGE:figures/full_fig_p044_29.png] view at source ↗

**Figure 30.** Figure 30: Performance on Common Corruptions. Illustration of behavior on applying several common corruptions, when using different robust vision encoders in LLaVA [PITH_FULL_IMAGE:figures/full_fig_p045_30.png] view at source ↗

**Figure 31.** Figure 31: Performance Under Increasing Corruption Severity. Visualization of how different robust vision encoders in LLaVA respond to corruption applied at varying severity levels [PITH_FULL_IMAGE:figures/full_fig_p046_31.png] view at source ↗

**Figure 32.** Figure 32: Performance Under Increasing Corruption Severity. Visualization of how different robust vision encoders in LLaVA respond to corruption applied at varying severity levels [PITH_FULL_IMAGE:figures/full_fig_p047_32.png] view at source ↗

**Figure 33.** Figure 33: Performance Under Increasing Corruption Severity. Visualization of how different robust vision encoders in LLaVA respond to corruption applied at varying severity levels [PITH_FULL_IMAGE:figures/full_fig_p048_33.png] view at source ↗

read the original abstract

Multi-modal Large Language Models (MLLMs) achieve strong performance on vision-language tasks, but incorporating visual inputs through a vision encoder (e.g., CLIP) substantially expands the attack surface, making these models vulnerable to visual adversarial perturbations. Prior defenses typically preserve compatibility with pretrained MLLMs by enforcing strict alignment to CLIP's original embedding space during adversarial fine-tuning; while practical, this constraint fundamentally limits achievable robustness. We present a systematic investigation of adversarial robustness in MLLMs. We first introduce a diagnostic CLIP-alignment protocol that predicts, prior to full MLLM training, which robust vision encoders will transfer effectively to the multimodal setting, revealing that large-scale multimodal adversarial pretraining, rather than unimodal scale alone, is the critical factor for strong robustness transfer. Integrating such encoders into MLLMs via end-to-end multimodal training yields average gains of 28 CIDEr points on captioning and 11.7% VQA accuracy under strong adversarial attacks compared to constrained plug-and-play baselines. We further show that adversarial training applied directly to a standard non-robust MLLM degrades both clean and adversarial performance, establishing robust visual representations as a strict prerequisite, while end-to-end adversarial training from a robust backbone delivers additional gains of 1.9 CIDEr points and 4.3% VQA accuracy. Beyond training-time defenses, lightweight test-time visual stochastic transformations serve as an effective black-box defense for non-robust MLLMs, elevating adversarial performance from near-zero to levels comparable with robust models. Finally, we show that our robust models substantially reduce toxic generation under white-box visual jailbreak attacks. Code and pretrained weights will be released publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new diagnostic protocol and end-to-end training results point to multimodal adversarial pretraining as the main driver of robustness transfer in MLLMs, with solid reported gains, though the protocol's claimed predictive power before full training is not strongly evidenced.

read the letter

The core claim is that large-scale multimodal adversarial pretraining on vision encoders, rather than unimodal scale, is what actually transfers robustness to MLLMs, and their diagnostic CLIP-alignment protocol is meant to flag good encoders ahead of time. They report average gains of 28 CIDEr on captioning and 11.7% on VQA under attack when using these encoders with end-to-end training, plus smaller extra gains from adversarial fine-tuning on top. They also show that adversarial training on a standard MLLM hurts clean performance, that test-time stochastic transformations help non-robust models, and that the robust versions cut toxic outputs on jailbreaks.

What the work does well is lay out a clear comparison between constrained plug-and-play baselines and full multimodal training, and it covers both training-time and test-time defenses in one place. The numbers are large enough to be worth checking, and releasing code and weights is the right move.

The soft spot is exactly the one in the stress-test note. The abstract says the protocol predicts transfer prior to full MLLM training and reveals multimodal pretraining as the decisive factor, but there is no mention of prospective ranking metrics, correlation with downstream gains independent of the final runs, or controls for encoder scale. If those scores were only computed after seeing the results, the distinction between pretraining type and other factors is harder to isolate. Without the full methods, attack details, or ablation tables it is also difficult to judge whether the gains hold under different splits or stronger attacks.

This is useful for anyone working on MLLM safety or vision-language robustness. It has enough concrete comparisons and practical angles to deserve a serious referee, even with the protocol validation needing tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that a diagnostic CLIP-alignment protocol, applied prior to full MLLM training, can predict which robust vision encoders transfer effectively, establishing large-scale multimodal adversarial pretraining (rather than unimodal scale) as the decisive factor for robustness. End-to-end integration of such encoders yields average gains of 28 CIDEr on captioning and 11.7% VQA accuracy under strong adversarial attacks versus constrained plug-and-play baselines; adversarial training on non-robust MLLMs degrades both clean and adversarial performance, while end-to-end training from a robust backbone adds further gains; lightweight test-time stochastic transformations provide an effective black-box defense; and the resulting robust models reduce toxic generation under white-box visual jailbreaks.

Significance. If the protocol is shown to be prospectively validated and the reported gains are robust to experimental details, the work would provide a concrete method for selecting vision encoders and a clear separation between multimodal pretraining and scale effects, with direct implications for building more secure MLLMs. The public release of code and weights would further strengthen reproducibility.

major comments (2)

[Abstract] Abstract: the central claim that the diagnostic CLIP-alignment protocol 'predicts, prior to full MLLM training, which robust vision encoders will transfer effectively' and thereby isolates multimodal adversarial pretraining as the critical factor is not supported by any reported pre-training prediction metrics, correlation coefficients, or prospective ranking results. Without these, the distinction between multimodal pretraining and scale cannot be isolated from post-hoc selection or retrospective interpretation.
[Abstract] Abstract: the reported average gains of 28 CIDEr and 11.7% VQA accuracy are presented without reference to attack specifications, dataset splits, number of runs, or statistical significance tests, making it impossible to assess whether the gains are robust or sensitive to post-hoc choices.

minor comments (2)

The manuscript should include an explicit section or table showing the protocol scores versus downstream adversarial CIDEr/VQA gains for each encoder, with controls for encoder scale.
Clarify whether the CLIP-alignment protocol was applied before any MLLM training runs or only after observing the final results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the abstract accordingly to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the diagnostic CLIP-alignment protocol 'predicts, prior to full MLLM training, which robust vision encoders will transfer effectively' and thereby isolates multimodal adversarial pretraining as the critical factor is not supported by any reported pre-training prediction metrics, correlation coefficients, or prospective ranking results. Without these, the distinction between multimodal pretraining and scale cannot be isolated from post-hoc selection or retrospective interpretation.

Authors: The full manuscript validates the protocol through controlled experiments across multiple vision encoders, showing that alignment scores computed prior to MLLM training correlate strongly with downstream adversarial robustness transfer, thereby isolating the effect of multimodal adversarial pretraining from scale. We agree the abstract would be strengthened by explicit reference to these metrics. We will revise the abstract to briefly note the key correlation results and the prospective validation design. revision: yes
Referee: [Abstract] Abstract: the reported average gains of 28 CIDEr and 11.7% VQA accuracy are presented without reference to attack specifications, dataset splits, number of runs, or statistical significance tests, making it impossible to assess whether the gains are robust or sensitive to post-hoc choices.

Authors: The full paper details the attack configurations (strong PGD-based visual attacks), evaluation datasets and splits, averaging over multiple random seeds, and reports statistical significance. We agree these details should be referenced in the abstract for completeness. We will revise the abstract to include concise qualifiers on attack settings, evaluation protocol, and significance of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on independent experimental comparisons

full rationale

The paper presents no equations, fitted parameters, or derivations. Its core claims rely on empirical comparisons between training regimes (multimodal adversarial pretraining vs. unimodal scale, end-to-end vs. plug-and-play) and a diagnostic protocol evaluated through downstream performance metrics. No self-citation load-bearing steps, self-definitional reductions, or fitted inputs renamed as predictions appear in the provided text. The protocol is described as predictive prior to full training and is validated via the reported gains, keeping the argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or invented entities; the work is an empirical investigation of transfer and training strategies.

pith-pipeline@v0.9.1-grok · 5838 in / 1170 out tokens · 28719 ms · 2026-06-28T11:09:32.225905+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 30 canonical work pages · 11 internal anchors

[1]

Synthesizing robust adversarial examples

Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. InInternational conference on machine learning, pages 284–
[2]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Foundational models defining a new era in vision: A survey and outlook.arXiv preprint arXiv:2307.13721, 2023

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundational models defining a new era in vision: A survey and outlook.arXiv preprint arXiv:2307.13721, 2023

work page arXiv 2023
[4]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Towards adversarially robust vision-language mod- els: Insights from design choices and prompt formatting techniques.arXiv preprint arXiv:2407.11121, 2024

Rishika Bhagwatkar, Shravan Nayak, Reza Bayat, Alexis Roger, Daniel Z Kaplan, Pouya Bashivan, and Irina Rish. Towards adversarially robust vision-language mod- els: Insights from design choices and prompt formatting techniques.arXiv preprint arXiv:2407.11121, 2024

work page arXiv 2024
[6]

The (r) evolution of multimodal large language models: A survey.arXiv preprint arXiv:2402.12451, 2024

Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. The (r) evolution of multimodal large language models: A survey.arXiv preprint arXiv:2402.12451, 2024

work page arXiv 2024
[7]

Towards evaluating the robustness of neural net- works

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural net- works. In2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017

2017
[8]

Are aligned neural networks adversarially aligned?Advances in Neural Information Pro- cessing Systems, 36, 2024

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei W Koh, Daphne Ippolito, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned?Advances in Neural Information Pro- cessing Systems, 36, 2024. 16MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS

2024
[9]

Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

2023
[10]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. Ininternational conference on machine learning, pages 1310–
[11]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational conference on machine learning, pages 2206–2216. PMLR, 2020

2020
[12]

HotFlip: White-Box Adversarial Examples for Text Classification

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: White-box adver- sarial examples for text classification.arXiv preprint arXiv:1712.06751, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36, 2024

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36, 2024

2024
[14]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models.arXiv preprint arXiv:2009.11462, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[15]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[16]

Mak- ing the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Mak- ing the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

2017
[17]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

2018
[18]

Detoxify

Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020

2020
[19]

Using pre-training can improve model robustness and uncertainty

Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pre-training can improve model robustness and uncertainty. InInternational conference on machine learning, pages 2712–2721. PMLR, 2019

2019
[20]

Pyramid adversarial training improves vit performance

Charles Herrmann, Kyle Sargent, Lu Jiang, Ramin Zabih, Huiwen Chang, Ce Liu, Dilip Krishnan, and Deqing Sun. Pyramid adversarial training improves vit performance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13419–13429, 2022

2022
[21]

Securing vision-language models with a robust encoder against jailbreak and adversarial attacks.arXiv preprint arXiv:2409.07353, 2024

Md Zarif Hossain and Ahmed Imteaj. Securing vision-language models with a robust encoder against jailbreak and adversarial attacks.arXiv preprint arXiv:2409.07353, 2024. MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS17

work page arXiv 2024
[22]

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Md Zarif Hossain and Ahmed Imteaj. Sim-clip: Unsupervised siamese adversarial fine-tuning for robust and semantically-rich vision-language models.arXiv preprint arXiv:2407.14971, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022
[24]

arXiv preprint arXiv:2402.14683 , year=

Wen Huang, Hongbin Liu, Minxin Guo, and Neil Zhenqiang Gong. Visual hallucina- tions of multi-modal large language models.arXiv preprint arXiv:2402.14683, 2024

work page arXiv 2024
[25]

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024

Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024

work page arXiv 2024
[26]

Certified Robustness to Adversarial Examples with Differential Privacy

Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy.arXiv preprint arXiv:1802.03471, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends® in Computer Graphics and Vision, 16(1-2):1–214, 2024

Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jian- feng Gao, et al. Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends® in Computer Graphics and Vision, 16(1-2):1–214, 2024

2024
[28]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra- manan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzer- land, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014

2014
[30]

A comprehensive study on robustness of image classification models: Benchmarking and rethinking.International Journal of Computer Vision, 133(2):567–589, 2025

Chang Liu, Yinpeng Dong, Wenzhao Xiang, Xiao Yang, Hang Su, Jun Zhu, Yuefeng Chen, Yuan He, Hui Xue, and Shibao Zheng. A comprehensive study on robustness of image classification models: Benchmarking and rethinking.International Journal of Computer Vision, 133(2):567–589, 2025

2025
[31]

Adpo: Enhancing the adversarial ro- bustness of large vision-language models with preference optimization.arXiv preprint arXiv:2504.01735, 2025

Chaohu Liu, Tianyi Gui, Yu Liu, and Linli Xu. Adpo: Enhancing the adversarial ro- bustness of large vision-language models with preference optimization.arXiv preprint arXiv:2504.01735, 2025

work page arXiv 2025
[32]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

2024
[33]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

2024
[34]

Safety of multimodal large language models on images and text.arXiv preprint arXiv:2402.00357, 2024

Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, and Yu Qiao. Safety of multimodal large language models on images and text.arXiv preprint arXiv:2402.00357, 2024. 18MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS

work page arXiv 2024
[35]

Mm- safetybench: A benchmark for safety evaluation of multimodal large language models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm- safetybench: A benchmark for safety evaluation of multimodal large language models. InEuropean Conference on Computer Vision, pages 386–403. Springer, 2025

2025
[36]

Clips: An enhanced clip framework for learning with synthetic captions.arXiv preprint arXiv:2411.16828, 2024

Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, and Cihang Xie. Clips: An enhanced clip framework for learning with synthetic captions.arXiv preprint arXiv:2411.16828, 2024

work page arXiv 2024
[37]

Towards deep learning models resistant to adversarial attacks.stat, 1050(9), 2017

Aleksander M ˛ adry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.stat, 1050(9), 2017

2017
[38]

Un- derstanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Un- derstanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

work page arXiv 2022
[39]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195– 3204, 2019

2019
[40]

Text-to-concept (and back) via cross-model alignment

Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, and Soheil Feizi. Text-to-concept (and back) via cross-model alignment. InInternational Conference on Machine Learning, pages 25037–25060. PMLR, 2023

2023
[41]

Introducing mpt-7b: A new standard for open-source, com- mercially usable llms.https://www.mosaicml.com/blog/mpt-7b, 2023

MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, com- mercially usable llms.https://www.mosaicml.com/blog/mpt-7b, 2023

2023
[42]

Diffusion models for adversarial purification.arXiv preprint arXiv:2205.07460, 2022

Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and An- ima Anandkumar. Diffusion models for adversarial purification.arXiv preprint arXiv:2205.07460, 2022

work page arXiv 2022
[43]

Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

work page arXiv 2024
[44]

Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hocken- maier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InProceedings of the IEEE interna- tional conference on computer vision, pages 2641–2649, 2015

2015
[45]

Visual adversarial examples jailbreak aligned large language models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 21527–21536, 2024

2024
[46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational confer- ence on machine learning, pages 8748–8763. PMLR, 2021. MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS19

2021
[47]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115: 211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115: 211–252, 2015

2015
[48]

On the adversarial robustness of multi-modal foundation models

Christian Schlarmann and Matthias Hein. On the adversarial robustness of multi-modal foundation models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3677–3685, 2023

2023
[49]

Ro- bust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Ro- bust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

work page arXiv 2024
[50]

Ro- bust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Ro- bust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models. InForty-first International Conference on Machine Learning, 2024

2024
[51]

Adversarially robust generalization requires more data.Advances in neural information processing systems, 31, 2018

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data.Advances in neural information processing systems, 31, 2018

2018
[52]

R-tpt: Improving adversarial robust- ness of vision-language models through test-time prompt tuning

Lijun Sheng, Jian Liang, Zilei Wang, and Ran He. R-tpt: Improving adversarial robust- ness of vision-language models through test-time prompt tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29958–29967, 2025

2025
[53]

Shi, M., Liu, F., Wang, S., Liao, S., Radhakrishnan, S., Zhao, Y ., Huang, D.-A., Yin, H., Sapra, K., Yacoob, Y ., et al

Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, et al. Eagle: Ex- ploring the design space for multimodal llms with mixture of encoders.arXiv preprint arXiv:2408.15998, 2024

work page arXiv 2024
[54]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317– 8326, 2019

2019
[55]

Disentangling adversarial robustness and generalization

David Stutz, Matthias Hein, and Bernt Schiele. Disentangling adversarial robustness and generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6976–6987, 2019

2019
[56]

Intriguing properties of neural networks

C Szegedy. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[57]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024

2024
[58]

Cider: Consensus- based image description evaluation

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus- based image description evaluation. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 4566–4575, 2015. 20MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS

2015
[59]

Survey on factuality in large language models: Knowledge, retrieval and domain-specificity.arXiv preprint arXiv:2310.07521, 2023

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, et al. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity.arXiv preprint arXiv:2310.07521, 2023

work page arXiv 2023
[60]

Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning.arXiv preprint arXiv:2401.06805, 2024

work page arXiv 2024
[61]

Revisiting adversarial training at scale

Zeyu Wang, Xianhang Li, Hongru Zhu, and Cihang Xie. Revisiting adversarial training at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24675–24685, 2024

2024
[62]

Double visual defense: Adversarial pre-training and instruction tuning for improving vision-language model robustness.arXiv preprint arXiv:2501.09446, 2025

Zeyu Wang, Cihang Xie, Brian Bartoldson, and Bhavya Kailkhura. Double visual defense: Adversarial pre-training and instruction tuning for improving vision-language model robustness.arXiv preprint arXiv:2501.09446, 2025

work page arXiv 2025
[63]

Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip

Songlong Xing, Zhengyu Zhao, and Nicu Sebe. Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pages 15172–15182, 2025

2025
[64]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Smooth adver- sarial examples.EURASIP Journal on Information Security, 2020(1):15, 2020

Hanwei Zhang, Yannis Avrithis, Teddy Furon, and Laurent Amsaleg. Smooth adver- sarial examples.EURASIP Journal on Information Security, 2020(1):15, 2020

2020
[66]

Theoretically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482. PMLR, 2019

2019
[67]

Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792, 2023

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792, 2023

work page arXiv 2023
[68]

Benchmarking trustworthi- ness of multimodal large language models: A comprehensive study.arXiv preprint arXiv:2406.07057, 2024

Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, et al. Benchmarking trustworthi- ness of multimodal large language models: A comprehensive study.arXiv preprint arXiv:2406.07057, 2024

work page arXiv 2024
[69]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS21 Appendix This appendix provides a comprehensive exploration of various aspects of the proposed ap- proach. •Section...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Synthesizing robust adversarial examples

Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. InInternational conference on machine learning, pages 284–

[2] [2]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Foundational models defining a new era in vision: A survey and outlook.arXiv preprint arXiv:2307.13721, 2023

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundational models defining a new era in vision: A survey and outlook.arXiv preprint arXiv:2307.13721, 2023

work page arXiv 2023

[4] [4]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Towards adversarially robust vision-language mod- els: Insights from design choices and prompt formatting techniques.arXiv preprint arXiv:2407.11121, 2024

Rishika Bhagwatkar, Shravan Nayak, Reza Bayat, Alexis Roger, Daniel Z Kaplan, Pouya Bashivan, and Irina Rish. Towards adversarially robust vision-language mod- els: Insights from design choices and prompt formatting techniques.arXiv preprint arXiv:2407.11121, 2024

work page arXiv 2024

[6] [6]

The (r) evolution of multimodal large language models: A survey.arXiv preprint arXiv:2402.12451, 2024

Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. The (r) evolution of multimodal large language models: A survey.arXiv preprint arXiv:2402.12451, 2024

work page arXiv 2024

[7] [7]

Towards evaluating the robustness of neural net- works

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural net- works. In2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017

2017

[8] [8]

Are aligned neural networks adversarially aligned?Advances in Neural Information Pro- cessing Systems, 36, 2024

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei W Koh, Daphne Ippolito, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned?Advances in Neural Information Pro- cessing Systems, 36, 2024. 16MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS

2024

[9] [9]

Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

2023

[10] [10]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. Ininternational conference on machine learning, pages 1310–

[11] [11]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational conference on machine learning, pages 2206–2216. PMLR, 2020

2020

[12] [12]

HotFlip: White-Box Adversarial Examples for Text Classification

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: White-box adver- sarial examples for text classification.arXiv preprint arXiv:1712.06751, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36, 2024

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36, 2024

2024

[14] [14]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models.arXiv preprint arXiv:2009.11462, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[15] [15]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[16] [16]

Mak- ing the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Mak- ing the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

2017

[17] [17]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

2018

[18] [18]

Detoxify

Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020

2020

[19] [19]

Using pre-training can improve model robustness and uncertainty

Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pre-training can improve model robustness and uncertainty. InInternational conference on machine learning, pages 2712–2721. PMLR, 2019

2019

[20] [20]

Pyramid adversarial training improves vit performance

Charles Herrmann, Kyle Sargent, Lu Jiang, Ramin Zabih, Huiwen Chang, Ce Liu, Dilip Krishnan, and Deqing Sun. Pyramid adversarial training improves vit performance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13419–13429, 2022

2022

[21] [21]

Securing vision-language models with a robust encoder against jailbreak and adversarial attacks.arXiv preprint arXiv:2409.07353, 2024

Md Zarif Hossain and Ahmed Imteaj. Securing vision-language models with a robust encoder against jailbreak and adversarial attacks.arXiv preprint arXiv:2409.07353, 2024. MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS17

work page arXiv 2024

[22] [22]

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Md Zarif Hossain and Ahmed Imteaj. Sim-clip: Unsupervised siamese adversarial fine-tuning for robust and semantically-rich vision-language models.arXiv preprint arXiv:2407.14971, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022

[24] [24]

arXiv preprint arXiv:2402.14683 , year=

Wen Huang, Hongbin Liu, Minxin Guo, and Neil Zhenqiang Gong. Visual hallucina- tions of multi-modal large language models.arXiv preprint arXiv:2402.14683, 2024

work page arXiv 2024

[25] [25]

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024

Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024

work page arXiv 2024

[26] [26]

Certified Robustness to Adversarial Examples with Differential Privacy

Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy.arXiv preprint arXiv:1802.03471, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends® in Computer Graphics and Vision, 16(1-2):1–214, 2024

Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jian- feng Gao, et al. Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends® in Computer Graphics and Vision, 16(1-2):1–214, 2024

2024

[28] [28]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra- manan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzer- land, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014

2014

[30] [30]

A comprehensive study on robustness of image classification models: Benchmarking and rethinking.International Journal of Computer Vision, 133(2):567–589, 2025

Chang Liu, Yinpeng Dong, Wenzhao Xiang, Xiao Yang, Hang Su, Jun Zhu, Yuefeng Chen, Yuan He, Hui Xue, and Shibao Zheng. A comprehensive study on robustness of image classification models: Benchmarking and rethinking.International Journal of Computer Vision, 133(2):567–589, 2025

2025

[31] [31]

Adpo: Enhancing the adversarial ro- bustness of large vision-language models with preference optimization.arXiv preprint arXiv:2504.01735, 2025

Chaohu Liu, Tianyi Gui, Yu Liu, and Linli Xu. Adpo: Enhancing the adversarial ro- bustness of large vision-language models with preference optimization.arXiv preprint arXiv:2504.01735, 2025

work page arXiv 2025

[32] [32]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

2024

[33] [33]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

2024

[34] [34]

Safety of multimodal large language models on images and text.arXiv preprint arXiv:2402.00357, 2024

Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, and Yu Qiao. Safety of multimodal large language models on images and text.arXiv preprint arXiv:2402.00357, 2024. 18MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS

work page arXiv 2024

[35] [35]

Mm- safetybench: A benchmark for safety evaluation of multimodal large language models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm- safetybench: A benchmark for safety evaluation of multimodal large language models. InEuropean Conference on Computer Vision, pages 386–403. Springer, 2025

2025

[36] [36]

Clips: An enhanced clip framework for learning with synthetic captions.arXiv preprint arXiv:2411.16828, 2024

Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, and Cihang Xie. Clips: An enhanced clip framework for learning with synthetic captions.arXiv preprint arXiv:2411.16828, 2024

work page arXiv 2024

[37] [37]

Towards deep learning models resistant to adversarial attacks.stat, 1050(9), 2017

Aleksander M ˛ adry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.stat, 1050(9), 2017

2017

[38] [38]

Un- derstanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Un- derstanding zero-shot adversarial robustness for large-scale models.arXiv preprint arXiv:2212.07016, 2022

work page arXiv 2022

[39] [39]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195– 3204, 2019

2019

[40] [40]

Text-to-concept (and back) via cross-model alignment

Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, and Soheil Feizi. Text-to-concept (and back) via cross-model alignment. InInternational Conference on Machine Learning, pages 25037–25060. PMLR, 2023

2023

[41] [41]

Introducing mpt-7b: A new standard for open-source, com- mercially usable llms.https://www.mosaicml.com/blog/mpt-7b, 2023

MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, com- mercially usable llms.https://www.mosaicml.com/blog/mpt-7b, 2023

2023

[42] [42]

Diffusion models for adversarial purification.arXiv preprint arXiv:2205.07460, 2022

Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and An- ima Anandkumar. Diffusion models for adversarial purification.arXiv preprint arXiv:2205.07460, 2022

work page arXiv 2022

[43] [43]

Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, and Rong Jin. Jailbreaking attack against multimodal large language model.arXiv preprint arXiv:2402.02309, 2024

work page arXiv 2024

[44] [44]

Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hocken- maier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InProceedings of the IEEE interna- tional conference on computer vision, pages 2641–2649, 2015

2015

[45] [45]

Visual adversarial examples jailbreak aligned large language models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 21527–21536, 2024

2024

[46] [46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational confer- ence on machine learning, pages 8748–8763. PMLR, 2021. MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS19

2021

[47] [47]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115: 211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115: 211–252, 2015

2015

[48] [48]

On the adversarial robustness of multi-modal foundation models

Christian Schlarmann and Matthias Hein. On the adversarial robustness of multi-modal foundation models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3677–3685, 2023

2023

[49] [49]

Ro- bust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Ro- bust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

work page arXiv 2024

[50] [50]

Ro- bust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Ro- bust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models. InForty-first International Conference on Machine Learning, 2024

2024

[51] [51]

Adversarially robust generalization requires more data.Advances in neural information processing systems, 31, 2018

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data.Advances in neural information processing systems, 31, 2018

2018

[52] [52]

R-tpt: Improving adversarial robust- ness of vision-language models through test-time prompt tuning

Lijun Sheng, Jian Liang, Zilei Wang, and Ran He. R-tpt: Improving adversarial robust- ness of vision-language models through test-time prompt tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29958–29967, 2025

2025

[53] [53]

Shi, M., Liu, F., Wang, S., Liao, S., Radhakrishnan, S., Zhao, Y ., Huang, D.-A., Yin, H., Sapra, K., Yacoob, Y ., et al

Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, et al. Eagle: Ex- ploring the design space for multimodal llms with mixture of encoders.arXiv preprint arXiv:2408.15998, 2024

work page arXiv 2024

[54] [54]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317– 8326, 2019

2019

[55] [55]

Disentangling adversarial robustness and generalization

David Stutz, Matthias Hein, and Bernt Schiele. Disentangling adversarial robustness and generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6976–6987, 2019

2019

[56] [56]

Intriguing properties of neural networks

C Szegedy. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[57] [57]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024

2024

[58] [58]

Cider: Consensus- based image description evaluation

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus- based image description evaluation. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 4566–4575, 2015. 20MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS

2015

[59] [59]

Survey on factuality in large language models: Knowledge, retrieval and domain-specificity.arXiv preprint arXiv:2310.07521, 2023

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, et al. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity.arXiv preprint arXiv:2310.07521, 2023

work page arXiv 2023

[60] [60]

Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning.arXiv preprint arXiv:2401.06805, 2024

work page arXiv 2024

[61] [61]

Revisiting adversarial training at scale

Zeyu Wang, Xianhang Li, Hongru Zhu, and Cihang Xie. Revisiting adversarial training at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24675–24685, 2024

2024

[62] [62]

Double visual defense: Adversarial pre-training and instruction tuning for improving vision-language model robustness.arXiv preprint arXiv:2501.09446, 2025

Zeyu Wang, Cihang Xie, Brian Bartoldson, and Bhavya Kailkhura. Double visual defense: Adversarial pre-training and instruction tuning for improving vision-language model robustness.arXiv preprint arXiv:2501.09446, 2025

work page arXiv 2025

[63] [63]

Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip

Songlong Xing, Zhengyu Zhao, and Nicu Sebe. Clip is strong enough to fight back: Test-time counterattacks towards zero-shot adversarial robustness of clip. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pages 15172–15182, 2025

2025

[64] [64]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [65]

Smooth adver- sarial examples.EURASIP Journal on Information Security, 2020(1):15, 2020

Hanwei Zhang, Yannis Avrithis, Teddy Furon, and Laurent Amsaleg. Smooth adver- sarial examples.EURASIP Journal on Information Security, 2020(1):15, 2020

2020

[66] [66]

Theoretically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482. PMLR, 2019

2019

[67] [67]

Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792, 2023

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792, 2023

work page arXiv 2023

[68] [68]

Benchmarking trustworthi- ness of multimodal large language models: A comprehensive study.arXiv preprint arXiv:2406.07057, 2024

Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, et al. Benchmarking trustworthi- ness of multimodal large language models: A comprehensive study.arXiv preprint arXiv:2406.07057, 2024

work page arXiv 2024

[69] [69]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. MALIK ET.AL: ADVERSARIAL ROBUSTNESS OF MLLMS21 Appendix This appendix provides a comprehensive exploration of various aspects of the proposed ap- proach. •Section...

work page internal anchor Pith review Pith/arXiv arXiv 2023