Robust vision encoders from multimodal adversarial pretraining transfer to MLLMs and deliver large gains in adversarial captioning and VQA performance, while test-time stochastic transformations provide an effective black-box defense.
Towards adversarially robust vision-language mod- els: Insights from design choices and prompt formatting techniques.arXiv preprint arXiv:2407.11121, 2024
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Investigating Adversarial Robustness of Multi-modal Large Language Models
Robust vision encoders from multimodal adversarial pretraining transfer to MLLMs and deliver large gains in adversarial captioning and VQA performance, while test-time stochastic transformations provide an effective black-box defense.