Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

· 2024 · cs.CV · arXiv 2407.14971

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Vision-Language Models (VLMs) rely heavily on pretrained vision encoders to support downstream tasks such as image captioning, visual question answering, and zero-shot classification. Despite their strong performance, these encoders remain highly vulnerable to imperceptible adversarial perturbations, which can severely degrade both robustness and semantic quality in multimodal reasoning. In this work, we introduce Sim-CLIP, an unsupervised adversarial fine-tuning framework that enhances the robustness of the CLIP vision encoder while preserving overall semantic representations. Sim-CLIP adopts a Siamese training architecture with a cosine similarity objective and a symmetric stop-gradient mechanism to enforce semantic alignment between clean and adversarial views. This design avoids large-batch contrastive learning and additional momentum encoders, enabling robust training with low computational overhead. We evaluate Sim-CLIP across multiple Vision-Language Models and tasks under both targeted and untargeted adversarial attacks. Experimental results demonstrate that Sim-CLIP consistently outperforms state-of-the-art robust CLIP variants, achieving stronger adversarial robustness while maintaining or improving semantic fidelity. These findings highlight the limitations of existing adversarial defenses and establish Sim-CLIP as an effective and scalable solution for robust vision-language representation learning.

representative citing papers

Investigating Adversarial Robustness of Multi-modal Large Language Models

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

Robust vision encoders from multimodal adversarial pretraining transfer to MLLMs and deliver large gains in adversarial captioning and VQA performance, while test-time stochastic transformations provide an effective black-box defense.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Investigating Adversarial Robustness of Multi-modal Large Language Models cs.CV · 2026-06-02 · unverdicted · none · ref 22 · internal anchor
Robust vision encoders from multimodal adversarial pretraining transfer to MLLMs and deliver large gains in adversarial captioning and VQA performance, while test-time stochastic transformations provide an effective black-box defense.

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

fields

years

verdicts

representative citing papers

citing papers explorer