Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Ahmed Imteaj; Md Zarif Hossain

arxiv: 2407.14971 · v3 · submitted 2024-07-20 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Md Zarif Hossain , Ahmed Imteaj This is my paper

Pith reviewed 2026-05-23 22:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords adversarial robustnessCLIP vision encoderSiamese architectureunsupervised fine-tuningcosine similarityvision-language modelsstop-gradient

0 comments

The pith

A Siamese cosine-similarity setup fine-tunes CLIP vision encoders to resist adversarial attacks while keeping semantic quality intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sim-CLIP to make the vision encoders inside vision-language models harder to fool with tiny image changes. It trains in an unsupervised way by feeding clean and attacked versions of the same image through two copies of the network and pulling their outputs together with a cosine similarity loss. A stop-gradient on one side prevents the representations from collapsing to a trivial solution. The method avoids the usual requirements of huge batches or extra momentum networks, which keeps training cheap. Experiments across several models and tasks show better resistance to both targeted and untargeted attacks than earlier robust CLIP versions, with no loss in semantic performance.

Core claim

Sim-CLIP adopts a Siamese training architecture with a cosine similarity objective and a symmetric stop-gradient mechanism to enforce semantic alignment between clean and adversarial views. This unsupervised approach enhances the robustness of the CLIP vision encoder against adversarial perturbations while preserving semantic representations, outperforming existing robust CLIP variants across multiple vision-language tasks and attack types.

What carries the argument

Siamese training architecture with cosine similarity objective and symmetric stop-gradient mechanism that aligns representations of clean and adversarial image views.

If this is right

The fine-tuned encoder resists both targeted and untargeted attacks more effectively than prior robust CLIP variants.
Semantic performance on downstream tasks such as captioning, visual question answering, and zero-shot classification stays the same or improves.
Training requires only modest compute because it skips large-batch contrastive losses and extra momentum encoders.
The same alignment procedure applies across multiple vision-language models without task-specific labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clean-adversarial alignment pattern could be tried on encoders used for audio or video tasks to check whether the robustness gain generalizes.
If the stop-gradient trick proves stable, it might reduce reliance on supervised adversarial training pipelines that need labeled attack examples.
Replacing cosine similarity with other distance measures in the Siamese head would show which objectives best trade off robustness against semantic drift.

Load-bearing premise

The cosine similarity objective with symmetric stop-gradient will enforce meaningful semantic alignment between clean and adversarial views without large-batch contrastive learning or momentum encoders.

What would settle it

A direct test in which the fine-tuned encoder shows no gain in accuracy under a fresh set of adversarial perturbations or drops performance on zero-shot image classification compared with the original CLIP encoder.

read the original abstract

Vision-Language Models (VLMs) rely heavily on pretrained vision encoders to support downstream tasks such as image captioning, visual question answering, and zero-shot classification. Despite their strong performance, these encoders remain highly vulnerable to imperceptible adversarial perturbations, which can severely degrade both robustness and semantic quality in multimodal reasoning. In this work, we introduce Sim-CLIP, an unsupervised adversarial fine-tuning framework that enhances the robustness of the CLIP vision encoder while preserving overall semantic representations. Sim-CLIP adopts a Siamese training architecture with a cosine similarity objective and a symmetric stop-gradient mechanism to enforce semantic alignment between clean and adversarial views. This design avoids large-batch contrastive learning and additional momentum encoders, enabling robust training with low computational overhead. We evaluate Sim-CLIP across multiple Vision-Language Models and tasks under both targeted and untargeted adversarial attacks. Experimental results demonstrate that Sim-CLIP consistently outperforms state-of-the-art robust CLIP variants, achieving stronger adversarial robustness while maintaining or improving semantic fidelity. These findings highlight the limitations of existing adversarial defenses and establish Sim-CLIP as an effective and scalable solution for robust vision-language representation learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sim-CLIP's Siamese cosine objective with stop-gradient is presented as a low-overhead robustness fix for CLIP encoders, but the abstract supplies no numbers or attack details so the outperformance claim cannot be checked.

read the letter

The paper introduces Sim-CLIP, an unsupervised adversarial fine-tuning method for the CLIP vision encoder. It uses a Siamese architecture with cosine similarity between clean and adversarial views plus symmetric stop-gradient to align the representations. This setup is meant to avoid large-batch contrastive training and momentum encoders, which keeps the overhead low and could make it easier to apply in practice for vision-language models. That efficiency angle is the clearest practical element if the alignment actually produces useful invariance. The abstract claims consistent outperformance over existing robust CLIP variants on both adversarial robustness and semantic fidelity across multiple models, tasks, and attack types. If the experiments back this up with clear metrics and controls, it would be relevant for people working on reliable multimodal systems. The main problem is that the abstract contains no quantitative results, no baselines, no dataset sizes, and no attack strengths or success rates. The central claims therefore cannot be evaluated from what is shown. The stress-test concern about the objective is on point here: with only positive pairs and a simple cosine loss, the optimization could satisfy the objective through collapse to constant embeddings or by the encoder learning to treat the perturbation as noise without preserving semantics. Nothing in the abstract rules that out or shows ablations against it. The work targets researchers focused on adversarial defenses inside VLMs. A reader in that narrow area might want to see the full experiments, but the lack of any data makes it hard to extract value or judge whether the mechanism delivers non-trivial gains. This does not look ready for peer review until the results section and validation against trivial solutions are included.

Referee Report

2 major / 1 minor

Summary. The paper introduces Sim-CLIP, an unsupervised adversarial fine-tuning framework for CLIP vision encoders in VLMs. It employs a Siamese architecture with a cosine similarity objective and symmetric stop-gradient mechanism to align clean and adversarial views, avoiding large-batch contrastive learning and momentum encoders. The central claim is that this yields stronger adversarial robustness than prior robust CLIP variants while preserving or improving semantic fidelity, as demonstrated across multiple VLMs, tasks, and both targeted and untargeted attacks.

Significance. If the results hold, the work would provide a low-overhead approach to robustifying vision encoders for downstream VLM tasks such as captioning, VQA, and zero-shot classification. The design choice to forgo large batches and momentum encoders could improve accessibility of adversarial training for VLMs.

major comments (2)

[Methods (Siamese objective and stop-gradient)] The training objective (described in the methods): the cosine similarity loss with symmetric stop-gradient is asserted to enforce non-trivial semantic alignment between clean and adversarial views. However, without negative samples or a momentum encoder, the only pressure is to make the two views close; because the adversarial view is generated to maximize loss, the optimizer can satisfy the objective by ignoring the perturbation or by collapsing representations rather than learning invariance. This directly threatens the claim of both improved robustness and preserved semantic fidelity.
[Abstract and Experiments] Experimental claims (abstract and results section): the manuscript asserts consistent outperformance over SOTA robust CLIP variants with stronger robustness and maintained semantic fidelity, yet the abstract supplies no numerical values, attack strengths (e.g., epsilon), datasets, or baseline numbers. The results section must furnish these quantities with clear tables so that the reported gains can be verified against the data.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one or two key quantitative metrics (robustness accuracy, semantic similarity scores) to support the outperformance statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Methods (Siamese objective and stop-gradient)] The training objective (described in the methods): the cosine similarity loss with symmetric stop-gradient is asserted to enforce non-trivial semantic alignment between clean and adversarial views. However, without negative samples or a momentum encoder, the only pressure is to make the two views close; because the adversarial view is generated to maximize loss, the optimizer can satisfy the objective by ignoring the perturbation or by collapsing representations rather than learning invariance. This directly threatens the claim of both improved robustness and preserved semantic fidelity.

Authors: The symmetric stop-gradient, following the SimSiam design, prevents trivial collapse by stopping gradient flow through one branch, forcing the model to learn non-trivial representations. The adversarial view is generated by maximizing dissimilarity to the clean view via PGD, after which the cosine similarity objective pulls the representations together; this combination encourages invariance specifically to adversarial perturbations rather than ignoring them. Empirical results across VLMs, tasks, and attack types show gains in robustness metrics without loss in semantic fidelity on downstream tasks, supporting that the objective achieves the intended effect. revision: no
Referee: [Abstract and Experiments] Experimental claims (abstract and results section): the manuscript asserts consistent outperformance over SOTA robust CLIP variants with stronger robustness and maintained semantic fidelity, yet the abstract supplies no numerical values, attack strengths (e.g., epsilon), datasets, or baseline numbers. The results section must furnish these quantities with clear tables so that the reported gains can be verified against the data.

Authors: We agree that the abstract would benefit from key quantitative highlights and that results tables should explicitly list attack parameters, datasets, and baseline comparisons for verifiability. In the revised manuscript we will update the abstract to report specific robustness gains (e.g., under standard epsilon values) and ensure all results tables include complete numerical comparisons with attack strengths and metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: method and claims are self-contained

full rationale

The paper describes Sim-CLIP via a Siamese cosine-similarity objective with symmetric stop-gradient and reports empirical outperformance on robustness and semantic tasks. No equations, fitted parameters, or self-citations are shown that reduce any claimed result to a definition or prior input by construction. The derivation chain consists of a stated architecture plus experimental evaluation; it does not contain self-definitional steps, fitted-input predictions, or load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the approach rests on standard assumptions from contrastive learning and adversarial training that are not detailed here.

pith-pipeline@v0.9.0 · 5742 in / 1087 out tokens · 39133 ms · 2026-05-23T22:31:07.745344+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We minimize negative cosine similarity between the representations Rp and Rc … Lsimclip(Rp, Rc) = 1/2 (CosSim(Rp, stopgrad(Rc)) + CosSim(Rc, stopgrad(Rp)))
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Sim-CLIP adopts a Siamese training architecture with a cosine similarity objective … without requiring large batch sizes or additional momentum encoders

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.