What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Jie Zhang; Sen Nie; Shiguang Shan; Xilin Chen; Zhaoyang Wei; Zhongqi Wang

arxiv: 2603.12799 · v2 · submitted 2026-03-13 · 💻 cs.CV

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Sen Nie , Jie Zhang , Zhongqi Wang , Zhaoyang Wei , Shiguang Shan , Xilin Chen This is my paper

Pith reviewed 2026-05-15 11:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial robustnessvision-language modelsshallow layersrobustness-accuracy tradeofffine-tuningspectral biasattention patterns

0 comments

The pith

Adversarial robustness in vision-language models concentrates in their shallow layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the long-standing trade-off where making vision-language models robust to adversarial attacks reduces their accuracy on clean data. Analysis of adversarially fine-tuned models reveals that robustness mechanisms are not spread evenly but cluster mainly in the shallow layers, supported by a bias toward low-frequency patterns and attention that stays largely insensitive to the specific input. Updates to deeper layers, by contrast, tend to degrade both normal accuracy and robustness under attack. From this finding the authors introduce a framework that freezes the original pre-trained weights and applies only small, targeted changes in the initial layers. This yields strong robustness while keeping clean performance high across many datasets and tasks.

Core claim

Adversarial robustness is not uniformly distributed across network depth but is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns, while updates to the deep layers undermine both clean accuracy and robust generalization. This observation enables the R-Adapt framework that freezes all pre-trained weights and introduces minimal adaptations only in the initial layers to reconcile the two goals.

What carries the argument

Localization of adversarial robustness to shallow layers through low-frequency spectral bias and input-insensitive attention patterns, realized by the R-Adapt method that adapts only the first layers while freezing the remainder.

If this is right

Minimal adaptations confined to shallow layers can deliver high adversarial robustness without large losses in clean accuracy.
The same shallow-layer approach works under training-free, model-guided, and data-driven adaptation paradigms.
The method scales directly to large models such as LLaVA and Qwen-VL.
Strong performance is maintained across 18 datasets and multiple attack types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If robustness lives mostly in shallow layers, full-network adversarial training may be unnecessary and computationally wasteful.
Layer-wise localization analysis could be applied to other vision or multimodal architectures to find similar efficiency gains.
Shallow-layer interventions might also improve robustness properties outside adversarial settings, such as resistance to distribution shifts.

Load-bearing premise

The localization of robustness to shallow layers observed in adversarially fine-tuned models is general enough that minimal adaptations only in the initial layers suffice to balance robustness and accuracy without changes to deeper layers.

What would settle it

An experiment in which adapting only the deep layers of a VLM produces a better robustness-accuracy trade-off than adapting only the shallow layers would contradict the localization claim.

read the original abstract

Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at https://summu77.github.io/R-Adapt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims robustness in VLMs concentrates in shallow layers from low-frequency bias and input-insensitive attention, so freezing deeper layers and adapting only the start with R-Adapt balances it against clean accuracy.

read the letter

The main takeaway is that adversarial robustness in these vision-language models sits mostly in the shallow layers rather than being uniform across depth. The analysis of fine-tuned models points to low-frequency spectral bias and attention patterns that ignore input details as the drivers, while changes deeper in the network hurt both clean accuracy and robust generalization. From that they build R-Adapt, which freezes the pre-trained weights and adds only minimal, targeted changes in the initial layers, with support for training-free, model-guided, or data-driven versions. It reportedly scales to large models like LLaVA and Qwen-VL and hits strong numbers across 18 datasets under multiple attacks. That localization insight and the resulting lightweight framework are the clearest new pieces relative to earlier robustness work on VLMs. The practical angle is useful: a method that avoids full retraining while claiming to keep accuracy intact could matter for anyone deploying these models in settings where both reliability and performance count. The abstract does not include the actual numbers, controls, or layer-wise breakdowns, so the strength of the spectral bias and attention evidence is hard to judge from what is here. The assumption that minimal early-layer tweaks are enough might be tied to the specific fine-tuning setups they examined, and it would be good to see how far that generalizes. This is aimed at researchers working on robust multimodal models who need efficient adaptation tricks rather than full retraining. A reader already familiar with spectral analysis or layer-wise studies in vision models would get the most from the internal breakdown. I would send it for peer review. The idea is straightforward enough that referees can test the claims directly against the experiments.

Referee Report

1 major / 0 minor

Summary. The manuscript claims that adversarial robustness in Vision-Language Models is not uniformly distributed but localized primarily in shallow layers, driven by low-frequency spectral bias and input-insensitive attention patterns, while updates to deep layers undermine both clean accuracy and robust generalization. Motivated by this analysis of adversarially fine-tuned models, it proposes R-Adapt, a framework that freezes all pre-trained weights and introduces minimal adaptations only in the initial layers. R-Adapt supports training-free, model-guided, and data-driven paradigms and is shown to achieve state-of-the-art robustness-accuracy balance on 18 datasets across diverse tasks, with efficient generalization to large VLMs such as LLaVA and Qwen-VL.

Significance. If the localization of robustness to shallow layers and the effectiveness of minimal initial-layer adaptations hold, the work would offer a practical and insight-driven approach to reconciling the robustness-accuracy trade-off in VLMs without full retraining, potentially enabling more reliable deployment of large vision-language models.

major comments (1)

Abstract: the central claims rest on an internal analysis of fine-tuned models leading to the method, but the abstract lacks concrete data, error bars, or verification details for the spectral bias and attention claims; full paper required to assess support for the localization and R-Adapt effectiveness assertions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the opportunity to clarify our manuscript. We address the single major comment below regarding the abstract's content and structure.

read point-by-point responses

Referee: Abstract: the central claims rest on an internal analysis of fine-tuned models leading to the method, but the abstract lacks concrete data, error bars, or verification details for the spectral bias and attention claims; full paper required to assess support for the localization and R-Adapt effectiveness assertions.

Authors: We agree that the abstract is a high-level summary and does not include specific numerical results, error bars, or detailed verification metrics, which is standard practice to maintain brevity and readability. The full manuscript provides the supporting evidence: Section 3 details the internal analysis of adversarially fine-tuned models, including quantitative measurements of low-frequency spectral bias (with explicit metrics and visualizations) and input-insensitive attention patterns across layers; Section 4 reports comprehensive results on 18 datasets with error bars, ablation studies, and comparisons under multiple attacks, directly verifying the localization of robustness to shallow layers and the effectiveness of R-Adapt. These sections substantiate the claims without relying solely on the abstract. If the referee recommends, we can add 1-2 key quantitative highlights (e.g., robustness gains on representative datasets) to the abstract in a revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's claims rest on empirical observations from adversarially fine-tuned VLMs, identifying robustness localization in shallow layers via spectral bias and attention patterns, followed by the R-Adapt proposal that freezes weights and adapts only initial layers. With only the abstract available, no equations, derivations, or self-citations are present that could reduce any prediction or result to its inputs by construction. The argument is self-contained, driven by analysis rather than definitional or fitted circularity, consistent with a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not introduce or rely on any explicit free parameters, mathematical axioms, or newly invented entities; the approach is empirical and adaptation-based.

pith-pipeline@v0.9.0 · 5543 in / 1086 out tokens · 46066 ms · 2026-05-15T11:46:45.751276+00:00 · methodology

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)