pith. sign in

arxiv: 2603.12799 · v2 · submitted 2026-03-13 · 💻 cs.CV

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Pith reviewed 2026-05-15 11:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords adversarial robustnessvision-language modelsshallow layersrobustness-accuracy tradeofffine-tuningspectral biasattention patterns
0
0 comments X

The pith

Adversarial robustness in vision-language models concentrates in their shallow layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the long-standing trade-off where making vision-language models robust to adversarial attacks reduces their accuracy on clean data. Analysis of adversarially fine-tuned models reveals that robustness mechanisms are not spread evenly but cluster mainly in the shallow layers, supported by a bias toward low-frequency patterns and attention that stays largely insensitive to the specific input. Updates to deeper layers, by contrast, tend to degrade both normal accuracy and robustness under attack. From this finding the authors introduce a framework that freezes the original pre-trained weights and applies only small, targeted changes in the initial layers. This yields strong robustness while keeping clean performance high across many datasets and tasks.

Core claim

Adversarial robustness is not uniformly distributed across network depth but is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns, while updates to the deep layers undermine both clean accuracy and robust generalization. This observation enables the R-Adapt framework that freezes all pre-trained weights and introduces minimal adaptations only in the initial layers to reconcile the two goals.

What carries the argument

Localization of adversarial robustness to shallow layers through low-frequency spectral bias and input-insensitive attention patterns, realized by the R-Adapt method that adapts only the first layers while freezing the remainder.

If this is right

  • Minimal adaptations confined to shallow layers can deliver high adversarial robustness without large losses in clean accuracy.
  • The same shallow-layer approach works under training-free, model-guided, and data-driven adaptation paradigms.
  • The method scales directly to large models such as LLaVA and Qwen-VL.
  • Strong performance is maintained across 18 datasets and multiple attack types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If robustness lives mostly in shallow layers, full-network adversarial training may be unnecessary and computationally wasteful.
  • Layer-wise localization analysis could be applied to other vision or multimodal architectures to find similar efficiency gains.
  • Shallow-layer interventions might also improve robustness properties outside adversarial settings, such as resistance to distribution shifts.

Load-bearing premise

The localization of robustness to shallow layers observed in adversarially fine-tuned models is general enough that minimal adaptations only in the initial layers suffice to balance robustness and accuracy without changes to deeper layers.

What would settle it

An experiment in which adapting only the deep layers of a VLM produces a better robustness-accuracy trade-off than adapting only the shallow layers would contradict the localization claim.

read the original abstract

Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at https://summu77.github.io/R-Adapt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript claims that adversarial robustness in Vision-Language Models is not uniformly distributed but localized primarily in shallow layers, driven by low-frequency spectral bias and input-insensitive attention patterns, while updates to deep layers undermine both clean accuracy and robust generalization. Motivated by this analysis of adversarially fine-tuned models, it proposes R-Adapt, a framework that freezes all pre-trained weights and introduces minimal adaptations only in the initial layers. R-Adapt supports training-free, model-guided, and data-driven paradigms and is shown to achieve state-of-the-art robustness-accuracy balance on 18 datasets across diverse tasks, with efficient generalization to large VLMs such as LLaVA and Qwen-VL.

Significance. If the localization of robustness to shallow layers and the effectiveness of minimal initial-layer adaptations hold, the work would offer a practical and insight-driven approach to reconciling the robustness-accuracy trade-off in VLMs without full retraining, potentially enabling more reliable deployment of large vision-language models.

major comments (1)
  1. Abstract: the central claims rest on an internal analysis of fine-tuned models leading to the method, but the abstract lacks concrete data, error bars, or verification details for the spectral bias and attention claims; full paper required to assess support for the localization and R-Adapt effectiveness assertions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the opportunity to clarify our manuscript. We address the single major comment below regarding the abstract's content and structure.

read point-by-point responses
  1. Referee: Abstract: the central claims rest on an internal analysis of fine-tuned models leading to the method, but the abstract lacks concrete data, error bars, or verification details for the spectral bias and attention claims; full paper required to assess support for the localization and R-Adapt effectiveness assertions.

    Authors: We agree that the abstract is a high-level summary and does not include specific numerical results, error bars, or detailed verification metrics, which is standard practice to maintain brevity and readability. The full manuscript provides the supporting evidence: Section 3 details the internal analysis of adversarially fine-tuned models, including quantitative measurements of low-frequency spectral bias (with explicit metrics and visualizations) and input-insensitive attention patterns across layers; Section 4 reports comprehensive results on 18 datasets with error bars, ablation studies, and comparisons under multiple attacks, directly verifying the localization of robustness to shallow layers and the effectiveness of R-Adapt. These sections substantiate the claims without relying solely on the abstract. If the referee recommends, we can add 1-2 key quantitative highlights (e.g., robustness gains on representative datasets) to the abstract in a revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's claims rest on empirical observations from adversarially fine-tuned VLMs, identifying robustness localization in shallow layers via spectral bias and attention patterns, followed by the R-Adapt proposal that freezes weights and adapts only initial layers. With only the abstract available, no equations, derivations, or self-citations are present that could reduce any prediction or result to its inputs by construction. The argument is self-contained, driven by analysis rather than definitional or fitted circularity, consistent with a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not introduce or rely on any explicit free parameters, mathematical axioms, or newly invented entities; the approach is empirical and adaptation-based.

pith-pipeline@v0.9.0 · 5543 in / 1086 out tokens · 46066 ms · 2026-05-15T11:46:45.751276+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.