pith. sign in

arxiv: 2604.14973 · v1 · submitted 2026-04-16 · 💻 cs.CR · cs.CV

Robustness of Vision Foundation Models to Common Perturbations

Pith reviewed 2026-05-10 11:02 UTC · model grok-4.3

classification 💻 cs.CR cs.CV
keywords vision foundation modelsrobustness metricsimage perturbationsembedding vectorsdownstream tasksfine-tuningJPEG compressioncontrast adjustment
0
0 comments X

The pith

Vision foundation models are generally non-robust to common perturbations like JPEG compression, brightness, and contrast adjustments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts the first systematic evaluation of six large-scale vision foundation models on how their image embeddings respond to nine everyday editing operations. It introduces three new robustness metrics along with five mathematical properties those metrics should satisfy, then uses the metrics to measure embedding changes and connect them to real drops in downstream task performance. The work shows that these models from major providers are consistently sensitive to the perturbations and that the sensitivity can be reduced through targeted fine-tuning without harming the models' original capabilities. A reader would care because these embeddings power many practical applications, so fragility to routine image changes could make those applications unreliable in normal use.

Core claim

We present the first systematic study on foundation models' robustness to common perturbations that alter embedding vectors. We propose three robustness metrics and formulate five desired mathematical properties for these metrics, analyzing which properties they satisfy or violate. Using these metrics, we evaluate six industry-scale foundation models across nine common perturbation categories, finding them generally non-robust. We also show that common perturbations degrade downstream application performance and that robustness values can predict performance impacts. Finally, we propose a fine-tuning approach to improve robustness without sacrificing utility.

What carries the argument

Three robustness metrics, each checked against five mathematical properties, that quantify changes in embedding vectors caused by image perturbations.

If this is right

  • Downstream tasks experience measurable drops in accuracy when inputs undergo common perturbations.
  • The numerical robustness values directly predict the size of those accuracy drops.
  • Fine-tuning on perturbed examples raises robustness scores while leaving original task utility intact.
  • Models from different providers exhibit similar patterns of sensitivity across the nine perturbation types.
  • Embedding-based applications become less reliable unless robustness is explicitly addressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications that rely on these embeddings may need to preprocess inputs or adopt the fine-tuning step to maintain consistent behavior.
  • The metrics could serve as a quick benchmark when comparing new training methods or model architectures for sensitivity to everyday image variation.
  • Persistent non-robustness might point to deeper limitations in how current training data and objectives handle natural image variability.

Load-bearing premise

The three proposed robustness metrics accurately capture how perturbations affect performance in actual downstream applications.

What would settle it

A new model that scores high on the proposed robustness metrics but still shows large drops in downstream accuracy when the same perturbations are applied would falsify the claim that the metrics track practical impact.

Figures

Figures reproduced from arXiv: 2604.14973 by Cheng Hong, Hongbin Liu, Neil Zhenqiang Gong, Zhengyuan Jiang.

Figure 1
Figure 1. Figure 1: An example for the worst-robustness property. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example to illustrate the minimum enclosing ball. The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparing random sampling and equally-spaced sam [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average DivergenceRadius of ImageNet testing images for different foundation models and perturbation functions. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy under perturbation ACCp vs. DivergenceRadius of ImageNet testing images for (a) zero-shot classification and (b) linear-probe classification when different perturbation functions are used. Zero-shot classification is based on the CLIP ViT-L/14 foundation model and linear-probe classification is based on the DINO v2 ViT-g/14 foundation model [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy under perturbation ACCp vs. DivergenceRa￾dius of Food101 testing images for (a) zero-shot classification and (b) linear-probe classification when different perturbation functions are used. Zero-shot classification is based on the CLIP ViT-L/14 foundation model and linear-probe classification is based on the DINO v2 ViT-g/14 foundation model. sion. We divide the dataset, train a linear model on the… view at source ↗
Figure 7
Figure 7. Figure 7: (a) Average cosine similarity, (b) average DivergenceRa [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Mean squared error of predicting an image’s [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Root mean squared error under perturbation [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mean squared error of predicting an image’s [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Average cosine similarity of ImageNet testing images for different foundation models and perturbation functions. [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Average DivergenceRadius of Food101 testing images for different foundation models and perturbation functions. [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Average cosine similarity of Food101 testing images for different foundation models and perturbation functions. [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Average DivergenceRadius of NYU-Depth V2 testing images for different foundation models and perturbation functions. [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Average cosine similarity of images in NYU-Depth V2 for different foundation models and perturbation functions. [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Accuracy under perturbation ACCp vs. cosine similarity of ImageNet and Food101 testing images for zero-shot classification and linear-probe classification when different perturbation functions are used. Zero-shot classification is based on the CLIP ViT-L/14 foundation model and linear-probe classification is based on the DINO v2 ViT-g/14 foundation model. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗
read the original abstract

A vision foundation model outputs an embedding vector for an image, which can be affected by common editing operations (e.g., JPEG compression, brightness, contrast adjustments). These common perturbations alter embedding vectors and may impact the performance of downstream tasks using these embeddings. In this work, we present the first systematic study on foundation models' robustness to such perturbations. We propose three robustness metrics and formulate five desired mathematical properties for these metrics, analyzing which properties they satisfy or violate. Using these metrics, we evaluate six industry-scale foundation models (OpenAI, Meta) across nine common perturbation categories, finding them generally non-robust. We also show that common perturbations degrade downstream application performance (e.g., classification accuracy) and that robustness values can predict performance impacts. Finally, we propose a fine-tuning approach to improve robustness without sacrificing utility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims to perform the first systematic study on the robustness of vision foundation models to common perturbations. It proposes three robustness metrics and analyzes their satisfaction of five mathematical properties. The evaluation covers six industry-scale models from OpenAI and Meta across nine perturbation categories, concluding they are generally non-robust. It demonstrates that these perturbations degrade downstream application performance like classification accuracy, that the robustness metrics can predict such performance impacts, and proposes a fine-tuning approach to improve robustness without sacrificing utility.

Significance. If the results hold, the paper makes a significant contribution by identifying vulnerabilities in widely-used vision foundation models to everyday perturbations, which is important for applications relying on their embeddings. The analysis of mathematical properties for the new metrics and the empirical demonstration of their predictive power for downstream tasks are notable strengths. The proposed fine-tuning method adds practical value. This could encourage the community to prioritize robustness in model development.

minor comments (2)
  1. The abstract outlines the contributions but would benefit from briefly specifying the exact number of models evaluated and perturbation categories to provide a more complete overview at a glance.
  2. In the evaluation results, include error bars, standard deviations, or statistical significance tests for the robustness metric values and downstream performance degradations to better support the 'generally non-robust' conclusion and the predictive claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work, recognition of its significance, and recommendation for minor revision. The referee's description accurately reflects the manuscript's contributions regarding robustness metrics, evaluation of vision foundation models, downstream impact analysis, and the proposed fine-tuning method. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical study with independent metric validation

full rationale

The paper proposes three robustness metrics and five mathematical properties, then explicitly analyzes which properties each metric satisfies or violates. It evaluates six external foundation models on nine perturbation categories using direct measurements, demonstrates downstream task degradation (e.g., classification accuracy) via separate experiments, and shows observed correlations between robustness scores and performance drops. A fine-tuning method is proposed to improve robustness. No derivation reduces by construction to fitted parameters, self-definitions, or self-citation chains; all load-bearing claims rest on external data and explicit property checks rather than renaming or tautological prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described; the work relies on standard empirical evaluation practices.

pith-pipeline@v0.9.0 · 5438 in / 1009 out tokens · 44913 ms · 2026-05-10T11:02:59.905261+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Are transformers more robust than cnns? InNeurIPS, 2021

    Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. Are transformers more robust than cnns? InNeurIPS, 2021. 8

  2. [2]

    Under- standing robustness of transformers for image classification

    Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, and Andreas Veit. Under- standing robustness of transformers for image classification. InICCV, 2021. 8

  3. [3]

    Food-101 – mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InECCV, 2014. 5

  4. [4]

    Towards evaluating the robustness of neural networks

    Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. InS&P, 2017. 1

  5. [5]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- frey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 8

  6. [6]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 5

  7. [7]

    When does contrastive learning preserve ad- versarial robustness from pretraining to finetuning? 2021

    Lijie Fan, Sijia Liu, Pin-Yu Chen, Gaoyuan Zhang, and Chuang Gan. When does contrastive learning preserve ad- versarial robustness from pretraining to finetuning? 2021. 1

  8. [8]

    Momentum contrast for unsupervised visual repre- sentation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual repre- sentation learning. InCVPR, 2020. 8

  9. [9]

    Benchmarking neural network robustness to common corruptions and perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InICLR, 2019. 5, 8

  10. [10]

    The many faces of robust- ness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. InICCV, 2021. 8

  11. [11]

    Baden- coder: Backdoor attacks to pre-trained encoders in self- supervised learning

    Jinyuan Jia, Yupei Liu, and Neil Zhenqiang Gong. Baden- coder: Backdoor attacks to pre-trained encoders in self- supervised learning. InS&P, 2022. 1

  12. [12]

    Robust pre-training by adversarial contrastive learning

    Ziyu Jiang, Tianlong Chen, Ting Chen, and Zhangyang Wang. Robust pre-training by adversarial contrastive learning. NeurIPS, 2020

  13. [13]

    Evading watermark based detection of ai-generated content

    Zhengyuan Jiang, Jinghuai Zhang, and Neil Zhenqiang Gong. Evading watermark based detection of ai-generated content. InCCS, 2023

  14. [14]

    An embarrassingly simple backdoor attack on self-supervised learning

    Changjiang Li, Ren Pang, Zhaohan Xi, Tianyu Du, Shouling Ji, Yuan Yao, and Ting Wang. An embarrassingly simple backdoor attack on self-supervised learning. InICCV, 2023. 1

  15. [15]

    Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InICML, 2022. 8

  16. [16]

    Poi- sonedEncoder: Poisoning the unlabeled pre-training data in contrastive learning

    Hongbin Liu, Jinyuan Jia, and Neil Zhenqiang Gong. Poi- sonedEncoder: Poisoning the unlabeled pre-training data in contrastive learning. InUSENIX Security Symposium, 2022. 1

  17. [17]

    Indoor segmentation and support inference from rgbd images

    Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012. 5

  18. [18]

    Dinov2: Learning robust visual features without supervision.arXiv,

    Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv,

  19. [19]

    Vision transformers are robust learners

    Sayak Paul and Pin-Yu Chen. Vision transformers are robust learners. InAAAI, 2022. 8

  20. [20]

    Reaas: En- abling adversarially robust downstream classifiers via robust encoder as a service

    Wenjie Qu, Jinyuan Jia, and Neil Zhenqiang Gong. Reaas: En- abling adversarially robust downstream classifiers via robust encoder as a service. InNDSS, 2023. 1

  21. [21]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 1, 4, 8

  22. [22]

    Backdoor attacks on self- supervised learning

    Aniruddha Saha, Ajinkya Tejankar, Soroush Abbasi Kooh- payegani, and Hamed Pirsiavash. Backdoor attacks on self- supervised learning. InCVPR, 2022. 1

  23. [23]

    Intriguing properties of neural networks

    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. InICLR, 2014. 1

  24. [24]

    Can cnns be more robust than transformers? InICLR, 2023

    Zeyu Wang, Yutong Bai, Yuyin Zhou, and Cihang Xie. Can cnns be more robust than transformers? InICLR, 2023. 8

  25. [25]

    Smallest enclosing disks (balls and ellipsoids)

    Emo Welzl. Smallest enclosing disks (balls and ellipsoids). In New Results and New Trends in Computer Science. Springer,

  26. [26]

    Promptbench: Towards evaluating the robustness of large language models on adversarial prompts

    Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Neil Zhenqiang Gong, Yue Zhang, et al. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv, 2023. 8 9 A. Impact Statements This work strengthens the reliability of vision foundation models by systematically eval...

  27. [27]

    Without loss of generality, we assume the subdomain K′ contains n discrete values k1,· · ·, k n. Then, we have the following: nX i=1 f(P(x, k i)) =0.(13) Based on Equation 10, we have the following equation group:    ||f(P(x, k 1))||2 2 −2f T (P(x, k 1))·c+||c|| 2 2 ≤r 2 ||f(P(x, k 2))||2 2 −2f T (P(x, k 2))·c+||c|| 2 2 ≤r 2 · · · ||f(P(x, k n))||2...