pith. machine review for the scientific record. sign in

arxiv: 2605.11107 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsspurious correlationsbackground invariancelinear additivityWaterbirdssynthetic datarobust representations
0
0 comments X

The pith

Linear additivity in VLM embedding spaces enables background-invariant representations from synthetic data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models exhibit a property where the embedding of a full scene is close to the sum of separate foreground and background embeddings. The paper exploits this to break foreground-background correlations by training on synthetic composites that pair the same object with many different backgrounds. The resulting models reach the first reported worst-group accuracy above 90 percent on Waterbirds even when every training image shows the object against its correlated background and no counterexamples exist. The method needs no real-world images that already contain the desired invariance.

Core claim

The high linear additivity of VLM embedding spaces permits reliable decomposition of a scene embedding into foreground and background components. This decomposition supports a pre-training procedure that builds background-invariant representations solely from synthetic data, yielding over 90 percent worst-group accuracy on Waterbirds under 100 percent spurious correlation without any minority-group examples in the training set.

What carries the argument

Linear additivity in VLM embedding spaces, which supports additive decomposition of foreground and background features to enable synthetic data construction.

If this is right

  • Models reach high worst-group accuracy on spurious-correlation benchmarks without ever seeing real minority-group examples.
  • The learned representations transfer from synthetic pre-training to real images.
  • The approach applies to standard VLMs such as CLIP and SigLIP without further architectural changes.
  • No access to real-world debiased datasets is required for the invariance property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same additive decomposition could be tested on other spurious factors such as texture or lighting if they also combine linearly in embedding space.
  • Generating more complex synthetic scenes with multiple foreground objects might extend the method beyond single-object classification.
  • Checking whether newer VLMs retain the same degree of linear additivity would indicate how broadly the technique applies.

Load-bearing premise

VLM embedding spaces maintain high linear additivity that allows clean separation of foreground from background components.

What would settle it

Measure whether the embedding of an object image plus the embedding of a background image closely matches the embedding of their composite image across many pairs; large consistent errors would show the decomposition does not hold.

Figures

Figures reproduced from arXiv: 2605.11107 by Mark Thomas, Youssef Zaazou.

Figure 1
Figure 1. Figure 1: Ablation on K. Left: Anchor vector cosine similarity to foreground vs. background text prompts indicating that the application of Equation 2 preserves foreground signals while suppressing background signals. Right: Residual background variance V ar(ϵ). As K increases, V ar(ϵ) exhibits a strict 1/K decay, confirming the relationship in Equation 6 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the BAP Pipeline. The method consists of two sequential phases. Phase 1 (Top): Anchor Extraction. A foreground-specific anchor a is generated by averaging the embeddings z of a fixed object composited onto K randomized backgrounds using a frozen teacher model f ∗ θ . Phase 2 (Bottom): Robust Alignment. The student encoder fθ is optimized via Lalign to map the same object across M novel contexts… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of parameter M on BAP using CLIP at low values of N showing that high performance may be achieved with as few as 50 distinct segmented items. The results of the ablations in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance vs N Total Segmented Foregrounds Used. Above, we investigate the effect of the number of selected foregrounds (birds) used by BAP on performance on the Waterbirds benchmark. The x-axis indicates how many total birds were sampled for each BAP run and the y-axes indicates average and worst group accuracy performance. We observe WGA increases sharply when going from 100 to 500 total birds but then… view at source ↗
Figure 6
Figure 6. Figure 6: Impact of background randomization scaling on model robustness under data-scarce [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Left: progression of the alignment loss during BAP. Right: progression of average and [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
read the original abstract

Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes exploiting the known high linear additivity property of VLM embedding spaces (e.g., CLIP, SigLIP) to decompose scene representations into foreground and background components. It introduces a synthetic-data pre-training procedure that enforces background invariance on the foreground component, claiming the first worst-group accuracy above 90% on Waterbirds under 100% spurious correlation (zero minority-group examples in training) together with strong sim-to-real transfer and no requirement for real debiased data.

Significance. If the decomposition is shown to be sufficiently exact and the invariance transfers reliably, the result would constitute a practically important advance in robust VLM classification, because it removes the need for real-world minority examples or post-hoc debiasing while still reaching high worst-group performance on a canonical spurious-correlation benchmark.

major comments (3)
  1. [Method] Method section (linear decomposition step): the paper treats the foreground/background separation as sufficiently clean for the 100% spurious case, yet provides neither quantitative bounds on residual background leakage nor an ablation measuring how much background signal remains in the extracted foreground vector; this is load-bearing for the central claim because any non-zero residual would allow the classifier to exploit background cues on real test images.
  2. [Experiments] Experiments (Waterbirds 100% spurious setting): the reported >90% worst-group accuracy is presented without error bars across multiple random seeds, without an ablation that varies the quality or diversity of the synthetic backgrounds, and without a direct comparison to a baseline that uses the same synthetic data but omits the linear decomposition; these omissions make it impossible to isolate whether the performance stems from the claimed mechanism.
  3. [Experiments] Section 4.3 (sim-to-real transfer): the transfer results are shown only for the final model; an intermediate result demonstrating that the foreground component alone (before any classifier training) already exhibits reduced background sensitivity on real images would strengthen the mechanistic claim.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'to our knowledge' for the 90% claim should be accompanied by a brief citation to the closest prior numbers on the same 100% spurious split.
  2. [Method] Notation: define the exact linear operator used for decomposition (e.g., the projection matrix or subtraction formula) in a single displayed equation rather than inline text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will incorporate the suggested analyses and ablations to strengthen the presentation of the linear decomposition and its empirical validation.

read point-by-point responses
  1. Referee: [Method] Method section (linear decomposition step): the paper treats the foreground/background separation as sufficiently clean for the 100% spurious case, yet provides neither quantitative bounds on residual background leakage nor an ablation measuring how much background signal remains in the extracted foreground vector; this is load-bearing for the central claim because any non-zero residual would allow the classifier to exploit background cues on real test images.

    Authors: We agree that explicit quantification of residual leakage is important for supporting the central claim. In the revised manuscript we will add quantitative bounds on background leakage in the foreground vectors, derived from the known linearity properties of VLM embeddings, together with an ablation that measures residual background signal via cosine similarity to background-only directions and probe classification accuracy on held-out synthetic backgrounds. revision: yes

  2. Referee: [Experiments] Experiments (Waterbirds 100% spurious setting): the reported >90% worst-group accuracy is presented without error bars across multiple random seeds, without an ablation that varies the quality or diversity of the synthetic backgrounds, and without a direct comparison to a baseline that uses the same synthetic data but omits the linear decomposition; these omissions make it impossible to isolate whether the performance stems from the claimed mechanism.

    Authors: We concur that these controls are needed to isolate the contribution of the linear decomposition. The revision will report worst-group accuracy with standard deviations over at least five random seeds, include an ablation varying synthetic background diversity and quality, and add a direct baseline that performs the same synthetic-data pre-training but without the foreground/background decomposition step. revision: yes

  3. Referee: [Experiments] Section 4.3 (sim-to-real transfer): the transfer results are shown only for the final model; an intermediate result demonstrating that the foreground component alone (before any classifier training) already exhibits reduced background sensitivity on real images would strengthen the mechanistic claim.

    Authors: We will strengthen the mechanistic evidence by adding an intermediate analysis in Section 4.3 that evaluates the foreground component in isolation on real images before classifier training. This will include metrics such as the correlation of foreground embeddings with background labels and the accuracy of a linear probe trained to predict background attributes from the foreground vectors on the real Waterbirds test set. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external documented VLM property and synthetic data

full rationale

The paper explicitly builds on the 'well-known property of high linear additivity in VLM embedding spaces' as an established external fact rather than deriving or fitting it internally. It then applies this property to enable foreground/background decomposition and enforces invariance through synthetic data generation, which supplies an independent training signal outside the target dataset's spurious correlations. No load-bearing step reduces by construction to fitted parameters, self-citations, or redefinitions within the paper; the >90% worst-group claim under 100% spurious correlation follows from the external synthetic handle and the cited additivity property, making the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that linear additivity holds sufficiently well in current VLMs to permit clean foreground-background separation and that synthetic data can stand in for real distributions.

axioms (1)
  • domain assumption High linear additivity holds in VLM embedding spaces and permits decomposition of scene representations into foreground and background components
    Invoked in the abstract as the foundation for the pre-training approach.

pith-pipeline@v0.9.0 · 5464 in / 1268 out tokens · 56975 ms · 2026-05-13T07:21:40.699185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 8 internal anchors

  1. [1]

    Zero-shot robustification of zero-shot models.arXiv preprint arXiv:2309.04344, 2023

    Dyah Adila, Changho Shin, Linrong Cai, and Frederic Sala. Zero-shot robustification of zero-shot models.arXiv preprint arXiv:2309.04344, 2023

  2. [2]

    Foreground or background? visual interpretability and robustness analysis of CLIP, 2025

    Aishwarya Agarwal, Srikrishna Karanam, and Vineet Gandhi. Foreground or background? visual interpretability and robustness analysis of CLIP, 2025. URL https://openreview. net/forum?id=K7wkjqLjrt

  3. [3]

    Interpreting clip with sparse linear concept embeddings (splice).Advances in Neural Information Processing Systems, 37:84298–84328, 2024

    Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio P Calmon, and Himabindu Lakkaraju. Interpreting clip with sparse linear concept embeddings (splice).Advances in Neural Information Processing Systems, 37:84298–84328, 2024

  4. [4]

    Visual categorization with bags of keypoints

    Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. Visual categorization with bags of keypoints. InWorkshop on statistical learning in computer vision, ECCV, volume 1, pages 1–2. Prague, 2004

  5. [5]

    Biased data, biased ai: deep networks predict the acquisition site of tcga images.Diagnostic Pathology, 18(1):67, 2023

    Taher Dehkharghanian, Azam Asilian Bidgoli, Abtin Riasatian, Pooria Mazaheri, Clinton JV Campbell, Liron Pantanowitz, HR Tizhoosh, and Shahryar Rahnamayan. Biased data, biased ai: deep networks predict the acquisition site of tcga images.Diagnostic Pathology, 18(1):67, 2023

  6. [6]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  7. [7]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/ abs/1810.04805

  8. [8]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URLhttps://arxiv.org/abs/2010.11929

  9. [9]

    Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

  10. [10]

    Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024

  11. [11]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  12. [14]

    beta-V AE: Learning basic visual concepts with a constrained variational framework

    Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-V AE: Learning basic visual concepts with a constrained variational framework. InInternational Conference on Learning Representations,

  13. [15]

    URLhttps://openreview.net/forum?id=Sy2fzU9gl

  14. [16]

    Robust context-aware object recognition

    Klara Janouskova, Cristian Gavrus, and Jiri Matas. Robust context-aware object recognition. arXiv preprint arXiv:2510.00618, 2025

  15. [17]

    Transformers in vision: A survey.ACM Computing Surveys (CSUR), 54 (10s):1–41, 2022

    Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey.ACM Computing Surveys (CSUR), 54 (10s):1–41, 2022. 10

  16. [18]

    Last layer re-training is sufficient for robustness to spurious correlations

    Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations.arXiv preprint arXiv:2204.02937, 2022

  17. [19]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  18. [20]

    Fine-tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054,

    Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine- tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054, 2022

  19. [21]

    Out of Spuriousity: Improving Robustness to Spurious Correlations without Group Annotations

    Phuong Quynh Le, Jörg Schlötterer, and Christin Seifert. Out of spuriousity: Improving robustness to spurious correlations without group annotations.arXiv preprint arXiv:2407.14974, 2024

  20. [22]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014

  21. [23]

    Superclass- guided representation disentanglement for spurious correlation mitigation, 2025

    Chenruo Liu, Hongjun Liu, Zeyu Lai, Yiqiu Shen, Chen Zhao, and Qi Lei. Superclass- guided representation disentanglement for spurious correlation mitigation, 2025. URL https: //arxiv.org/abs/2508.08570

  22. [24]

    A convnet for the 2020s.arXiv preprint arXiv:2201.03545, 2022

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s, 2022. URLhttps://arxiv.org/abs/2201.03545

  23. [25]

    Deep Learning Face Attributes in the Wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild, 2015. URLhttps://arxiv.org/abs/1411.7766

  24. [26]

    Robustness to spurious correla- tion: A comprehensive review

    Mohammadjavad Maheronnaghsh and Taha Akbari Alvanagh. Robustness to spurious correla- tion: A comprehensive review. InEuropean Conference on Computer Vision, pages 361–379. Springer, 2024

  25. [27]

    Efficient Estimation of Word Representations in Vector Space

    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013. URLhttps://arxiv.org/abs/1301.3781

  26. [28]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  27. [29]

    Bridging explainability and embeddings: Bee aware of spuriousness

    Cristian Daniel Paduraru, Antonio Barbalau, Radu Filipescu, Andrei Liviu Nicolicioiu, and Elena Burceanu. Bridging explainability and embeddings: Bee aware of spuriousness. InThe F ourteenth International Conference on Learning Representations

  28. [30]

    In- terpreting the linear structure of vision-language model embedding spaces.arXiv preprint arXiv:2504.11695, 2025

    Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, and Stephanie Gil. In- terpreting the linear structure of vision-language model embedding spaces.arXiv preprint arXiv:2504.11695, 2025

  29. [31]

    Simple and fast group robustness by automatic feature reweighting, 2023

    Shikai Qiu, Andres Potapczynski, Pavel Izmailov, and Andrew Gordon Wilson. Simple and fast group robustness by automatic feature reweighting, 2023. URL https://arxiv.org/abs/ 2306.11074

  30. [32]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

  31. [33]

    arXiv preprint arXiv:2104.10972 , year=

    Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses.arXiv preprint arXiv:2104.10972, 2021

  32. [34]

    Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

    Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization, 2020. URLhttps://arxiv.org/abs/1911.08731. 11

  33. [35]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models,...

  34. [36]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  35. [37]

    Robustness may be at odds with accuracy

    Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy.arXiv preprint arXiv:1805.12152, 2018

  36. [38]

    Ravl: Discovering and mitigating spurious correlations in fine-tuned vision-language models

    Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, and Curtis Langlotz. Ravl: Discovering and mitigating spurious correlations in fine-tuned vision-language models. Advances in Neural Information Processing Systems, 37:82235–82264, 2024

  37. [39]

    Detecting and mitigating the clever hans effect in medical imaging: a scoping review.Journal of Imaging Informatics in Medicine, 38(4):2563–2579, 2025

    Constanza Vasquez-Venegas, Chenwei Wu, Saketh Sundar, Renata Proa, Francis Joshua Beloy, Jillian Reeze Medina, Megan Mcnichol, Krishnaveni Parvataneni, Nicholas Kurtzman, Felipe Mirshawka, et al. Detecting and mitigating the clever hans effect in medical imaging: a scoping review.Journal of Imaging Informatics in Medicine, 38(4):2563–2579, 2025

  38. [40]

    The caltech-ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

  39. [41]

    A sober look at the robustness of clips to spurious features.Advances in Neural Information Processing Systems, 37:122484–122523, 2024

    Qizhou Wang, Yong Lin, Yongqiang Chen, Ludwig Schmidt, Bo Han, and Tong Zhang. A sober look at the robustness of clips to spurious features.Advances in Neural Information Processing Systems, 37:122484–122523, 2024

  40. [42]

    Robust fine-tuning of zero-shot models

    Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022

  41. [43]

    Demysti- fying clip data

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang- Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023

  42. [44]

    Label-free mitigation of spurious correlations in vlms using sparse autoencoders

    Bharat Chandra Yalavarthi, Nalini K Ratha, and Venu Govindaraju. Label-free mitigation of spurious correlations in vlms using sparse autoencoders. InThe F ourteenth International Conference on Learning Representations

  43. [45]

    When and why vision-language models behave like bags-of-words, and what to do about it? arXiv preprint arXiv:2210.01936, 2022

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

  44. [46]

    Interpreting clip with hierarchi- cal sparse autoencoders, 2025

    Vladimir Zaigrajew, Hubert Baniecki, and Przemyslaw Biecek. Interpreting clip with hierarchi- cal sparse autoencoders, 2025. URLhttps://arxiv.org/abs/2502.20578

  45. [47]

    Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8): 5625–5644, 2024

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8): 5625–5644, 2024

  46. [48]

    Nico++: Towards better benchmarking for domain generalization

    Xingxuan Zhang, Yue He, Renzhe Xu, Han Yu, Zheyan Shen, and Peng Cui. Nico++: Towards better benchmarking for domain generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16036–16047, 2023

  47. [49]

    bird”–“swamp

    Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, 2017. 12 Appendix Overview Unless otherwise specified, all ablation studies utilize the CLIP ViT-B/16 architecture initialized with LAION-2B...

  48. [50]

    Mapping several instances of a given foreground, on randomized backgrounds, to a single point in the embedding space is responsible for the suppression of background signals and produces the observed quality of background invariance. 21

  49. [51]

    a photo of a {}

    Utilizing a frozen VLM vision encoder as a teacher model to create individualized anchor vectors preserves semantic information which allows the increased robustness to transfer to O.O.D domains. Furthermore, this aspect facilitates a more gentle restructuring of the embedding space, and by extension the internal model representations, such that catastrop...

  50. [52]

    Scaling:The isolated foreground object and its corresponding mask were proportionally downscaled to occupy a maximum of 75% of the target background dimensions (168×168 pixels) using Lanczos resampling

  51. [53]

    Mask Smoothing:To prevent sharp, jagged boundaries between the foreground and the new background, the segmentation mask was strictly thresholded (pixel values >100 mapped to

  52. [54]

    and subsequently filtered using a Gaussian blur with a radius ofσ= 1

  53. [55]

    water background

    Centering:The preprocessed foreground and mask were pasted directly into the center offset of the chosen224×224background image. L.5.3 Evaluation Split While the primary training set enforced a high or perfect spurious correlation, the evaluation re- quired a strictly balanced test set to calculate worst-group accuracy accurately. To achieve this, the tes...

  54. [56]

    Isolation:The foreground bird is isolated using its provided segmentation mask and cropped tightly to its bounding box

  55. [57]

    Scaling:The isolated foreground and its mask are scaled to a random number in the range [0.6−0.8]as a fraction of the target 224x224 resolution using Lanczos resampling

  56. [58]

    Subsequently, a Gaussian blur with a radius of 1 is applied to smooth the composite edges

    Mask Refinement:To prevent background noise from bleeding through transparent or soft edges, the mask is solidified via a hard threshold (pixel values > 100 are set to 255). Subsequently, a Gaussian blur with a radius of 1 is applied to smooth the composite edges. 32

  57. [59]

    a photo of a {}

    Placement:The resized foreground is centered and pasted onto the target 224x224 back- ground image using the refined mask. M.3.4 BAP Hyperparameters and Optimization Strategy Table 17 details the specific hyperparameter configuration utilized during Phase 2 (Alignment Pre- training) of BAP. Rather than relying on an exhaustive and computationally expensiv...