arxiv: 2605.11107 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

Youssef Zaazou , Mark Thomas

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsspurious correlationsbackground invariancelinear additivityWaterbirdssynthetic datarobust representations

0 comments

The pith

Linear additivity in VLM embedding spaces enables background-invariant representations from synthetic data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models exhibit a property where the embedding of a full scene is close to the sum of separate foreground and background embeddings. The paper exploits this to break foreground-background correlations by training on synthetic composites that pair the same object with many different backgrounds. The resulting models reach the first reported worst-group accuracy above 90 percent on Waterbirds even when every training image shows the object against its correlated background and no counterexamples exist. The method needs no real-world images that already contain the desired invariance.

Core claim

The high linear additivity of VLM embedding spaces permits reliable decomposition of a scene embedding into foreground and background components. This decomposition supports a pre-training procedure that builds background-invariant representations solely from synthetic data, yielding over 90 percent worst-group accuracy on Waterbirds under 100 percent spurious correlation without any minority-group examples in the training set.

What carries the argument

Linear additivity in VLM embedding spaces, which supports additive decomposition of foreground and background features to enable synthetic data construction.

If this is right

Models reach high worst-group accuracy on spurious-correlation benchmarks without ever seeing real minority-group examples.
The learned representations transfer from synthetic pre-training to real images.
The approach applies to standard VLMs such as CLIP and SigLIP without further architectural changes.
No access to real-world debiased datasets is required for the invariance property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same additive decomposition could be tested on other spurious factors such as texture or lighting if they also combine linearly in embedding space.
Generating more complex synthetic scenes with multiple foreground objects might extend the method beyond single-object classification.
Checking whether newer VLMs retain the same degree of linear additivity would indicate how broadly the technique applies.

Load-bearing premise

VLM embedding spaces maintain high linear additivity that allows clean separation of foreground from background components.

What would settle it

Measure whether the embedding of an object image plus the embedding of a background image closely matches the embedding of their composite image across many pairs; large consistent errors would show the decomposition does not hold.

Figures

Figures reproduced from arXiv: 2605.11107 by Mark Thomas, Youssef Zaazou.

**Figure 1.** Figure 1: Ablation on K. Left: Anchor vector cosine similarity to foreground vs. background text prompts indicating that the application of Equation 2 preserves foreground signals while suppressing background signals. Right: Residual background variance V ar(ϵ). As K increases, V ar(ϵ) exhibits a strict 1/K decay, confirming the relationship in Equation 6 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the BAP Pipeline. The method consists of two sequential phases. Phase 1 (Top): Anchor Extraction. A foreground-specific anchor a is generated by averaging the embeddings z of a fixed object composited onto K randomized backgrounds using a frozen teacher model f ∗ θ . Phase 2 (Bottom): Robust Alignment. The student encoder fθ is optimized via Lalign to map the same object across M novel contexts… view at source ↗

**Figure 3.** Figure 3: Effect of parameter M on BAP using CLIP at low values of N showing that high performance may be achieved with as few as 50 distinct segmented items. The results of the ablations in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Performance vs N Total Segmented Foregrounds Used. Above, we investigate the effect of the number of selected foregrounds (birds) used by BAP on performance on the Waterbirds benchmark. The x-axis indicates how many total birds were sampled for each BAP run and the y-axes indicates average and worst group accuracy performance. We observe WGA increases sharply when going from 100 to 500 total birds but then… view at source ↗

**Figure 6.** Figure 6: Impact of background randomization scaling on model robustness under data-scarce [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Left: progression of the alignment loss during BAP. Right: progression of average and [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

read the original abstract

Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns known linear additivity in VLM embeddings into a synthetic pre-training recipe that hits the first reported >90% worst-group accuracy on Waterbirds under 100% spurious correlation.

read the letter

The core advance is straightforward: they take the documented linear structure of VLM embeddings, decompose scene vectors into foreground and background parts, and train on synthetic data to push the classifier toward foreground-only features. This produces the claimed jump past 90% worst-group accuracy on Waterbirds when the training set has literally zero minority-group examples. That setting is harder than most prior robustness benchmarks, so the number stands out if it holds up in the full experiments. The sim-to-real transfer without real debiased data is also a practical plus for anyone who cannot collect balanced real-world sets. The method itself is simple enough that it could be tried quickly on other VLMs like SigLIP. The soft spot is exactly the one the stress-test note flags. Linear additivity is rarely perfect in practice, so the foreground component will still carry some linear combination of background directions. On real test images with novel backgrounds, that residual signal could still leak through and drop worst-group accuracy below the reported figure. The abstract gives no derivation or ablation table showing how much leakage remains after decomposition, which makes it hard to judge whether the 90% comes from the linear trick or from other choices in the synthetic data pipeline. If the full paper has controlled ablations on the decomposition error and on out-of-distribution backgrounds, that would tighten the claim considerably. This is the kind of work that belongs in a reading group for people focused on spurious correlations in vision-language models. A reader who already knows the linear-additivity literature will see the incremental step clearly and can judge whether the new performance threshold justifies the extra synthetic-data step. It is worth sending to peer review because the benchmark result is sharp enough to merit referee scrutiny on the decomposition details and the exact training protocol, even if revisions are needed to address the approximation issue.

Referee Report

3 major / 2 minor

Summary. The paper proposes exploiting the known high linear additivity property of VLM embedding spaces (e.g., CLIP, SigLIP) to decompose scene representations into foreground and background components. It introduces a synthetic-data pre-training procedure that enforces background invariance on the foreground component, claiming the first worst-group accuracy above 90% on Waterbirds under 100% spurious correlation (zero minority-group examples in training) together with strong sim-to-real transfer and no requirement for real debiased data.

Significance. If the decomposition is shown to be sufficiently exact and the invariance transfers reliably, the result would constitute a practically important advance in robust VLM classification, because it removes the need for real-world minority examples or post-hoc debiasing while still reaching high worst-group performance on a canonical spurious-correlation benchmark.

major comments (3)

[Method] Method section (linear decomposition step): the paper treats the foreground/background separation as sufficiently clean for the 100% spurious case, yet provides neither quantitative bounds on residual background leakage nor an ablation measuring how much background signal remains in the extracted foreground vector; this is load-bearing for the central claim because any non-zero residual would allow the classifier to exploit background cues on real test images.
[Experiments] Experiments (Waterbirds 100% spurious setting): the reported >90% worst-group accuracy is presented without error bars across multiple random seeds, without an ablation that varies the quality or diversity of the synthetic backgrounds, and without a direct comparison to a baseline that uses the same synthetic data but omits the linear decomposition; these omissions make it impossible to isolate whether the performance stems from the claimed mechanism.
[Experiments] Section 4.3 (sim-to-real transfer): the transfer results are shown only for the final model; an intermediate result demonstrating that the foreground component alone (before any classifier training) already exhibits reduced background sensitivity on real images would strengthen the mechanistic claim.

minor comments (2)

[Abstract] Abstract: the phrase 'to our knowledge' for the 90% claim should be accompanied by a brief citation to the closest prior numbers on the same 100% spurious split.
[Method] Notation: define the exact linear operator used for decomposition (e.g., the projection matrix or subtraction formula) in a single displayed equation rather than inline text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will incorporate the suggested analyses and ablations to strengthen the presentation of the linear decomposition and its empirical validation.

read point-by-point responses

Referee: [Method] Method section (linear decomposition step): the paper treats the foreground/background separation as sufficiently clean for the 100% spurious case, yet provides neither quantitative bounds on residual background leakage nor an ablation measuring how much background signal remains in the extracted foreground vector; this is load-bearing for the central claim because any non-zero residual would allow the classifier to exploit background cues on real test images.

Authors: We agree that explicit quantification of residual leakage is important for supporting the central claim. In the revised manuscript we will add quantitative bounds on background leakage in the foreground vectors, derived from the known linearity properties of VLM embeddings, together with an ablation that measures residual background signal via cosine similarity to background-only directions and probe classification accuracy on held-out synthetic backgrounds. revision: yes
Referee: [Experiments] Experiments (Waterbirds 100% spurious setting): the reported >90% worst-group accuracy is presented without error bars across multiple random seeds, without an ablation that varies the quality or diversity of the synthetic backgrounds, and without a direct comparison to a baseline that uses the same synthetic data but omits the linear decomposition; these omissions make it impossible to isolate whether the performance stems from the claimed mechanism.

Authors: We concur that these controls are needed to isolate the contribution of the linear decomposition. The revision will report worst-group accuracy with standard deviations over at least five random seeds, include an ablation varying synthetic background diversity and quality, and add a direct baseline that performs the same synthetic-data pre-training but without the foreground/background decomposition step. revision: yes
Referee: [Experiments] Section 4.3 (sim-to-real transfer): the transfer results are shown only for the final model; an intermediate result demonstrating that the foreground component alone (before any classifier training) already exhibits reduced background sensitivity on real images would strengthen the mechanistic claim.

Authors: We will strengthen the mechanistic evidence by adding an intermediate analysis in Section 4.3 that evaluates the foreground component in isolation on real images before classifier training. This will include metrics such as the correlation of foreground embeddings with background labels and the accuracy of a linear probe trained to predict background attributes from the foreground vectors on the real Waterbirds test set. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external documented VLM property and synthetic data

full rationale

The paper explicitly builds on the 'well-known property of high linear additivity in VLM embedding spaces' as an established external fact rather than deriving or fitting it internally. It then applies this property to enable foreground/background decomposition and enforces invariance through synthetic data generation, which supplies an independent training signal outside the target dataset's spurious correlations. No load-bearing step reduces by construction to fitted parameters, self-citations, or redefinitions within the paper; the >90% worst-group claim under 100% spurious correlation follows from the external synthetic handle and the cited additivity property, making the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that linear additivity holds sufficiently well in current VLMs to permit clean foreground-background separation and that synthetic data can stand in for real distributions.

axioms (1)

domain assumption High linear additivity holds in VLM embedding spaces and permits decomposition of scene representations into foreground and background components
Invoked in the abstract as the foundation for the pre-training approach.

pith-pipeline@v0.9.0 · 5464 in / 1268 out tokens · 56975 ms · 2026-05-13T07:21:40.699185+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components... anchor a by averaging these vectors... Lalign = 1 − fθ(x̂m)⊤a / ‖...‖
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By the Law of Large Numbers, as K increases, the sample mean of the backgrounds converges to its expected value µbg... residual noise ϵ... Var(ϵ)∝1/K

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 8 internal anchors

[1]

Zero-shot robustification of zero-shot models.arXiv preprint arXiv:2309.04344, 2023

Dyah Adila, Changho Shin, Linrong Cai, and Frederic Sala. Zero-shot robustification of zero-shot models.arXiv preprint arXiv:2309.04344, 2023

work page arXiv 2023
[2]

Foreground or background? visual interpretability and robustness analysis of CLIP, 2025

Aishwarya Agarwal, Srikrishna Karanam, and Vineet Gandhi. Foreground or background? visual interpretability and robustness analysis of CLIP, 2025. URL https://openreview. net/forum?id=K7wkjqLjrt

work page 2025
[3]

Interpreting clip with sparse linear concept embeddings (splice).Advances in Neural Information Processing Systems, 37:84298–84328, 2024

Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio P Calmon, and Himabindu Lakkaraju. Interpreting clip with sparse linear concept embeddings (splice).Advances in Neural Information Processing Systems, 37:84298–84328, 2024

work page 2024
[4]

Visual categorization with bags of keypoints

Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. Visual categorization with bags of keypoints. InWorkshop on statistical learning in computer vision, ECCV, volume 1, pages 1–2. Prague, 2004

work page 2004
[5]

Biased data, biased ai: deep networks predict the acquisition site of tcga images.Diagnostic Pathology, 18(1):67, 2023

Taher Dehkharghanian, Azam Asilian Bidgoli, Abtin Riasatian, Pooria Mazaheri, Clinton JV Campbell, Liron Pantanowitz, HR Tizhoosh, and Shahryar Rahnamayan. Biased data, biased ai: deep networks predict the acquisition site of tcga images.Diagnostic Pathology, 18(1):67, 2023

work page 2023
[6]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/ abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URLhttps://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

work page 2023
[10]

Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024

work page 2024
[11]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022
[14]

beta-V AE: Learning basic visual concepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-V AE: Learning basic visual concepts with a constrained variational framework. InInternational Conference on Learning Representations,

work page
[15]

URLhttps://openreview.net/forum?id=Sy2fzU9gl

work page
[16]

Robust context-aware object recognition

Klara Janouskova, Cristian Gavrus, and Jiri Matas. Robust context-aware object recognition. arXiv preprint arXiv:2510.00618, 2025

work page arXiv 2025
[17]

Transformers in vision: A survey.ACM Computing Surveys (CSUR), 54 (10s):1–41, 2022

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey.ACM Computing Surveys (CSUR), 54 (10s):1–41, 2022. 10

work page 2022
[18]

Last layer re-training is sufficient for robustness to spurious correlations

Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations.arXiv preprint arXiv:2204.02937, 2022

work page arXiv 2022
[19]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023
[20]

Fine-tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054,

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine- tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054, 2022

work page arXiv 2022
[21]

Out of Spuriousity: Improving Robustness to Spurious Correlations without Group Annotations

Phuong Quynh Le, Jörg Schlötterer, and Christin Seifert. Out of spuriousity: Improving robustness to spurious correlations without group annotations.arXiv preprint arXiv:2407.14974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014

work page 2014
[23]

Superclass- guided representation disentanglement for spurious correlation mitigation, 2025

Chenruo Liu, Hongjun Liu, Zeyu Lai, Yiqiu Shen, Chen Zhao, and Qi Lei. Superclass- guided representation disentanglement for spurious correlation mitigation, 2025. URL https: //arxiv.org/abs/2508.08570

work page arXiv 2025
[24]

A convnet for the 2020s.arXiv preprint arXiv:2201.03545, 2022

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s, 2022. URLhttps://arxiv.org/abs/2201.03545

work page arXiv 2022
[25]

Deep Learning Face Attributes in the Wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild, 2015. URLhttps://arxiv.org/abs/1411.7766

work page Pith review arXiv 2015
[26]

Robustness to spurious correla- tion: A comprehensive review

Mohammadjavad Maheronnaghsh and Taha Akbari Alvanagh. Robustness to spurious correla- tion: A comprehensive review. InEuropean Conference on Computer Vision, pages 361–379. Springer, 2024

work page 2024
[27]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013. URLhttps://arxiv.org/abs/1301.3781

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Bridging explainability and embeddings: Bee aware of spuriousness

Cristian Daniel Paduraru, Antonio Barbalau, Radu Filipescu, Andrei Liviu Nicolicioiu, and Elena Burceanu. Bridging explainability and embeddings: Bee aware of spuriousness. InThe F ourteenth International Conference on Learning Representations

work page
[30]

In- terpreting the linear structure of vision-language model embedding spaces.arXiv preprint arXiv:2504.11695, 2025

Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, and Stephanie Gil. In- terpreting the linear structure of vision-language model embedding spaces.arXiv preprint arXiv:2504.11695, 2025

work page arXiv 2025
[31]

Simple and fast group robustness by automatic feature reweighting, 2023

Shikai Qiu, Andres Potapczynski, Pavel Izmailov, and Andrew Gordon Wilson. Simple and fast group robustness by automatic feature reweighting, 2023. URL https://arxiv.org/abs/ 2306.11074

work page arXiv 2023
[32]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

work page 2021
[33]

arXiv preprint arXiv:2104.10972 , year=

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses.arXiv preprint arXiv:2104.10972, 2021

work page arXiv 2021
[34]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization, 2020. URLhttps://arxiv.org/abs/1911.08731. 11

work page internal anchor Pith review arXiv 2020
[35]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models,...

work page internal anchor Pith review arXiv 2022
[36]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Robustness may be at odds with accuracy

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy.arXiv preprint arXiv:1805.12152, 2018

work page arXiv 2018
[38]

Ravl: Discovering and mitigating spurious correlations in fine-tuned vision-language models

Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, and Curtis Langlotz. Ravl: Discovering and mitigating spurious correlations in fine-tuned vision-language models. Advances in Neural Information Processing Systems, 37:82235–82264, 2024

work page 2024
[39]

Detecting and mitigating the clever hans effect in medical imaging: a scoping review.Journal of Imaging Informatics in Medicine, 38(4):2563–2579, 2025

Constanza Vasquez-Venegas, Chenwei Wu, Saketh Sundar, Renata Proa, Francis Joshua Beloy, Jillian Reeze Medina, Megan Mcnichol, Krishnaveni Parvataneni, Nicholas Kurtzman, Felipe Mirshawka, et al. Detecting and mitigating the clever hans effect in medical imaging: a scoping review.Journal of Imaging Informatics in Medicine, 38(4):2563–2579, 2025

work page 2025
[40]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

work page 2011
[41]

A sober look at the robustness of clips to spurious features.Advances in Neural Information Processing Systems, 37:122484–122523, 2024

Qizhou Wang, Yong Lin, Yongqiang Chen, Ludwig Schmidt, Bo Han, and Tong Zhang. A sober look at the robustness of clips to spurious features.Advances in Neural Information Processing Systems, 37:122484–122523, 2024

work page 2024
[42]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022

work page 2022
[43]

Demysti- fying clip data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang- Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023

work page arXiv 2023
[44]

Label-free mitigation of spurious correlations in vlms using sparse autoencoders

Bharat Chandra Yalavarthi, Nalini K Ratha, and Venu Govindaraju. Label-free mitigation of spurious correlations in vlms using sparse autoencoders. InThe F ourteenth International Conference on Learning Representations

work page
[45]

When and why vision-language models behave like bags-of-words, and what to do about it? arXiv preprint arXiv:2210.01936, 2022

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

work page arXiv 2022
[46]

Interpreting clip with hierarchi- cal sparse autoencoders, 2025

Vladimir Zaigrajew, Hubert Baniecki, and Przemyslaw Biecek. Interpreting clip with hierarchi- cal sparse autoencoders, 2025. URLhttps://arxiv.org/abs/2502.20578

work page arXiv 2025
[47]

Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8): 5625–5644, 2024

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8): 5625–5644, 2024

work page 2024
[48]

Nico++: Towards better benchmarking for domain generalization

Xingxuan Zhang, Yue He, Renzhe Xu, Han Yu, Zheyan Shen, and Peng Cui. Nico++: Towards better benchmarking for domain generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16036–16047, 2023

work page 2023
[49]

bird”–“swamp

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, 2017. 12 Appendix Overview Unless otherwise specified, all ablation studies utilize the CLIP ViT-B/16 architecture initialized with LAION-2B...

work page 2017
[50]

Mapping several instances of a given foreground, on randomized backgrounds, to a single point in the embedding space is responsible for the suppression of background signals and produces the observed quality of background invariance. 21

work page
[51]

a photo of a {}

Utilizing a frozen VLM vision encoder as a teacher model to create individualized anchor vectors preserves semantic information which allows the increased robustness to transfer to O.O.D domains. Furthermore, this aspect facilitates a more gentle restructuring of the embedding space, and by extension the internal model representations, such that catastrop...

work page 2000
[52]

Scaling:The isolated foreground object and its corresponding mask were proportionally downscaled to occupy a maximum of 75% of the target background dimensions (168×168 pixels) using Lanczos resampling

work page
[53]

Mask Smoothing:To prevent sharp, jagged boundaries between the foreground and the new background, the segmentation mask was strictly thresholded (pixel values >100 mapped to

work page
[54]

and subsequently filtered using a Gaussian blur with a radius ofσ= 1

work page
[55]

water background

Centering:The preprocessed foreground and mask were pasted directly into the center offset of the chosen224×224background image. L.5.3 Evaluation Split While the primary training set enforced a high or perfect spurious correlation, the evaluation re- quired a strictly balanced test set to calculate worst-group accuracy accurately. To achieve this, the tes...

work page
[56]

Isolation:The foreground bird is isolated using its provided segmentation mask and cropped tightly to its bounding box

work page
[57]

Scaling:The isolated foreground and its mask are scaled to a random number in the range [0.6−0.8]as a fraction of the target 224x224 resolution using Lanczos resampling

work page
[58]

Subsequently, a Gaussian blur with a radius of 1 is applied to smooth the composite edges

Mask Refinement:To prevent background noise from bleeding through transparent or soft edges, the mask is solidified via a hard threshold (pixel values > 100 are set to 255). Subsequently, a Gaussian blur with a radius of 1 is applied to smooth the composite edges. 32

work page
[59]

a photo of a {}

Placement:The resized foreground is centered and pasted onto the target 224x224 back- ground image using the refined mask. M.3.4 BAP Hyperparameters and Optimization Strategy Table 17 details the specific hyperparameter configuration utilized during Phase 2 (Alignment Pre- training) of BAP. Rather than relying on an exhaustive and computationally expensiv...

work page 2017