Recognition: 2 theorem links
· Lean TheoremBirds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
Pith reviewed 2026-05-13 07:21 UTC · model grok-4.3
The pith
Linear additivity in VLM embedding spaces enables background-invariant representations from synthetic data alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The high linear additivity of VLM embedding spaces permits reliable decomposition of a scene embedding into foreground and background components. This decomposition supports a pre-training procedure that builds background-invariant representations solely from synthetic data, yielding over 90 percent worst-group accuracy on Waterbirds under 100 percent spurious correlation without any minority-group examples in the training set.
What carries the argument
Linear additivity in VLM embedding spaces, which supports additive decomposition of foreground and background features to enable synthetic data construction.
If this is right
- Models reach high worst-group accuracy on spurious-correlation benchmarks without ever seeing real minority-group examples.
- The learned representations transfer from synthetic pre-training to real images.
- The approach applies to standard VLMs such as CLIP and SigLIP without further architectural changes.
- No access to real-world debiased datasets is required for the invariance property.
Where Pith is reading between the lines
- The same additive decomposition could be tested on other spurious factors such as texture or lighting if they also combine linearly in embedding space.
- Generating more complex synthetic scenes with multiple foreground objects might extend the method beyond single-object classification.
- Checking whether newer VLMs retain the same degree of linear additivity would indicate how broadly the technique applies.
Load-bearing premise
VLM embedding spaces maintain high linear additivity that allows clean separation of foreground from background components.
What would settle it
Measure whether the embedding of an object image plus the embedding of a background image closely matches the embedding of their composite image across many pairs; large consistent errors would show the decomposition does not hold.
Figures
read the original abstract
Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes exploiting the known high linear additivity property of VLM embedding spaces (e.g., CLIP, SigLIP) to decompose scene representations into foreground and background components. It introduces a synthetic-data pre-training procedure that enforces background invariance on the foreground component, claiming the first worst-group accuracy above 90% on Waterbirds under 100% spurious correlation (zero minority-group examples in training) together with strong sim-to-real transfer and no requirement for real debiased data.
Significance. If the decomposition is shown to be sufficiently exact and the invariance transfers reliably, the result would constitute a practically important advance in robust VLM classification, because it removes the need for real-world minority examples or post-hoc debiasing while still reaching high worst-group performance on a canonical spurious-correlation benchmark.
major comments (3)
- [Method] Method section (linear decomposition step): the paper treats the foreground/background separation as sufficiently clean for the 100% spurious case, yet provides neither quantitative bounds on residual background leakage nor an ablation measuring how much background signal remains in the extracted foreground vector; this is load-bearing for the central claim because any non-zero residual would allow the classifier to exploit background cues on real test images.
- [Experiments] Experiments (Waterbirds 100% spurious setting): the reported >90% worst-group accuracy is presented without error bars across multiple random seeds, without an ablation that varies the quality or diversity of the synthetic backgrounds, and without a direct comparison to a baseline that uses the same synthetic data but omits the linear decomposition; these omissions make it impossible to isolate whether the performance stems from the claimed mechanism.
- [Experiments] Section 4.3 (sim-to-real transfer): the transfer results are shown only for the final model; an intermediate result demonstrating that the foreground component alone (before any classifier training) already exhibits reduced background sensitivity on real images would strengthen the mechanistic claim.
minor comments (2)
- [Abstract] Abstract: the phrase 'to our knowledge' for the 90% claim should be accompanied by a brief citation to the closest prior numbers on the same 100% spurious split.
- [Method] Notation: define the exact linear operator used for decomposition (e.g., the projection matrix or subtraction formula) in a single displayed equation rather than inline text.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and will incorporate the suggested analyses and ablations to strengthen the presentation of the linear decomposition and its empirical validation.
read point-by-point responses
-
Referee: [Method] Method section (linear decomposition step): the paper treats the foreground/background separation as sufficiently clean for the 100% spurious case, yet provides neither quantitative bounds on residual background leakage nor an ablation measuring how much background signal remains in the extracted foreground vector; this is load-bearing for the central claim because any non-zero residual would allow the classifier to exploit background cues on real test images.
Authors: We agree that explicit quantification of residual leakage is important for supporting the central claim. In the revised manuscript we will add quantitative bounds on background leakage in the foreground vectors, derived from the known linearity properties of VLM embeddings, together with an ablation that measures residual background signal via cosine similarity to background-only directions and probe classification accuracy on held-out synthetic backgrounds. revision: yes
-
Referee: [Experiments] Experiments (Waterbirds 100% spurious setting): the reported >90% worst-group accuracy is presented without error bars across multiple random seeds, without an ablation that varies the quality or diversity of the synthetic backgrounds, and without a direct comparison to a baseline that uses the same synthetic data but omits the linear decomposition; these omissions make it impossible to isolate whether the performance stems from the claimed mechanism.
Authors: We concur that these controls are needed to isolate the contribution of the linear decomposition. The revision will report worst-group accuracy with standard deviations over at least five random seeds, include an ablation varying synthetic background diversity and quality, and add a direct baseline that performs the same synthetic-data pre-training but without the foreground/background decomposition step. revision: yes
-
Referee: [Experiments] Section 4.3 (sim-to-real transfer): the transfer results are shown only for the final model; an intermediate result demonstrating that the foreground component alone (before any classifier training) already exhibits reduced background sensitivity on real images would strengthen the mechanistic claim.
Authors: We will strengthen the mechanistic evidence by adding an intermediate analysis in Section 4.3 that evaluates the foreground component in isolation on real images before classifier training. This will include metrics such as the correlation of foreground embeddings with background labels and the accuracy of a linear probe trained to predict background attributes from the foreground vectors on the real Waterbirds test set. revision: yes
Circularity Check
No circularity: derivation relies on external documented VLM property and synthetic data
full rationale
The paper explicitly builds on the 'well-known property of high linear additivity in VLM embedding spaces' as an established external fact rather than deriving or fitting it internally. It then applies this property to enable foreground/background decomposition and enforces invariance through synthetic data generation, which supplies an independent training signal outside the target dataset's spurious correlations. No load-bearing step reduces by construction to fitted parameters, self-citations, or redefinitions within the paper; the >90% worst-group claim under 100% spurious correlation follows from the external synthetic handle and the cited additivity property, making the chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High linear additivity holds in VLM embedding spaces and permits decomposition of scene representations into foreground and background components
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components... anchor a by averaging these vectors... Lalign = 1 − fθ(x̂m)⊤a / ‖...‖
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By the Law of Large Numbers, as K increases, the sample mean of the backgrounds converges to its expected value µbg... residual noise ϵ... Var(ϵ)∝1/K
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Zero-shot robustification of zero-shot models.arXiv preprint arXiv:2309.04344, 2023
Dyah Adila, Changho Shin, Linrong Cai, and Frederic Sala. Zero-shot robustification of zero-shot models.arXiv preprint arXiv:2309.04344, 2023
-
[2]
Foreground or background? visual interpretability and robustness analysis of CLIP, 2025
Aishwarya Agarwal, Srikrishna Karanam, and Vineet Gandhi. Foreground or background? visual interpretability and robustness analysis of CLIP, 2025. URL https://openreview. net/forum?id=K7wkjqLjrt
work page 2025
-
[3]
Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio P Calmon, and Himabindu Lakkaraju. Interpreting clip with sparse linear concept embeddings (splice).Advances in Neural Information Processing Systems, 37:84298–84328, 2024
work page 2024
-
[4]
Visual categorization with bags of keypoints
Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. Visual categorization with bags of keypoints. InWorkshop on statistical learning in computer vision, ECCV, volume 1, pages 1–2. Prague, 2004
work page 2004
-
[5]
Taher Dehkharghanian, Azam Asilian Bidgoli, Abtin Riasatian, Pooria Mazaheri, Clinton JV Campbell, Liron Pantanowitz, HR Tizhoosh, and Shahryar Rahnamayan. Biased data, biased ai: deep networks predict the acquisition site of tcga images.Diagnostic Pathology, 18(1):67, 2023
work page 2023
-
[6]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
work page 2009
-
[7]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/ abs/1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[8]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URLhttps://arxiv.org/abs/2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023
work page 2023
-
[10]
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024
work page 2024
-
[11]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
work page 2022
-
[14]
beta-V AE: Learning basic visual concepts with a constrained variational framework
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-V AE: Learning basic visual concepts with a constrained variational framework. InInternational Conference on Learning Representations,
-
[15]
URLhttps://openreview.net/forum?id=Sy2fzU9gl
-
[16]
Robust context-aware object recognition
Klara Janouskova, Cristian Gavrus, and Jiri Matas. Robust context-aware object recognition. arXiv preprint arXiv:2510.00618, 2025
-
[17]
Transformers in vision: A survey.ACM Computing Surveys (CSUR), 54 (10s):1–41, 2022
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey.ACM Computing Surveys (CSUR), 54 (10s):1–41, 2022. 10
work page 2022
-
[18]
Last layer re-training is sufficient for robustness to spurious correlations
Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations.arXiv preprint arXiv:2204.02937, 2022
-
[19]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023
work page 2023
-
[20]
Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine- tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054, 2022
-
[21]
Out of Spuriousity: Improving Robustness to Spurious Correlations without Group Annotations
Phuong Quynh Le, Jörg Schlötterer, and Christin Seifert. Out of spuriousity: Improving robustness to spurious correlations without group annotations.arXiv preprint arXiv:2407.14974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014
work page 2014
-
[23]
Superclass- guided representation disentanglement for spurious correlation mitigation, 2025
Chenruo Liu, Hongjun Liu, Zeyu Lai, Yiqiu Shen, Chen Zhao, and Qi Lei. Superclass- guided representation disentanglement for spurious correlation mitigation, 2025. URL https: //arxiv.org/abs/2508.08570
-
[24]
A convnet for the 2020s.arXiv preprint arXiv:2201.03545, 2022
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s, 2022. URLhttps://arxiv.org/abs/2201.03545
-
[25]
Deep Learning Face Attributes in the Wild
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild, 2015. URLhttps://arxiv.org/abs/1411.7766
work page Pith review arXiv 2015
-
[26]
Robustness to spurious correla- tion: A comprehensive review
Mohammadjavad Maheronnaghsh and Taha Akbari Alvanagh. Robustness to spurious correla- tion: A comprehensive review. InEuropean Conference on Computer Vision, pages 361–379. Springer, 2024
work page 2024
-
[27]
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013. URLhttps://arxiv.org/abs/1301.3781
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[28]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Bridging explainability and embeddings: Bee aware of spuriousness
Cristian Daniel Paduraru, Antonio Barbalau, Radu Filipescu, Andrei Liviu Nicolicioiu, and Elena Burceanu. Bridging explainability and embeddings: Bee aware of spuriousness. InThe F ourteenth International Conference on Learning Representations
-
[30]
Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, and Stephanie Gil. In- terpreting the linear structure of vision-language model embedding spaces.arXiv preprint arXiv:2504.11695, 2025
-
[31]
Simple and fast group robustness by automatic feature reweighting, 2023
Shikai Qiu, Andres Potapczynski, Pavel Izmailov, and Andrew Gordon Wilson. Simple and fast group robustness by automatic feature reweighting, 2023. URL https://arxiv.org/abs/ 2306.11074
-
[32]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[33]
arXiv preprint arXiv:2104.10972 , year=
Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses.arXiv preprint arXiv:2104.10972, 2021
-
[34]
Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization, 2020. URLhttps://arxiv.org/abs/1911.08731. 11
work page internal anchor Pith review arXiv 2020
-
[35]
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models,...
work page internal anchor Pith review arXiv 2022
-
[36]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Robustness may be at odds with accuracy
Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy.arXiv preprint arXiv:1805.12152, 2018
-
[38]
Ravl: Discovering and mitigating spurious correlations in fine-tuned vision-language models
Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, and Curtis Langlotz. Ravl: Discovering and mitigating spurious correlations in fine-tuned vision-language models. Advances in Neural Information Processing Systems, 37:82235–82264, 2024
work page 2024
-
[39]
Constanza Vasquez-Venegas, Chenwei Wu, Saketh Sundar, Renata Proa, Francis Joshua Beloy, Jillian Reeze Medina, Megan Mcnichol, Krishnaveni Parvataneni, Nicholas Kurtzman, Felipe Mirshawka, et al. Detecting and mitigating the clever hans effect in medical imaging: a scoping review.Journal of Imaging Informatics in Medicine, 38(4):2563–2579, 2025
work page 2025
-
[40]
The caltech-ucsd birds-200-2011 dataset
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011
work page 2011
-
[41]
Qizhou Wang, Yong Lin, Yongqiang Chen, Ludwig Schmidt, Bo Han, and Tong Zhang. A sober look at the robustness of clips to spurious features.Advances in Neural Information Processing Systems, 37:122484–122523, 2024
work page 2024
-
[42]
Robust fine-tuning of zero-shot models
Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022
work page 2022
-
[43]
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang- Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023
-
[44]
Label-free mitigation of spurious correlations in vlms using sparse autoencoders
Bharat Chandra Yalavarthi, Nalini K Ratha, and Venu Govindaraju. Label-free mitigation of spurious correlations in vlms using sparse autoencoders. InThe F ourteenth International Conference on Learning Representations
-
[45]
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022
-
[46]
Interpreting clip with hierarchi- cal sparse autoencoders, 2025
Vladimir Zaigrajew, Hubert Baniecki, and Przemyslaw Biecek. Interpreting clip with hierarchi- cal sparse autoencoders, 2025. URLhttps://arxiv.org/abs/2502.20578
-
[47]
Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8): 5625–5644, 2024
work page 2024
-
[48]
Nico++: Towards better benchmarking for domain generalization
Xingxuan Zhang, Yue He, Renzhe Xu, Han Yu, Zheyan Shen, and Peng Cui. Nico++: Towards better benchmarking for domain generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16036–16047, 2023
work page 2023
-
[49]
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, 2017. 12 Appendix Overview Unless otherwise specified, all ablation studies utilize the CLIP ViT-B/16 architecture initialized with LAION-2B...
work page 2017
-
[50]
Mapping several instances of a given foreground, on randomized backgrounds, to a single point in the embedding space is responsible for the suppression of background signals and produces the observed quality of background invariance. 21
-
[51]
Utilizing a frozen VLM vision encoder as a teacher model to create individualized anchor vectors preserves semantic information which allows the increased robustness to transfer to O.O.D domains. Furthermore, this aspect facilitates a more gentle restructuring of the embedding space, and by extension the internal model representations, such that catastrop...
work page 2000
-
[52]
Scaling:The isolated foreground object and its corresponding mask were proportionally downscaled to occupy a maximum of 75% of the target background dimensions (168×168 pixels) using Lanczos resampling
-
[53]
Mask Smoothing:To prevent sharp, jagged boundaries between the foreground and the new background, the segmentation mask was strictly thresholded (pixel values >100 mapped to
-
[54]
and subsequently filtered using a Gaussian blur with a radius ofσ= 1
-
[55]
Centering:The preprocessed foreground and mask were pasted directly into the center offset of the chosen224×224background image. L.5.3 Evaluation Split While the primary training set enforced a high or perfect spurious correlation, the evaluation re- quired a strictly balanced test set to calculate worst-group accuracy accurately. To achieve this, the tes...
-
[56]
Isolation:The foreground bird is isolated using its provided segmentation mask and cropped tightly to its bounding box
-
[57]
Scaling:The isolated foreground and its mask are scaled to a random number in the range [0.6−0.8]as a fraction of the target 224x224 resolution using Lanczos resampling
-
[58]
Subsequently, a Gaussian blur with a radius of 1 is applied to smooth the composite edges
Mask Refinement:To prevent background noise from bleeding through transparent or soft edges, the mask is solidified via a hard threshold (pixel values > 100 are set to 255). Subsequently, a Gaussian blur with a radius of 1 is applied to smooth the composite edges. 32
-
[59]
Placement:The resized foreground is centered and pasted onto the target 224x224 back- ground image using the refined mask. M.3.4 BAP Hyperparameters and Optimization Strategy Table 17 details the specific hyperparameter configuration utilized during Phase 2 (Alignment Pre- training) of BAP. Rather than relying on an exhaustive and computationally expensiv...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.