SHED: Style-Homogenized Embedding Alignment for Domain Generalization
Pith reviewed 2026-05-19 20:20 UTC · model grok-4.3
The pith
Removing style centroids from CLIP embeddings improves generalization to unseen domains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting.
What carries the argument
The style-homogenized embedding alignment process that subtracts estimated style centroids to eliminate domain-specific information from embeddings.
Load-bearing premise
Subtracting estimated style centroids from embeddings preserves class semantics while removing only domain-specific information, and that projecting textual domain centroids into visual space enables effective membership-weighted inference without any target domain data.
What would settle it
Demonstrating that models using style-homogenized embeddings perform worse than raw embedding alignment on unseen domain tests would falsify the central claim.
Figures
read the original abstract
Domain generalization aims to enhance model robustness against unseen domains with embedding distribution shifts. While large-scale vision-language models like CLIP exhibit strong generalization, their direct image-text embedding alignment suffers from inherent information asymmetry: images encode both class semantics and domain-specific styles, whereas text prompts primarily convey basic class cues. This asymmetry hinders generalization to novel domains in realistic scenarios. To address this, we propose Style-Homogenized Embedding alignment for Domain-generalization (SHED), a novel CLIP-based method that aligns style-homogenized embeddings instead of raw representations from encoders in CLIP. During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting. Extensive experiments on five benchmarks show SHED achieves state-of-the-art performance, outperforming prior methods significantly (e.g., +4.0\% on DomainNet vs. standard fine-tuning).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SHED, a CLIP-based method for domain generalization that aligns style-homogenized embeddings rather than raw representations. During training, per-source-domain style centroids are subtracted from image embeddings and a global centroid is removed from averaged text embeddings across prompt templates. For inference without target-domain data, textual domain centroids are projected into visual space and predictions are aggregated via membership weighting. Experiments across five benchmarks report state-of-the-art results, including a +4.0% gain on DomainNet relative to standard fine-tuning.
Significance. If the central claims hold, the work provides a concrete mechanism to mitigate information asymmetry between images (which carry style) and text (which carry primarily class cues) in large vision-language models, leading to better robustness on unseen domains. The approach is notable for operating without target data at inference and for its empirical evaluation on multiple standard DG benchmarks. The absence of additional free parameters in the core alignment step and the use of existing CLIP encoders are strengths that support practical adoption if the gains prove reliable.
major comments (2)
- [§3.2] §3.2: The claim that subtracting estimated style centroids removes only domain-specific information while preserving class semantics is load-bearing for the central contribution. No diagnostic is reported (e.g., change in intra-class variance, class-prototype cosine similarity, or decision-boundary shift) to verify that the operation does not distort class-discriminative directions when source domains exhibit class-conditional style biases. This separability assumption directly affects whether the reported gains can be attributed to style homogenization rather than incidental effects.
- [§4.3] §4.3 and Table 2: The +4.0% improvement on DomainNet and consistent gains across the five benchmarks are presented without error bars, statistical significance tests, or full ablations isolating the contribution of centroid removal versus the membership-weighted inference step. These omissions make it difficult to assess the robustness of the SOTA claim and the necessity of each component.
minor comments (2)
- [§3.1] §3.1: The precise definition and computation of the global text centroid (after averaging across prompt templates) would benefit from an explicit equation to ensure reproducibility.
- [Figure 3] Figure 3: The visualization of homogenized embeddings would be clearer if the caption explicitly stated the projection method and the meaning of the plotted axes.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments and constructive feedback on our manuscript. We have carefully considered each point and made revisions to strengthen the paper accordingly. Below we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: [§3.2] §3.2: The claim that subtracting estimated style centroids removes only domain-specific information while preserving class semantics is load-bearing for the central contribution. No diagnostic is reported (e.g., change in intra-class variance, class-prototype cosine similarity, or decision-boundary shift) to verify that the operation does not distort class-discriminative directions when source domains exhibit class-conditional style biases. This separability assumption directly affects whether the reported gains can be attributed to style homogenization rather than incidental effects.
Authors: We agree that verifying the preservation of class semantics after style centroid subtraction is crucial for validating the core mechanism. In the revised manuscript, we have added diagnostic analyses, including measurements of intra-class variance and class-prototype cosine similarity before and after the subtraction operation. These diagnostics demonstrate that class-discriminative information is preserved while domain-specific style information is reduced. We also include decision boundary visualizations to show minimal distortion in class separation. revision: yes
-
Referee: [§4.3] §4.3 and Table 2: The +4.0% improvement on DomainNet and consistent gains across the five benchmarks are presented without error bars, statistical significance tests, or full ablations isolating the contribution of centroid removal versus the membership-weighted inference step. These omissions make it difficult to assess the robustness of the SOTA claim and the necessity of each component.
Authors: We acknowledge the importance of reporting variability and conducting rigorous statistical analysis. In the revised version, we have included error bars representing standard deviations over multiple runs and performed statistical significance tests to confirm the improvements. We have also expanded the ablation studies to better isolate the contributions of the style centroid removal during training and the membership-weighted aggregation during inference, showing that each component is necessary for achieving the reported performance. revision: yes
Circularity Check
No significant circularity in SHED derivation or performance claims
full rationale
The paper describes an algorithmic procedure for style-homogenized embedding alignment using CLIP encoders: per-domain centroid subtraction on image embeddings, global centroid removal on averaged text embeddings, and inference via projection of textual domain centroids with membership weighting. No equations, derivations, or self-citations are shown that reduce the reported benchmark gains (e.g., +4.0% on DomainNet) to quantities defined by construction from the authors' inputs or prior work. The method is presented as an empirical technique evaluated on standard external benchmarks, with performance arising from the explicit operations rather than tautological redefinition or fitted-parameter renaming. This qualifies as self-contained against external benchmarks, with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption CLIP image and text encoders produce embeddings where style and class semantics are linearly separable enough for centroid subtraction to isolate domain effects.
- domain assumption Averaging text embeddings across diverse prompt templates yields a representation from which a global centroid can be stripped without losing class information.
Reference graph
Works this paper leans on
-
[1]
Structure and Interpretation of Computer Programs
Harold Abelson and Gerald Jay Sussman and Julie Sussman. Structure and Interpretation of Computer Programs. 1985
work page 1985
-
[2]
Visual Information Extraction with Lixto
Robert Baumgartner and Georg Gottlob and Sergio Flesca. Visual Information Extraction with Lixto. Proceedings of the 27th International Conference on Very Large Databases. 2001
work page 2001
-
[3]
Ronald J. Brachman and James G. Schmolze. An overview of the KL-ONE knowledge representation system. Cognitive Science. 1985
work page 1985
-
[4]
Complexity results for nonmonotonic logics
Georg Gottlob. Complexity results for nonmonotonic logics. Journal of Logic and Computation. 1992
work page 1992
-
[5]
Hypertree Decompositions and Tractable Queries
Georg Gottlob and Nicola Leone and Francesco Scarcello. Hypertree Decompositions and Tractable Queries. Journal of Computer and System Sciences. 2002
work page 2002
- [6]
- [7]
-
[8]
On the compilability and expressive power of propositional planning formalisms
Bernhard Nebel. On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research. 2000
work page 2000
-
[9]
IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=
Domain generalization: A survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=
work page 2022
-
[10]
International Conference on Machine Learning , pages=
Clipood: Generalizing clip to out-of-distributions , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[11]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
Domain generalization with adversarial feature learning , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[12]
Proceedings of the European Conference on Computer Vision (ECCV) , pages=
Deep domain generalization via conditional invariant adversarial networks , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=
-
[13]
International Conference on Machine Learning , pages=
Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[14]
Advances in Neural Information Processing Systems , volume=
Align before fuse: Vision and language representation learning with momentum distillation , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
International Journal of Computer Vision , volume=
Learning to prompt for vision-language models , author=. International Journal of Computer Vision , volume=. 2022 , publisher=
work page 2022
-
[16]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Conditional prompt learning for vision-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[17]
Clipself: Vision transformer distills itself for open-vocabulary dense prediction,
Clipself: Vision transformer distills itself for open-vocabulary dense prediction , author=. arXiv preprint arXiv:2310.01403 , year=
-
[18]
Proceedings of the 41st International Conference on Machine Learning , pages=
Erasing the bias: fine-tuning foundation models for semi-supervised learning , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
-
[19]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Padclip: Pseudo-labeling with adaptive debiasing in clip for unsupervised domain adaptation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[20]
2024 IEEE International Conference on Image Processing (ICIP) , pages=
Rethinking domain adaptation and generalization in the era Of clip , author=. 2024 IEEE International Conference on Image Processing (ICIP) , pages=. 2024 , organization=
work page 2024
-
[21]
Advances in Neural Information Processing Systems , volume=
CLIPCEIL: Domain Generalization through CLIP via channel refinement and image-text aLignment , author=. Advances in Neural Information Processing Systems , volume=
-
[22]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Moment matching for multi-source domain adaptation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[23]
International Conference on Machine Learning , pages=
Long-tail learning with foundation model: heavy fine-tuning hurts , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[24]
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
Center-aware adversarial augmentation for single domain generalization , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
-
[25]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
UMFC: Unsupervised multi-domain feature calibration for vision-language models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[26]
Advances in Neural Information Processing Systems , volume=
Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning , author=. Advances in Neural Information Processing Systems , volume=
-
[27]
European Conference on Computer Vision , pages=
Domain generalization by mutual-information regularization with pre-trained models , author=. European Conference on Computer Vision , pages=. 2022 , organization=
work page 2022
-
[28]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Leveraging vision-language models for improving domain generalization in image classification , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[29]
Advances in Neural Information Processing Systems , volume=
Fixmatch: Simplifying semi-supervised learning with consistency and confidence , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
International Conference on Machine Learning , pages=
Scaling up visual and vision-language representation learning with noisy text supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[31]
Combined scaling for zero-shot transfer learning , author=. Neurocomputing , volume=. 2023 , publisher=
work page 2023
-
[32]
International Conference on Machine Learning , pages=
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[33]
International Conference on Machine Learning , pages=
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[34]
International Conference on Machine Learning , pages=
Parameter-efficient transfer learning for NLP , author=. International Conference on Machine Learning , pages=. 2019 , organization=
work page 2019
-
[35]
Advances in Neural Information Processing Systems , volume=
Adaptformer: Adapting vision transformers for scalable visual recognition , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
European Conference on Computer Vision , pages=
Visual prompt tuning , author=. European Conference on Computer Vision , pages=. 2022 , organization=
work page 2022
-
[37]
arXiv preprint arXiv:2210.01033 , year=
Lpt: Long-tailed prompt tuning for image classification , author=. arXiv preprint arXiv:2210.01033 , year=
-
[38]
International Conference on Machine Learning , pages=
Ods: Test-time adaptation in the presence of open-world data shift , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[39]
arXiv preprint arXiv:2406.13875 , year=
WATT: Weight average test-time adaptation of CLIP , author=. arXiv preprint arXiv:2406.13875 , year=
-
[40]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Reducing domain gap by reducing style bias , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[41]
Batch normalization embeddings for deep domain generalization , author=. Pattern Recognition , volume=. 2023 , publisher=
work page 2023
-
[42]
Advances in Neural Information Processing Systems , volume=
Exploiting domain-specific features to enhance domain generalization , author=. Advances in Neural Information Processing Systems , volume=
-
[43]
Proceedings of the IEEE International Conference on Computer Vision , pages=
Deeper, broader and artier domain generalization , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=
-
[44]
Unbiased look at dataset bias , author=. CVPR 2011 , pages=. 2011 , organization=
work page 2011
-
[45]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
Deep hashing network for unsupervised domain adaptation , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[46]
Proceedings of the European Conference on Computer Vision (ECCV) , pages=
Recognition in terra incognita , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=
-
[47]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[48]
Decoupled Weight Decay Regularization
Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
arXiv preprint arXiv:2502.09507 , year=
When and how does CLIP enable domain and compositional generalization? , author=. arXiv preprint arXiv:2502.09507 , year=
- [50]
-
[51]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Domain generalization in clip via learning with diverse text prompts , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[52]
Mao-Lin Luo and Zi-Hao Zhou and Tong Wei and Min-Ling Zhang , booktitle=
-
[53]
The Thirteenth International Conference on Learning Representations , year=
Semi-supervised clip adaptation by enforcing semantic and trapezoidal consistency , author=. The Thirteenth International Conference on Learning Representations , year=
-
[54]
Advances in Neural Information Processing Systems , volume=
X-mahalanobis: Transformer feature mixing for reliable OOD detection , author=. Advances in Neural Information Processing Systems , volume=
-
[55]
Advances in Neural Information Processing Systems , volume=
Vision-language models are strong noisy label detectors , author=. Advances in Neural Information Processing Systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.