pith. sign in

arxiv: 2605.16973 · v1 · pith:EQ5O53GInew · submitted 2026-05-16 · 💻 cs.CV · cs.LG

SHED: Style-Homogenized Embedding Alignment for Domain Generalization

Pith reviewed 2026-05-19 20:20 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords domain generalizationCLIP alignmentstyle homogenizationembedding distribution shiftvision language modelsunseen domainsdomain adaptation
0
0 comments X

The pith

Removing style centroids from CLIP embeddings improves generalization to unseen domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SHED to address information asymmetry in CLIP models where images carry both semantic class information and domain-specific styles while text prompts mainly provide class cues. By subtracting style centroids from image embeddings per source domain and from averaged text embeddings, SHED creates style-homogenized representations for alignment during training. For inference on unseen domains, it projects textual domain centroids into visual space and uses membership weighting for predictions. This approach leads to significant performance gains on standard benchmarks without requiring any data from the target domain.

Core claim

During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting.

What carries the argument

The style-homogenized embedding alignment process that subtracts estimated style centroids to eliminate domain-specific information from embeddings.

Load-bearing premise

Subtracting estimated style centroids from embeddings preserves class semantics while removing only domain-specific information, and that projecting textual domain centroids into visual space enables effective membership-weighted inference without any target domain data.

What would settle it

Demonstrating that models using style-homogenized embeddings perform worse than raw embedding alignment on unseen domain tests would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.16973 by Kai Gan, Tong Wei.

Figure 1
Figure 1. Figure 1: An illustration of the embedding space distribution for [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The heatmaps of cross-modal centroids similarities for “Clipart” domain in DomainNet Peng [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance change of different methods relative to the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance change by incorporating additional centroids [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The t-SNE visualization of embeddings on DomainNet. SH is an abbreviation for style-homogenized. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance change ∆SH before and after applying style￾homogenized inference on DomainNet. which emphasizes the necessity of generating diverse addi￾tional centroids to generalize to more style domains. Analysis on Performance Change. In [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The sensitivity of τ and τC under various settings on Do￾mainNet. A More Implementation Details In this section, we provide additional implementation details of SHED. To ensure a fair comparison, we follow the most experimental setup of CLIPood. Given the relatively larger data volume of DomainNet, we train SHED for 20 epochs, whereas only 10 epochs are used for the other datasets. We train for 500 iterati… view at source ↗
read the original abstract

Domain generalization aims to enhance model robustness against unseen domains with embedding distribution shifts. While large-scale vision-language models like CLIP exhibit strong generalization, their direct image-text embedding alignment suffers from inherent information asymmetry: images encode both class semantics and domain-specific styles, whereas text prompts primarily convey basic class cues. This asymmetry hinders generalization to novel domains in realistic scenarios. To address this, we propose Style-Homogenized Embedding alignment for Domain-generalization (SHED), a novel CLIP-based method that aligns style-homogenized embeddings instead of raw representations from encoders in CLIP. During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting. Extensive experiments on five benchmarks show SHED achieves state-of-the-art performance, outperforming prior methods significantly (e.g., +4.0\% on DomainNet vs. standard fine-tuning).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SHED, a CLIP-based method for domain generalization that aligns style-homogenized embeddings rather than raw representations. During training, per-source-domain style centroids are subtracted from image embeddings and a global centroid is removed from averaged text embeddings across prompt templates. For inference without target-domain data, textual domain centroids are projected into visual space and predictions are aggregated via membership weighting. Experiments across five benchmarks report state-of-the-art results, including a +4.0% gain on DomainNet relative to standard fine-tuning.

Significance. If the central claims hold, the work provides a concrete mechanism to mitigate information asymmetry between images (which carry style) and text (which carry primarily class cues) in large vision-language models, leading to better robustness on unseen domains. The approach is notable for operating without target data at inference and for its empirical evaluation on multiple standard DG benchmarks. The absence of additional free parameters in the core alignment step and the use of existing CLIP encoders are strengths that support practical adoption if the gains prove reliable.

major comments (2)
  1. [§3.2] §3.2: The claim that subtracting estimated style centroids removes only domain-specific information while preserving class semantics is load-bearing for the central contribution. No diagnostic is reported (e.g., change in intra-class variance, class-prototype cosine similarity, or decision-boundary shift) to verify that the operation does not distort class-discriminative directions when source domains exhibit class-conditional style biases. This separability assumption directly affects whether the reported gains can be attributed to style homogenization rather than incidental effects.
  2. [§4.3] §4.3 and Table 2: The +4.0% improvement on DomainNet and consistent gains across the five benchmarks are presented without error bars, statistical significance tests, or full ablations isolating the contribution of centroid removal versus the membership-weighted inference step. These omissions make it difficult to assess the robustness of the SOTA claim and the necessity of each component.
minor comments (2)
  1. [§3.1] §3.1: The precise definition and computation of the global text centroid (after averaging across prompt templates) would benefit from an explicit equation to ensure reproducibility.
  2. [Figure 3] Figure 3: The visualization of homogenized embeddings would be clearer if the caption explicitly stated the projection method and the meaning of the plotted axes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments and constructive feedback on our manuscript. We have carefully considered each point and made revisions to strengthen the paper accordingly. Below we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The claim that subtracting estimated style centroids removes only domain-specific information while preserving class semantics is load-bearing for the central contribution. No diagnostic is reported (e.g., change in intra-class variance, class-prototype cosine similarity, or decision-boundary shift) to verify that the operation does not distort class-discriminative directions when source domains exhibit class-conditional style biases. This separability assumption directly affects whether the reported gains can be attributed to style homogenization rather than incidental effects.

    Authors: We agree that verifying the preservation of class semantics after style centroid subtraction is crucial for validating the core mechanism. In the revised manuscript, we have added diagnostic analyses, including measurements of intra-class variance and class-prototype cosine similarity before and after the subtraction operation. These diagnostics demonstrate that class-discriminative information is preserved while domain-specific style information is reduced. We also include decision boundary visualizations to show minimal distortion in class separation. revision: yes

  2. Referee: [§4.3] §4.3 and Table 2: The +4.0% improvement on DomainNet and consistent gains across the five benchmarks are presented without error bars, statistical significance tests, or full ablations isolating the contribution of centroid removal versus the membership-weighted inference step. These omissions make it difficult to assess the robustness of the SOTA claim and the necessity of each component.

    Authors: We acknowledge the importance of reporting variability and conducting rigorous statistical analysis. In the revised version, we have included error bars representing standard deviations over multiple runs and performed statistical significance tests to confirm the improvements. We have also expanded the ablation studies to better isolate the contributions of the style centroid removal during training and the membership-weighted aggregation during inference, showing that each component is necessary for achieving the reported performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SHED derivation or performance claims

full rationale

The paper describes an algorithmic procedure for style-homogenized embedding alignment using CLIP encoders: per-domain centroid subtraction on image embeddings, global centroid removal on averaged text embeddings, and inference via projection of textual domain centroids with membership weighting. No equations, derivations, or self-citations are shown that reduce the reported benchmark gains (e.g., +4.0% on DomainNet) to quantities defined by construction from the authors' inputs or prior work. The method is presented as an empirical technique evaluated on standard external benchmarks, with performance arising from the explicit operations rather than tautological redefinition or fitted-parameter renaming. This qualifies as self-contained against external benchmarks, with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach assumes CLIP encoders already encode separable style and semantic information and that domain centroids can be reliably estimated from source data alone. No new free parameters beyond standard training choices are explicitly introduced in the abstract, and no new physical or mathematical entities are postulated.

axioms (2)
  • domain assumption CLIP image and text encoders produce embeddings where style and class semantics are linearly separable enough for centroid subtraction to isolate domain effects.
    Invoked implicitly in the description of removing domain-specific style centroids from both image and text embeddings.
  • domain assumption Averaging text embeddings across diverse prompt templates yields a representation from which a global centroid can be stripped without losing class information.
    Stated in the training procedure for text embeddings.

pith-pipeline@v0.9.0 · 5716 in / 1430 out tokens · 35642 ms · 2026-05-19T20:20:02.015623+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 2 internal anchors

  1. [1]

    Structure and Interpretation of Computer Programs

    Harold Abelson and Gerald Jay Sussman and Julie Sussman. Structure and Interpretation of Computer Programs. 1985

  2. [2]

    Visual Information Extraction with Lixto

    Robert Baumgartner and Georg Gottlob and Sergio Flesca. Visual Information Extraction with Lixto. Proceedings of the 27th International Conference on Very Large Databases. 2001

  3. [3]

    Brachman and James G

    Ronald J. Brachman and James G. Schmolze. An overview of the KL-ONE knowledge representation system. Cognitive Science. 1985

  4. [4]

    Complexity results for nonmonotonic logics

    Georg Gottlob. Complexity results for nonmonotonic logics. Journal of Logic and Computation. 1992

  5. [5]

    Hypertree Decompositions and Tractable Queries

    Georg Gottlob and Nicola Leone and Francesco Scarcello. Hypertree Decompositions and Tractable Queries. Journal of Computer and System Sciences. 2002

  6. [6]

    Levesque

    Hector J. Levesque. Foundations of a functional approach to knowledge representation. Artificial Intelligence. 1984

  7. [7]

    Levesque

    Hector J. Levesque. A logic of implicit and explicit belief. Proceedings of the Fourth National Conference on Artificial Intelligence. 1984

  8. [8]

    On the compilability and expressive power of propositional planning formalisms

    Bernhard Nebel. On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research. 2000

  9. [9]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Domain generalization: A survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=

  10. [10]

    International Conference on Machine Learning , pages=

    Clipood: Generalizing clip to out-of-distributions , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  11. [11]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Domain generalization with adversarial feature learning , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  12. [12]

    Proceedings of the European Conference on Computer Vision (ECCV) , pages=

    Deep domain generalization via conditional invariant adversarial networks , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

  13. [13]

    International Conference on Machine Learning , pages=

    Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Align before fuse: Vision and language representation learning with momentum distillation , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    International Journal of Computer Vision , volume=

    Learning to prompt for vision-language models , author=. International Journal of Computer Vision , volume=. 2022 , publisher=

  16. [16]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Conditional prompt learning for vision-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  17. [17]

    Clipself: Vision transformer distills itself for open-vocabulary dense prediction,

    Clipself: Vision transformer distills itself for open-vocabulary dense prediction , author=. arXiv preprint arXiv:2310.01403 , year=

  18. [18]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    Erasing the bias: fine-tuning foundation models for semi-supervised learning , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

  19. [19]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Padclip: Pseudo-labeling with adaptive debiasing in clip for unsupervised domain adaptation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  20. [20]

    2024 IEEE International Conference on Image Processing (ICIP) , pages=

    Rethinking domain adaptation and generalization in the era Of clip , author=. 2024 IEEE International Conference on Image Processing (ICIP) , pages=. 2024 , organization=

  21. [21]

    Advances in Neural Information Processing Systems , volume=

    CLIPCEIL: Domain Generalization through CLIP via channel refinement and image-text aLignment , author=. Advances in Neural Information Processing Systems , volume=

  22. [22]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Moment matching for multi-source domain adaptation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  23. [23]

    International Conference on Machine Learning , pages=

    Long-tail learning with foundation model: heavy fine-tuning hurts , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  24. [24]

    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

    Center-aware adversarial augmentation for single domain generalization , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

  25. [25]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    UMFC: Unsupervised multi-domain feature calibration for vision-language models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    European Conference on Computer Vision , pages=

    Domain generalization by mutual-information regularization with pre-trained models , author=. European Conference on Computer Vision , pages=. 2022 , organization=

  28. [28]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Leveraging vision-language models for improving domain generalization in image classification , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  29. [29]

    Advances in Neural Information Processing Systems , volume=

    Fixmatch: Simplifying semi-supervised learning with consistency and confidence , author=. Advances in Neural Information Processing Systems , volume=

  30. [30]

    International Conference on Machine Learning , pages=

    Scaling up visual and vision-language representation learning with noisy text supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  31. [31]

    Neurocomputing , volume=

    Combined scaling for zero-shot transfer learning , author=. Neurocomputing , volume=. 2023 , publisher=

  32. [32]

    International Conference on Machine Learning , pages=

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  33. [33]

    International Conference on Machine Learning , pages=

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  34. [34]

    International Conference on Machine Learning , pages=

    Parameter-efficient transfer learning for NLP , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  35. [35]

    Advances in Neural Information Processing Systems , volume=

    Adaptformer: Adapting vision transformers for scalable visual recognition , author=. Advances in Neural Information Processing Systems , volume=

  36. [36]

    European Conference on Computer Vision , pages=

    Visual prompt tuning , author=. European Conference on Computer Vision , pages=. 2022 , organization=

  37. [37]

    arXiv preprint arXiv:2210.01033 , year=

    Lpt: Long-tailed prompt tuning for image classification , author=. arXiv preprint arXiv:2210.01033 , year=

  38. [38]

    International Conference on Machine Learning , pages=

    Ods: Test-time adaptation in the presence of open-world data shift , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  39. [39]

    arXiv preprint arXiv:2406.13875 , year=

    WATT: Weight average test-time adaptation of CLIP , author=. arXiv preprint arXiv:2406.13875 , year=

  40. [40]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Reducing domain gap by reducing style bias , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  41. [41]

    Pattern Recognition , volume=

    Batch normalization embeddings for deep domain generalization , author=. Pattern Recognition , volume=. 2023 , publisher=

  42. [42]

    Advances in Neural Information Processing Systems , volume=

    Exploiting domain-specific features to enhance domain generalization , author=. Advances in Neural Information Processing Systems , volume=

  43. [43]

    Proceedings of the IEEE International Conference on Computer Vision , pages=

    Deeper, broader and artier domain generalization , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=

  44. [44]

    CVPR 2011 , pages=

    Unbiased look at dataset bias , author=. CVPR 2011 , pages=. 2011 , organization=

  45. [45]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Deep hashing network for unsupervised domain adaptation , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  46. [46]

    Proceedings of the European Conference on Computer Vision (ECCV) , pages=

    Recognition in terra incognita , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

  47. [47]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

  48. [48]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  49. [49]

    arXiv preprint arXiv:2502.09507 , year=

    When and how does CLIP enable domain and compositional generalization? , author=. arXiv preprint arXiv:2502.09507 , year=

  50. [50]

    , author=

    Visualizing data using t-SNE. , author=. Journal of Machine Learning Research , volume=

  51. [51]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Domain generalization in clip via learning with diverse text prompts , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  52. [52]

    Mao-Lin Luo and Zi-Hao Zhou and Tong Wei and Min-Ling Zhang , booktitle=

  53. [53]

    The Thirteenth International Conference on Learning Representations , year=

    Semi-supervised clip adaptation by enforcing semantic and trapezoidal consistency , author=. The Thirteenth International Conference on Learning Representations , year=

  54. [54]

    Advances in Neural Information Processing Systems , volume=

    X-mahalanobis: Transformer feature mixing for reliable OOD detection , author=. Advances in Neural Information Processing Systems , volume=

  55. [55]

    Advances in Neural Information Processing Systems , volume=

    Vision-language models are strong noisy label detectors , author=. Advances in Neural Information Processing Systems , volume=