Dual-Foundation Models for Unsupervised Domain Adaptation
Pith reviewed 2026-05-08 01:29 UTC · model grok-4.3
The pith
Combining SAM with superpixel prompting and DINOv3 for prototypes improves unsupervised domain adaptation for semantic segmentation by addressing limits in pixel coverage and prototype stability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a dual-foundation UDA framework that leverages two complementary foundation models. First, we employ the Segment Anything Model (SAM) with superpixel-guided prompting to enable learning from a broader range of target pixels beyond high-confidence predictions. Second, we incorporate DINOv3 to construct stable, domain-invariant class prototypes through its robust representation learning.
What carries the argument
The dual-foundation UDA framework that pairs SAM superpixel-guided prompting for expanded target pixel supervision with DINOv3-derived domain-invariant class prototypes.
Load-bearing premise
The method rests on the assumption that SAM prompted by superpixels can provide reliable guidance for learning on low-confidence target pixels and that DINOv3 features produce class prototypes that remain unbiased across the source and target domains without additional tuning.
What would settle it
An experiment that disables the superpixel-guided prompting from SAM or replaces DINOv3 prototypes with source-initialized ones and finds no performance improvement on the GTA-to-Cityscapes task would show that these components are not responsible for the gains.
Figures
read the original abstract
Semantic segmentation provides pixel-level scene understanding essential for autonomous driving and fine-grained perception tasks. However, training segmentation models requires costly, labor-intensive annotations on real-world datasets. Unsupervised Domain Adaptation (UDA) addresses this by training models on labeled synthetic data and adapting them to unlabeled real images. While conceptually simple, adaptation is challenging due to the domain gap, i.e., differences in visual appearance and scene structure between synthetic and real data. Prior approaches bridge this gap through pixel-level mixing or feature-level contrastive learning. Yet, these techniques suffer from two major limitations: (1) reliance on high-confidence pseudo-labels restricts learning to a subset of the target domain, and (2) prototype-based contrastive methods initialize class prototypes from source-trained models, yielding biased and unstable anchors during adaptation. To address these issues, we propose a dual-foundation UDA framework that leverages two complementary foundation models. First, we employ the Segment Anything Model (SAM) with superpixel-guided prompting to enable learning from a broader range of target pixels beyond high-confidence predictions. Second, we incorporate DINOv3 to construct stable, domain-invariant class prototypes through its robust representation learning. Our method achieves consistent improvements of +1.3% and +1.4% mIoU over strong UDA baselines on GTA-to-Cityscapes and SYNTHIA-to-Cityscapes, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a dual-foundation model framework for unsupervised domain adaptation (UDA) in semantic segmentation. It employs the Segment Anything Model (SAM) with superpixel-guided prompting to expand pseudo-label learning beyond high-confidence target pixels and incorporates DINOv3 to derive stable, domain-invariant class prototypes for contrastive learning. The approach reports consistent mIoU gains of +1.3% on GTA-to-Cityscapes and +1.4% on SYNTHIA-to-Cityscapes over strong UDA baselines.
Significance. If the empirical gains hold under rigorous validation and the DINOv3 component is shown to deliver genuinely less biased prototypes than source-derived alternatives, the work provides a practical template for leveraging complementary foundation models to mitigate two persistent UDA limitations. The modest but positive improvements indicate incremental utility for driving-scene segmentation, with potential to influence subsequent foundation-model-assisted adaptation research provided ablations confirm complementarity of the two pillars.
major comments (2)
- §3.2 (DINOv3 prototype construction): The central claim that DINOv3 yields 'stable, domain-invariant class prototypes' without any target-domain adaptation, fine-tuning, or explicit alignment step is load-bearing for the dual-framework contribution. If residual domain shift persists in the DINOv3 embedding space for classes such as vehicle or pedestrian, the resulting anchors remain biased in the same manner as source-initialized prototypes, reducing the method to the SAM superpixel component alone. The manuscript should supply either quantitative invariance metrics (e.g., prototype drift across domains) or an ablation replacing DINOv3 with source-derived prototypes to substantiate the claim.
- Table 1 (quantitative results): The reported +1.3% and +1.4% mIoU improvements are presented without standard deviations, multiple random seeds, or statistical significance tests. In the absence of these, it is impossible to determine whether the gains exceed implementation variance or hyper-parameter sensitivity, weakening the assertion of 'consistent improvements' over strong baselines.
minor comments (3)
- Abstract: The phrase 'strong UDA baselines' should explicitly name the compared methods (e.g., DAFormer, HRDA) so readers can immediately gauge the strength of the reference points.
- Figure 1 (framework overview): The diagram would be clearer if arrows explicitly labeled the information flow from SAM superpixel prompts into the segmentation loss and from DINOv3 features into prototype computation.
- §4 (experimental protocol): The backbone architecture, training schedule, and hyper-parameter settings for both the segmentation network and the foundation-model components should be stated in a single consolidated table or paragraph for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. The feedback highlights important aspects for strengthening the empirical validation of our dual-foundation approach. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: §3.2 (DINOv3 prototype construction): The central claim that DINOv3 yields 'stable, domain-invariant class prototypes' without any target-domain adaptation, fine-tuning, or explicit alignment step is load-bearing for the dual-framework contribution. If residual domain shift persists in the DINOv3 embedding space for classes such as vehicle or pedestrian, the resulting anchors remain biased in the same manner as source-initialized prototypes, reducing the method to the SAM superpixel component alone. The manuscript should supply either quantitative invariance metrics (e.g., prototype drift across domains) or an ablation replacing DINOv3 with source-derived prototypes to substantiate the claim.
Authors: We agree that explicit validation of DINOv3's domain-invariance is necessary to support the dual-framework contribution. The manuscript motivates DINOv3 by its large-scale pretraining on diverse data, which we expect to yield more stable prototypes than source-only initialization. However, to directly address the concern, the revised version will add both an ablation replacing DINOv3 with source-derived prototypes and quantitative metrics (cosine similarity and drift between source/target embeddings for classes such as vehicle and pedestrian). These additions will clarify the incremental benefit of the DINOv3 component. revision: yes
-
Referee: Table 1 (quantitative results): The reported +1.3% and +1.4% mIoU improvements are presented without standard deviations, multiple random seeds, or statistical significance tests. In the absence of these, it is impossible to determine whether the gains exceed implementation variance or hyper-parameter sensitivity, weakening the assertion of 'consistent improvements' over strong baselines.
Authors: We acknowledge that the current Table 1 reports single-run results without variability measures or significance testing. The experiments were performed with a fixed seed for reproducibility. In the revision we will rerun all methods with at least three random seeds, report mean mIoU together with standard deviations, and add paired statistical significance tests to confirm that the observed gains exceed typical implementation variance. revision: yes
Circularity Check
No significant circularity; empirical claims rest on benchmarks
full rationale
The paper introduces a dual-foundation UDA framework for semantic segmentation that combines SAM with superpixel-guided prompting and DINOv3-derived class prototypes. No equations, derivations, parameter fittings, or self-referential constructions appear in the abstract or described method. The reported gains (+1.3% and +1.4% mIoU on GTA-to-Cityscapes and SYNTHIA-to-Cityscapes) are presented as empirical outcomes rather than results forced by definition or prior self-citations. The premise that DINOv3 yields domain-invariant prototypes is an external modeling assumption, not a tautological reduction within the paper's own logic, leaving the derivation chain self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
org/abs/2407.21311, accessed 3 May 2026
Abedi, A., Wu, Q.M.J., Zhang, N., Pourpanah, F.: Euda: An efficient unsupervised domain adaptation via self-supervised vision transformer (2024),https://arxiv. org/abs/2407.21311, accessed 3 May 2026
-
[2]
Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M.S., Ma- haraj, T., Fischer, A., Courville, A., Bengio, Y., et al.: A closer look at memoriza- tion in deep networks. In: ICML (2017)
work page 2017
-
[3]
T-PAMI39(12), 2481–2495 (2017)
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. T-PAMI39(12), 2481–2495 (2017)
work page 2017
-
[4]
Benigmim, Y., Roy, S., Essid, S., Kalogeiton, V., Lathuilière, S.: Collaborating foundationmodelsfordomaingeneralizedsemanticsegmentation.In:CVPR(2024)
work page 2024
-
[5]
Van den Bergh, M., Boix, X., Roig, G., De Capitani, B., Van Gool, L.: Seeds: Superpixels extracted via energy-driven sampling. In: ECCV (2012)
work page 2012
-
[6]
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: Mixmatch: A holistic approach to semi-supervised learning. NeurIPS (2019)
work page 2019
-
[7]
Brüggemann, D., Sakaridis, C., Truong, P., Van Gool, L.: Refign: Align and refine for adaptation of semantic segmentation to adverse conditions. In: WACV (2023)
work page 2023
-
[8]
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. T-PAMI40(4), 834–848 (2017)
work page 2017
-
[9]
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
work page 2016
-
[10]
Das, A., Xian, Y., Dai, D., Schiele, B.: Weakly-supervised domain adaptive seman- tic segmentation with prototypical contrastive learning. In: CVPR (2023)
work page 2023
-
[11]
Englert, B.B., Piva, F.J., Kerssies, T., De Geus, D., Dubbelman, G.: Exploring the benefits of vision foundation models for unsupervised domain adaptation. In: CVPR (2024)
work page 2024
-
[12]
Fahes, M., Vu, T.H., Bursuc, A., Pérez, P., De Charette, R.: Poda: Prompt-driven zero-shot domain adaptation. In: ICCV (2023)
work page 2023
-
[13]
Fahes, M., Vu, T.H., Bursuc, A., Pérez, P., De Charette, R.: A simple recipe for language-guided domain generalized segmentation. In: CVPR (2024)
work page 2024
-
[14]
Gong, R., Li, W., Chen, Y., Gool, L.V.: Dlow: Domain flow for adaptation and generalization. In: CVPR (2019)
work page 2019
-
[15]
Guo, X., Yang, C., Li, B., Yuan, Y.: Metacorrection: Domain-aware meta loss cor- rection for unsupervised domain adaptation in semantic segmentation. In: CVPR (2021)
work page 2021
-
[16]
Hoshen, J., Kopelman, R.: Percolation and cluster distribution. i. cluster multiple labeling technique and critical concentration algorithm. Physical Review B14(8), 3438 (1976) 14 Y. Cheon et al
work page 1976
-
[17]
Hoyer, L., Dai, D., Van Gool, L.: Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In: CVPR (2022)
work page 2022
-
[18]
Hoyer, L., Dai, D., Van Gool, L.: Hrda: Context-aware high-resolution domain- adaptive semantic segmentation. In: ECCV (2022)
work page 2022
-
[19]
Hoyer, L., Dai, D., Wang, H., Van Gool, L.: Mic: Masked image consistency for context-enhanced domain adaptation. In: CVPR (2023)
work page 2023
-
[20]
Jiang, Z., Li, Y., Yang, C., Gao, P., Wang, Y., Tai, Y., Wang, C.: Prototypical contrast adaptation for domain adaptive semantic segmentation. In: ECCV (2022)
work page 2022
-
[21]
Kang, G., Wei, Y., Yang, Y., Zhuang, Y., Hauptmann, A.: Pixel-level cycle asso- ciation: A new perspective for domain adaptive semantic segmentation. NeurIPS (2020)
work page 2020
-
[22]
Kim, M., Byun, H.: Learning texture invariant representation for domain adapta- tion of semantic segmentation. In: CVPR (2020)
work page 2020
-
[23]
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: ICCV (2023)
work page 2023
-
[24]
Kweon, H., Kim, J., Yoon, K.J.: Weakly supervised point cloud semantic segmen- tation via artificial oracle. In: CVPR) (2024)
work page 2024
-
[25]
In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M
Li, G., Kang, G., Liu, W., Wei, Y., Yang, Y.: Content-consistent matching for domain adaptive semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) ECCV (2020)
work page 2020
- [26]
- [27]
-
[28]
Liu,X., Wu,J., Lu, T., Zhang, S., Wang, G.: Srpl-sfda: Sam-guidedreliable pseudo- labels for source-free domain adaptation in medical image segmentation. Neuro- computing p. 130749 (2025)
work page 2025
-
[29]
Mata, C., Ranasinghe, K., Ryoo, M.S.: Copt: Unsupervised domain adaptive seg- mentation using domain-agnostic text embeddings. In: ECCV (2024)
work page 2024
-
[30]
McCormac, J., Handa, A., Davison, A., Leutenegger, S.: Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In: ICRA (2017)
work page 2017
-
[31]
Melas-Kyriazi, L., Manrai, A.K.: Pixmatch: Unsupervised domain adaptation via pixelwise consistency training. In: CVPR (2021)
work page 2021
-
[32]
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W.,Howes,R.,Huang,P.Y.,Li,S.W.,Misra,I.,Rabbat,M.,Sharma,V.,Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without su...
work page internal anchor Pith review arXiv 2024
-
[33]
Paul, S., Tsai, Y.H., Schulter, S., Roy-Chowdhury, A.K., Chandraker, M.: Domain adaptive semantic segmentation using weak labels. In: ECCV (2020)
work page 2020
-
[34]
Peng, X., Chen, R., Qiao, F., Kong, L., Liu, Y., Wang, T., Zhu, X., Ma, Y.: Sam- guided unsupervised domain adaptation for 3d segmentation (2023)
work page 2023
-
[35]
Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. In: CVPR (2024)
work page 2024
-
[36]
In: ICML (2021) Dual-Foundation Models for Unsupervised Domain Adaptation 15
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) Dual-Foundation Models for Unsupervised Domain Adaptation 15
work page 2021
- [37]
-
[38]
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: A large collection of synthetic images for semantic segmentation of ur- ban scenes. In: CVPR (2016)
work page 2016
-
[39]
Sakaridis, C., Dai, D., Van Gool, L.: Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In: ICCV (2021)
work page 2021
-
[40]
Sikdar, A., Kishor, A., Kadam, I., Sundaram, S.: Picazo: Pixel-aligned contrastive learning for zero-shot domain adaptation. In: CVPR Workshops (2025)
work page 2025
-
[41]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)
work page internal anchor Pith review arXiv 2025
-
[42]
Subhani, M.N., Ali, M.: Learning from scale-invariant examples for domain adap- tation in semantic segmentation. In: ECCV (2020)
work page 2020
-
[43]
Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeuRIPS (2017)
work page 2017
-
[44]
Toldo, M., Maracani, A., Michieli, U., Zanuttigh, P.: Unsupervised domain adap- tation in semantic segmentation: a review. Technologies8(2), 35 (2020)
work page 2020
-
[45]
Tranheden, W., Olsson, V., Pinto, J., Svensson, L.: Dacs: Domain adaptation via cross-domain mixed sampling. In: WACV (2021)
work page 2021
-
[46]
Vu, T.H., Jain, H., Bucher, M., Cord, M., Pérez, P.: Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In: CVPR (2019)
work page 2019
-
[47]
Wang, Q., Fink, O., Van Gool, L., Dai, D.: Continual test-time domain adaptation. In: CVPR (2022)
work page 2022
-
[48]
Wang, W., Zhou, T., Yu, F., Dai, J., Konukoglu, E., Van Gool, L.: Exploring cross-image pixel contrast for semantic segmentation. In: ICCV (2021)
work page 2021
-
[49]
Wang, Y., Peng, J., Zhang, Z.: Uncertainty-aware pseudo label refinery for domain adaptive semantic segmentation. In: ICCV (2021)
work page 2021
-
[50]
Wang, Z., Yu, M., Wei, Y., Feris, R., Xiong, J., Hwu, W.m., Huang, T.S., Shi, H.: Differential treatment for stuff and things: A simple unsupervised domain adapta- tion method for semantic segmentation. In: CVPR (2020)
work page 2020
-
[51]
Wu, Y., Xing, M., Zhang, Y., Xie, Y., Qu, Y.: Clip2uda: Making frozen clip reward unsupervised domain adaptation in 3d semantic segmentation. In: ACM Multime- dia (2024)
work page 2024
-
[52]
Yan, W., Qian, Y., Zhuang, H., Wang, C., Yang, M.: Sam4udass: When sam meets unsupervised domain adaptive semantic segmentation in intelligent vehicles. Trans- actions on Intelligent Vehicles9(2), 3396–3408 (2024).https://doi.org/10.1109/ TIV.2023.3344754, accessed 3 May 2026
- [53]
-
[54]
Yang, S., Tian, Z., Jiang, L., Jia, J.: Unified language-driven zero-shot domain adaptation. In: CVPR (2024)
work page 2024
-
[55]
Yang, Y., Soatto, S.: Fda: Fourier domain adaptation for semantic segmentation. In: CVPR (2020)
work page 2020
-
[56]
Zhang, P., Zhang, B., Zhang, T., Chen, D., Wang, Y., Wen, F.: Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmen- tation. In: CVPR (2021)
work page 2021
-
[57]
Zhao, X., Mithun, N.C., Rajvanshi, A., Chiu, H.P., Samarasekera, S.: Unsupervised domain adaptation for semantic segmentation with pseudo label self-refinement. In: WACV (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.