TCSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation
Pith reviewed 2026-05-17 23:58 UTC · model grok-4.3
The pith
Textual class descriptions can align visual features across CT and MRI to reduce domain shift in medical image segmentation without target labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TCSA-UDA uses domain-invariant textual class descriptions to guide visual representation learning via a vision-language covariance cosine loss that aligns image encoder features with inter-class textual semantic relations, combined with a prototype alignment module that matches class-wise pixel distributions across domains, thereby reducing residual category-level discrepancies in cross-modality cardiac, abdominal, and brain tumor segmentation.
What carries the argument
The vision-language covariance cosine loss that directly aligns image encoder features with inter-class textual semantic relations to produce modality-invariant representations.
If this is right
- Segmentation accuracy on target-domain MRI improves when only CT source labels and text descriptions are available.
- The same text-guided alignment produces gains on abdominal and brain tumor cross-modality tasks.
- Residual category-level domain gaps are reduced beyond what image-only alignment achieves.
- The method consistently exceeds prior state-of-the-art unsupervised domain adaptation performance on the tested benchmarks.
Where Pith is reading between the lines
- The same text-alignment principle could be tested on ultrasound or PET by writing equivalent class descriptions.
- Hospitals might maintain one segmentation model across scanner vendors if text prompts are standardized.
- Automatically deriving the textual descriptions from radiology textbooks or ontologies would remove the need for manual prompt engineering.
Load-bearing premise
Domain-invariant textual class descriptions exist and can reliably guide visual representation learning across modalities.
What would settle it
Replacing the chosen domain-invariant text prompts with random or modality-specific prompts and checking whether the reported gains over baseline UDA methods disappear on the same cross-modality benchmarks.
Figures
read the original abstract
Unsupervised domain adaptation for medical image segmentation remains a significant challenge due to substantial domain shifts across imaging modalities, such as CT and MRI. While recent vision-language representation learning methods have shown promise, their potential in UDA segmentation tasks remains underexplored. To address this gap, we propose TCSA-UDA, a Text-driven Cross-Semantic Alignment framework that leverages domain-invariant textual class descriptions to guide visual representation learning. Our approach introduces a vision-language covariance cosine loss to directly align image encoder features with inter-class textual semantic relations, encouraging semantically meaningful and modality-invariant feature representations. Additionally, we incorporate a prototype alignment module that aligns class-wise pixel-level feature distributions across domains using high-level semantic prototypes. This mitigates residual category-level discrepancies and enhances cross-modal consistency. Extensive experiments on challenging cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks demonstrate that our TCSA-UDA framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods, establishing a new paradigm for integrating language-driven semantics into domain-adaptive medical image analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TCSA-UDA, a Text-driven Cross-Semantic Alignment framework for unsupervised domain adaptation in medical image segmentation. It uses domain-invariant textual class descriptions to guide visual representation learning through a vision-language covariance cosine loss that aligns image encoder features with inter-class textual semantic relations, plus a prototype alignment module to align class-wise pixel-level feature distributions across domains. The authors report that experiments on cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks show the framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods.
Significance. If the claimed outperformance holds with substantial and statistically validated gains, the work would be significant for introducing language-driven semantics into UDA for medical imaging, offering a new direction to handle cross-modality shifts (e.g., CT-MRI) where purely visual methods often fall short. The combination of covariance alignment and prototype matching is a plausible extension of vision-language techniques, and reproducible code or parameter-free elements would strengthen its contribution.
major comments (2)
- [Abstract] Abstract: the central claim of consistent outperformance and significant domain-shift reduction is presented without any quantitative deltas, ablation results, or mention of statistical testing. This absence is load-bearing because the empirical superiority over prior UDA methods cannot be assessed from the given description alone.
- [Method] Method description (vision-language covariance cosine loss and prototype alignment): the approach assumes fixed textual class descriptions encode modality-robust semantics that remain invariant under large intensity and contrast differences between CT and MRI. No sensitivity analysis to prompt choice or physics-based justification is referenced, which directly affects whether the alignment losses can produce the claimed modality-invariant representations.
minor comments (2)
- [Abstract] Abstract: consider including at least one key quantitative result (e.g., Dice improvement on a benchmark) or a pointer to the main results table to make the performance claim concrete.
- Notation: the terms 'vision-language covariance cosine loss' and 'high-level semantic prototypes' would benefit from an explicit equation or diagram reference on first use to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the presentation of our contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of consistent outperformance and significant domain-shift reduction is presented without any quantitative deltas, ablation results, or mention of statistical testing. This absence is load-bearing because the empirical superiority over prior UDA methods cannot be assessed from the given description alone.
Authors: We agree that the abstract would be strengthened by including concrete quantitative evidence. In the revised manuscript, we will update the abstract to report specific performance deltas (e.g., average Dice score improvements over the strongest baseline across the three benchmarks) and note that statistical significance was assessed via paired tests. This change directly addresses the concern while preserving the abstract's brevity. revision: yes
-
Referee: [Method] Method description (vision-language covariance cosine loss and prototype alignment): the approach assumes fixed textual class descriptions encode modality-robust semantics that remain invariant under large intensity and contrast differences between CT and MRI. No sensitivity analysis to prompt choice or physics-based justification is referenced, which directly affects whether the alignment losses can produce the claimed modality-invariant representations.
Authors: We appreciate the referee's point on the underlying assumption. The class descriptions are drawn from standard medical terminology to emphasize anatomical and pathological properties rather than modality-specific appearance. To address the gap, we will add a sensitivity analysis to prompt variations (reported in the supplement) and a concise justification in the method section explaining the expected robustness to intensity/contrast shifts. These additions will better support the design of the covariance and prototype alignment losses. revision: yes
Circularity Check
Derivation is self-contained; new losses and modules defined independently without reduction to inputs
full rationale
The TCSA-UDA framework defines its core components—the vision-language covariance cosine loss for aligning image features with textual semantic relations and the prototype alignment module for class-wise feature distributions—directly from standard feature extraction and cross-modal alignment principles. These constructions use explicit mathematical formulations based on encoder outputs, inter-class textual descriptions, and high-level prototypes without any fitted parameters from target-domain data being repurposed as predictions, nor any self-definitional loops where outputs are presupposed in the inputs. No load-bearing self-citations or uniqueness theorems from the authors' prior work are invoked to justify the central alignment mechanism; the approach extends existing UDA techniques with independently specified text-driven terms. The experimental claims rest on benchmark comparisons rather than circular derivations, making the chain self-contained against external validation.
Axiom & Free-Parameter Ledger
free parameters (1)
- loss weighting coefficients
axioms (1)
- domain assumption Textual class descriptions are domain-invariant and semantically meaningful for guiding visual features
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
introduces a vision–language covariance cosine loss to directly align image encoder features with inter-class textual semantic relations
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
prototype alignment module that aligns class-wise pixel-level feature distributions across domains
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. C. Florkow, K. Willemsen, V . V . Mascarenhas, E. H. Oei, M. van Stralen, and P. R. Seevinck, “Magnetic resonance imaging versus com- puted tomography for three-dimensional bone imaging of musculoskele- tal pathologies: a review,”Journal of Magnetic Resonance Imaging, vol. 56, no. 1, pp. 11–34, 2022
work page 2022
-
[2]
Unsupervised domain adaptation for em image denoising with invertible networks,
S. Deng, Y . Chen, W. Huang, R. Zhang, and Z. Xiong, “Unsupervised domain adaptation for em image denoising with invertible networks,” IEEE Transactions on Medical Imaging, vol. 44, no. 1, pp. 92–105, 2025
work page 2025
-
[3]
C. Chen, Q. Dou, H. Chen, J. Qin, and P.-A. Heng, “Synergistic image and feature adaptation: Towards cross-modality domain adaptation for medical image segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 865–872, 2019. 10
work page 2019
-
[4]
Deep unsupervised domain adaptation: A review of recent advances and perspectives,
X. Liu, C. Yoo, F. Xing, H. Oh, G. El Fakhri, J.-W. Kang, J. Woo,et al., “Deep unsupervised domain adaptation: A review of recent advances and perspectives,”APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, 2022
work page 2022
-
[5]
Unpaired image-to- image translation using cycle-consistent adversarial networks,
J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- image translation using cycle-consistent adversarial networks,” inIEEE International Conference on Computer Vision (ICCV), pp. 2242–2251, 2017
work page 2017
-
[6]
W. Feng, L. Ju, L. Wang, K. Song, X. Zhao, and Z. Ge, “Unsupervised domain adaptation for medical image segmentation by selective entropy constraints and adaptive semantic alignment,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 623–631, Jun. 2023
work page 2023
-
[7]
Learning transferable features with deep adaptation networks,
M. Long, Y . Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” inInternational Conference on Machine Learning, pp. 97–105, PMLR, 2015
work page 2015
-
[8]
C. Chen, Q. Dou, H. Chen, J. Qin, and P. A. Heng, “Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation,”IEEE Transactions on Medical Imaging, vol. 39, no. 7, pp. 2494–2505, 2020
work page 2020
-
[9]
Le- uda: Label-efficient unsupervised domain adaptation for medical image segmentation,
Z. Zhao, F. Zhou, K. Xu, Z. Zeng, C. Guan, and S. K. Zhou, “Le- uda: Label-efficient unsupervised domain adaptation for medical image segmentation,”IEEE Transactions on Medical Imaging, vol. 42, no. 3, pp. 633–646, 2022
work page 2022
-
[10]
X. Sun, Z. Liu, S. Zheng, C. Lin, Z. Zhu, and Y . Zhao, “Attention- enhanced disentangled representation learning for unsupervised do- main adaptation in cardiac segmentation,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 745–754, Springer, 2022
work page 2022
-
[11]
Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,
Y . Zou, Z. Yu, B. V . Kumar, and J. Wang, “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018
work page 2018
-
[12]
Bidirectional learning for domain adaptation of semantic segmentation,
Y . Li, L. Yuan, and N. Vasconcelos, “Bidirectional learning for domain adaptation of semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
work page 2019
-
[13]
Col- laborative unsupervised domain adaptation for medical image diagnosis,
Y . Zhang, Y . Wei, Q. Wu, P. Zhao, S. Niu, J. Huang, and M. Tan, “Col- laborative unsupervised domain adaptation for medical image diagnosis,” IEEE Transactions on Image Processing, vol. 29, pp. 7834–7844, 2020
work page 2020
-
[14]
Learning to adapt structured output space for semantic segmentation,
Y .-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker, “Learning to adapt structured output space for semantic segmentation,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 7472–7481, 2018
work page 2018
-
[15]
Advent: Ad- versarial entropy minimization for domain adaptation in semantic seg- mentation,
T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. P ´erez, “Advent: Ad- versarial entropy minimization for domain adaptation in semantic seg- mentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2517–2526, 2019
work page 2019
-
[16]
Q. En and Y . Guo, “Unsupervised domain adaptation for medical image segmentation with dynamic prototype-based contrastive learning,” in Conference on Health, Inference, and Learning, pp. 312–325, PMLR, 2024
work page 2024
-
[17]
L. Hoyer, D. Dai, and L. Van Gool, “Daformer: Improving network architectures and training strategies for domain-adaptive semantic seg- mentation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9914–9925, 2022
work page 2022
-
[18]
Unsupervised domain adaptation for medical image segmentation using transformer with meta attention,
W. Ji and A. C. S. Chung, “Unsupervised domain adaptation for medical image segmentation using transformer with meta attention,” IEEE Transactions on Medical Imaging, vol. 43, no. 2, pp. 820–831, 2024
work page 2024
-
[19]
Tganet: Text-guided attention for improved polyp segmentation,
N. K. Tomar, D. Jha, U. Bagci, and S. Ali, “Tganet: Text-guided attention for improved polyp segmentation,” inInternational Conference on Medi- cal Image Computing and Computer-Assisted Intervention, pp. 151–160, Springer, 2022
work page 2022
-
[20]
Y . Zhong, M. Xu, K. Liang, K. Chen, and M. Wu, “Ariadne’s thread: Using text prompts to improve segmentation of infected areas from chest x-ray images,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 724–733, Springer, 2023
work page 2023
-
[21]
C. Liu, S. Cheng, C. Chen, M. Qiao, W. Zhang, A. Shah, W. Bai, and R. Arcucci, “M-flag: Medical vision-language pre-training with frozen language models and latent space geometry optimization,” in International Conference on Medical Image Computing and Computer- Assisted Intervention, pp. 637–647, Springer, 2023
work page 2023
-
[22]
Generative text-guided 3d vision-language pretraining for unified medical image segmentation,
Y . Chen, C. Liu, W. Huang, S. Cheng, R. Arcucci, and Z. Xiong, “Gen- erative text-guided 3d vision-language pretraining for unified medical image segmentation,”arXiv preprint arXiv:2306.04811, 2023
-
[23]
Z. Liu, Z. Zhu, S. Zheng, Y . Liu, J. Zhou, and Y . Zhao, “Margin preserving self-paced contrastive learning towards domain adaptation for medical image segmentation,”IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 2, pp. 638–647, 2022
work page 2022
-
[24]
Domain-agnostic mutual prompting for unsupervised domain adaptation,
Z. Du, X. Li, F. Li, K. Lu, L. Zhu, and J. Li, “Domain-agnostic mutual prompting for unsupervised domain adaptation,” inIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pp. 23375– 23384, June 2024
work page 2024
-
[25]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning(M. Meila and T. Zhang, eds.), vol. 139 ofProceedings of Machine Learnin...
work page 2021
-
[26]
Biobert: a pre-trained biomedical language representation model for biomedical text mining,
J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical text mining,”Bioinformatics, vol. 36, pp. 1234–1240, 09 2019
work page 2019
-
[27]
Learning semantic representa- tions for unsupervised domain adaptation,
S. Xie, Z. Zheng, L. Chen, and C. Chen, “Learning semantic representa- tions for unsupervised domain adaptation,” inInternational Conference on Machine Learning, pp. 5423–5432, PMLR, 2018
work page 2018
-
[28]
Multi-scale patch and multi-modality atlases for whole heart segmentation of mri,
X. Zhuang and J. Shen, “Multi-scale patch and multi-modality atlases for whole heart segmentation of mri,”Medical Image Analysis, vol. 31, pp. 77–87, 2016
work page 2016
-
[29]
Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,
B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein, “Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,” inProc. MICCAI multi-atlas labeling beyond cranial vault—workshop challenge, vol. 5, p. 12, Munich, Germany, 2015
work page 2015
-
[30]
Chaos challenge- combined (ct-mr) healthy abdominal organ segmentation,
A. E. Kavur, N. S. Gezer, M. Barıs ¸, S. Aslan, P.-H. Conze, V . Groza, D. D. Pham, S. Chatterjee, P. Ernst, S. ¨Ozkan,et al., “Chaos challenge- combined (ct-mr) healthy abdominal organ segmentation,”Medical Image Analysis, vol. 69, p. 101950, 2021
work page 2021
-
[31]
The multimodal brain tumor image segmentation benchmark (brats),
B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y . Burren, N. Porz, J. Slotboom, R. Wiest,et al., “The multimodal brain tumor image segmentation benchmark (brats),”IEEE Transactions on Medical Imaging, vol. 34, no. 10, pp. 1993–2024, 2014
work page 1993
-
[32]
Q. Dou, C. Ouyang, C. Chen, H. Chen, B. Glocker, X. Zhuang, and P.-A. Heng, “Pnp-adanet: Plug-and-play adversarial domain adaptation network at unpaired cross-modality cardiac segmentation,”IEEE Access, vol. 7, pp. 99065–99076, 2019
work page 2019
-
[33]
Energy-constrained self-training for unsupervised domain adaptation,
X. Liu, B. Hu, X. Liu, J. Lu, J. You, and L. Kong, “Energy-constrained self-training for unsupervised domain adaptation,” in25th International Conference on Pattern Recognition (ICPR), pp. 7515–7520, 2021
work page 2021
-
[34]
Confidence regularized self-training,
Y . Zou, Z. Yu, X. Liu, B. V . K. V . Kumar, and J. Wang, “Confidence regularized self-training,” inIEEE/CVF International Conference on Computer Vision (ICCV), pp. 5981–5990, 2019
work page 2019
-
[35]
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017
work page 2017
-
[36]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, Ieee, 2009
work page 2009
-
[37]
Image-to-image trans- lation with conditional adversarial networks,
P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image trans- lation with conditional adversarial networks,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.