arxiv: 2604.10000 · v1 · submitted 2026-04-11 · 💻 cs.CV

Recognition: unknown

SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation

Ashfak Yeafi , Parthaw Goswami , Md Khairul Islam , Ashifa Islam Shamme

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical image segmentationCLIPSwin TransformerU-Netmultimodal fusiontext guidanceCOVID-19cross-attention

0 comments

The pith

CLIP text embeddings fused into Swin U-Net raise Dice score to 86.47 percent on COVID CT scans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SwinTextUNet, a framework that feeds CLIP-derived text embeddings into a Swin Transformer U-Net backbone. Cross-attention and convolutional fusion modules align the semantic text signals with the network's hierarchical visual features. The goal is to give the model extra guidance when visual patterns are ambiguous or low-contrast. On the QaTaCOV19 dataset the four-stage version reaches 86.47 percent Dice and 78.2 percent IoU while keeping complexity moderate. Ablation experiments indicate that removing the text path lowers performance, supporting the claim that multimodal fusion adds value.

Core claim

SwinTextUNet incorporates CLIP text embeddings into the Swin Transformer U-Net by means of cross-attention layers and convolutional fusion blocks; this alignment of textual semantics with multi-scale visual representations produces more accurate segmentation masks than the visual backbone alone.

What carries the argument

Cross-attention module that injects CLIP text embeddings into the encoder-decoder stages of the Swin U-Net to supply semantic guidance.

If this is right

The model handles low-contrast and ambiguous regions in medical images more reliably than visual-only networks.
A four-stage architecture strikes a practical trade-off between accuracy and computational cost.
Ablation results show that both text guidance and multimodal fusion contribute measurably to the reported scores.
Vision-language integration offers a route to more clinically useful segmentation tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion pattern could be tried on other CT or MRI datasets to test whether the gains generalize.
Alternative text encoders or prompt strategies might yield further improvements without changing the visual backbone.
The approach suggests a template for adding language guidance to other transformer segmentation models.

Load-bearing premise

CLIP text embeddings carry useful semantic information that improves segmentation accuracy beyond what the Swin U-Net visual features alone can achieve on the QaTaCOV19 data.

What would settle it

Train an identical four-stage Swin U-Net on QaTaCOV19 without any CLIP text input or cross-attention and measure whether Dice and IoU stay within a few points of 86.47 percent and 78.2 percent.

Figures

Figures reproduced from arXiv: 2604.10000 by Ashfak Yeafi, Ashifa Islam Shamme, Md Khairul Islam, Parthaw Goswami.

**Figure 2.** Figure 2: Cross-Attention block. Vision tokens attend to text tokens, refining [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Overall architecture of SwinTextUNet. The model integrates CLIP-based text embeddings with a hierarchical Swin U-Net through cross-attention and [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Representative samples from the QaTa-COV19 dataset across training, [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 7.** Figure 7: Representative qualitative segmentation results. Each triplet shows (a) [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 6.** Figure 6: Training and validation loss curves of SwinTextUNet. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

Precise medical image segmentation is fundamental for enabling computer aided diagnosis and effective treatment planning. Traditional models that rely solely on visual features often struggle when confronted with ambiguous or low contrast patterns. To overcome these limitations, we introduce SwinTextUNet, a multimodal segmentation framework that incorporates Contrastive Language Image Pretraining (CLIP), derived textual embeddings into a Swin Transformer UNet backbone. By integrating cross attention and convolutional fusion, the model effectively aligns semantic text guidance with hierarchical visual representations, enhancing robustness and accuracy. We evaluate our approach on the QaTaCOV19 dataset, where the proposed four stage variant achieves an optimal balance between performance and complexity, yielding Dice and IoU scores of 86.47% and 78.2%, respectively. Ablation studies further validate the importance of text guidance and multimodal fusion. These findings underscore the promise of vision language integration in advancing medical image segmentation and supporting clinically meaningful diagnostic tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SwinTextUNet bolts CLIP text embeddings onto a Swin U-Net via cross-attention and reports 86.47 Dice on QaTaCOV19, but the added value of the text branch over the visual backbone alone is not clearly quantified.

read the letter

The paper's main move is a four-stage Swin U-Net that pulls in CLIP-derived text embeddings through cross-attention and convolutional fusion modules. They test it on QaTaCOV19 and get Dice 86.47% and IoU 78.2% for the best variant, with ablations that they say confirm the text guidance helps. That setup is a direct, workable extension of existing multimodal techniques rather than a new framework. The architecture description is clear and the decision to keep it to four stages for a performance-complexity balance makes sense for practical use. Credit for running ablations at all; too many papers skip that step. The soft spots sit in the evaluation. The headline numbers are given without error bars, run-to-run variance, or a direct head-to-head against the identical Swin U-Net without the text branch, so it is hard to judge whether the multimodal part is doing real work or just riding along. Training protocol details are light in the abstract, and there is no discussion of how well CLIP's natural-image pretraining transfers to medical text cues. Those gaps do not sink the paper, but they make the central claim rest more on the reported scores than on airtight evidence. This is the sort of incremental multimodal tweak that might interest someone already working on medical segmentation who wants a concrete starting point for adding language signals. It is not going to shift the field, but a reader looking for practical implementation ideas could pull useful pieces from it. I would send it to peer review. The results are concrete, the ablations exist, and referees can check whether the text guidance actually moves the needle once the full tables and baselines are in front of them.

Referee Report

2 major / 3 minor

Summary. The paper introduces SwinTextUNet, a multimodal medical image segmentation framework that fuses CLIP-derived text embeddings into a Swin Transformer U-Net backbone via cross-attention and convolutional modules. On the QaTaCOV19 dataset the four-stage variant is reported to achieve Dice 86.47 % and IoU 78.2 %, with ablation studies presented to demonstrate the value of the text-guidance and multimodal-fusion components.

Significance. If the reported gains are reproducible, the work provides concrete evidence that language-derived semantic guidance can improve segmentation accuracy on low-contrast CT images beyond a pure visual Swin U-Net backbone. The inclusion of ablation studies is a strength that helps isolate the contribution of the text branch and supports the broader claim that vision-language integration is useful for medical segmentation tasks.

major comments (2)

[§4] Experimental protocol (optimizer, learning-rate schedule, batch size, data splits, and augmentation) is not described in sufficient detail to allow independent verification of the headline Dice/IoU numbers or fair comparison with the cited baselines.
[Table 2] Table 2 (ablation results): while the text states that ablations confirm the utility of text guidance, the table does not report the exact Dice/IoU delta between the full model and the identical Swin U-Net backbone without the CLIP branch, making the marginal benefit of the multimodal fusion difficult to quantify.

minor comments (3)

[Abstract] Abstract: 'Contrastive Language Image Pretraining' should be written with a hyphen ('Pre-training') for standard terminology consistency.
[Figure 1] Figure 1: the architecture diagram would be clearer if the text-embedding pathway and the precise locations of the cross-attention blocks were labeled with arrows or call-outs.
[§3.2] §3.2: the cross-attention equations are described in prose but lack an explicit mathematical formulation (e.g., definitions of Q, K, V projections from text and visual tokens), which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the ablation studies, and recommendation for minor revision. The comments are constructive and we address each major point below, indicating the revisions we will implement.

read point-by-point responses

Referee: [§4] Experimental protocol (optimizer, learning-rate schedule, batch size, data splits, and augmentation) is not described in sufficient detail to allow independent verification of the headline Dice/IoU numbers or fair comparison with the cited baselines.

Authors: We agree that the experimental protocol in Section 4 lacks sufficient detail for full reproducibility. In the revised manuscript we will expand this section to specify the optimizer (AdamW), learning-rate schedule, batch size, train/validation/test splits, and all augmentation techniques. These additions will enable independent verification of the reported Dice and IoU scores and support fair comparisons with the baselines. revision: yes
Referee: [Table 2] Table 2 (ablation results): while the text states that ablations confirm the utility of text guidance, the table does not report the exact Dice/IoU delta between the full model and the identical Swin U-Net backbone without the CLIP branch, making the marginal benefit of the multimodal fusion difficult to quantify.

Authors: We acknowledge that the marginal benefit would be clearer if the exact deltas were reported. In the revised manuscript we will update Table 2 to include the performance of the Swin U-Net backbone without the CLIP branch and will explicitly state the Dice and IoU deltas achieved by the full multimodal model. This will better quantify the contribution of the text-guidance and fusion components. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical model evaluation is self-contained

full rationale

The paper proposes an architectural integration of CLIP-derived text embeddings into a Swin Transformer U-Net via cross-attention and convolutional fusion modules, then reports empirical Dice and IoU scores plus ablation results on the QaTaCOV19 dataset. No derivation chain, uniqueness theorem, or mathematical reduction is claimed; the central assertions rest on experimental performance numbers and design choices that are directly tested rather than defined in terms of their own outputs or fitted parameters. No self-citation load-bearing steps or ansatz smuggling appear in the provided text, so the evaluation remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that CLIP embeddings transfer useful semantic information to medical images and that the chosen fusion mechanism is effective; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption CLIP text embeddings provide semantically relevant guidance for medical image segmentation tasks
Invoked implicitly when claiming that text guidance enhances robustness on ambiguous patterns.

pith-pipeline@v0.9.0 · 5480 in / 1281 out tokens · 61247 ms · 2026-05-10T16:06:06.491948+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KIRA: Knowledge-Intensive Image Retrieval and Reasoning Architecture for Specialized Visual Domains
cs.CV 2026-04 unverdicted novelty 7.0

KIRA is a unified architecture for visual RAG that reports 0.97 retrieval precision, 1.0 grounding, and 0.707 domain correctness across medical, circuit, satellite, and histopathology domains via hierarchical chunking...

Reference graph

Works this paper leans on

24 extracted references · 3 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022

2021
[2]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

2021
[3]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

2015
[4]

Unet++: A nested u-net architecture for medical image segmentation,

Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” inInternational workshop on deep learning in medical image analysis. Springer, 2018, pp. 3–11

2018
[5]

Adtnet: Attention-guided u-net with dynamic cnn and transformers for skin cancer detection,

A. Yeafi and L. Sarker, “Adtnet: Attention-guided u-net with dynamic cnn and transformers for skin cancer detection,” in2024 13th Interna- tional Conference on Electrical and Computer Engineering (ICECE). IEEE, 2024, pp. 679–684

2024
[6]

Gsnet: a multi-class 3d attention-based hybrid glioma segmentation network,

M. T. Jawad, A. Yeafi, and K. K. Halder, “Gsnet: a multi-class 3d attention-based hybrid glioma segmentation network,”Optics Express, vol. 31, no. 24, pp. 40 881–40 906, 2023

2023
[7]

A deep learning framework for 3d brain tumor segmentation and survival prediction,

A. Yeafi, M. Islam, and M. S. U. Yusuf, “A deep learning framework for 3d brain tumor segmentation and survival prediction,”Healthcare Analytics, p. 100418, 2025

2025
[8]

Street object detection from synthesized and processed semantic image: A deep learning based study,

P. Goswami and A. A. Hossain, “Street object detection from synthesized and processed semantic image: A deep learning based study,”Human- Centric Intelligent Systems, vol. 3, no. 4, pp. 487–507, 2023

2023
[9]

An end-to-end web- based system for rice leaf disease classification using deep learning,

P. Goswami, A. A. Hossain, and A. N. M. Sakib, “An end-to-end web- based system for rice leaf disease classification using deep learning,” inInternational Joint Conference on Advances in Computational Intel- ligence. Singapore: Springer Nature Singapore, 2022, pp. 517–531

2022
[10]

Corn leaf disease identification via transfer learning: A comprehensive web-based solution,

P. Goswami, A. A. Safi, A. N. M. Sakib, and T. Datta, “Corn leaf disease identification via transfer learning: A comprehensive web-based solution,” inInternational Conference on Sustainable and Innovative Solutions for Current Challenges in Engineering & Technology. Sin- gapore: Springer Nature Singapore, 2023, pp. 429–441

2023
[11]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[12]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

J. Chen, Y . Lu, Q. Yu, X. Luo, E. Adeli, Y . Wang, L. Lu, A. L. Yuille, and Y . Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,”arXiv preprint arXiv:2102.04306, 2021

work page internal anchor Pith review arXiv 2021
[13]

Swin-unet: Unet-like pure transformer for medical image segmenta- tion,

H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmenta- tion,” inEuropean conference on computer vision. Springer, 2022, pp. 205–218

2022
[14]

Missformer: An effective med- ical image segmentation transformer,

X. Huang, Z. Deng, D. Li, and X. Yuan, “Missformer: An effective med- ical image segmentation transformer,”arXiv preprint arXiv:2109.07162, 2021

work page arXiv 2021
[15]

Medt: Using multimodal encoding-decoding network as in transformer for multimodal sentiment analysis,

Q. Qi, L. Lin, R. Zhang, and C. Xue, “Medt: Using multimodal encoding-decoding network as in transformer for multimodal sentiment analysis,”IEEE Access, vol. 10, pp. 28 750–28 759, 2022

2022
[16]

Ef-swinnet: A hybrid efficientnet-swin trans- former model for skin cancer classification,

L. Sarker and A. Yeafi, “Ef-swinnet: A hybrid efficientnet-swin trans- former model for skin cancer classification,” in2024 International Con- ference on Recent Progresses in Science, Engineering and Technology (ICRPSET). IEEE, 2024, pp. 1–4

2024
[17]

Adaptive real-time gap detection system: A multi-algorithm approach integrating ultrasonic sensing and machine learning for robust structural analysis,

M. K. Islam, A. Biswas, and H. Hu, “Adaptive real-time gap detection system: A multi-algorithm approach integrating ultrasonic sensing and machine learning for robust structural analysis,” in2025 IEEE 4th In- ternational Conference on Computing and Machine Intelligence (ICMI). IEEE, 2025

2025
[18]

Learning to exploit temporal structure for biomedical vision-language processing,

S. Bannur, S. Hyland, Q. Liu, F. Perez-Garcia, M. Ilse, D. C. Castro, B. Boecking, H. Sharma, K. Bouzid, A. Thiemeet al., “Learning to exploit temporal structure for biomedical vision-language processing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 016–15 027

2023
[19]

Pubmedclip: How much does clip benefit visual question answering in the medical domain?

S. Eslami, C. Meinel, and G. De Melo, “Pubmedclip: How much does clip benefit visual question answering in the medical domain?” in Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1181–1193

2023
[20]

Medclip-sam: Bridging text and image towards universal medical image segmentation,

T. Koleilat, H. Asgariandehkordi, H. Rivaz, and Y . Xiao, “Medclip-sam: Bridging text and image towards universal medical image segmentation,” inInternational conference on medical image computing and computer- assisted intervention. Springer, 2024, pp. 643–653

2024
[21]

A semi-supervised approach for brain tumor classification using wasser- stein generative adversarial network with gradient penalty,

A. Yeafi, M. Islam, S. K. Mondal, K. I. H. Nashad, and M. S. U. Yusuf, “A semi-supervised approach for brain tumor classification using wasser- stein generative adversarial network with gradient penalty,” in2023 6th International Conference on Electrical Information and Communication Technology (EICT). IEEE, 2023, pp. 1–6

2023
[22]

Osegnet: Operational segmentation network for covid-19 detection using chest x-ray images,

A. Degerli, S. Kiranyaz, M. E. Chowdhury, and M. Gabbouj, “Osegnet: Operational segmentation network for covid-19 detection using chest x-ray images,” in2022 IEEE International Conference on Image Pro- cessing (ICIP). IEEE, 2022, pp. 2306–2310

2022
[23]

Lvit: language meets vision transformer in medical image segmentation,

Z. Li, Y . Li, Q. Li, P. Wang, D. Guo, L. Lu, D. Jin, Y . Zhang, and Q. Hong, “Lvit: language meets vision transformer in medical image segmentation,”IEEE transactions on medical imaging, vol. 43, no. 1, pp. 96–107, 2023

2023
[24]

Attention U-Net: Learning Where to Look for the Pancreas

O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y . Hammerla, B. Kainzet al., “Atten- tion u-net: Learning where to look for the pancreas,”arXiv preprint arXiv:1804.03999, 2018

work page internal anchor Pith review arXiv 2018