pith. sign in

arxiv: 2605.15720 · v1 · pith:T65WQ7P3new · submitted 2026-05-15 · 💻 cs.CV · cs.LG

Semi-MedRef: Semi-Supervised Medical Referring Image Segmentation with Cross-Modal Alignment

Pith reviewed 2026-05-20 19:36 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords semi-supervised learningmedical image segmentationreferring expression segmentationcross-modal alignmentdata augmentationcontrastive learningCOVID-19 chest scans
0
0 comments X

The pith

Semi-MedRef keeps medical images aligned with their text descriptions during semi-supervised training by synchronizing augmentations across modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical referring image segmentation assigns pixel masks to match free-form text descriptions of anatomical locations, but creating these paired labels is expensive. The paper introduces Semi-MedRef, a teacher-student framework that adds three targeted components to let unlabeled medical scans contribute without breaking the image-text link. T-PatchMix mixes image patches only when the referring text can be adjusted consistently, PosAug alters anatomical phrases to match spatial changes, and ITCL pulls matching image-text pairs closer using position-based pseudo-labels. If these steps preserve alignment, the approach yields higher segmentation accuracy than either fully supervised models or standard semi-supervised baselines when only a small fraction of the data carries labels.

Core claim

The central claim is that a teacher-student semi-supervised framework for medical referring image segmentation can maintain reliable cross-modal consistency under strong augmentation by replacing generic perturbations with T-PatchMix (position-constrained patch mixing synchronized to referring expressions), PosAug (position-aware text masking or fuzzing), and ITCL (position-guided image-text contrastive learning that builds soft anatomical positives).

What carries the argument

T-PatchMix, a cross-modal CutMix variant that applies patch mixing to images only under position-constrained and probability-driven rules that keep the referring text coherent.

If this is right

  • The method outperforms both fully supervised and existing semi-supervised baselines on QaTa-COV19 and MosMedData+ in every tested label regime.
  • CutMix-style mixing becomes usable in multi-modal medical settings once it is synchronized with referring expressions.
  • Position-guided contrastive learning strengthens medically relevant image-text pairs even when most training examples lack masks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment-preserving logic could be tested on other referring tasks such as chest X-ray report generation or pathology slide captioning.
  • If the components generalize, annotation budgets for new clinical datasets could shift from dense masks toward cheaper text descriptions plus a small set of labeled examples.
  • Combining T-PatchMix with existing single-modal SSL techniques might further reduce the label requirement without additional architectural changes.

Load-bearing premise

The three alignment components preserve reliable image-text correspondence under strong augmentation and do not create new inconsistencies that would weaken the teacher-student consistency signal.

What would settle it

Measure whether segmentation Dice scores on QaTa-COV19 or MosMedData+ drop below standard semi-supervised baselines when only 10 percent of labels are used, or whether cross-modal alignment scores degrade after T-PatchMix is applied.

Figures

Figures reproduced from arXiv: 2605.15720 by Luping Zhou, Yi Liu, Yuchen Li, Zhen Zhao.

Figure 1
Figure 1. Figure 1: Overview of Semi-MedRef: (a) the full pipeline (refer to Sec. 2.2), (b) the T-PatchMix augmentation, and (c) the teacher/student model architecture. In summary, our contributions are four-fold: – We propose Semi-MedRef, a Semi-supervised Medical Referring image segmentation framework that explicitly preserves cross-modal alignment under strong perturbations through alignment-aware augmentation and cross-mo… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of segmentation results. From left to right are the input image, ground-truth mask, and the segmentation results of the methods in comparison. Across both datasets and all evaluated label ratios, Semi-MedRef consistently improves over its corresponding backbone and existing fully supervised and semi-supervised baselines. In particular, Semi-MedRef (MMI-UNet) achieves Dice scores of 87.25% (2%) and… view at source ↗
read the original abstract

Medical referring image segmentation (MRIS) requires pixel-level masks aligned with textual descriptions of anatomical locations, making annotation costly in low-label regimes. Semi-supervised learning (SSL) can mitigate this burden by leveraging unlabeled data, but its success hinges on maintaining reliable image-text alignment under perturbations. Most existing SSL-based referred segmentation methods use either independent or simplistic multi-modal perturbations (e.g., left-right flips), without fully addressing cross-modal alignment under strong augmentation, while CutMix, highly effective in single-modal SSL, remains underexplored in multi-modal settings due to its tendency to disrupt image-text coherence. We propose Semi-MedRef, a teacher-student SSL framework designed to explicitly maintain consistency between medical images and positional language through three alignment-preserving components: T-PatchMix, a cross-modal CutMix-style augmentation that synchronizes patch mixing with referring expressions via position-constrained and probability-driven rules; PosAug, a position-aware text augmentation that masks or fuzzes anatomical phrases; and ITCL, a position-guided image-text contrastive learning module, which leverages positional pseudo-labels to construct soft anatomical positives and strengthen medically grounded cross-modal alignment. Experiments on QaTa-COV19 and MosMedData+ demonstrate that Semi-MedRef consistently outperforms both fully supervised and semi-supervised baselines across all label regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents Semi-MedRef, a teacher-student semi-supervised framework for medical referring image segmentation. It introduces three alignment-preserving modules—T-PatchMix (cross-modal CutMix with position-constrained mixing rules synchronized to referring expressions), PosAug (position-aware text augmentation via masking or fuzzing of anatomical phrases), and ITCL (position-guided image-text contrastive learning using positional pseudo-labels for soft anatomical positives)—to maintain image-text coherence under strong augmentation. Experiments on QaTa-COV19 and MosMedData+ report consistent outperformance over fully supervised and semi-supervised baselines across label regimes.

Significance. If the reported gains hold under the provided module definitions and pseudo-label construction, the work offers a practical advance in multi-modal SSL for medical imaging by explicitly addressing cross-modal consistency, a known weakness of standard CutMix-style augmentations in referring tasks. The coherent logical chain from the three components to preserved teacher-student signals, combined with the use of position-constrained rules and pseudo-labels, strengthens the contribution relative to prior independent or simplistic perturbation approaches.

minor comments (2)
  1. [Abstract] Abstract: the claim of 'consistent outperformance' would be more informative if accompanied by at least one or two representative quantitative margins (e.g., Dice or IoU deltas) and the specific label percentages tested.
  2. [Methods] The description of positional pseudo-label generation and the probability schedules for T-PatchMix should be cross-referenced to the exact equations or algorithms in the methods section for easier verification.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation of minor revision. The referee summary accurately captures the core contributions of Semi-MedRef, including the alignment-preserving modules T-PatchMix, PosAug, and ITCL within the teacher-student framework for medical referring image segmentation.

Circularity Check

0 steps flagged

No significant circularity: novel modules defined independently of fitted results

full rationale

The paper proposes a teacher-student SSL framework whose core consists of three explicitly defined new components (T-PatchMix with position-constrained mixing rules, PosAug for text masking, and ITCL using positional pseudo-labels). These are introduced as design choices to preserve image-text alignment under augmentation, not derived from or fitted to the target performance metrics. No equations reduce a prediction to a fitted input by construction, no self-citation chain justifies the central premise, and the experimental claims rest on direct comparisons rather than internal self-definition. The derivation chain is therefore self-contained and externally testable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method introduces three new algorithmic components whose effectiveness is asserted via empirical comparison; no explicit free parameters, axioms, or invented physical entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5767 in / 1216 out tokens · 29624 ms · 2026-05-20T19:36:35.065517+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

  1. [1]

    Advances in neural information processing systems27(2014)

    Bachman, P., Alsharif, Q., Precup, D.: Learning with pseudo-ensembles. Advances in neural information processing systems27(2014)

  2. [2]

    arXiv preprint arXiv:1911.09785 (2019)

    Berthelot, D., Carlini, N., Cubuk, E.D., Kurakin, A., Sohn, K., Zhang, H., Raf- fel, C.: Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785 (2019)

  3. [3]

    In: European conference on computer vision

    Boecking, B., Usuyama, N., Bannur, S., Castro, D.C., Schwaighofer, A., Hyland, S., Wetscherek, M., Naumann, T., Nori, A., Alvarez-Valle, J., et al.: Making the most of text semantics to improve biomedical vision–language processing. In: European conference on computer vision. pp. 1–21. Springer (2022)

  4. [4]

    In: Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A

    Bui, P.N., Le, D.T., Choo, H.: Visual-textual matching attention for lesion seg- mentation in chest images. In: Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. pp. 702–711. Springer Nature Switzerland, Cham (2024)

  5. [5]

    MONAI: An open-source framework for deep learning in healthcare

    Cardoso, M.J., Li, W., Brown, R., Ma, N., Kerfoot, E., Wang, Y., Murrey, B., Myronenko, A., Zhao, C., Yang, D., et al.: Monai: An open-source framework for deep learning in healthcare. arXiv preprint arXiv:2211.02701 (2022)

  6. [6]

    In: IEEE International Conference on Image Processing (ICIP)

    Degerli, A., Kiranyaz, S., Chowdhury, M.E., Gabbouj, M.: Osegnet: Operational segmentation network for covid-19 detection using chest x-ray images. In: IEEE International Conference on Image Processing (ICIP). pp. 2306–2310. IEEE (2022)

  7. [7]

    In: International Conference on Neural Information Processing

    Hong, T., Wang, Y., Sun, X., Li, X., Ma, J.: Cmmix: Cross-modal mix augmentation between images and texts for visual grounding. In: International Conference on Neural Information Processing. pp. 471–482. Springer (2023)

  8. [8]

    Knowledge and Information Systems66(7), 3855–3881 (2024)

    Hong, Y., Chen, Y.: Patchmix: patch-level mixup for data augmentation in convo- lutional neural networks. Knowledge and Information Systems66(7), 3855–3881 (2024)

  9. [9]

    Nature methods18(2), 203–211 (2021)

    Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods18(2), 203–211 (2021)

  10. [10]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

  11. [11]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Li, A., Zeng, X., Zeng, P., Ding, S., Wang, P., Wang, C., Wang, Y.: Textmatch: Using text prompts to improve semi-supervised medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 699–709. Springer (2024)

  12. [12]

    IEEE transactions on medical imaging43(1), 96–107 (2023)

    Li, Z., Li, Y., Li, Q., Wang, P., Guo, D., Lu, L., Jin, D., Zhang, Y., Hong, Q.: Lvit: language meets vision transformer in medical image segmentation. IEEE transactions on medical imaging43(1), 96–107 (2023)

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, Y.C., Ma, C.Y., Kira, Z.: Unbiased teacher v2: Semi-supervised object detection for anchor-free and anchor-based detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9819–9828 (2022) 10 Y. Li et al

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

  15. [15]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  16. [16]

    Medical image analysis102, 103514 (2025)

    Lu, Y., Wang, A.: Integrating language into medical visual recognition and reasoning: A survey. Medical image analysis102, 103514 (2025)

  17. [17]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Mi, P., Lin, J., Zhou, Y., Shen, Y., Luo, G., Sun, X., Cao, L., Fu, R., Xu, Q., Ji, R.: Active teacher for semi-supervised object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14482–14491 (2022)

  18. [18]

    Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks forvolumetricmedicalimagesegmentation.In:2016FourthInternationalConference on 3D Vision (3DV). pp. 565–571 (2016).https://doi.org/10.1109/3DV.2016.79

  19. [19]

    arXiv preprint arXiv:2005.06465 (2020)

    Morozov, S.P., Andreychenko, A.E., Pavlov, N.A., Vladzymyrskyy, A., Ledikhova, N.V., Gombolevskiy, V.A., Blokhin, I.A., Gelezhe, P.B., Gonchar, A., Chernina, V.Y.: Mosmeddata: Chest ct scans with covid-19 related findings dataset. arXiv preprint arXiv:2005.06465 (2020)

  20. [20]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  21. [21]

    In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. pp. 234–241. Springer International Publishing, Cham (2015)

  22. [22]

    Advances in neural information processing systems33, 596–608 (2020)

    Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems33, 596–608 (2020)

  23. [23]

    In: International Conference on Machine Learning

    Yang, D., Ji, J., Ma, Y., Guo, T., Wang, H., Sun, X., Ji, R.: Sam as the guide: Mas- tering pseudo-label refinement in semi-supervised referring expression segmentation. In: International Conference on Machine Learning. pp. 56139–56155. PMLR (2024)

  24. [24]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: Language- aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18155–18165 (2022)

  25. [25]

    In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2024

    Ye, S., Meng, M., Li, M., Feng, D., Kim, J.: Enabling text-free inference in language- guided segmentation of chest x-rays via self-guidance. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. pp. 242–252. Springer Nature Switzerland, Cham (2024)

  26. [26]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 6023–6032 (2019)

  27. [27]

    Information Sciences694, 121709 (2025)

    Zang, Y., Cao, R., Fu, C., Zhu, D., Zhang, M., Hu, W., Zhu, L., Chen, T.: Res- match: Referring expression segmentation in a semi-supervised manner. Information Sciences694, 121709 (2025)

  28. [28]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhao, Z., Zhou, L., Duan, Y., Wang, L., Qi, L., Shi, Y.: Dc-ssl: Addressing mis- matched class distribution in semi-supervised learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9757–9765 (2022) Semi-MedRef 11

  29. [29]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Zhong, Y., Xu, M., Liang, K., Chen, K., Wu, M.: Ariadne’s thread: Using text prompts to improve segmentation of infected areas from chest x-ray images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 724–733. Springer (2023)