arxiv: 2605.13544 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

CA-GCL: Cross-Anatomy Global-Local Contrastive Learning for Robust 3D Medical Image Understanding

Hanwen Zhang , Yao Liu , Die Dai , Jiaye Yang , Qiao Liu , Yutong Xie , Peng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords contrastive learningvision-language pretraining3D medical imagingzero-shot detectionrepresentation collapseanatomical separationcross-dataset generalizationtext augmentation

0 comments

The pith

A global contrastive objective separates anatomical categories to stop text embedding collapse in 3D medical vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fine-grained vision-language pretraining for 3D medical images often collapses distinct anatomical text embeddings into tight clusters, rendering models hypersensitive to small changes in prompt wording. CA-GCL adds a global contrastive loss that pushes embeddings of different body structures apart while preserving local visual-text alignments. It pairs this with text augmentation that respects permutation invariance and partial completeness in clinical descriptions. Experiments on CT-RATE and Rad-ChestCT show higher zero-shot abnormality detection accuracy, stronger cross-dataset transfer, and markedly lower performance variance across prompt templates.

Core claim

CA-GCL establishes that a cross-anatomy global contrastive objective, combined with local alignment and clinical-aware text augmentation, counteracts the aggregation of distinct anatomical text embeddings, producing more stable representations that support reliable zero-shot abnormality detection and generalization across 3D CT datasets.

What carries the argument

CA-GCL, whose global contrastive term enforces separation between anatomical category embeddings to offset the clustering induced by local visual-text alignment.

If this is right

Zero-shot abnormality detection accuracy rises over prior vision-language pretraining methods on the CT-RATE and Rad-ChestCT datasets.
Cross-dataset generalization improves when the same model is evaluated on data sources different from training.
Performance variance across diverse prompt templates drops sharply compared with collapsed baselines.
Textual similarity distributions shift from peaked clusters to a bell-shaped form.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same global separation principle could be tested on other 3D modalities such as MRI to check whether anatomical category separation generalizes beyond CT.
Lower prompt sensitivity may reduce the need for prompt engineering when deploying these models in clinical workflows.
The approach could be combined with report-generation tasks to see whether clearer anatomical distinctions improve generated text quality.

Load-bearing premise

That enforcing global separation between anatomical categories via contrastive objectives will counteract local alignment collapse without degrading fine-grained visual-textual correspondences or introducing new instabilities in the latent space.

What would settle it

If models trained with CA-GCL continue to show high performance variance across different prompt templates or fail to outperform baselines on an unseen 3D medical dataset, the claim that global separation prevents collapse and yields robustness would not hold.

Figures

Figures reproduced from arXiv: 2605.13544 by Die Dai, Hanwen Zhang, Jiaye Yang, Peng Wang, Qiao Liu, Yao Liu, Yutong Xie.

**Figure 1.** Figure 1: Comparison of VLP paradigms and their corresponding text similarity distributions. (a) Global VLP aligns global visual and report tokens. (b) Fine-grained VLP performs pairwise anatomical alignment but suffers from severe distributional degeneracy (similarity peak near 1.0). (c) CA-GCL (Ours) aggregates anatomy-specific representations into synthetic global tokens while maintaining local alignment, effec… view at source ↗

**Figure 2.** Figure 2: The pipeline of our proposed CA-GCL framework. The framework extracts anatomy-level tokens from CT images and radiology reports (Top). It integrates Anatomy-aware Local Contrastive Alignment for organ-specific matching (BottomLeft) and Cross-anatomy Global Contrastive Alignment (Bottom-Right). organ masks. On the text side, the anatomy-level report X R i,j is directly fed into the text encoder to obtain a… view at source ↗

**Figure 3.** Figure 3: Zero-shot performance stability across five prompt templates (p0–p4) on CTRATE dataset. We compare our CA-GCL framework (green) against SOTA baselines fVLM (red) and ViSD-Boost (blue). The variant "Ours w/o" (purple) represents the ablation model using only local contrastive alignment. to prompt template variations. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of image and text embeddings for fVLM (left) and our method (right). While fVLM exhibits significant text embedding collapse, our method effectively separates anatomical clusters in the embedding space [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Fine-grained Vision-Language Pre-training (FVLP) demonstrates significant potential in 3D medical image understanding by aligning anatomy-level visual representations with corresponding textual descriptions. However, existing FVLP paradigms often suffer from severe representation collapse in the textual embedding space, where text embeddings of distinct anatomical structures become highly clustered and indistinguishable. This distributional degeneracy renders the model hypersensitive to prompt variations, hindering reliable clinical deployment. To address these challenges, we propose a novel Cross-Anatomy Global-Local Contrastive Learning framework (CA-GCL). CA-GCL introduces a global contrastive objective that enforces separation between anatomical categories in the latent space, effectively counteracting the aggregation tendency induced by local alignment. Furthermore, we incorporate a clinical-aware text augmentation strategy based on permutation invariance and partial completeness to enhance robustness against descriptive incompleteness. Extensive evaluations on the CT-RATE and Rad-ChestCT datasets demonstrate that CA-GCL consistently outperforms existing VLP paradigms in zero-shot abnormality detection, achieving superior performance while exhibiting strong cross-dataset generalization. Crucially, CA-GCL reduces performance variance across diverse prompt templates, transforming the collapsed textual similarity distribution into a bell-shaped distribution. These results validate CA-GCL as an effective framework for robust 3D medical image understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CA-GCL adds a global contrastive term to push apart anatomical categories and uses permutation text augmentation to cut prompt variance, but the evidence that local alignments survive is still thin.

read the letter

The main thing to know is that this paper targets textual embedding collapse in fine-grained vision-language pretraining for 3D CT scans. They add a global contrastive objective that separates different anatomical categories in the latent space while keeping the usual local vision-text alignment, plus a clinical text augmentation scheme based on permutation invariance to handle incomplete descriptions. On CT-RATE and Rad-ChestCT they report stronger zero-shot abnormality detection and much lower performance variance across prompt templates, turning a collapsed similarity distribution into something closer to normal.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CA-GCL, a Cross-Anatomy Global-Local Contrastive Learning framework for 3D medical image understanding. It augments fine-grained vision-language pre-training with a global contrastive objective that enforces separation between anatomical categories to counteract textual embedding collapse, plus a clinical-aware text augmentation strategy based on permutation invariance and partial completeness. Evaluations on CT-RATE and Rad-ChestCT claim consistent outperformance over existing VLP paradigms in zero-shot abnormality detection, stronger cross-dataset generalization, and reduced performance variance across prompt templates, with textual similarity distributions shifting from collapsed to bell-shaped.

Significance. If the results hold and the global term is shown not to degrade local alignments, the work would be significant for medical VLP by directly addressing prompt hypersensitivity—a practical barrier to deployment. The explicit separation mechanism and augmentation strategy offer a concrete way to stabilize representations while retaining fine-grained correspondences, with potential for broader adoption in 3D imaging tasks.

major comments (2)

[Abstract and Methods] The central claim that the global contrastive term counteracts collapse while leaving intra-category visual-textual correspondences intact lacks quantitative support; no matched-pair cosine similarities, local contrastive loss values, or ablation isolating the global term's effect on fine-grained alignments are referenced in the abstract or methods description.
[Experiments] The reported outperformance and variance reduction are stated without exact metrics, baseline names, statistical tests, or ablation tables, making it impossible to verify the magnitude of gains or rule out that the global term dominates and weakens the local alignments needed for abnormality detection.

minor comments (2)

Clarify implementation details of the clinical-aware text augmentation, including how permutation invariance is enforced and what constitutes partial completeness.
Provide dataset statistics for CT-RATE and Rad-ChestCT (e.g., number of scans, abnormality categories) and list the exact baselines used in comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments, which help clarify the presentation of our quantitative results. We address each major point below and will revise the manuscript to strengthen the evidence for our claims.

read point-by-point responses

Referee: [Abstract and Methods] The central claim that the global contrastive term counteracts collapse while leaving intra-category visual-textual correspondences intact lacks quantitative support; no matched-pair cosine similarities, local contrastive loss values, or ablation isolating the global term's effect on fine-grained alignments are referenced in the abstract or methods description.

Authors: We agree that the abstract and methods description do not explicitly reference these quantitative metrics. The experiments section (Section 4) includes ablation studies with cosine similarity distributions before/after the global term, local contrastive loss values, and matched-pair analyses showing preserved intra-category alignments. We will revise the abstract to summarize these key results and add explicit cross-references in the methods to the ablation tables in Section 4.3. revision: yes
Referee: [Experiments] The reported outperformance and variance reduction are stated without exact metrics, baseline names, statistical tests, or ablation tables, making it impossible to verify the magnitude of gains or rule out that the global term dominates and weakens the local alignments needed for abnormality detection.

Authors: The full experiments section reports exact metrics (e.g., AUC improvements on CT-RATE and Rad-ChestCT), baseline names (standard VLP methods including 3D-adapted CLIP variants), ablation tables, and variance statistics across prompt templates. To address the concern about potential weakening of local alignments, our ablations show that adding the global term improves zero-shot abnormality detection performance, confirming that fine-grained correspondences are retained. We will update the abstract and main text to explicitly cite these numbers, tables, and any additional statistical tests. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents CA-GCL as an additive framework that augments standard local alignment with an independent global contrastive objective to enforce inter-category separation. No equations, self-definitions, or fitted parameters are shown that reduce the claimed zero-shot gains or distributional improvements to the inputs by construction. Evaluations rely on external datasets (CT-RATE, Rad-ChestCT) and prompt-variance metrics that are not internally forced by the training objective itself. The derivation chain remains self-contained with no load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view provides no explicit free parameters, axioms, or invented entities; standard contrastive learning assumptions are implicitly relied upon but not detailed.

pith-pipeline@v0.9.0 · 5538 in / 1123 out tokens · 33873 ms · 2026-05-14T19:54:17.227053+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 3 internal anchors

[1]

M3d:Ad- vancing 3d medical image analysis with multi-modal large language models

Bai, F., Du, Y., Huang, T., Meng, M.Q.H., Zhao, B.: M3d: Advancing 3d medical image analysis with multi-modal large language models. arXiv preprint arXiv:2404.00578 (2024)

work page arXiv 2024
[2]

Research Square pp

Blankemeier, L., Cohen, J.P., Kumar, A., Van Veen, D., Gardezi, S.J.S., Paschali, M., Chen, Z., Delbrouck, J.B., Reis, E., Truyts, C., et al.: Merlin: A vision language foundation model for 3d computed tomography. Research Square pp. rs–3 (2024)

work page 2024
[3]

In: European conference on computer vision

Boecking, B., Usuyama, N., Bannur, S., Castro, D.C., Schwaighofer, A., Hyland, S., Wetscherek, M., Naumann, T., Nori, A., Alvarez-Valle, J., et al.: Making the most of text semantics to improve biomedical vision–language processing. In: European conference on computer vision. pp. 1–21. Springer (2022)

work page 2022
[4]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Cao, W., Zhang, J., Shui, Z., Wang, S., Chen, Z., Li, X., Lu, L., Ye, X., Zhang, Q., Liang, T., et al.: Boosting vision semantic density with anatomy normality mod- eling for medical vision-language pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23041–23050 (2025)

work page 2025
[5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cao, W., Zhang, J., Xia, Y., Mok, T.C., Li, Z., Ye, X., Lu, L., Zheng, J., Tang, Y., Zhang, L.: Bootstrapping chest ct image understanding by distilling knowl- edge from x-ray expert models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11238–11247 (2024)

work page 2024
[6]

In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

work page 2019
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

Medical image analysis67, 101857 (2021)

Draelos,R.L.,Dov,D.,Mazurowski,M.A.,Lo,J.Y.,Henao,R.,Rubin,G.D.,Carin, L.: Machine-learning-based multiple abnormality prediction with large-scale chest computed tomography volumes. Medical image analysis67, 101857 (2021)

work page 2021
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

CoRR (2024) 10 H

Hamamci, I.E., Er, S., Almas, F., Simsek, A.G., Esirgun, S.N., Dogan, I., Dasdelen, M.F., Wittmann, B., Simsar, E., Simsar, M., et al.: A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities. CoRR (2024) 10 H. Zhang et al

work page 2024
[11]

In: Proceedings of the IEEE/CVF international conference on computer vision

Huang, S.C., Shen, L., Lungren, M.P., Yeung, S.: Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3942–3951 (2021)

work page 2021
[12]

Nature methods18(2), 203–211 (2021)

Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods18(2), 203–211 (2021)

work page 2021
[13]

arXiv preprint arXiv:2404.15272 (2024)

Lin, J., Xia, Y., Zhang, J., Yan, K., Cao, K., Lu, L., Luo, J., Zhang, L.: Ct-glip: 3d grounded language-image pretraining with ct scans and radiology reports for full-body scenarios. arXiv preprint arXiv:2404.15272 (2024)

work page arXiv 2024
[14]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Lin,W.,Zhao,Z.,Zhang,X.,Wu,C.,Zhang,Y.,Wang,Y.,Xie,W.:Pmc-clip:Con- trastive language-image pre-training using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 525–536. Springer (2023)

work page 2023
[15]

Journal of machine learning research9(11) (2008)

Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research9(11) (2008)

work page 2008
[16]

arXiv preprint arXiv:2412.05876 (2024)

Ni, X., Wu, L., Zhuang, J., Wang, Q., Wu, M., Vardhanabhuti, V., Zhang, L., Gao, H., Chen, H.: Mg-3d: Multi-grained knowledge-enhanced 3d medical vision- language pre-training. arXiv preprint arXiv:2412.05876 (2024)

work page arXiv 2024
[17]

Representation Learning with Contrastive Predictive Coding

Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

arXiv preprint arXiv:2501.14548 (2025)

Shui, Z., Zhang, J., Cao, W., Wang, S., Guo, R., Lu, L., Yang, L., Ye, X., Liang, T., Zhang, Q., et al.: Large-scale and fine-grained vision-language pre-training for enhanced ct image understanding. arXiv preprint arXiv:2501.14548 (2025)

work page arXiv 2025
[19]

Advances in neural information processing systems35, 33536–33549 (2022)

Wang, F., Zhou, Y., Wang, S., Vardhanabhuti, V., Yu, L.: Multi-granularity cross- modal alignment for generalized medical visual representation learning. Advances in neural information processing systems35, 33536–33549 (2022)

work page 2022
[20]

IEEE Transactions on Neural Networks and Learning Systems (2025)

Wang, H., Guo, S., Ye, J., Deng, Z., Cheng, J., Li, T., Chen, J., Su, Y., Huang, Z., Shen, Y., et al.: Sam-med3d: A vision foundation model for general-purpose seg- mentation on volumetric medical images. IEEE Transactions on Neural Networks and Learning Systems (2025)

work page 2025
[21]

Radiology: Artificial Intelligence 5(5), e230024 (2023)

Wasserthal, J., Breit, H.C., Meyer, M.T., Pradella, M., Hinck, D., Sauter, A.W., Heye, T., Boll, D.T., Cyriac, J., Yang, S., et al.: Totalsegmentator: robust segmen- tation of 104 anatomic structures in ct images. Radiology: Artificial Intelligence 5(5), e230024 (2023)

work page 2023
[22]

npj Artificial Intelligence1(1), 17 (2025)

Wu, J., Wang, Y., Zhong, Z., Liao, W., Trayanova, N., Jiao, Z., Bai, H.X.: Vision- language foundation model for 3d medical imaging. npj Artificial Intelligence1(1), 17 (2025)

work page 2025
[23]

In: International conference on medical image computing and computer-assisted intervention

Xie, Y., Zhang, J., Shen, C., Xia, Y.: Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In: International conference on medical image computing and computer-assisted intervention. pp. 171–180. Springer (2021)

work page 2021
[24]

IEEE Transactions on Medical Imaging40(2), 661– 672 (2020)

Zhang, J., Xie, Y., Wang, Y., Xia, Y.: Inter-slice context residual learning for 3d medical image segmentation. IEEE Transactions on Medical Imaging40(2), 661– 672 (2020)

work page 2020
[25]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Zhang, J., Ye, X., Zhang, J., Tang, Y., Xu, M., Guo, J., Chen, X., Liu, Z., Zhou, J., Lu,L.,etal.:Parseandrecall:Towardsaccuratelungnodulemalignancyprediction like radiologists. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 199–209. Springer (2023)

work page 2023
[26]

In: Machine learning for healthcare conference

Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learn- ing of medical visual representations from paired images and text. In: Machine learning for healthcare conference. pp. 2–25. PMLR (2022)

work page 2022