CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
Pith reviewed 2026-05-10 13:54 UTC · model grok-4.3
The pith
Random sampling with alternating anatomical batches outperforms class-balanced sampling for training 3D CT vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that the stochastic diversity of random sampling, combined with alternating batching over anatomical subsections, provides more effective regularization than engineered class ratios at the small batch sizes required by 3D medical volumes, as shown by balanced ratios dropping macro F1 from 74.45% to as low as 72.02% and by sub-linear gains from 65.26% to 71.88% across data fractions.
What carries the argument
Merlin's dual-encoder architecture using symmetric InfoNCE loss together with its alternating batching over anatomical subsections.
If this is right
- All tested balanced ratios (25:75, 50:50, 75:25) reduce zero-shot macro F1 by 2.4–2.8 points relative to the unbalanced baseline.
- Performance on a 4,362-study subset rises sub-linearly from 65.26% at 20% data to 71.88% at 100% data.
- Individual findings vary sharply in how much additional data improves their zero-shot detection.
- Enforcing 50:50 balance on the subset further drops performance to 68.01%, confirming the pattern holds at different scales.
Where Pith is reading between the lines
- The advantage of random batching may extend to other 3D medical imaging tasks where GPU memory forces small batches.
- The interaction between stochastic batch diversity and anatomical alternation could be tested by ablating the alternation step alone.
- Sub-linear scaling implies that targeted data collection for underperforming findings may be more efficient than uniform dataset growth.
Load-bearing premise
That measured performance gaps arise from the batch composition choices themselves rather than from uncontrolled differences in training dynamics, hyperparameters, or dataset properties.
What would settle it
Re-running the full training protocol multiple times with identical batch strategies but different random seeds to check whether the gap between random and balanced sampling stays consistent.
Figures
read the original abstract
Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings (original: 73.00%). We then investigate two axes of variation. First, we control the normal-to-abnormal ratio within training batches at 25:75, 50:50, and 75:25 using section-level balanced sampling on the full dataset. All three configurations underperform the unbalanced baseline by 2.4 to 2.8 points, with 75:25 achieving the best result (72.02%) among balanced variants. Second, we conduct data scaling ablations on a 4,362-study subset, training with 20%, 40%, and 100% of the data. Performance scales sub-linearly from 65.26% to 71.88%, with individual findings varying dramatically in data sensitivity. Enforcing 50:50 balanced sampling on the same subset further degrades performance to 68.01%, confirming that explicit class balancing hurts regardless of dataset or balancing granularity. Our results indicate that the stochastic diversity of random sampling, combined with Merlin's alternating batching over anatomical subsections, provides more effective regularization than engineered class ratios at the small batch sizes required by 3D medical volumes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reproduces the Merlin dual-encoder model for contrastive alignment of 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, reaching 74.45% zero-shot macro F1 across 30 findings (original 73.00%). It then ablates batch composition on the full dataset by comparing random sampling against section-level balanced sampling at normal:abnormal ratios of 25:75, 50:50, and 75:25, finding all balanced variants underperform the random baseline by 2.4–2.8 F1 points (best balanced: 72.02%). A data-scaling study on a 4,362-study subset shows sub-linear gains from 65.26% (20% data) to 71.88% (100% data), with 50:50 balancing on the subset further dropping to 68.01%. The authors conclude that stochastic diversity from random sampling plus Merlin’s alternating anatomical subsection batching regularizes better than engineered class ratios at the small batch sizes required by 3D volumes.
Significance. If the observed gaps are shown to arise from batch composition rather than uncontrolled training dynamics, the result would be significant for medical vision-language modeling: it would indicate that explicit class balancing is counterproductive for contrastive pre-training on imbalanced 3D CT data and that preserving natural distributions with stochastic sampling plus anatomical alternation is preferable. The reproduction (comparable F1, same directional balancing effect on full and subset data) and the finding that individual findings vary sharply in data sensitivity are concrete contributions.
major comments (2)
- [batch-composition experiments] The central comparison of random versus section-level balanced sampling (abstract and experimental results) does not state whether the balanced runs preserve the same total training steps, the same per-epoch unique volume count, the same subsection alternation schedule, or the same effective learning-rate schedule as the random baseline. Because section-level balancing necessarily changes co-occurrence statistics and sampling frequencies, any of these differences could produce the reported 2.4–2.8 point F1 gaps; the performance delta therefore cannot yet be attributed solely to class ratio.
- [results] No error bars, standard deviations across runs, or statistical tests are reported for the F1 differences between random and balanced configurations (abstract and results). Given that the reproduction reaches only 74.45% and the reader notes limited implementation detail, demonstrating that the gaps are statistically reliable would be required to support the claim that random sampling plus alternation is superior regularization.
minor comments (2)
- [abstract] The abstract refers to “Merlin’s alternating batching over anatomical subsections” without specifying how the alternation is implemented or whether it is held constant across the random and balanced conditions.
- [data-scaling ablation] The 4,362-study subset and the 20%/40%/100% splits are mentioned but not described (e.g., whether splits are stratified by finding prevalence or by patient).
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our reproduction and ablation study. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: The central comparison of random versus section-level balanced sampling (abstract and experimental results) does not state whether the balanced runs preserve the same total training steps, the same per-epoch unique volume count, the same subsection alternation schedule, or the same effective learning-rate schedule as the random baseline. Because section-level balancing necessarily changes co-occurrence statistics and sampling frequencies, any of these differences could produce the reported 2.4–2.8 point F1 gaps; the performance delta therefore cannot yet be attributed solely to class ratio.
Authors: We appreciate this clarification request. All runs used the same total number of optimization steps, the same learning-rate schedule, and the identical subsection alternation schedule from the Merlin reproduction. The per-epoch unique volume count necessarily differs under balancing, as this is inherent to the class-ratio ablation. We will revise the experimental setup section to explicitly document these controls, enabling clearer attribution of the observed gaps to batch composition while acknowledging the changes in sampling frequencies. revision: yes
-
Referee: No error bars, standard deviations across runs, or statistical tests are reported for the F1 differences between random and balanced configurations (abstract and results). Given that the reproduction reaches only 74.45% and the reader notes limited implementation detail, demonstrating that the gaps are statistically reliable would be required to support the claim that random sampling plus alternation is superior regularization.
Authors: We agree that variability metrics would strengthen the claims. Due to the high computational cost of 3D CT contrastive training, results are reported from single runs. The directional effect is consistent across the full dataset and the independent 4,362-study subset. We will add a limitations paragraph discussing single-run results and the cross-experiment consistency, while noting that multiple seeds would be preferable if resources permit. revision: partial
Circularity Check
No circularity: purely empirical ablation results
full rationale
The paper reports direct experimental measurements of zero-shot macro F1 on held-out data after training dual-encoder models under controlled batch-composition and data-scaling regimes. No equations, derivations, or first-principles predictions appear; performance deltas (e.g., 2.4–2.8 F1 points) are obtained by training and evaluating separate runs rather than by fitting a parameter that is then renamed as a prediction. Self-citation to the original Merlin work serves only as a reproducible baseline and does not supply a uniqueness theorem, ansatz, or load-bearing premise that the present results reduce to. The central claim about stochastic diversity plus anatomical alternation therefore rests on observable training outcomes, not on any self-referential construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Symmetric InfoNCE loss aligns image and text embeddings sufficiently for zero-shot classification of radiology findings.
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision,
A.Radford, J.W.Kim, C.Hallacy, A.Ramesh, G.Goh, S.Agarwal, G.Sastry, A.Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProc. ICML, 2021
work page 2021
-
[2]
Scaling up visual and vision-language representation learning with noisy text supervision,
C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inProc. ICML, 2021
work page 2021
-
[3]
Contrastive learning of medical visual representations from paired images and text,
Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive learning of medical visual representations from paired images and text,” inProc. MLHC, 2022
work page 2022
-
[4]
S. Huang, L. Shen, M. P. Lungren, and S. Yeung, “GLoRIA: A multimodal global-local representationlearningframeworkforlabel-efficientmedicalimagerecognition,” inProc. ICCV, 2021
work page 2021
-
[5]
Mak- ing the most of text semantics to improve biomedical vision-language processing,
S. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle, H. Poon, and O. Oktay, “Mak- ing the most of text semantics to improve biomedical vision-language processing,” in Proc. ECCV, 2022
work page 2022
-
[6]
Merlin: A vision language foundation model for 3D computed tomography,
L. Blankemeier, J. P. Cohen, A. Kumar, M. Van Veen, S. Gardezi, M. Parekh, S. Shah, A. Chaudhari, and R. Boutin, “Merlin: A vision language foundation model for 3D computed tomography,”Research, 2024
work page 2024
-
[7]
Balanced con- trastive learning for long-tailed visual recognition,
J. Zhu, B. Liu, Z. Yang, Y. Yi, H. Mao, J. Wang, C. Cui, and J. Lu, “Balanced con- trastive learning for long-tailed visual recognition,” inProc. CVPR, 2022
work page 2022
-
[8]
Parametric contrastive learning,
J. Cui, Z. Zhong, S. Liu, B. Yu, and J. Jia, “Parametric contrastive learning,” inProc. ICCV, 2021. 17
work page 2021
-
[9]
Contrastive learning with hard negative samples,
J. Robinson, C.-Y. Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” inProc. ICLR, 2021
work page 2021
-
[10]
Hard negative mixing for contrastive learning,
Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and D. Larlus, “Hard negative mixing for contrastive learning,” inProc. NeurIPS, 2020
work page 2020
-
[11]
A simple framework for contrastive learning of visual representations,
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inProc. ICML, 2020
work page 2020
-
[12]
Momentum contrast for unsupervised visual representation learning,
K. He, H. Fan, Y. Wu, S. Xie, and R. Girard, “Momentum contrast for unsupervised visual representation learning,” inProc. CVPR, 2020
work page 2020
-
[13]
Decoupled contrastive learning,
C.-H. Yeh, C.-Y. Hong, Y.-C. Hsu, T.-L. Liu, Y. Chen, and Y. LeCun, “Decoupled contrastive learning,” inProc. ECCV, 2022
work page 2022
-
[14]
E. Tiu, E. Taber, P. Langlotz, A. Ng, and P. Rajpurkar, “Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning,”Nature Biomedical Engineering, vol. 6, pp. 1399–1406, 2022
work page 2022
-
[15]
Generating CT images from free-form text reports using GANs,
I. Hamamci, S. Er, F. Almas, A. G. Simsek, S. N. Esirgemez, A. Dogan, M. Dasdelen, B. Ones, H. E. Dogan, M. K. Sezgin, U. Akata, and B. Menze, “Generating CT images from free-form text reports using GANs,” inProc. ECCV, 2024
work page 2024
-
[16]
Reproducible scaling laws for contrastive language– image learning,
M.Cherti, R.Beaumont, R.Wightman, M.Wortsman, G.Ilharco, C.Gordon, C.Schuh- mann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language– image learning,” inProc. CVPR, 2023
work page 2023
-
[17]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016
work page 2016
-
[18]
Y. Li, R. M. Wehbe, F. S. Ahmad, H. Wang, and Y. Luo, “Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences,”arXiv preprint arXiv:2201.11838, 2022
-
[19]
Representation Learning with Contrastive Predictive Coding
A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018. 18
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.