CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling

Kartik Bose; Pankaj Gupta; Shivika

arxiv: 2604.13561 · v1 · submitted 2026-04-15 · 💻 cs.CV · cs.AI

CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling

Shivika , Kartik Bose , Pankaj Gupta This is my paper

Pith reviewed 2026-05-10 13:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords batch compositionzero-shot learning3D CT imagingvision-language modelscontrastive learningabdominal CTdata scalingclass balancing

0 comments

The pith

Random sampling with alternating anatomical batches outperforms class-balanced sampling for training 3D CT vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reproduces a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports through contrastive learning. It tests controlled normal-to-abnormal ratios in batches and finds every balanced setup underperforms the original random sampling baseline by 2.4 to 2.8 points in zero-shot macro F1. Separate scaling experiments on a data subset show performance rising sub-linearly with added studies, while forcing balance on the subset still lowers results. These patterns indicate that natural stochastic variety in random batches, paired with subsection alternation, supplies stronger regularization than deliberate class ratios when batch sizes stay small due to 3D volume memory limits.

Core claim

The paper establishes that the stochastic diversity of random sampling, combined with alternating batching over anatomical subsections, provides more effective regularization than engineered class ratios at the small batch sizes required by 3D medical volumes, as shown by balanced ratios dropping macro F1 from 74.45% to as low as 72.02% and by sub-linear gains from 65.26% to 71.88% across data fractions.

What carries the argument

Merlin's dual-encoder architecture using symmetric InfoNCE loss together with its alternating batching over anatomical subsections.

If this is right

All tested balanced ratios (25:75, 50:50, 75:25) reduce zero-shot macro F1 by 2.4–2.8 points relative to the unbalanced baseline.
Performance on a 4,362-study subset rises sub-linearly from 65.26% at 20% data to 71.88% at 100% data.
Individual findings vary sharply in how much additional data improves their zero-shot detection.
Enforcing 50:50 balance on the subset further drops performance to 68.01%, confirming the pattern holds at different scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The advantage of random batching may extend to other 3D medical imaging tasks where GPU memory forces small batches.
The interaction between stochastic batch diversity and anatomical alternation could be tested by ablating the alternation step alone.
Sub-linear scaling implies that targeted data collection for underperforming findings may be more efficient than uniform dataset growth.

Load-bearing premise

That measured performance gaps arise from the batch composition choices themselves rather than from uncontrolled differences in training dynamics, hyperparameters, or dataset properties.

What would settle it

Re-running the full training protocol multiple times with identical batch strategies but different random seeds to check whether the gap between random and balanced sampling stays consistent.

Figures

Figures reproduced from arXiv: 2604.13561 by Kartik Bose, Pankaj Gupta, Shivika.

**Figure 2.** Figure 2: Training and validation loss curves for the baseline and batch composition experiments. Left: [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Data scaling: zero-shot F1 vs. training data size for ablations 100% NAB, 40% NAB, and 20% [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Per-finding zero-shot F1 scores across key experiments: original Merlin, our reproduction baseline, [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings (original: 73.00%). We then investigate two axes of variation. First, we control the normal-to-abnormal ratio within training batches at 25:75, 50:50, and 75:25 using section-level balanced sampling on the full dataset. All three configurations underperform the unbalanced baseline by 2.4 to 2.8 points, with 75:25 achieving the best result (72.02%) among balanced variants. Second, we conduct data scaling ablations on a 4,362-study subset, training with 20%, 40%, and 100% of the data. Performance scales sub-linearly from 65.26% to 71.88%, with individual findings varying dramatically in data sensitivity. Enforcing 50:50 balanced sampling on the same subset further degrades performance to 68.01%, confirming that explicit class balancing hurts regardless of dataset or balancing granularity. Our results indicate that the stochastic diversity of random sampling, combined with Merlin's alternating batching over anatomical subsections, provides more effective regularization than engineered class ratios at the small batch sizes required by 3D medical volumes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Random sampling with Merlin's alternation beats explicit normal-abnormal balancing by 2.4-2.8 F1 points in this 3D CT CLIP setup, with sublinear scaling curves as a side observation.

read the letter

The paper reproduces Merlin on abdominal CT volumes and reports, then shows that three section-level balanced batch ratios all lose to the unbalanced random baseline. The same directional hit appears when they repeat the 50:50 test on a 4,362-study subset. They also give scaling curves from 20% to 100% of that subset, with performance moving from 65% to 72% and big per-finding differences in data hunger. The reproduction lands at 74.45% macro F1, a bit above the original 73% number. The batch-composition ablations and the scaling plots are the new empirical pieces; nothing like this level of ratio testing was in the prior Merlin work. The consistent direction across full and subset data is the cleanest part of the evidence. The main soft spot is exactly the one the stress test flags: we do not know whether the balanced runs kept the same total steps, the same per-epoch unique volume count, or the exact same subsection alternation schedule as the baseline. Changing the sampling rule necessarily changes which volumes travel together, so the performance gap could come from altered training dynamics rather than the ratio itself. No error bars or statistical tests are reported, and the abstract stays light on hyperparameter matching. That leaves the 2-3 point differences looking plausible but not yet locked down. This is useful reading for anyone training 3D medical vision-language models who has to decide whether to force class balance at small batch sizes. It is worth sending to peer review because the experiments are straightforward and the result pushes back on a common assumption, but any referee will ask for explicit confirmation that the training schedules were held constant.

Referee Report

2 major / 2 minor

Summary. The manuscript reproduces the Merlin dual-encoder model for contrastive alignment of 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, reaching 74.45% zero-shot macro F1 across 30 findings (original 73.00%). It then ablates batch composition on the full dataset by comparing random sampling against section-level balanced sampling at normal:abnormal ratios of 25:75, 50:50, and 75:25, finding all balanced variants underperform the random baseline by 2.4–2.8 F1 points (best balanced: 72.02%). A data-scaling study on a 4,362-study subset shows sub-linear gains from 65.26% (20% data) to 71.88% (100% data), with 50:50 balancing on the subset further dropping to 68.01%. The authors conclude that stochastic diversity from random sampling plus Merlin’s alternating anatomical subsection batching regularizes better than engineered class ratios at the small batch sizes required by 3D volumes.

Significance. If the observed gaps are shown to arise from batch composition rather than uncontrolled training dynamics, the result would be significant for medical vision-language modeling: it would indicate that explicit class balancing is counterproductive for contrastive pre-training on imbalanced 3D CT data and that preserving natural distributions with stochastic sampling plus anatomical alternation is preferable. The reproduction (comparable F1, same directional balancing effect on full and subset data) and the finding that individual findings vary sharply in data sensitivity are concrete contributions.

major comments (2)

[batch-composition experiments] The central comparison of random versus section-level balanced sampling (abstract and experimental results) does not state whether the balanced runs preserve the same total training steps, the same per-epoch unique volume count, the same subsection alternation schedule, or the same effective learning-rate schedule as the random baseline. Because section-level balancing necessarily changes co-occurrence statistics and sampling frequencies, any of these differences could produce the reported 2.4–2.8 point F1 gaps; the performance delta therefore cannot yet be attributed solely to class ratio.
[results] No error bars, standard deviations across runs, or statistical tests are reported for the F1 differences between random and balanced configurations (abstract and results). Given that the reproduction reaches only 74.45% and the reader notes limited implementation detail, demonstrating that the gaps are statistically reliable would be required to support the claim that random sampling plus alternation is superior regularization.

minor comments (2)

[abstract] The abstract refers to “Merlin’s alternating batching over anatomical subsections” without specifying how the alternation is implemented or whether it is held constant across the random and balanced conditions.
[data-scaling ablation] The 4,362-study subset and the 20%/40%/100% splits are mentioned but not described (e.g., whether splits are stratified by finding prevalence or by patient).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our reproduction and ablation study. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: The central comparison of random versus section-level balanced sampling (abstract and experimental results) does not state whether the balanced runs preserve the same total training steps, the same per-epoch unique volume count, the same subsection alternation schedule, or the same effective learning-rate schedule as the random baseline. Because section-level balancing necessarily changes co-occurrence statistics and sampling frequencies, any of these differences could produce the reported 2.4–2.8 point F1 gaps; the performance delta therefore cannot yet be attributed solely to class ratio.

Authors: We appreciate this clarification request. All runs used the same total number of optimization steps, the same learning-rate schedule, and the identical subsection alternation schedule from the Merlin reproduction. The per-epoch unique volume count necessarily differs under balancing, as this is inherent to the class-ratio ablation. We will revise the experimental setup section to explicitly document these controls, enabling clearer attribution of the observed gaps to batch composition while acknowledging the changes in sampling frequencies. revision: yes
Referee: No error bars, standard deviations across runs, or statistical tests are reported for the F1 differences between random and balanced configurations (abstract and results). Given that the reproduction reaches only 74.45% and the reader notes limited implementation detail, demonstrating that the gaps are statistically reliable would be required to support the claim that random sampling plus alternation is superior regularization.

Authors: We agree that variability metrics would strengthen the claims. Due to the high computational cost of 3D CT contrastive training, results are reported from single runs. The directional effect is consistent across the full dataset and the independent 4,362-study subset. We will add a limitations paragraph discussing single-run results and the cross-experiment consistency, while noting that multiple seeds would be preferable if resources permit. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical ablation results

full rationale

The paper reports direct experimental measurements of zero-shot macro F1 on held-out data after training dual-encoder models under controlled batch-composition and data-scaling regimes. No equations, derivations, or first-principles predictions appear; performance deltas (e.g., 2.4–2.8 F1 points) are obtained by training and evaluating separate runs rather than by fitting a parameter that is then renamed as a prediction. Self-citation to the original Merlin work serves only as a reproducible baseline and does not supply a uniqueness theorem, ansatz, or load-bearing premise that the present results reduce to. The central claim about stochastic diversity plus anatomical alternation therefore rests on observable training outcomes, not on any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that contrastive alignment via symmetric InfoNCE produces useful zero-shot representations and that macro F1 on 30 findings is a stable evaluation metric. No new free parameters are fitted to produce the claim; the ratios tested are experimental controls, not model parameters.

axioms (1)

domain assumption Symmetric InfoNCE loss aligns image and text embeddings sufficiently for zero-shot classification of radiology findings.
Invoked when interpreting the reproduced 74.45% F1 as evidence of successful alignment.

pith-pipeline@v0.9.0 · 5594 in / 1395 out tokens · 51178 ms · 2026-05-10T13:54:38.818083+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Learning transferable visual models from natural language supervision,

A.Radford, J.W.Kim, C.Hallacy, A.Ramesh, G.Goh, S.Agarwal, G.Sastry, A.Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProc. ICML, 2021

work page 2021
[2]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inProc. ICML, 2021

work page 2021
[3]

Contrastive learning of medical visual representations from paired images and text,

Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive learning of medical visual representations from paired images and text,” inProc. MLHC, 2022

work page 2022
[4]

GLoRIA: A multimodal global-local representationlearningframeworkforlabel-efficientmedicalimagerecognition,

S. Huang, L. Shen, M. P. Lungren, and S. Yeung, “GLoRIA: A multimodal global-local representationlearningframeworkforlabel-efficientmedicalimagerecognition,” inProc. ICCV, 2021

work page 2021
[5]

Mak- ing the most of text semantics to improve biomedical vision-language processing,

S. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle, H. Poon, and O. Oktay, “Mak- ing the most of text semantics to improve biomedical vision-language processing,” in Proc. ECCV, 2022

work page 2022
[6]

Merlin: A vision language foundation model for 3D computed tomography,

L. Blankemeier, J. P. Cohen, A. Kumar, M. Van Veen, S. Gardezi, M. Parekh, S. Shah, A. Chaudhari, and R. Boutin, “Merlin: A vision language foundation model for 3D computed tomography,”Research, 2024

work page 2024
[7]

Balanced con- trastive learning for long-tailed visual recognition,

J. Zhu, B. Liu, Z. Yang, Y. Yi, H. Mao, J. Wang, C. Cui, and J. Lu, “Balanced con- trastive learning for long-tailed visual recognition,” inProc. CVPR, 2022

work page 2022
[8]

Parametric contrastive learning,

J. Cui, Z. Zhong, S. Liu, B. Yu, and J. Jia, “Parametric contrastive learning,” inProc. ICCV, 2021. 17

work page 2021
[9]

Contrastive learning with hard negative samples,

J. Robinson, C.-Y. Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” inProc. ICLR, 2021

work page 2021
[10]

Hard negative mixing for contrastive learning,

Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and D. Larlus, “Hard negative mixing for contrastive learning,” inProc. NeurIPS, 2020

work page 2020
[11]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inProc. ICML, 2020

work page 2020
[12]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y. Wu, S. Xie, and R. Girard, “Momentum contrast for unsupervised visual representation learning,” inProc. CVPR, 2020

work page 2020
[13]

Decoupled contrastive learning,

C.-H. Yeh, C.-Y. Hong, Y.-C. Hsu, T.-L. Liu, Y. Chen, and Y. LeCun, “Decoupled contrastive learning,” inProc. ECCV, 2022

work page 2022
[14]

Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning,

E. Tiu, E. Taber, P. Langlotz, A. Ng, and P. Rajpurkar, “Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning,”Nature Biomedical Engineering, vol. 6, pp. 1399–1406, 2022

work page 2022
[15]

Generating CT images from free-form text reports using GANs,

I. Hamamci, S. Er, F. Almas, A. G. Simsek, S. N. Esirgemez, A. Dogan, M. Dasdelen, B. Ones, H. E. Dogan, M. K. Sezgin, U. Akata, and B. Menze, “Generating CT images from free-form text reports using GANs,” inProc. ECCV, 2024

work page 2024
[16]

Reproducible scaling laws for contrastive language– image learning,

M.Cherti, R.Beaumont, R.Wightman, M.Wortsman, G.Ilharco, C.Gordon, C.Schuh- mann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language– image learning,” inProc. CVPR, 2023

work page 2023
[17]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016

work page 2016
[18]

Wehbe and Faraz S

Y. Li, R. M. Wehbe, F. S. Ahmad, H. Wang, and Y. Luo, “Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences,”arXiv preprint arXiv:2201.11838, 2022

work page arXiv 2022
[19]

Representation Learning with Contrastive Predictive Coding

A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018. 18

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Learning transferable visual models from natural language supervision,

A.Radford, J.W.Kim, C.Hallacy, A.Ramesh, G.Goh, S.Agarwal, G.Sastry, A.Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProc. ICML, 2021

work page 2021

[2] [2]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inProc. ICML, 2021

work page 2021

[3] [3]

Contrastive learning of medical visual representations from paired images and text,

Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive learning of medical visual representations from paired images and text,” inProc. MLHC, 2022

work page 2022

[4] [4]

GLoRIA: A multimodal global-local representationlearningframeworkforlabel-efficientmedicalimagerecognition,

S. Huang, L. Shen, M. P. Lungren, and S. Yeung, “GLoRIA: A multimodal global-local representationlearningframeworkforlabel-efficientmedicalimagerecognition,” inProc. ICCV, 2021

work page 2021

[5] [5]

Mak- ing the most of text semantics to improve biomedical vision-language processing,

S. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle, H. Poon, and O. Oktay, “Mak- ing the most of text semantics to improve biomedical vision-language processing,” in Proc. ECCV, 2022

work page 2022

[6] [6]

Merlin: A vision language foundation model for 3D computed tomography,

L. Blankemeier, J. P. Cohen, A. Kumar, M. Van Veen, S. Gardezi, M. Parekh, S. Shah, A. Chaudhari, and R. Boutin, “Merlin: A vision language foundation model for 3D computed tomography,”Research, 2024

work page 2024

[7] [7]

Balanced con- trastive learning for long-tailed visual recognition,

J. Zhu, B. Liu, Z. Yang, Y. Yi, H. Mao, J. Wang, C. Cui, and J. Lu, “Balanced con- trastive learning for long-tailed visual recognition,” inProc. CVPR, 2022

work page 2022

[8] [8]

Parametric contrastive learning,

J. Cui, Z. Zhong, S. Liu, B. Yu, and J. Jia, “Parametric contrastive learning,” inProc. ICCV, 2021. 17

work page 2021

[9] [9]

Contrastive learning with hard negative samples,

J. Robinson, C.-Y. Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” inProc. ICLR, 2021

work page 2021

[10] [10]

Hard negative mixing for contrastive learning,

Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and D. Larlus, “Hard negative mixing for contrastive learning,” inProc. NeurIPS, 2020

work page 2020

[11] [11]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inProc. ICML, 2020

work page 2020

[12] [12]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y. Wu, S. Xie, and R. Girard, “Momentum contrast for unsupervised visual representation learning,” inProc. CVPR, 2020

work page 2020

[13] [13]

Decoupled contrastive learning,

C.-H. Yeh, C.-Y. Hong, Y.-C. Hsu, T.-L. Liu, Y. Chen, and Y. LeCun, “Decoupled contrastive learning,” inProc. ECCV, 2022

work page 2022

[14] [14]

Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning,

E. Tiu, E. Taber, P. Langlotz, A. Ng, and P. Rajpurkar, “Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning,”Nature Biomedical Engineering, vol. 6, pp. 1399–1406, 2022

work page 2022

[15] [15]

Generating CT images from free-form text reports using GANs,

I. Hamamci, S. Er, F. Almas, A. G. Simsek, S. N. Esirgemez, A. Dogan, M. Dasdelen, B. Ones, H. E. Dogan, M. K. Sezgin, U. Akata, and B. Menze, “Generating CT images from free-form text reports using GANs,” inProc. ECCV, 2024

work page 2024

[16] [16]

Reproducible scaling laws for contrastive language– image learning,

M.Cherti, R.Beaumont, R.Wightman, M.Wortsman, G.Ilharco, C.Gordon, C.Schuh- mann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language– image learning,” inProc. CVPR, 2023

work page 2023

[17] [17]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016

work page 2016

[18] [18]

Wehbe and Faraz S

Y. Li, R. M. Wehbe, F. S. Ahmad, H. Wang, and Y. Luo, “Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences,”arXiv preprint arXiv:2201.11838, 2022

work page arXiv 2022

[19] [19]

Representation Learning with Contrastive Predictive Coding

A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018. 18

work page internal anchor Pith review Pith/arXiv arXiv 2018