Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining
Pith reviewed 2026-05-25 06:13 UTC · model grok-4.3
The pith
FlexiCT's three-stage agglomerative pretraining on public CT volumes produces representations that match task-specific models across five task families and organize scans by tumor stage gradients.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlexiCT is obtained by agglomerative pretraining across 2D axial, 3D anatomical, and report-guided stages on a large public CT collection; the resulting representations enable competitive performance on multiple downstream task families at slice, volume, and vision-language levels while also organizing volumes along tumor-stage gradients.
What carries the argument
The three-stage agglomerative pretraining process (2D axial pretraining, 3D anatomical pretraining, report-guided semantic alignment) that builds representations supporting multiple analysis levels from one model.
If this is right
- A single pretrained model can be applied to segmentation, classification, registration, vision-language understanding, and retrieval without per-task retraining.
- CT representations can encode both anatomical structure and disease phenotype features such as tumor stage.
- The training strategy enables analysis at slice, volume, and vision-language levels from the same embedding space.
- Performance on the tested benchmarks reaches or surpasses that of prior task-specific models.
Where Pith is reading between the lines
- The public dataset collection and staged training could serve as a base for extending the same model to additional CT tasks or modalities without starting from random weights.
- Embedding organization by tumor stage opens the possibility of using the representations for unsupervised progression tracking if the gradient structure holds on new data.
- The approach implies that continual pretraining across increasing levels of structure (slice to volume to report) may reduce fragmentation in medical imaging AI.
Load-bearing premise
The three-stage agglomerative pretraining on the selected 56 public datasets produces representations that generalize across tasks and capture disease phenotype information without additional task-specific adaptation or explicit controls for dataset biases.
What would settle it
Evaluation of FlexiCT embeddings on an independent CT collection with documented tumor-stage labels to test whether the embeddings form consistent, monotonic gradients with stage progression.
Figures
read the original abstract
Computed tomography (CT) is a central to three-dimensional medical imaging, yet CT-based artificial intelligence remains fragmented across task-specific models for segmentation, classification, registration, and report analysis. Here we present FlexiCT, a family of CT foundation models trained by agglomerative continual pretraining on 266,227 CT volumes from 56 publicly available datasets, forming a large-scale public resource for CT representation learning. FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment. This training strategy supports slice-level, volume-level and vision-language analysis. Across five downstream task families (segmentation, classification, registration, vision-language understanding and clinical retrieval), FlexiCT matches or exceeds prior task-specific approaches on multiple benchmarks. Its embeddings further organize CT scans along gradients associated with various tumor stages, suggesting that CT foundation models can capture imaging features relevant to disease phenotype characterization. Project page and code are available at: https://ricklisz.github.io/flexict.github.io and https://github.com/ricklisz/FlexiCT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FlexiCT, a family of CT foundation models trained via three-stage agglomerative continual pretraining (2D axial, 3D anatomical, report-guided semantic alignment) on 266227 volumes from 56 public datasets. It claims support for slice-, volume-, and vision-language tasks and states that FlexiCT matches or exceeds prior task-specific models across segmentation, classification, registration, vision-language understanding, and clinical retrieval benchmarks while organizing embeddings along tumor-stage gradients.
Significance. If the reported performance is shown to arise from the agglomerative strategy rather than data leakage or scale alone, the work would provide a substantial public resource for unifying fragmented CT AI tasks and for phenotype-aware representations. The use of only public data and the release of code are strengths that would facilitate community follow-up.
major comments (2)
- [Abstract] Abstract: the central claim that FlexiCT 'matches or exceeds prior task-specific approaches on multiple benchmarks' is presented without any quantitative metrics, error bars, baseline tables, dataset splits, or exclusion criteria, rendering the headline result unverifiable.
- [Experimental evaluation] Experimental evaluation (downstream task families): no decontamination protocol, overlap audit, or ablation isolating the three agglomerative stages from simple data scale is described between the 56 pretraining datasets and the five evaluation families; this directly undermines attribution of any observed parity or superiority to the claimed method.
minor comments (1)
- The manuscript would benefit from an explicit table listing the 56 datasets with their sizes, task types, and any known acquisition characteristics to allow readers to assess potential biases.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We agree that the abstract and experimental sections require strengthening for verifiability and to better isolate the contribution of the agglomerative pretraining strategy. We outline point-by-point revisions below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that FlexiCT 'matches or exceeds prior task-specific approaches on multiple benchmarks' is presented without any quantitative metrics, error bars, baseline tables, dataset splits, or exclusion criteria, rendering the headline result unverifiable.
Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will add specific metrics (e.g., mean Dice scores and standard deviations for segmentation tasks, accuracy/F1 for classification, and retrieval mAP) together with brief statements on dataset splits and exclusion criteria. These numbers will be drawn directly from the results tables already present in the main text. revision: yes
-
Referee: [Experimental evaluation] Experimental evaluation (downstream task families): no decontamination protocol, overlap audit, or ablation isolating the three agglomerative stages from simple data scale is described between the 56 pretraining datasets and the five evaluation families; this directly undermines attribution of any observed parity or superiority to the claimed method.
Authors: We acknowledge this limitation. The revised manuscript will add: (1) an explicit decontamination protocol listing all evaluation datasets and confirming zero patient-level overlap with the 56 pretraining collections, (2) a supplementary table auditing dataset overlap, and (3) an ablation that compares the full three-stage agglomerative model against a single-stage baseline trained on the same total volume count. These additions will allow readers to attribute performance differences to the training strategy rather than scale or leakage alone. revision: yes
Circularity Check
No significant circularity; all claims are empirical training and evaluation results.
full rationale
The paper describes an empirical ML study: agglomerative pretraining of FlexiCT on 266227 volumes from 56 public datasets, followed by evaluation on downstream tasks. No equations, derivations, or mathematical claims appear in the abstract or description. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce results to inputs by construction. The central claims (matching task-specific models, embeddings organizing by tumor stage) are presented as outcomes of training and benchmarking, not as reductions of prior results. This matches the default expectation for non-circular empirical papers; score 0 is appropriate and common.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Manuel Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, V . Nath, Yufan He, Ziyue Xu, Ali Hatamizadeh, Wenjie Zhu, Yun Liu, Mingxin Zheng, Yucheng Tang, Isaac Yang, Michael Zephyr, Behrooz Hashemian, Sachidanand Alle, Mohammad Zalbagi Darestani, Charles. Budd, Marc Modat, To...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Scaling self-supervised and cross-modal pretraining for volumetric ct transformers
Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, and Fons van der Sommen. Scaling self-supervised and cross-modal pretraining for volumetric ct transformers. arXiv preprint arXiv:2511.17209,
-
[3]
Corentin Dancette, Julien Khlaut, Antoine Saporta, Helene Philippe, Elodie Ferreres, Baptiste Callard, Théo Danielou, Léo Alberge, Léo Machado, Daniel Tordjman, et al. Curia: A multi-modal foundation model for radiology.arXiv preprint arXiv:2509.06830,
-
[4]
Vision Transformers Need Registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[6]
semanticscholar.org/CorpusID:208547601
URL https://api. semanticscholar.org/CorpusID:208547601. Nicholas Heller, Fabian Isensee, Resha Tejpau, Andrew Wood, Nikolaos Papanikolopoulos, and Christopher Weight. 2023 kidney and kidney tumor segmentation challenge. International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023),
work page 2023
-
[7]
Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein
URLhttps://doi.org/10.5281/zenodo.7840134. Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211,
-
[8]
Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534,
-
[9]
Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,
Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,
-
[10]
DINOv2: Learning Robust Visual Features without Supervision
48 Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Vision foundation models for computed tomography
Suraj Pai, Ibrahim Hadzic, Dennis Bontempi, Keno Bressem, Benjamin H Kann, Andriy Fedorov, Raymond H Mak, and Hugo JWL Aerts. Vision foundation models for computed tomography. arXiv preprint arXiv:2501.09001,
-
[12]
URL https://api.semanticscholar.org/ CorpusID:272693057. Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016
Rebecca Smith-Bindman, Marilyn L Kwan, Emily C Marlow, Mary Kay Theis, Wesley Bolch, Stephanie Y Cheng, Erin JA Bowles, James R Duncan, Robert T Greenlee, Lawrence H Kushi, et al. Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016. Jama, 322(9):843–856,
work page 2000
-
[14]
Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Anton Schwaighofer, Noel CF Codella, et al. Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a. Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Seb...
-
[15]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.ArXiv, abs/2506.05176,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
iBOT: Image BERT Pre-Training with Online Tokenizer
URL https://api.semanticscholar.org/CorpusID:279243736. Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.