pith. sign in

arxiv: 2605.21906 · v2 · pith:OIVZEORGnew · submitted 2026-05-21 · 💻 cs.CV

Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

Pith reviewed 2026-05-25 06:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords CT foundation modelsagglomerative pretrainingmedical image analysissegmentationclassificationdisease phenotypevision-language models
0
0 comments X

The pith

FlexiCT's three-stage agglomerative pretraining on public CT volumes produces representations that match task-specific models across five task families and organize scans by tumor stage gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FlexiCT as a family of CT foundation models built through agglomerative continual pretraining on 266,227 volumes drawn from 56 public datasets. Training proceeds in three stages that begin with two-dimensional axial slices, advance to three-dimensional anatomical structures, and conclude with report-guided semantic alignment. The resulting embeddings support slice-level, volume-level, and vision-language analysis without task-specific retraining. On benchmarks for segmentation, classification, registration, vision-language understanding, and clinical retrieval, FlexiCT matches or exceeds prior specialized approaches. The same embeddings further arrange CT scans along gradients tied to tumor stages, indicating that the representations encode phenotype-relevant information.

Core claim

FlexiCT is obtained by agglomerative pretraining across 2D axial, 3D anatomical, and report-guided stages on a large public CT collection; the resulting representations enable competitive performance on multiple downstream task families at slice, volume, and vision-language levels while also organizing volumes along tumor-stage gradients.

What carries the argument

The three-stage agglomerative pretraining process (2D axial pretraining, 3D anatomical pretraining, report-guided semantic alignment) that builds representations supporting multiple analysis levels from one model.

If this is right

  • A single pretrained model can be applied to segmentation, classification, registration, vision-language understanding, and retrieval without per-task retraining.
  • CT representations can encode both anatomical structure and disease phenotype features such as tumor stage.
  • The training strategy enables analysis at slice, volume, and vision-language levels from the same embedding space.
  • Performance on the tested benchmarks reaches or surpasses that of prior task-specific models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The public dataset collection and staged training could serve as a base for extending the same model to additional CT tasks or modalities without starting from random weights.
  • Embedding organization by tumor stage opens the possibility of using the representations for unsupervised progression tracking if the gradient structure holds on new data.
  • The approach implies that continual pretraining across increasing levels of structure (slice to volume to report) may reduce fragmentation in medical imaging AI.

Load-bearing premise

The three-stage agglomerative pretraining on the selected 56 public datasets produces representations that generalize across tasks and capture disease phenotype information without additional task-specific adaptation or explicit controls for dataset biases.

What would settle it

Evaluation of FlexiCT embeddings on an independent CT collection with documented tumor-stage labels to test whether the embeddings form consistent, monotonic gradients with stage progression.

Figures

Figures reproduced from arXiv: 2605.21906 by Haoyu Dong, James E. Baciak, Mojtaba Safari, Shansong Wang, Xiaofeng Yang, Yuan Gao, Yuheng Li, Yuxiang Lai.

Figure 1
Figure 1. Figure 1: Dataset statistics and three-stage pretraining strategy of FlexiCT. a, Composition of the FlexiCT pretraining dataset. Four donut charts summarise body region (top left; n = 266,227 volumes), geographic distribution (top right; n = 266,227), disease family (bottom left; n = 186,700 volumes with case- or cohort-level labels) and anatomical system (bottom right; same n). b, Frequency of the top 20 clinical c… view at source ↗
Figure 2
Figure 2. Figure 2: FlexiCT outperforms foundation models across 3D and 2D segmentation benchmarks. a, Volumetric segmentation Dice coefficient on six abdominal, thoracic and whole-body benchmarks (KiTS23, WORD, MSD Liver, MSD Lung, MSD Pancreas, and AutoPET), comparing nnU-Net, Primus-M, VoCo, CT-FM and FlexiCT-3D (red). b, Slice-level segmentation Dice coefficient on TotalSegmentator (104 anatomical classes partitioned into… view at source ↗
Figure 3
Figure 3. Figure 3: FlexiCT-2D enables training-free intra- and cross-modal abdominal registration. a, Per-organ Dice similarity coefficient on the Learn2Reg abdominal CT–CT task across 13 organs (n = 45 registration pairs across 5-fold cross-validation), comparing VoxelMorph, Curia, DINO-Reg and FlexiCT-2D (red). Curia, DINO-Reg and FlexiCT-2D share the same ConvexAdam optimisation framework and differ only in the feature ba… view at source ↗
Figure 4
Figure 4. Figure 4: FlexiCT-2D enables label-efficient disease classification from frozen features. a–d, Label-efficiency curves for frozen pretrained encoders trained for: renal tumor subtyping (KiTS; a), universal lesion classification (Deep-Lesion; b), pulmonary nodule detection (Luna16; c) and COVID￾19 identification (Covidx-CT; d). X-axis labels give training-sample counts; dashed lines mark each model’s full-data AUC (n… view at source ↗
Figure 5
Figure 5. Figure 5: FlexiCT-3D embeddings organize tumors along clinical severity gradients without staging supervision. a, Zero-shot tumor retrieval (Recall@1, Recall@3) for T-stage (NSCLC￾Radiogenomics) and ISUP grade (C4KC-KiTS), comparing CT-FM, VoCo, SPECTRE and FlexiCT￾3D. b, Linear probing (AUC, balanced accuracy) on frozen embeddings for T-stage and ISUP grade, including a tumor-diameter-only clinical baseline (grey).… view at source ↗
Figure 6
Figure 6. Figure 6: FlexiCT-3D-VLM supports zero-shot disease classification and report retrieval across chest and abdominal CT. a, Zero-shot multi-label disease classification on CT-RATE (left) and Merlin (right), reporting macro-averaged precision, F1, accuracy (ACC) and area under the ROC curve (AUC). Baselines are CT-CLIP, COLIPRI and SPECTRE on CT-RATE; Merlin, COLIPRI and SPECTRE on the Merlin benchmark. b, Semantic rep… view at source ↗
read the original abstract

Computed tomography (CT) is a central to three-dimensional medical imaging, yet CT-based artificial intelligence remains fragmented across task-specific models for segmentation, classification, registration, and report analysis. Here we present FlexiCT, a family of CT foundation models trained by agglomerative continual pretraining on 266,227 CT volumes from 56 publicly available datasets, forming a large-scale public resource for CT representation learning. FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment. This training strategy supports slice-level, volume-level and vision-language analysis. Across five downstream task families (segmentation, classification, registration, vision-language understanding and clinical retrieval), FlexiCT matches or exceeds prior task-specific approaches on multiple benchmarks. Its embeddings further organize CT scans along gradients associated with various tumor stages, suggesting that CT foundation models can capture imaging features relevant to disease phenotype characterization. Project page and code are available at: https://ricklisz.github.io/flexict.github.io and https://github.com/ricklisz/FlexiCT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces FlexiCT, a family of CT foundation models trained via three-stage agglomerative continual pretraining (2D axial, 3D anatomical, report-guided semantic alignment) on 266227 volumes from 56 public datasets. It claims support for slice-, volume-, and vision-language tasks and states that FlexiCT matches or exceeds prior task-specific models across segmentation, classification, registration, vision-language understanding, and clinical retrieval benchmarks while organizing embeddings along tumor-stage gradients.

Significance. If the reported performance is shown to arise from the agglomerative strategy rather than data leakage or scale alone, the work would provide a substantial public resource for unifying fragmented CT AI tasks and for phenotype-aware representations. The use of only public data and the release of code are strengths that would facilitate community follow-up.

major comments (2)
  1. [Abstract] Abstract: the central claim that FlexiCT 'matches or exceeds prior task-specific approaches on multiple benchmarks' is presented without any quantitative metrics, error bars, baseline tables, dataset splits, or exclusion criteria, rendering the headline result unverifiable.
  2. [Experimental evaluation] Experimental evaluation (downstream task families): no decontamination protocol, overlap audit, or ablation isolating the three agglomerative stages from simple data scale is described between the 56 pretraining datasets and the five evaluation families; this directly undermines attribution of any observed parity or superiority to the claimed method.
minor comments (1)
  1. The manuscript would benefit from an explicit table listing the 56 datasets with their sizes, task types, and any known acquisition characteristics to allow readers to assess potential biases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the abstract and experimental sections require strengthening for verifiability and to better isolate the contribution of the agglomerative pretraining strategy. We outline point-by-point revisions below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that FlexiCT 'matches or exceeds prior task-specific approaches on multiple benchmarks' is presented without any quantitative metrics, error bars, baseline tables, dataset splits, or exclusion criteria, rendering the headline result unverifiable.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will add specific metrics (e.g., mean Dice scores and standard deviations for segmentation tasks, accuracy/F1 for classification, and retrieval mAP) together with brief statements on dataset splits and exclusion criteria. These numbers will be drawn directly from the results tables already present in the main text. revision: yes

  2. Referee: [Experimental evaluation] Experimental evaluation (downstream task families): no decontamination protocol, overlap audit, or ablation isolating the three agglomerative stages from simple data scale is described between the 56 pretraining datasets and the five evaluation families; this directly undermines attribution of any observed parity or superiority to the claimed method.

    Authors: We acknowledge this limitation. The revised manuscript will add: (1) an explicit decontamination protocol listing all evaluation datasets and confirming zero patient-level overlap with the 56 pretraining collections, (2) a supplementary table auditing dataset overlap, and (3) an ablation that compares the full three-stage agglomerative model against a single-stage baseline trained on the same total volume count. These additions will allow readers to attribute performance differences to the training strategy rather than scale or leakage alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; all claims are empirical training and evaluation results.

full rationale

The paper describes an empirical ML study: agglomerative pretraining of FlexiCT on 266227 volumes from 56 public datasets, followed by evaluation on downstream tasks. No equations, derivations, or mathematical claims appear in the abstract or description. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce results to inputs by construction. The central claims (matching task-specific models, embeddings organizing by tumor stage) are presented as outcomes of training and benchmarking, not as reductions of prior results. This matches the default expectation for non-circular empirical papers; score 0 is appropriate and common.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents enumeration of specific free parameters, axioms, or invented entities; typical foundation-model training involves many unstated hyperparameters and standard assumptions about data distribution that cannot be audited here.

pith-pipeline@v0.9.0 · 5752 in / 1105 out tokens · 45326 ms · 2026-05-25T06:13:56.751096+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 9 internal anchors

  1. [1]

    Manuel Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, V . Nath, Yufan He, Ziyue Xu, Ali Hatamizadeh, Wenjie Zhu, Yun Liu, Mingxin Zheng, Yucheng Tang, Isaac Yang, Michael Zephyr, Behrooz Hashemian, Sachidanand Alle, Mohammad Zalbagi Darestani, Charles. Budd, Marc Modat, To...

  2. [2]

    Scaling self-supervised and cross-modal pretraining for volumetric ct transformers

    Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, and Fons van der Sommen. Scaling self-supervised and cross-modal pretraining for volumetric ct transformers. arXiv preprint arXiv:2511.17209,

  3. [3]

    Dancette, J

    Corentin Dancette, Julien Khlaut, Antoine Saporta, Helene Philippe, Elodie Ferreres, Baptiste Callard, Théo Danielou, Léo Alberge, Léo Machado, Daniel Tordjman, et al. Curia: A multi-modal foundation model for radiology.arXiv preprint arXiv:2509.06830,

  4. [4]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588,

  5. [5]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  6. [6]

    semanticscholar.org/CorpusID:208547601

    URL https://api. semanticscholar.org/CorpusID:208547601. Nicholas Heller, Fabian Isensee, Resha Tejpau, Andrew Wood, Nikolaos Papanikolopoulos, and Christopher Weight. 2023 kidney and kidney tumor segmentation challenge. International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023),

  7. [7]

    Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein

    URLhttps://doi.org/10.5281/zenodo.7840134. Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211,

  8. [8]

    Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534,

    Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534,

  9. [9]

    Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

    Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

  10. [10]

    DINOv2: Learning Robust Visual Features without Supervision

    48 Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  11. [11]

    Vision foundation models for computed tomography

    Suraj Pai, Ibrahim Hadzic, Dennis Bontempi, Keno Bressem, Benjamin H Kann, Andriy Fedorov, Raymond H Mak, and Hugo JWL Aerts. Vision foundation models for computed tomography. arXiv preprint arXiv:2501.09001,

  12. [12]

    DINOv3

    URL https://api.semanticscholar.org/ CorpusID:272693057. Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

  13. [13]

    Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016

    Rebecca Smith-Bindman, Marilyn L Kwan, Emily C Marlow, Mary Kay Theis, Wesley Bolch, Stephanie Y Cheng, Erin JA Bowles, James R Duncan, Robert T Greenlee, Lawrence H Kushi, et al. Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016. Jama, 322(9):843–856,

  14. [14]

    Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a

    Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Anton Schwaighofer, Noel CF Codella, et al. Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a. Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Seb...

  15. [15]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  16. [16]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

  17. [17]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.ArXiv, abs/2506.05176,

  18. [18]

    iBOT: Image BERT Pre-Training with Online Tokenizer

    URL https://api.semanticscholar.org/CorpusID:279243736. Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,