pith. sign in

arxiv: 2605.21906 · v1 · pith:OIVZEORGnew · submitted 2026-05-21 · 💻 cs.CV

Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

Pith reviewed 2026-05-22 07:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords CTfoundation modelpretrainingmedical imagingsegmentationclassificationvision-language
0
0 comments X

The pith

A single CT foundation model trained in three agglomerative stages matches or exceeds task-specific models across five clinical task families.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlexiCT, a family of models pretrained on 266,227 CT volumes drawn from 56 public datasets. It uses a three-stage process that starts with 2D axial slices, moves to 3D anatomical volumes, and finishes with report-guided semantic alignment. The goal is to replace the current patchwork of separate models for different CT tasks with one set of general representations. If the approach holds, clinicians and researchers could use the same embeddings for segmentation, classification, registration, report interpretation, and retrieval while also reading off disease progression signals directly from the learned space.

Core claim

FlexiCT is trained by agglomerative continual pretraining in three stages—two-dimensional axial pretraining, three-dimensional anatomical pretraining, and report-guided semantic alignment—on 266,227 CT volumes from 56 publicly available datasets. The resulting representations match or exceed prior task-specific approaches on benchmarks spanning segmentation, classification, registration, vision-language understanding, and clinical retrieval. The same embeddings further organize scans along gradients linked to tumor stage progression.

What carries the argument

Three-stage agglomerative continual pretraining that progressively builds slice-level, volume-level, and vision-language representations from the same data pool.

If this is right

  • One model family supports slice-level, volume-level, and vision-language analysis without retraining from scratch.
  • Embeddings capture disease phenotype information such as tumor stage gradients even without explicit supervision for those labels.
  • A public checkpoint and code release creates a shared starting point for new CT tasks instead of training each one separately.
  • Clinical retrieval and report alignment become feasible using the same representation space built for imaging tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged pretraining recipe could be tried on MRI or PET to test whether modality-specific foundations emerge without starting from scratch each time.
  • If the tumor-stage organization generalizes, the embeddings might support longitudinal tracking of individual patients across multiple scans.
  • Evaluating the model on private multi-center data with scanner and demographic shifts would directly test whether the public-data pretraining is sufficient for broad deployment.

Load-bearing premise

The 56 public datasets are representative of clinical practice and free of leakage so that the learned representations transfer to new patient populations and scanners.

What would settle it

A clear drop in performance relative to task-specific baselines when the model is tested on CT data from a previously unseen hospital network or scanner vendor would show the representations are not yet universal.

Figures

Figures reproduced from arXiv: 2605.21906 by Haoyu Dong, James E. Baciak, Mojtaba Safari, Shansong Wang, Xiaofeng Yang, Yuan Gao, Yuheng Li, Yuxiang Lai.

Figure 1
Figure 1. Figure 1: Dataset statistics and three-stage pretraining strategy of FlexiCT. a, Composition of the FlexiCT pretraining dataset. Four donut charts summarise body region (top left; n = 266,227 volumes), geographic distribution (top right; n = 266,227), disease family (bottom left; n = 186,700 volumes with case- or cohort-level labels) and anatomical system (bottom right; same n). b, Frequency of the top 20 clinical c… view at source ↗
Figure 2
Figure 2. Figure 2: FlexiCT outperforms foundation models across 3D and 2D segmentation benchmarks. a, Volumetric segmentation Dice coefficient on six abdominal, thoracic and whole-body benchmarks (KiTS23, WORD, MSD Liver, MSD Lung, MSD Pancreas, and AutoPET), comparing nnU-Net, Primus-M, VoCo, CT-FM and FlexiCT-3D (red). b, Slice-level segmentation Dice coefficient on TotalSegmentator (104 anatomical classes partitioned into… view at source ↗
Figure 3
Figure 3. Figure 3: FlexiCT-2D enables training-free intra- and cross-modal abdominal registration. a, Per-organ Dice similarity coefficient on the Learn2Reg abdominal CT–CT task across 13 organs (n = 45 registration pairs across 5-fold cross-validation), comparing VoxelMorph, Curia, DINO-Reg and FlexiCT-2D (red). Curia, DINO-Reg and FlexiCT-2D share the same ConvexAdam optimisation framework and differ only in the feature ba… view at source ↗
Figure 4
Figure 4. Figure 4: FlexiCT-2D enables label-efficient disease classification from frozen features. a–d, Label-efficiency curves for frozen pretrained encoders trained for: renal tumor subtyping (KiTS; a), universal lesion classification (Deep-Lesion; b), pulmonary nodule detection (Luna16; c) and COVID￾19 identification (Covidx-CT; d). X-axis labels give training-sample counts; dashed lines mark each model’s full-data AUC (n… view at source ↗
Figure 5
Figure 5. Figure 5: FlexiCT-3D embeddings organize tumors along clinical severity gradients without staging supervision. a, Zero-shot tumor retrieval (Recall@1, Recall@3) for T-stage (NSCLC￾Radiogenomics) and ISUP grade (C4KC-KiTS), comparing CT-FM, VoCo, SPECTRE and FlexiCT￾3D. b, Linear probing (AUC, balanced accuracy) on frozen embeddings for T-stage and ISUP grade, including a tumor-diameter-only clinical baseline (grey).… view at source ↗
Figure 6
Figure 6. Figure 6: FlexiCT-3D-VLM supports zero-shot disease classification and report retrieval across chest and abdominal CT. a, Zero-shot multi-label disease classification on CT-RATE (left) and Merlin (right), reporting macro-averaged precision, F1, accuracy (ACC) and area under the ROC curve (AUC). Baselines are CT-CLIP, COLIPRI and SPECTRE on CT-RATE; Merlin, COLIPRI and SPECTRE on the Merlin benchmark. b, Semantic rep… view at source ↗
read the original abstract

Computed tomography (CT) is a central to three-dimensional medical imaging, yet CT-based artificial intelligence remains fragmented across task-specific models for segmentation, classification, registration, and report analysis. Here we present FlexiCT, a family of CT foundation models trained by agglomerative continual pretraining on 266,227 CT volumes from 56 publicly available datasets, forming a large-scale public resource for CT representation learning. FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment. This training strategy supports slice-level, volume-level and vision-language analysis. Across five downstream task families (segmentation, classification, registration, vision-language understanding and clinical retrieval), FlexiCT matches or exceeds prior task-specific approaches on multiple benchmarks. Its embeddings further organize CT scans along gradients associated with various tumor stages, suggesting that CT foundation models can capture imaging features relevant to disease phenotype characterization. Code is available at https://github.com/ricklisz/FlexiCT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FlexiCT, a family of CT foundation models trained via three-stage agglomerative continual pretraining on 266,227 volumes drawn from 56 public datasets. The stages consist of 2D axial pretraining, 3D anatomical pretraining, and report-guided semantic alignment. The resulting embeddings are evaluated across five downstream task families (segmentation, classification, registration, vision-language understanding, and clinical retrieval) and are reported to match or exceed prior task-specific models on multiple benchmarks while also organizing scans along gradients associated with tumor stages.

Significance. If the central claims hold after addressing evaluation details, this would constitute a meaningful contribution to medical imaging by demonstrating that a single set of representations can span anatomy-to-phenotype tasks without task-specific retraining. The scale of the public data collection and the staged pretraining approach are notable strengths, as is the public release of code.

major comments (2)
  1. [§4 and Evaluation protocols] The manuscript does not provide an explicit description or audit of patient/scan overlap checks between the 56 pretraining collections and the downstream test splits used in the five task families. This is load-bearing for the generalization claim in the abstract and §4, because even modest leakage common in public archives could produce the reported transfer performance without demonstrating the claimed universal anatomy-to-phenotype mapping.
  2. [Results section] Quantitative results for the downstream benchmarks (including exact metrics, error bars, statistical tests, and full baseline details) are not reported in sufficient detail to support the claim that FlexiCT 'matches or exceeds' prior approaches. This information is required in the results section to allow assessment of effect sizes and to confirm the central performance claim.
minor comments (2)
  1. [§3] Clarify the precise definition and weighting of the three pretraining stages (e.g., loss functions and data sampling ratios) to improve reproducibility.
  2. [Figures 4-5] Figure captions should explicitly state the number of samples and any exclusion criteria used for the phenotype organization visualizations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment point by point below, with revisions made to strengthen the presentation of our methods and results.

read point-by-point responses
  1. Referee: [§4 and Evaluation protocols] The manuscript does not provide an explicit description or audit of patient/scan overlap checks between the 56 pretraining collections and the downstream test splits used in the five task families. This is load-bearing for the generalization claim in the abstract and §4, because even modest leakage common in public archives could produce the reported transfer performance without demonstrating the claimed universal anatomy-to-phenotype mapping.

    Authors: We agree that explicit verification of patient and scan overlaps is critical to support the generalization claims. The pretraining corpus was assembled exclusively from public datasets, and downstream evaluations followed the official published splits and protocols for each benchmark. However, the initial submission did not include a dedicated overlap audit. We have now performed this analysis using available metadata (patient identifiers, acquisition dates, and institutional tags where present across the public releases). The audit results, including any minimal overlaps detected and mitigation steps, have been added to §4 along with a new supplementary table. This revision directly bolsters the validity of the reported transfer performance. revision: yes

  2. Referee: [Results section] Quantitative results for the downstream benchmarks (including exact metrics, error bars, statistical tests, and full baseline details) are not reported in sufficient detail to support the claim that FlexiCT 'matches or exceeds' prior approaches. This information is required in the results section to allow assessment of effect sizes and to confirm the central performance claim.

    Authors: We appreciate the need for greater transparency in the quantitative results. The original manuscript summarized key outcomes in the main text while directing readers to supplementary materials for full tables. To address this concern, we have expanded the Results section with comprehensive tables for all five task families. These now report exact metric values, error bars or confidence intervals, statistical significance tests (e.g., paired comparisons against baselines), and complete baseline details with both originally reported and reproduced scores. Updated figures accompany the tables to facilitate assessment of effect sizes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pretraining evaluated on external benchmarks

full rationale

The paper describes an empirical agglomerative pretraining pipeline on 266,227 CT volumes from 56 public datasets, followed by evaluation across standard downstream task families. No equations, fitted parameters, or derivations are presented that reduce reported performance or embeddings to definitional equivalence with the inputs. The approach relies on external public data and benchmarks rather than self-referential steps, self-citation chains, or ansatzes smuggled via prior work. This is a self-contained experimental result against external validation sets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality and representativeness of the aggregated public datasets plus standard deep-learning transfer assumptions; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption The 56 public datasets collectively provide unbiased coverage of anatomical and pathological variation sufficient for learning universal representations.
    Invoked implicitly by the scale and diversity claims in the abstract; if violated, transfer performance would degrade.

pith-pipeline@v0.9.0 · 5736 in / 1350 out tokens · 50089 ms · 2026-05-22T07:47:38.467584+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 9 internal anchors

  1. [1]

    Manuel Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, V . Nath, Yufan He, Ziyue Xu, Ali Hatamizadeh, Wenjie Zhu, Yun Liu, Mingxin Zheng, Yucheng Tang, Isaac Yang, Michael Zephyr, Behrooz Hashemian, Sachidanand Alle, Mohammad Zalbagi Darestani, Charles. Budd, Marc Modat, To...

  2. [2]

    Scaling self-supervised and cross-modal pretraining for volumetric ct transformers

    Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, and Fons van der Sommen. Scaling self-supervised and cross-modal pretraining for volumetric ct transformers. arXiv preprint arXiv:2511.17209,

  3. [3]

    Curia: A multi-modal foundation model for radiology.arXiv preprint arXiv:2509.06830,

    Corentin Dancette, Julien Khlaut, Antoine Saporta, Helene Philippe, Elodie Ferreres, Baptiste Callard, Théo Danielou, Léo Alberge, Léo Machado, Daniel Tordjman, et al. Curia: A multi-modal foundation model for radiology.arXiv preprint arXiv:2509.06830,

  4. [4]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588,

  5. [5]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  6. [6]

    semanticscholar.org/CorpusID:208547601

    URL https://api. semanticscholar.org/CorpusID:208547601. Nicholas Heller, Fabian Isensee, Resha Tejpau, Andrew Wood, Nikolaos Papanikolopoulos, and Christopher Weight. 2023 kidney and kidney tumor segmentation challenge. International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023),

  7. [7]

    Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein

    URLhttps://doi.org/10.5281/zenodo.7840134. Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211,

  8. [8]

    Au- tomatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge

    Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534,

  9. [9]

    Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

    Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

  10. [10]

    DINOv2: Learning Robust Visual Features without Supervision

    48 Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  11. [11]

    Vision foundation models for computed tomography

    Suraj Pai, Ibrahim Hadzic, Dennis Bontempi, Keno Bressem, Benjamin H Kann, Andriy Fedorov, Raymond H Mak, and Hugo JWL Aerts. Vision foundation models for computed tomography. arXiv preprint arXiv:2501.09001,

  12. [12]

    DINOv3

    URL https://api.semanticscholar.org/ CorpusID:272693057. Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

  13. [13]

    Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016

    Rebecca Smith-Bindman, Marilyn L Kwan, Emily C Marlow, Mary Kay Theis, Wesley Bolch, Stephanie Y Cheng, Erin JA Bowles, James R Duncan, Robert T Greenlee, Lawrence H Kushi, et al. Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016. Jama, 322(9):843–856,

  14. [14]

    Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a

    Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Anton Schwaighofer, Noel CF Codella, et al. Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a. Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Seb...

  15. [15]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  16. [16]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

  17. [17]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.ArXiv, abs/2506.05176,

  18. [18]

    iBOT: Image BERT Pre-Training with Online Tokenizer

    URL https://api.semanticscholar.org/CorpusID:279243736. Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,