Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

Haoyu Dong; James E. Baciak; Mojtaba Safari; Shansong Wang; Xiaofeng Yang; Yuan Gao; Yuheng Li; Yuxiang Lai

arxiv: 2605.21906 · v1 · pith:OIVZEORGnew · submitted 2026-05-21 · 💻 cs.CV

Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

Yuheng Li , Yuan Gao , Haoyu Dong , Yuxiang Lai , Shansong Wang , Mojtaba Safari , James E. Baciak , Xiaofeng Yang This is my paper

Pith reviewed 2026-05-22 07:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords CTfoundation modelpretrainingmedical imagingsegmentationclassificationvision-language

0 comments

The pith

A single CT foundation model trained in three agglomerative stages matches or exceeds task-specific models across five clinical task families.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlexiCT, a family of models pretrained on 266,227 CT volumes drawn from 56 public datasets. It uses a three-stage process that starts with 2D axial slices, moves to 3D anatomical volumes, and finishes with report-guided semantic alignment. The goal is to replace the current patchwork of separate models for different CT tasks with one set of general representations. If the approach holds, clinicians and researchers could use the same embeddings for segmentation, classification, registration, report interpretation, and retrieval while also reading off disease progression signals directly from the learned space.

Core claim

FlexiCT is trained by agglomerative continual pretraining in three stages—two-dimensional axial pretraining, three-dimensional anatomical pretraining, and report-guided semantic alignment—on 266,227 CT volumes from 56 publicly available datasets. The resulting representations match or exceed prior task-specific approaches on benchmarks spanning segmentation, classification, registration, vision-language understanding, and clinical retrieval. The same embeddings further organize scans along gradients linked to tumor stage progression.

What carries the argument

Three-stage agglomerative continual pretraining that progressively builds slice-level, volume-level, and vision-language representations from the same data pool.

If this is right

One model family supports slice-level, volume-level, and vision-language analysis without retraining from scratch.
Embeddings capture disease phenotype information such as tumor stage gradients even without explicit supervision for those labels.
A public checkpoint and code release creates a shared starting point for new CT tasks instead of training each one separately.
Clinical retrieval and report alignment become feasible using the same representation space built for imaging tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged pretraining recipe could be tried on MRI or PET to test whether modality-specific foundations emerge without starting from scratch each time.
If the tumor-stage organization generalizes, the embeddings might support longitudinal tracking of individual patients across multiple scans.
Evaluating the model on private multi-center data with scanner and demographic shifts would directly test whether the public-data pretraining is sufficient for broad deployment.

Load-bearing premise

The 56 public datasets are representative of clinical practice and free of leakage so that the learned representations transfer to new patient populations and scanners.

What would settle it

A clear drop in performance relative to task-specific baselines when the model is tested on CT data from a previously unseen hospital network or scanner vendor would show the representations are not yet universal.

Figures

Figures reproduced from arXiv: 2605.21906 by Haoyu Dong, James E. Baciak, Mojtaba Safari, Shansong Wang, Xiaofeng Yang, Yuan Gao, Yuheng Li, Yuxiang Lai.

**Figure 1.** Figure 1: Dataset statistics and three-stage pretraining strategy of FlexiCT. a, Composition of the FlexiCT pretraining dataset. Four donut charts summarise body region (top left; n = 266,227 volumes), geographic distribution (top right; n = 266,227), disease family (bottom left; n = 186,700 volumes with case- or cohort-level labels) and anatomical system (bottom right; same n). b, Frequency of the top 20 clinical c… view at source ↗

**Figure 2.** Figure 2: FlexiCT outperforms foundation models across 3D and 2D segmentation benchmarks. a, Volumetric segmentation Dice coefficient on six abdominal, thoracic and whole-body benchmarks (KiTS23, WORD, MSD Liver, MSD Lung, MSD Pancreas, and AutoPET), comparing nnU-Net, Primus-M, VoCo, CT-FM and FlexiCT-3D (red). b, Slice-level segmentation Dice coefficient on TotalSegmentator (104 anatomical classes partitioned into… view at source ↗

**Figure 3.** Figure 3: FlexiCT-2D enables training-free intra- and cross-modal abdominal registration. a, Per-organ Dice similarity coefficient on the Learn2Reg abdominal CT–CT task across 13 organs (n = 45 registration pairs across 5-fold cross-validation), comparing VoxelMorph, Curia, DINO-Reg and FlexiCT-2D (red). Curia, DINO-Reg and FlexiCT-2D share the same ConvexAdam optimisation framework and differ only in the feature ba… view at source ↗

**Figure 4.** Figure 4: FlexiCT-2D enables label-efficient disease classification from frozen features. a–d, Label-efficiency curves for frozen pretrained encoders trained for: renal tumor subtyping (KiTS; a), universal lesion classification (Deep-Lesion; b), pulmonary nodule detection (Luna16; c) and COVID19 identification (Covidx-CT; d). X-axis labels give training-sample counts; dashed lines mark each model’s full-data AUC (n… view at source ↗

**Figure 5.** Figure 5: FlexiCT-3D embeddings organize tumors along clinical severity gradients without staging supervision. a, Zero-shot tumor retrieval (Recall@1, Recall@3) for T-stage (NSCLCRadiogenomics) and ISUP grade (C4KC-KiTS), comparing CT-FM, VoCo, SPECTRE and FlexiCT3D. b, Linear probing (AUC, balanced accuracy) on frozen embeddings for T-stage and ISUP grade, including a tumor-diameter-only clinical baseline (grey).… view at source ↗

**Figure 6.** Figure 6: FlexiCT-3D-VLM supports zero-shot disease classification and report retrieval across chest and abdominal CT. a, Zero-shot multi-label disease classification on CT-RATE (left) and Merlin (right), reporting macro-averaged precision, F1, accuracy (ACC) and area under the ROC curve (AUC). Baselines are CT-CLIP, COLIPRI and SPECTRE on CT-RATE; Merlin, COLIPRI and SPECTRE on the Merlin benchmark. b, Semantic rep… view at source ↗

read the original abstract

Computed tomography (CT) is a central to three-dimensional medical imaging, yet CT-based artificial intelligence remains fragmented across task-specific models for segmentation, classification, registration, and report analysis. Here we present FlexiCT, a family of CT foundation models trained by agglomerative continual pretraining on 266,227 CT volumes from 56 publicly available datasets, forming a large-scale public resource for CT representation learning. FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment. This training strategy supports slice-level, volume-level and vision-language analysis. Across five downstream task families (segmentation, classification, registration, vision-language understanding and clinical retrieval), FlexiCT matches or exceeds prior task-specific approaches on multiple benchmarks. Its embeddings further organize CT scans along gradients associated with various tumor stages, suggesting that CT foundation models can capture imaging features relevant to disease phenotype characterization. Code is available at https://github.com/ricklisz/FlexiCT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexiCT gives a workable three-stage pretraining recipe on a big public CT collection and releases code, but the generalization claims need tighter checks on dataset overlap before they land.

read the letter

Hey, the core of this paper is FlexiCT, a CT model pretrained in three stages—2D axial, then 3D anatomical, then report-guided alignment—on 266k volumes from 56 public datasets. It reports matching or beating task-specific baselines on segmentation, classification, registration, vision-language, and retrieval, and the embeddings appear to line up with tumor-stage gradients. The scale and the code release at the GitHub link are the practical wins; anyone who needs a starting CT backbone can actually use it without starting from scratch. The staged pipeline is a clear extension of earlier single-stage foundation work, and the phenotype-organization observation is worth following up if the numbers hold. The soft spots are straightforward. The abstract gives no concrete metrics, baselines, or statistical details, so the performance edge is hard to judge from the summary alone. The bigger issue is the 56-dataset mix: without explicit confirmation that downstream test splits are free of patient or scan overlap, the results could partly reflect leakage rather than true transfer. That assumption is load-bearing for the universal-representation story. Methods look standard for this area, no obvious circularity or invented steps. This is for medical-imaging groups that want a reusable CT encoder or are exploring direct phenotype readout from volumes. A reader who needs code and a broad starting point will get immediate value; someone chasing the strongest possible generalization claims will want the full evaluation tables first. It deserves a serious referee to examine the data splits and downstream protocols. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FlexiCT, a family of CT foundation models trained via three-stage agglomerative continual pretraining on 266,227 volumes drawn from 56 public datasets. The stages consist of 2D axial pretraining, 3D anatomical pretraining, and report-guided semantic alignment. The resulting embeddings are evaluated across five downstream task families (segmentation, classification, registration, vision-language understanding, and clinical retrieval) and are reported to match or exceed prior task-specific models on multiple benchmarks while also organizing scans along gradients associated with tumor stages.

Significance. If the central claims hold after addressing evaluation details, this would constitute a meaningful contribution to medical imaging by demonstrating that a single set of representations can span anatomy-to-phenotype tasks without task-specific retraining. The scale of the public data collection and the staged pretraining approach are notable strengths, as is the public release of code.

major comments (2)

[§4 and Evaluation protocols] The manuscript does not provide an explicit description or audit of patient/scan overlap checks between the 56 pretraining collections and the downstream test splits used in the five task families. This is load-bearing for the generalization claim in the abstract and §4, because even modest leakage common in public archives could produce the reported transfer performance without demonstrating the claimed universal anatomy-to-phenotype mapping.
[Results section] Quantitative results for the downstream benchmarks (including exact metrics, error bars, statistical tests, and full baseline details) are not reported in sufficient detail to support the claim that FlexiCT 'matches or exceeds' prior approaches. This information is required in the results section to allow assessment of effect sizes and to confirm the central performance claim.

minor comments (2)

[§3] Clarify the precise definition and weighting of the three pretraining stages (e.g., loss functions and data sampling ratios) to improve reproducibility.
[Figures 4-5] Figure captions should explicitly state the number of samples and any exclusion criteria used for the phenotype organization visualizations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment point by point below, with revisions made to strengthen the presentation of our methods and results.

read point-by-point responses

Referee: [§4 and Evaluation protocols] The manuscript does not provide an explicit description or audit of patient/scan overlap checks between the 56 pretraining collections and the downstream test splits used in the five task families. This is load-bearing for the generalization claim in the abstract and §4, because even modest leakage common in public archives could produce the reported transfer performance without demonstrating the claimed universal anatomy-to-phenotype mapping.

Authors: We agree that explicit verification of patient and scan overlaps is critical to support the generalization claims. The pretraining corpus was assembled exclusively from public datasets, and downstream evaluations followed the official published splits and protocols for each benchmark. However, the initial submission did not include a dedicated overlap audit. We have now performed this analysis using available metadata (patient identifiers, acquisition dates, and institutional tags where present across the public releases). The audit results, including any minimal overlaps detected and mitigation steps, have been added to §4 along with a new supplementary table. This revision directly bolsters the validity of the reported transfer performance. revision: yes
Referee: [Results section] Quantitative results for the downstream benchmarks (including exact metrics, error bars, statistical tests, and full baseline details) are not reported in sufficient detail to support the claim that FlexiCT 'matches or exceeds' prior approaches. This information is required in the results section to allow assessment of effect sizes and to confirm the central performance claim.

Authors: We appreciate the need for greater transparency in the quantitative results. The original manuscript summarized key outcomes in the main text while directing readers to supplementary materials for full tables. To address this concern, we have expanded the Results section with comprehensive tables for all five task families. These now report exact metric values, error bars or confidence intervals, statistical significance tests (e.g., paired comparisons against baselines), and complete baseline details with both originally reported and reproduced scores. Updated figures accompany the tables to facilitate assessment of effect sizes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pretraining evaluated on external benchmarks

full rationale

The paper describes an empirical agglomerative pretraining pipeline on 266,227 CT volumes from 56 public datasets, followed by evaluation across standard downstream task families. No equations, fitted parameters, or derivations are presented that reduce reported performance or embeddings to definitional equivalence with the inputs. The approach relies on external public data and benchmarks rather than self-referential steps, self-citation chains, or ansatzes smuggled via prior work. This is a self-contained experimental result against external validation sets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality and representativeness of the aggregated public datasets plus standard deep-learning transfer assumptions; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption The 56 public datasets collectively provide unbiased coverage of anatomical and pathological variation sufficient for learning universal representations.
Invoked implicitly by the scale and diversity claims in the abstract; if violated, transfer performance would degrade.

pith-pipeline@v0.9.0 · 5736 in / 1350 out tokens · 50089 ms · 2026-05-22T07:47:38.467584+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train using a DINOv3 self-supervised framework ... iBOT masked patch prediction loss ... contrastive loss.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 9 internal anchors

[1]

Manuel Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, V . Nath, Yufan He, Ziyue Xu, Ali Hatamizadeh, Wenjie Zhu, Yun Liu, Mingxin Zheng, Yucheng Tang, Isaac Yang, Michael Zephyr, Behrooz Hashemian, Sachidanand Alle, Mohammad Zalbagi Darestani, Charles. Budd, Marc Modat, To...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Scaling self-supervised and cross-modal pretraining for volumetric ct transformers

Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, and Fons van der Sommen. Scaling self-supervised and cross-modal pretraining for volumetric ct transformers. arXiv preprint arXiv:2511.17209,

work page arXiv
[3]

Curia: A multi-modal foundation model for radiology.arXiv preprint arXiv:2509.06830,

Corentin Dancette, Julien Khlaut, Antoine Saporta, Helene Philippe, Elodie Ferreres, Baptiste Callard, Théo Danielou, Léo Alberge, Léo Machado, Daniel Tordjman, et al. Curia: A multi-modal foundation model for radiology.arXiv preprint arXiv:2509.06830,

work page arXiv
[4]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

semanticscholar.org/CorpusID:208547601

URL https://api. semanticscholar.org/CorpusID:208547601. Nicholas Heller, Fabian Isensee, Resha Tejpau, Andrew Wood, Nikolaos Papanikolopoulos, and Christopher Weight. 2023 kidney and kidney tumor segmentation challenge. International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023),

work page 2023
[7]

Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein

URLhttps://doi.org/10.5281/zenodo.7840134. Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211,

work page doi:10.5281/zenodo.7840134
[8]

Au- tomatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge

Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534,

work page arXiv 2023
[9]

Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

work page arXiv
[10]

DINOv2: Learning Robust Visual Features without Supervision

48 Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Vision foundation models for computed tomography

Suraj Pai, Ibrahim Hadzic, Dennis Bontempi, Keno Bressem, Benjamin H Kann, Andriy Fedorov, Raymond H Mak, and Hugo JWL Aerts. Vision foundation models for computed tomography. arXiv preprint arXiv:2501.09001,

work page arXiv
[12]

DINOv3

URL https://api.semanticscholar.org/ CorpusID:272693057. Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016

Rebecca Smith-Bindman, Marilyn L Kwan, Emily C Marlow, Mary Kay Theis, Wesley Bolch, Stephanie Y Cheng, Erin JA Bowles, James R Duncan, Robert T Greenlee, Lawrence H Kushi, et al. Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016. Jama, 322(9):843–856,

work page 2000
[14]

Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a

Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Anton Schwaighofer, Noel CF Codella, et al. Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a. Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Seb...

work page arXiv
[15]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.ArXiv, abs/2506.05176,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

iBOT: Image BERT Pre-Training with Online Tokenizer

URL https://api.semanticscholar.org/CorpusID:279243736. Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Manuel Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, V . Nath, Yufan He, Ziyue Xu, Ali Hatamizadeh, Wenjie Zhu, Yun Liu, Mingxin Zheng, Yucheng Tang, Isaac Yang, Michael Zephyr, Behrooz Hashemian, Sachidanand Alle, Mohammad Zalbagi Darestani, Charles. Budd, Marc Modat, To...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Scaling self-supervised and cross-modal pretraining for volumetric ct transformers

Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, and Fons van der Sommen. Scaling self-supervised and cross-modal pretraining for volumetric ct transformers. arXiv preprint arXiv:2511.17209,

work page arXiv

[3] [3]

Curia: A multi-modal foundation model for radiology.arXiv preprint arXiv:2509.06830,

Corentin Dancette, Julien Khlaut, Antoine Saporta, Helene Philippe, Elodie Ferreres, Baptiste Callard, Théo Danielou, Léo Alberge, Léo Machado, Daniel Tordjman, et al. Curia: A multi-modal foundation model for radiology.arXiv preprint arXiv:2509.06830,

work page arXiv

[4] [4]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[6] [6]

semanticscholar.org/CorpusID:208547601

URL https://api. semanticscholar.org/CorpusID:208547601. Nicholas Heller, Fabian Isensee, Resha Tejpau, Andrew Wood, Nikolaos Papanikolopoulos, and Christopher Weight. 2023 kidney and kidney tumor segmentation challenge. International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023),

work page 2023

[7] [7]

Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein

URLhttps://doi.org/10.5281/zenodo.7840134. Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211,

work page doi:10.5281/zenodo.7840134

[8] [8]

Au- tomatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge

Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534,

work page arXiv 2023

[9] [9]

Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

work page arXiv

[10] [10]

DINOv2: Learning Robust Visual Features without Supervision

48 Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Vision foundation models for computed tomography

Suraj Pai, Ibrahim Hadzic, Dennis Bontempi, Keno Bressem, Benjamin H Kann, Andriy Fedorov, Raymond H Mak, and Hugo JWL Aerts. Vision foundation models for computed tomography. arXiv preprint arXiv:2501.09001,

work page arXiv

[12] [12]

DINOv3

URL https://api.semanticscholar.org/ CorpusID:272693057. Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016

Rebecca Smith-Bindman, Marilyn L Kwan, Emily C Marlow, Mary Kay Theis, Wesley Bolch, Stephanie Y Cheng, Erin JA Bowles, James R Duncan, Robert T Greenlee, Lawrence H Kushi, et al. Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016. Jama, 322(9):843–856,

work page 2000

[14] [14]

Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a

Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Anton Schwaighofer, Noel CF Codella, et al. Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a. Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Seb...

work page arXiv

[15] [15]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.ArXiv, abs/2506.05176,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

iBOT: Image BERT Pre-Training with Online Tokenizer

URL https://api.semanticscholar.org/CorpusID:279243736. Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,

work page internal anchor Pith review Pith/arXiv arXiv