Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

Haoyu Dong; James E. Baciak; Mojtaba Safari; Shansong Wang; Xiaofeng Yang; Yuan Gao; Yuheng Li; Yuxiang Lai

arxiv: 2605.21906 · v2 · pith:OIVZEORGnew · submitted 2026-05-21 · 💻 cs.CV

Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

Yuheng Li , Yuan Gao , Haoyu Dong , Yuxiang Lai , Shansong Wang , Mojtaba Safari , James E. Baciak , Xiaofeng Yang This is my paper

Pith reviewed 2026-05-25 06:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords CT foundation modelsagglomerative pretrainingmedical image analysissegmentationclassificationdisease phenotypevision-language models

0 comments

The pith

FlexiCT's three-stage agglomerative pretraining on public CT volumes produces representations that match task-specific models across five task families and organize scans by tumor stage gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FlexiCT as a family of CT foundation models built through agglomerative continual pretraining on 266,227 volumes drawn from 56 public datasets. Training proceeds in three stages that begin with two-dimensional axial slices, advance to three-dimensional anatomical structures, and conclude with report-guided semantic alignment. The resulting embeddings support slice-level, volume-level, and vision-language analysis without task-specific retraining. On benchmarks for segmentation, classification, registration, vision-language understanding, and clinical retrieval, FlexiCT matches or exceeds prior specialized approaches. The same embeddings further arrange CT scans along gradients tied to tumor stages, indicating that the representations encode phenotype-relevant information.

Core claim

FlexiCT is obtained by agglomerative pretraining across 2D axial, 3D anatomical, and report-guided stages on a large public CT collection; the resulting representations enable competitive performance on multiple downstream task families at slice, volume, and vision-language levels while also organizing volumes along tumor-stage gradients.

What carries the argument

The three-stage agglomerative pretraining process (2D axial pretraining, 3D anatomical pretraining, report-guided semantic alignment) that builds representations supporting multiple analysis levels from one model.

If this is right

A single pretrained model can be applied to segmentation, classification, registration, vision-language understanding, and retrieval without per-task retraining.
CT representations can encode both anatomical structure and disease phenotype features such as tumor stage.
The training strategy enables analysis at slice, volume, and vision-language levels from the same embedding space.
Performance on the tested benchmarks reaches or surpasses that of prior task-specific models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The public dataset collection and staged training could serve as a base for extending the same model to additional CT tasks or modalities without starting from random weights.
Embedding organization by tumor stage opens the possibility of using the representations for unsupervised progression tracking if the gradient structure holds on new data.
The approach implies that continual pretraining across increasing levels of structure (slice to volume to report) may reduce fragmentation in medical imaging AI.

Load-bearing premise

The three-stage agglomerative pretraining on the selected 56 public datasets produces representations that generalize across tasks and capture disease phenotype information without additional task-specific adaptation or explicit controls for dataset biases.

What would settle it

Evaluation of FlexiCT embeddings on an independent CT collection with documented tumor-stage labels to test whether the embeddings form consistent, monotonic gradients with stage progression.

Figures

Figures reproduced from arXiv: 2605.21906 by Haoyu Dong, James E. Baciak, Mojtaba Safari, Shansong Wang, Xiaofeng Yang, Yuan Gao, Yuheng Li, Yuxiang Lai.

**Figure 1.** Figure 1: Dataset statistics and three-stage pretraining strategy of FlexiCT. a, Composition of the FlexiCT pretraining dataset. Four donut charts summarise body region (top left; n = 266,227 volumes), geographic distribution (top right; n = 266,227), disease family (bottom left; n = 186,700 volumes with case- or cohort-level labels) and anatomical system (bottom right; same n). b, Frequency of the top 20 clinical c… view at source ↗

**Figure 2.** Figure 2: FlexiCT outperforms foundation models across 3D and 2D segmentation benchmarks. a, Volumetric segmentation Dice coefficient on six abdominal, thoracic and whole-body benchmarks (KiTS23, WORD, MSD Liver, MSD Lung, MSD Pancreas, and AutoPET), comparing nnU-Net, Primus-M, VoCo, CT-FM and FlexiCT-3D (red). b, Slice-level segmentation Dice coefficient on TotalSegmentator (104 anatomical classes partitioned into… view at source ↗

**Figure 3.** Figure 3: FlexiCT-2D enables training-free intra- and cross-modal abdominal registration. a, Per-organ Dice similarity coefficient on the Learn2Reg abdominal CT–CT task across 13 organs (n = 45 registration pairs across 5-fold cross-validation), comparing VoxelMorph, Curia, DINO-Reg and FlexiCT-2D (red). Curia, DINO-Reg and FlexiCT-2D share the same ConvexAdam optimisation framework and differ only in the feature ba… view at source ↗

**Figure 4.** Figure 4: FlexiCT-2D enables label-efficient disease classification from frozen features. a–d, Label-efficiency curves for frozen pretrained encoders trained for: renal tumor subtyping (KiTS; a), universal lesion classification (Deep-Lesion; b), pulmonary nodule detection (Luna16; c) and COVID19 identification (Covidx-CT; d). X-axis labels give training-sample counts; dashed lines mark each model’s full-data AUC (n… view at source ↗

**Figure 5.** Figure 5: FlexiCT-3D embeddings organize tumors along clinical severity gradients without staging supervision. a, Zero-shot tumor retrieval (Recall@1, Recall@3) for T-stage (NSCLCRadiogenomics) and ISUP grade (C4KC-KiTS), comparing CT-FM, VoCo, SPECTRE and FlexiCT3D. b, Linear probing (AUC, balanced accuracy) on frozen embeddings for T-stage and ISUP grade, including a tumor-diameter-only clinical baseline (grey).… view at source ↗

**Figure 6.** Figure 6: FlexiCT-3D-VLM supports zero-shot disease classification and report retrieval across chest and abdominal CT. a, Zero-shot multi-label disease classification on CT-RATE (left) and Merlin (right), reporting macro-averaged precision, F1, accuracy (ACC) and area under the ROC curve (AUC). Baselines are CT-CLIP, COLIPRI and SPECTRE on CT-RATE; Merlin, COLIPRI and SPECTRE on the Merlin benchmark. b, Semantic rep… view at source ↗

read the original abstract

Computed tomography (CT) is a central to three-dimensional medical imaging, yet CT-based artificial intelligence remains fragmented across task-specific models for segmentation, classification, registration, and report analysis. Here we present FlexiCT, a family of CT foundation models trained by agglomerative continual pretraining on 266,227 CT volumes from 56 publicly available datasets, forming a large-scale public resource for CT representation learning. FlexiCT uses agglomerative pretraining across three stages: two-dimensional axial pretraining, three-dimensional anatomical pretraining and report-guided semantic alignment. This training strategy supports slice-level, volume-level and vision-language analysis. Across five downstream task families (segmentation, classification, registration, vision-language understanding and clinical retrieval), FlexiCT matches or exceeds prior task-specific approaches on multiple benchmarks. Its embeddings further organize CT scans along gradients associated with various tumor stages, suggesting that CT foundation models can capture imaging features relevant to disease phenotype characterization. Project page and code are available at: https://ricklisz.github.io/flexict.github.io and https://github.com/ricklisz/FlexiCT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexiCT's three-stage pretraining on 266k public CT volumes is a reasonable multi-task recipe, but the abstract gives no metrics or overlap controls so the matching-or-exceeding claims cannot be judged yet.

read the letter

The main thing here is FlexiCT, a family of CT models trained with agglomerative continual pretraining in three stages on 266,227 volumes from 56 public datasets. The stages move from 2D axial slices to 3D anatomical volumes to report-guided alignment, with the goal of supporting segmentation, classification, registration, vision-language, and retrieval in one set of embeddings that also sort by tumor stage.

Referee Report

2 major / 1 minor

Summary. The paper introduces FlexiCT, a family of CT foundation models trained via three-stage agglomerative continual pretraining (2D axial, 3D anatomical, report-guided semantic alignment) on 266227 volumes from 56 public datasets. It claims support for slice-, volume-, and vision-language tasks and states that FlexiCT matches or exceeds prior task-specific models across segmentation, classification, registration, vision-language understanding, and clinical retrieval benchmarks while organizing embeddings along tumor-stage gradients.

Significance. If the reported performance is shown to arise from the agglomerative strategy rather than data leakage or scale alone, the work would provide a substantial public resource for unifying fragmented CT AI tasks and for phenotype-aware representations. The use of only public data and the release of code are strengths that would facilitate community follow-up.

major comments (2)

[Abstract] Abstract: the central claim that FlexiCT 'matches or exceeds prior task-specific approaches on multiple benchmarks' is presented without any quantitative metrics, error bars, baseline tables, dataset splits, or exclusion criteria, rendering the headline result unverifiable.
[Experimental evaluation] Experimental evaluation (downstream task families): no decontamination protocol, overlap audit, or ablation isolating the three agglomerative stages from simple data scale is described between the 56 pretraining datasets and the five evaluation families; this directly undermines attribution of any observed parity or superiority to the claimed method.

minor comments (1)

The manuscript would benefit from an explicit table listing the 56 datasets with their sizes, task types, and any known acquisition characteristics to allow readers to assess potential biases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the abstract and experimental sections require strengthening for verifiability and to better isolate the contribution of the agglomerative pretraining strategy. We outline point-by-point revisions below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that FlexiCT 'matches or exceeds prior task-specific approaches on multiple benchmarks' is presented without any quantitative metrics, error bars, baseline tables, dataset splits, or exclusion criteria, rendering the headline result unverifiable.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will add specific metrics (e.g., mean Dice scores and standard deviations for segmentation tasks, accuracy/F1 for classification, and retrieval mAP) together with brief statements on dataset splits and exclusion criteria. These numbers will be drawn directly from the results tables already present in the main text. revision: yes
Referee: [Experimental evaluation] Experimental evaluation (downstream task families): no decontamination protocol, overlap audit, or ablation isolating the three agglomerative stages from simple data scale is described between the 56 pretraining datasets and the five evaluation families; this directly undermines attribution of any observed parity or superiority to the claimed method.

Authors: We acknowledge this limitation. The revised manuscript will add: (1) an explicit decontamination protocol listing all evaluation datasets and confirming zero patient-level overlap with the 56 pretraining collections, (2) a supplementary table auditing dataset overlap, and (3) an ablation that compares the full three-stage agglomerative model against a single-stage baseline trained on the same total volume count. These additions will allow readers to attribute performance differences to the training strategy rather than scale or leakage alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; all claims are empirical training and evaluation results.

full rationale

The paper describes an empirical ML study: agglomerative pretraining of FlexiCT on 266227 volumes from 56 public datasets, followed by evaluation on downstream tasks. No equations, derivations, or mathematical claims appear in the abstract or description. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce results to inputs by construction. The central claims (matching task-specific models, embeddings organizing by tumor stage) are presented as outcomes of training and benchmarking, not as reductions of prior results. This matches the default expectation for non-circular empirical papers; score 0 is appropriate and common.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents enumeration of specific free parameters, axioms, or invented entities; typical foundation-model training involves many unstated hyperparameters and standard assumptions about data distribution that cannot be audited here.

pith-pipeline@v0.9.0 · 5752 in / 1105 out tokens · 45326 ms · 2026-05-25T06:13:56.751096+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 9 internal anchors

[1]

Manuel Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, V . Nath, Yufan He, Ziyue Xu, Ali Hatamizadeh, Wenjie Zhu, Yun Liu, Mingxin Zheng, Yucheng Tang, Isaac Yang, Michael Zephyr, Behrooz Hashemian, Sachidanand Alle, Mohammad Zalbagi Darestani, Charles. Budd, Marc Modat, To...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Scaling self-supervised and cross-modal pretraining for volumetric ct transformers

Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, and Fons van der Sommen. Scaling self-supervised and cross-modal pretraining for volumetric ct transformers. arXiv preprint arXiv:2511.17209,

work page arXiv
[3]

Dancette, J

Corentin Dancette, Julien Khlaut, Antoine Saporta, Helene Philippe, Elodie Ferreres, Baptiste Callard, Théo Danielou, Léo Alberge, Léo Machado, Daniel Tordjman, et al. Curia: A multi-modal foundation model for radiology.arXiv preprint arXiv:2509.06830,

work page arXiv
[4]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

semanticscholar.org/CorpusID:208547601

URL https://api. semanticscholar.org/CorpusID:208547601. Nicholas Heller, Fabian Isensee, Resha Tejpau, Andrew Wood, Nikolaos Papanikolopoulos, and Christopher Weight. 2023 kidney and kidney tumor segmentation challenge. International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023),

work page 2023
[7]

Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein

URLhttps://doi.org/10.5281/zenodo.7840134. Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211,

work page doi:10.5281/zenodo.7840134
[8]

Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534,

Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534,

work page arXiv 2023
[9]

Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

work page arXiv
[10]

DINOv2: Learning Robust Visual Features without Supervision

48 Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Vision foundation models for computed tomography

Suraj Pai, Ibrahim Hadzic, Dennis Bontempi, Keno Bressem, Benjamin H Kann, Andriy Fedorov, Raymond H Mak, and Hugo JWL Aerts. Vision foundation models for computed tomography. arXiv preprint arXiv:2501.09001,

work page arXiv
[12]

DINOv3

URL https://api.semanticscholar.org/ CorpusID:272693057. Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016

Rebecca Smith-Bindman, Marilyn L Kwan, Emily C Marlow, Mary Kay Theis, Wesley Bolch, Stephanie Y Cheng, Erin JA Bowles, James R Duncan, Robert T Greenlee, Lawrence H Kushi, et al. Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016. Jama, 322(9):843–856,

work page 2000
[14]

Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a

Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Anton Schwaighofer, Noel CF Codella, et al. Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a. Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Seb...

work page arXiv
[15]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.ArXiv, abs/2506.05176,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

iBOT: Image BERT Pre-Training with Online Tokenizer

URL https://api.semanticscholar.org/CorpusID:279243736. Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Manuel Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, V . Nath, Yufan He, Ziyue Xu, Ali Hatamizadeh, Wenjie Zhu, Yun Liu, Mingxin Zheng, Yucheng Tang, Isaac Yang, Michael Zephyr, Behrooz Hashemian, Sachidanand Alle, Mohammad Zalbagi Darestani, Charles. Budd, Marc Modat, To...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Scaling self-supervised and cross-modal pretraining for volumetric ct transformers

Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, and Fons van der Sommen. Scaling self-supervised and cross-modal pretraining for volumetric ct transformers. arXiv preprint arXiv:2511.17209,

work page arXiv

[3] [3]

Dancette, J

Corentin Dancette, Julien Khlaut, Antoine Saporta, Helene Philippe, Elodie Ferreres, Baptiste Callard, Théo Danielou, Léo Alberge, Léo Machado, Daniel Tordjman, et al. Curia: A multi-modal foundation model for radiology.arXiv preprint arXiv:2509.06830,

work page arXiv

[4] [4]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[6] [6]

semanticscholar.org/CorpusID:208547601

URL https://api. semanticscholar.org/CorpusID:208547601. Nicholas Heller, Fabian Isensee, Resha Tejpau, Andrew Wood, Nikolaos Papanikolopoulos, and Christopher Weight. 2023 kidney and kidney tumor segmentation challenge. International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023),

work page 2023

[7] [7]

Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein

URLhttps://doi.org/10.5281/zenodo.7840134. Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211,

work page doi:10.5281/zenodo.7840134

[8] [8]

Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534,

Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534,

work page arXiv 2023

[9] [9]

Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

work page arXiv

[10] [10]

DINOv2: Learning Robust Visual Features without Supervision

48 Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Vision foundation models for computed tomography

Suraj Pai, Ibrahim Hadzic, Dennis Bontempi, Keno Bressem, Benjamin H Kann, Andriy Fedorov, Raymond H Mak, and Hugo JWL Aerts. Vision foundation models for computed tomography. arXiv preprint arXiv:2501.09001,

work page arXiv

[12] [12]

DINOv3

URL https://api.semanticscholar.org/ CorpusID:272693057. Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016

Rebecca Smith-Bindman, Marilyn L Kwan, Emily C Marlow, Mary Kay Theis, Wesley Bolch, Stephanie Y Cheng, Erin JA Bowles, James R Duncan, Robert T Greenlee, Lawrence H Kushi, et al. Trends in use of medical imaging in us health care systems and in ontario, canada, 2000-2016. Jama, 322(9):843–856,

work page 2000

[14] [14]

Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a

Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Anton Schwaighofer, Noel CF Codella, et al. Com- prehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025a. Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Seb...

work page arXiv

[15] [15]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.ArXiv, abs/2506.05176,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

iBOT: Image BERT Pre-Training with Online Tokenizer

URL https://api.semanticscholar.org/CorpusID:279243736. Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,

work page internal anchor Pith review Pith/arXiv arXiv