Benchmarking Foundation Models for Renal Lesion Stratification in CT
Pith reviewed 2026-05-11 03:18 UTC · model grok-4.3
The pith
Medical foundation models match but do not beat radiomics for classifying six types of renal lesions on CT.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that generalist medical foundation model embeddings, extracted via frozen feature probing, achieve AUC values of 0.70-0.77 on the six-class renal lesion stratification task, matching the performance of a 3D ResNet-50 trained from scratch at AUC 0.72 while requiring only seconds of CPU time after feature extraction, yet falling significantly below a conventional radiomics baseline at AUC 0.88 on the external test set of 234 lesions.
What carries the argument
The frozen feature-probing protocol that extracts fixed embeddings from pre-trained medical foundation models and feeds them to a simple classifier for the renal lesion task.
If this is right
- Foundation model embeddings can serve as a low-compute alternative to training networks from scratch in data-scarce medical classification settings.
- Radiomics retains superiority for tasks that hinge on fine-grained texture and shape heterogeneity in histological subtype discrimination.
- Current generalist medical foundation models require further adaptation or richer pre-training data to close the gap with established feature-based methods on this task.
- The efficiency gains of foundation models come at the cost of accuracy relative to radiomics in the current benchmark.
Where Pith is reading between the lines
- Extending the same benchmark protocol to other organs or modalities could reveal whether the performance gap is specific to renal CT texture or more general.
- Combining handcrafted radiomics features with foundation model embeddings might produce hybrid classifiers that exceed either alone.
- If future foundation models incorporate more CT-specific texture data during pre-training, their transfer performance on similar scarce-data tasks could improve without full retraining.
Load-bearing premise
The external test set of 234 lesions and the frozen probing protocol give an unbiased, generalizable measure of each model's capability without selection biases or distribution shifts.
What would settle it
A new foundation model, evaluated with the same frozen probing protocol on the identical external set of 234 lesions, producing an AUC significantly above 0.88.
read the original abstract
The rapid proliferation of open-source medical foundation models (FMs) raises a practical question: how well do their pre-trained representations transfer to clinically relevant but data-scarce classification tasks? Particularly in CT-based renal lesion classification, a push toward greater generalizability would be meaningful, as the field is constrained by inherently limited training data. We addressed this through a benchmark of three medical FMs on this specific task. This six-class problem spans common entities like cysts and clear cell renal cell carcinoma, alongside rare subtypes. Using a frozen feature-probing protocol, we compared FM embeddings against a handcrafted radiomics classifier and a 3D ResNet-50 trained from scratch. Models were trained on a composite dataset of 2,854 lesions and evaluated on an external test set of 234 lesions from The Cancer Imaging Archive. Our results reveal two key findings. First, FM performance (AUC 0.70-0.77) matched the from-scratch ResNet (AUC 0.72) while drastically reducing hardware demand, requiring only seconds on a CPU after feature extraction. However, the conventional radiomics baseline significantly outperformed all deep learning approaches, achieving an AUC of 0.88 (all p $\leq$ 0.002). This suggests that current generalist FM embeddings do not yet capture the fine-grained texture and shape heterogeneity driving histological subtype discrimination. Despite their potential in data-scarce settings, medical FMs did not surpass established models for renal lesion stratification, leaving radiomics as the current state-of-the-art.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks three medical foundation models using a frozen feature-probing protocol on a six-class CT-based renal lesion classification task (cysts, ccRCC, and rare subtypes). Models are trained on a composite set of 2,854 lesions and evaluated on an external TCIA test set of 234 lesions, with comparisons to a 3D ResNet-50 trained from scratch and a handcrafted radiomics baseline. Results show FM AUCs of 0.70-0.77 (matching ResNet at 0.72) but significantly lower than radiomics at 0.88 (p ≤ 0.002), leading to the conclusion that current generalist FM embeddings do not capture the fine-grained texture and shape heterogeneity needed for histological subtype discrimination.
Significance. If the results hold after addressing the noted gaps, this provides a useful empirical benchmark showing that conventional radiomics remains superior for this data-scarce clinical task while FM probing offers efficiency gains (CPU seconds post-extraction). The external test set strengthens generalizability claims, and the direct comparison to both DL and radiomics baselines offers practical guidance for FM adoption in medical imaging.
major comments (2)
- [Abstract] Abstract: The interpretive claim that FM embeddings 'do not yet capture the fine-grained texture and shape heterogeneity driving histological subtype discrimination' rests on the aggregate AUC gap (0.70-0.77 vs. 0.88). Without per-class AUCs, confusion matrices, or class distribution counts for the 234-lesion external test set, the delta cannot be isolated to performance on rare/difficult subtypes rather than majority classes separable by basic features.
- [Methods/Results] Methods/Results: The manuscript reports p-values (all p ≤ 0.002) for the radiomics superiority but does not specify the statistical test (e.g., DeLong for AUC comparison), whether multiple-comparison correction was applied, or provide class-wise breakdowns. These details are load-bearing for validating the central claim and ruling out dataset-specific effects.
minor comments (2)
- [Abstract] Abstract: The three specific medical foundation models evaluated are not named; listing them would improve clarity for readers.
- [Methods] The training set size (2,854 lesions) and test set (234 lesions) are given, but explicit reporting of class imbalance ratios in both would aid interpretation of the aggregate metrics.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which has improved the clarity and transparency of our work. We address each major comment below and have revised the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The interpretive claim that FM embeddings 'do not yet capture the fine-grained texture and shape heterogeneity driving histological subtype discrimination' rests on the aggregate AUC gap (0.70-0.77 vs. 0.88). Without per-class AUCs, confusion matrices, or class distribution counts for the 234-lesion external test set, the delta cannot be isolated to performance on rare/difficult subtypes rather than majority classes separable by basic features.
Authors: We agree that aggregate AUCs alone limit the ability to attribute the performance gap to specific classes. In the revised manuscript we have added the class distribution counts for the 234-lesion external test set, per-class AUC values for all models, and confusion matrices. These additions allow readers to evaluate whether the observed differences are concentrated on the rarer subtypes. revision: yes
-
Referee: [Methods/Results] Methods/Results: The manuscript reports p-values (all p ≤ 0.002) for the radiomics superiority but does not specify the statistical test (e.g., DeLong for AUC comparison), whether multiple-comparison correction was applied, or provide class-wise breakdowns. These details are load-bearing for validating the central claim and ruling out dataset-specific effects.
Authors: We have clarified the statistical procedures in the revised Methods section: p-values for AUC comparisons were obtained with the DeLong test and Bonferroni correction was applied to account for multiple pairwise tests. Class-wise performance breakdowns have also been added to the Results to support transparent evaluation of the central claims. revision: yes
Circularity Check
No circularity: direct empirical benchmark on held-out data
full rationale
The paper performs a standard model comparison by training on a composite dataset of 2,854 lesions and evaluating AUC on an independent external test set of 234 lesions. No equations, derivations, fitted parameters renamed as predictions, or self-citations are present in the provided text. The central claim rests on measured performance deltas (radiomics 0.88 vs. FM/ResNet ~0.70-0.77), which are falsifiable against the external data rather than reducing to the inputs by construction. This matches the default case of a self-contained empirical study.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the conventional radiomics baseline significantly outperformed all deep learning approaches, achieving an AUC of 0.88... current generalist FM embeddings do not yet capture the fine-grained texture and shape heterogeneity
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using a frozen feature-probing protocol... 3D ResNet-50 trained from scratch
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Lipkova J and Kather JN. The age of foundation models. Nature Reviews Clinical Oncology 2024; 21:769–70.doi: 10.1038/s41571-024-00941-8
-
[2]
Overcoming data scarcity in biomedical imaging with a foundational multi-task model
Schäfer R, Nicke T, Höfener H, Lange A, Merhof D, Feuerhake F, Schulz V, Lotz J, and Kiessling F. Overcoming data scarcity in biomedical imaging with a foundational multi-task model. Nature Computational Science 2024; 4:495–509.doi: 10.1038/s43588-024-00662-z
-
[3]
International Agency for Research on Cancer (IARC). Kidney Fact Sheet. Accessed: 2026-02-
work page 2026
-
[4]
2021. Available from: https://gco.iarc.who.int/media/globocan/factsheets/cancers/29-ki dney-fact-sheet.pdf 10
work page 2021
-
[5]
CT and MRI of small renal masses
Wang ZJ, Westphalen AC, and Zagoria RJ. CT and MRI of small renal masses. The British Journal of Radiology 2018; 91:20180131.doi: 10.1259/bjr.20180131
-
[6]
Renal cell carcinoma: ESMO Clinical Practice Guideline for diagnosis, treatment and follow-up
Powles T, Albiges L, Bex A, Comperat E, Grünwald V, Kanesvaran R, Kitamura H, McKay R, Porta C, Procopio G, et al. Renal cell carcinoma: ESMO Clinical Practice Guideline for diagnosis, treatment and follow-up. Annals of Oncology 2024.doi: 10.1016/j.annonc.2024.05 .537
-
[7]
S3- Guideline for Diagnosis, Therapy, and Follow-up of Renal Cell Carcinoma, Short Version 5.0
Leitlinienprogramm Onkologie (Deutsche Krebsgesellschaft, Deutsche Krebshilfe, AWMF). S3- Guideline for Diagnosis, Therapy, and Follow-up of Renal Cell Carcinoma, Short Version 5.0. AWMF Registration Number: 043-017OL, Accessed: 23.02.2025. 2024. Available from: https: //www.leitlinienprogramm-onkologie.de/leitlinien/nierenzellkarzinom/
work page 2025
-
[8]
Bosniak classification of cystic renal masses, version 2019: an update proposal and needs assessment
Silverman SG, Pedrosa I, Ellis JH, Hindman NM, Schieda N, et al. Bosniak classification of cystic renal masses, version 2019: an update proposal and needs assessment. Radiology 2019; 292:475–88.doi: 10.1148/radiol.2019182646
-
[9]
Jinzaki M, Silverman SG, Akita H, Nagashima Y, Mikami S, and Oya M. Renal angiomy- olipoma: a radiological classification and update on recent developments in diagnosis and management. Abdominal imaging 2014; 39:588–604.doi: 10.1007/s00261-014-0083-3
-
[10]
Differentiation of papillary renal cell carcinoma subtypes on CT and MRI
Egbert ND, Caoili EM, Cohan RH, Davenport MS, Francis IR, Kunju LP, and Ellis JH. Differentiation of papillary renal cell carcinoma subtypes on CT and MRI. American Journal of Roentgenology 2013; 201:347–55.doi: 10.2214/AJR.12.9451
-
[11]
Renal oncocytoma: CT features cannot reliably distinguish oncocytoma from other renal neoplasms
Choudhary S, Rajesh A, Mayer N, Mulcahy K, and Haroon A. Renal oncocytoma: CT features cannot reliably distinguish oncocytoma from other renal neoplasms. Clinical radiology 2009; 64:517–22.doi: 10.1016/j.crad.2008.12.011
-
[12]
MRI features of renal oncocytoma and chromophobe renal cell carcinoma
Rosenkrantz AB, Hindman N, Fitzgerald EF, Niver BE, Melamed J, and Babb JS. MRI features of renal oncocytoma and chromophobe renal cell carcinoma. American Journal of Roentgenology 2010; 195:W421–W427.doi: 10.2214/AJR.10.4718
-
[13]
Deep learning for end-to-end kidney cancer diagnosis on multi-phase abdominal computed tomography
Uhm KH, Jung SW, Choi MH, Shin HK, Yoo JI, et al. Deep learning for end-to-end kidney cancer diagnosis on multi-phase abdominal computed tomography. NPJ precision oncology 2021; 5:54.doi: 10.1038/s41698-021-00195-y
-
[14]
Li Y, Huang X, Xia Y, and Long L. Value of radiomics in differential diagnosis of chromophobe renal cell carcinoma and renal oncocytoma. Abdominal Radiology 2020; 45:3193–201.doi: 10.1007/s00261-019-02269-9
-
[15]
Cancers 2022; 14:3609.doi: 10.3390/cancers14153609
AlhussainiAJ,SteeleJD,andNabiG.Comparativeanalysisforthedistinctionofchromophobe renal cell carcinoma from renal oncocytoma in computed tomography imaging using machine learning radiomics analysis. Cancers 2022; 14:3609.doi: 10.3390/cancers14153609
-
[16]
Uchida Y, Yoshida S, Arita Y, Shimoda H, Kimura K, et al. Apparent diffusion coefficient map-based texture analysis for the differentiation of chromophobe renal cell carcinoma from renal oncocytoma. Diagnostics 2022; 12:817.doi: 10.3390/diagnostics12040817
-
[17]
Doshi AM, Ream JM, Kierans AS, Bilbily M, Rusinek H, Huang WC, and Chandarana H. Use of MRI in differentiation of papillary renal cell carcinoma subtypes: qualitative and quan- titative analysis. American Journal of Roentgenology 2016; 206:566–72.doi: 10.2214/AJR.1 5.15004 11
-
[18]
Gao Y, Wang X, Wang S, Miao Y, Zhu C, Li C, Huang G, Jiang Y, Li J, Zhao X, et al. Differential diagnosis of type 1 and type 2 papillary renal cell carcinoma based on enhanced CT radiomics nomogram. Frontiers in Oncology 2022; 12:854979.doi: 10.3389/fonc.2022.85 4979
-
[19]
Stegeman M, Philipp L, Graaf F van der, D’Amato M, Grisi C, Builtjes L, Bosma JS, Lefkes J, Weber RA, Meakin JA, et al. Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language. arXiv preprint arXiv:2603.02790 2026.doi: 10.48550/arXiv.2603.02790
-
[20]
Incompletely characterized incidental renal masses: emerging data support conservative management
Silverman SG, Israel GM, and Trinh QD. Incompletely characterized incidental renal masses: emerging data support conservative management. Radiology 2015; 275:28–42.doi: 10.1148/r adiol.14141144
work page doi:10.1148/r 2015
-
[21]
2023 Kidney and Kidney Tumor Segmentation Challenge
Heller N, Isensee F, Tejpau R, Wood A, Papanikolopoulos N, and Weight C. 2023 Kidney and Kidney Tumor Segmentation Challenge. 2023 Apr.doi: 10.5281/zenodo.7840134. Available from: https://doi.org/10.5281/zenodo.7840134
-
[22]
Akin O, Elnajjar P, Heller M, Jarosz R, Erickson BJ, et al. The cancer genome atlas kidney renal clear cell carcinoma collection (TCGA-KIRC)(Version 3)[Data set]. Cancer Imaging Arch 2016.doi: 10.7937/K9/TCIA.2016.V6PBVTDR
-
[23]
W. LM, A. GRSC, and S. L. The Cancer Genome Atlas Kidney Chromophobe Collection (TCGA-KICH) (Version 3) [Data set]. Cancer Imaging Arch 2016.doi: 10.7937/K9/TCIA.2 016.YU3RBCZN
-
[24]
Linehan M, Gautam R, Kirk S, Lee Y, Roche C, Bonaccio E, Filippini J, Rieger-Christ K, Lemmerman J, and Jarosz R. The cancer genome atlas cervical kidney renal papillary cell carcinoma collection (TCGA-KIRP), version 4. The Cancer Imaging Archive 2016.doi: 10.7 937/K9/TCIA.2016.ACWOGBEF
work page 2016
-
[25]
Accessible and Reproducible Renal Cell Carcinoma Research Through Open-Sourcing Data and Annotations
de Boer S, Häntze H, Ziegelmayer S, Ginneken B van, Prokop M, Bressem KK, and Hering A. Accessible and Reproducible Renal Cell Carcinoma Research Through Open-Sourcing Data and Annotations. medRxiv 2026.doi: 10.64898/2026.04.22.26351451
-
[26]
Robust Kidney Abnormality Segmentation: A Validation Study of an AI-Based Framework
de Boer S, Häntze H, Venkadesh KV, Buser MA, Mamani GEH, Xu L, Adams LC, Nawabi J, Bressem KK, Ginneken B van, et al. Robust Kidney Abnormality Segmentation: A Validation Study of an AI-Based Framework. arXiv preprint arXiv:2505.07573 2025.doi: 10.48550/arXi v.2505.07573
-
[27]
Foundation model for cancer imaging biomarkers
Pai S, Bontempi D, Hadzic I, Prudente V, Sokač M, Chaunzwa TL, Bernatz S, Hosny A, Mak RH, Birkbak NJ, et al. Foundation model for cancer imaging biomarkers. Nature machine intelligence 2024; 6:354–67.doi: 10.1038/s42256-024-00807-9
-
[28]
arXiv preprint arXiv:2501.09001
Pai S, Hadzic I, Bontempi D, Bressem K, Kann BH, Fedorov A, Mak RH, and Aerts HJ. Vision foundation models for computed tomography. arXiv preprint arXiv:2501.09001 2025. doi: 10.48550/arXiv.2501.09001
-
[29]
Tissue concepts: Supervised foundation models in computational pathology
Nicke T, Schäfer JR, Höfener H, Feuerhake F, Merhof D, Kießling F, and Lotz J. Tissue concepts: Supervised foundation models in computational pathology. Computers in biology and medicine 2025; 186:109621.doi: 10.1016/j.compbiomed.2024.109621
-
[30]
Chen T and Guestrin C. Xgboost: A scalable tree boosting system.Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016 :785–94. doi: 10.1145/2939672.293978 12
-
[31]
Cancer research 2017; 77:e104–e107.doi: 10.1 158/0008-5472.CAN-17-0339
VanGriethuysenJJ,FedorovA,ParmarC,HosnyA,AucoinN,etal.Computationalradiomics system to decode the radiographic phenotype. Cancer research 2017; 77:e104–e107.doi: 10.1 158/0008-5472.CAN-17-0339
work page 2017
-
[32]
Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians
Carpenter J and Bithell J. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Statistics in medicine 2000; 19:1141–64.doi: 10.1002/(sici)10 97-0258(20000515)19:9<1141::aid-sim479>3.0.co;2-f
-
[33]
FindBounce: Package for multi-field bounce actions
Mandel M and Betensky RA. Simultaneous confidence intervals based on the percentile boot- strap approach. Computational statistics & data analysis 2008; 52:2158–65.doi: 10.1016/j.c sda.2007.07.005
work page doi:10.1016/j.c 2008
-
[34]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
McInnes L, Healy J, and Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 2018.doi: 10.48550/arXiv.1802.03426 13
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1802.03426 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.