pith. sign in

arxiv: 2605.07749 · v1 · submitted 2026-05-08 · 💻 cs.CV

Benchmarking Foundation Models for Renal Lesion Stratification in CT

Pith reviewed 2026-05-11 03:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords foundation modelsrenal lesionsCT imagingradiomicsmedical image classificationtransfer learningbenchmarkinglesion stratification
0
0 comments X

The pith

Medical foundation models match but do not beat radiomics for classifying six types of renal lesions on CT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks three medical foundation models on a six-class task of distinguishing renal lesions such as cysts and clear cell carcinoma from CT scans. It applies a frozen feature-probing protocol to compare their embeddings against a handcrafted radiomics classifier and a 3D ResNet-50 trained from scratch on 2,854 lesions, then evaluates on an external set of 234 lesions. The foundation models reach AUCs of 0.70-0.77, close to the ResNet at 0.72, yet the radiomics baseline reaches 0.88 and outperforms all deep learning methods. This shows that while foundation models cut hardware needs dramatically, their pre-trained representations miss the texture and shape details that drive subtype discrimination in this setting.

Core claim

The authors establish that generalist medical foundation model embeddings, extracted via frozen feature probing, achieve AUC values of 0.70-0.77 on the six-class renal lesion stratification task, matching the performance of a 3D ResNet-50 trained from scratch at AUC 0.72 while requiring only seconds of CPU time after feature extraction, yet falling significantly below a conventional radiomics baseline at AUC 0.88 on the external test set of 234 lesions.

What carries the argument

The frozen feature-probing protocol that extracts fixed embeddings from pre-trained medical foundation models and feeds them to a simple classifier for the renal lesion task.

If this is right

  • Foundation model embeddings can serve as a low-compute alternative to training networks from scratch in data-scarce medical classification settings.
  • Radiomics retains superiority for tasks that hinge on fine-grained texture and shape heterogeneity in histological subtype discrimination.
  • Current generalist medical foundation models require further adaptation or richer pre-training data to close the gap with established feature-based methods on this task.
  • The efficiency gains of foundation models come at the cost of accuracy relative to radiomics in the current benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the same benchmark protocol to other organs or modalities could reveal whether the performance gap is specific to renal CT texture or more general.
  • Combining handcrafted radiomics features with foundation model embeddings might produce hybrid classifiers that exceed either alone.
  • If future foundation models incorporate more CT-specific texture data during pre-training, their transfer performance on similar scarce-data tasks could improve without full retraining.

Load-bearing premise

The external test set of 234 lesions and the frozen probing protocol give an unbiased, generalizable measure of each model's capability without selection biases or distribution shifts.

What would settle it

A new foundation model, evaluated with the same frozen probing protocol on the identical external set of 234 lesions, producing an AUC significantly above 0.88.

read the original abstract

The rapid proliferation of open-source medical foundation models (FMs) raises a practical question: how well do their pre-trained representations transfer to clinically relevant but data-scarce classification tasks? Particularly in CT-based renal lesion classification, a push toward greater generalizability would be meaningful, as the field is constrained by inherently limited training data. We addressed this through a benchmark of three medical FMs on this specific task. This six-class problem spans common entities like cysts and clear cell renal cell carcinoma, alongside rare subtypes. Using a frozen feature-probing protocol, we compared FM embeddings against a handcrafted radiomics classifier and a 3D ResNet-50 trained from scratch. Models were trained on a composite dataset of 2,854 lesions and evaluated on an external test set of 234 lesions from The Cancer Imaging Archive. Our results reveal two key findings. First, FM performance (AUC 0.70-0.77) matched the from-scratch ResNet (AUC 0.72) while drastically reducing hardware demand, requiring only seconds on a CPU after feature extraction. However, the conventional radiomics baseline significantly outperformed all deep learning approaches, achieving an AUC of 0.88 (all p $\leq$ 0.002). This suggests that current generalist FM embeddings do not yet capture the fine-grained texture and shape heterogeneity driving histological subtype discrimination. Despite their potential in data-scarce settings, medical FMs did not surpass established models for renal lesion stratification, leaving radiomics as the current state-of-the-art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks three medical foundation models using a frozen feature-probing protocol on a six-class CT-based renal lesion classification task (cysts, ccRCC, and rare subtypes). Models are trained on a composite set of 2,854 lesions and evaluated on an external TCIA test set of 234 lesions, with comparisons to a 3D ResNet-50 trained from scratch and a handcrafted radiomics baseline. Results show FM AUCs of 0.70-0.77 (matching ResNet at 0.72) but significantly lower than radiomics at 0.88 (p ≤ 0.002), leading to the conclusion that current generalist FM embeddings do not capture the fine-grained texture and shape heterogeneity needed for histological subtype discrimination.

Significance. If the results hold after addressing the noted gaps, this provides a useful empirical benchmark showing that conventional radiomics remains superior for this data-scarce clinical task while FM probing offers efficiency gains (CPU seconds post-extraction). The external test set strengthens generalizability claims, and the direct comparison to both DL and radiomics baselines offers practical guidance for FM adoption in medical imaging.

major comments (2)
  1. [Abstract] Abstract: The interpretive claim that FM embeddings 'do not yet capture the fine-grained texture and shape heterogeneity driving histological subtype discrimination' rests on the aggregate AUC gap (0.70-0.77 vs. 0.88). Without per-class AUCs, confusion matrices, or class distribution counts for the 234-lesion external test set, the delta cannot be isolated to performance on rare/difficult subtypes rather than majority classes separable by basic features.
  2. [Methods/Results] Methods/Results: The manuscript reports p-values (all p ≤ 0.002) for the radiomics superiority but does not specify the statistical test (e.g., DeLong for AUC comparison), whether multiple-comparison correction was applied, or provide class-wise breakdowns. These details are load-bearing for validating the central claim and ruling out dataset-specific effects.
minor comments (2)
  1. [Abstract] Abstract: The three specific medical foundation models evaluated are not named; listing them would improve clarity for readers.
  2. [Methods] The training set size (2,854 lesions) and test set (234 lesions) are given, but explicit reporting of class imbalance ratios in both would aid interpretation of the aggregate metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which has improved the clarity and transparency of our work. We address each major comment below and have revised the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The interpretive claim that FM embeddings 'do not yet capture the fine-grained texture and shape heterogeneity driving histological subtype discrimination' rests on the aggregate AUC gap (0.70-0.77 vs. 0.88). Without per-class AUCs, confusion matrices, or class distribution counts for the 234-lesion external test set, the delta cannot be isolated to performance on rare/difficult subtypes rather than majority classes separable by basic features.

    Authors: We agree that aggregate AUCs alone limit the ability to attribute the performance gap to specific classes. In the revised manuscript we have added the class distribution counts for the 234-lesion external test set, per-class AUC values for all models, and confusion matrices. These additions allow readers to evaluate whether the observed differences are concentrated on the rarer subtypes. revision: yes

  2. Referee: [Methods/Results] Methods/Results: The manuscript reports p-values (all p ≤ 0.002) for the radiomics superiority but does not specify the statistical test (e.g., DeLong for AUC comparison), whether multiple-comparison correction was applied, or provide class-wise breakdowns. These details are load-bearing for validating the central claim and ruling out dataset-specific effects.

    Authors: We have clarified the statistical procedures in the revised Methods section: p-values for AUC comparisons were obtained with the DeLong test and Bonferroni correction was applied to account for multiple pairwise tests. Class-wise performance breakdowns have also been added to the Results to support transparent evaluation of the central claims. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark on held-out data

full rationale

The paper performs a standard model comparison by training on a composite dataset of 2,854 lesions and evaluating AUC on an independent external test set of 234 lesions. No equations, derivations, fitted parameters renamed as predictions, or self-citations are present in the provided text. The central claim rests on measured performance deltas (radiomics 0.88 vs. FM/ResNet ~0.70-0.77), which are falsifiable against the external data rather than reducing to the inputs by construction. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This empirical benchmark paper introduces no mathematical axioms, invented physical entities, or new conserved quantities. The only implicit assumptions are standard transfer-learning premises that pre-trained embeddings are useful without fine-tuning and that the chosen radiomics features are representative of clinical texture information.

pith-pipeline@v0.9.0 · 5616 in / 1225 out tokens · 62098 ms · 2026-05-11T03:18:47.283372+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    The age of foundation models

    Lipkova J and Kather JN. The age of foundation models. Nature Reviews Clinical Oncology 2024; 21:769–70.doi: 10.1038/s41571-024-00941-8

  2. [2]

    Overcoming data scarcity in biomedical imaging with a foundational multi-task model

    Schäfer R, Nicke T, Höfener H, Lange A, Merhof D, Feuerhake F, Schulz V, Lotz J, and Kiessling F. Overcoming data scarcity in biomedical imaging with a foundational multi-task model. Nature Computational Science 2024; 4:495–509.doi: 10.1038/s43588-024-00662-z

  3. [3]

    Kidney Fact Sheet

    International Agency for Research on Cancer (IARC). Kidney Fact Sheet. Accessed: 2026-02-

  4. [4]

    Available from: https://gco.iarc.who.int/media/globocan/factsheets/cancers/29-ki dney-fact-sheet.pdf 10

    2021. Available from: https://gco.iarc.who.int/media/globocan/factsheets/cancers/29-ki dney-fact-sheet.pdf 10

  5. [5]

    CT and MRI of small renal masses

    Wang ZJ, Westphalen AC, and Zagoria RJ. CT and MRI of small renal masses. The British Journal of Radiology 2018; 91:20180131.doi: 10.1259/bjr.20180131

  6. [6]

    Renal cell carcinoma: ESMO Clinical Practice Guideline for diagnosis, treatment and follow-up

    Powles T, Albiges L, Bex A, Comperat E, Grünwald V, Kanesvaran R, Kitamura H, McKay R, Porta C, Procopio G, et al. Renal cell carcinoma: ESMO Clinical Practice Guideline for diagnosis, treatment and follow-up. Annals of Oncology 2024.doi: 10.1016/j.annonc.2024.05 .537

  7. [7]

    S3- Guideline for Diagnosis, Therapy, and Follow-up of Renal Cell Carcinoma, Short Version 5.0

    Leitlinienprogramm Onkologie (Deutsche Krebsgesellschaft, Deutsche Krebshilfe, AWMF). S3- Guideline for Diagnosis, Therapy, and Follow-up of Renal Cell Carcinoma, Short Version 5.0. AWMF Registration Number: 043-017OL, Accessed: 23.02.2025. 2024. Available from: https: //www.leitlinienprogramm-onkologie.de/leitlinien/nierenzellkarzinom/

  8. [8]

    Bosniak classification of cystic renal masses, version 2019: an update proposal and needs assessment

    Silverman SG, Pedrosa I, Ellis JH, Hindman NM, Schieda N, et al. Bosniak classification of cystic renal masses, version 2019: an update proposal and needs assessment. Radiology 2019; 292:475–88.doi: 10.1148/radiol.2019182646

  9. [9]

    Renal angiomy- olipoma: a radiological classification and update on recent developments in diagnosis and management

    Jinzaki M, Silverman SG, Akita H, Nagashima Y, Mikami S, and Oya M. Renal angiomy- olipoma: a radiological classification and update on recent developments in diagnosis and management. Abdominal imaging 2014; 39:588–604.doi: 10.1007/s00261-014-0083-3

  10. [10]

    Differentiation of papillary renal cell carcinoma subtypes on CT and MRI

    Egbert ND, Caoili EM, Cohan RH, Davenport MS, Francis IR, Kunju LP, and Ellis JH. Differentiation of papillary renal cell carcinoma subtypes on CT and MRI. American Journal of Roentgenology 2013; 201:347–55.doi: 10.2214/AJR.12.9451

  11. [11]

    Renal oncocytoma: CT features cannot reliably distinguish oncocytoma from other renal neoplasms

    Choudhary S, Rajesh A, Mayer N, Mulcahy K, and Haroon A. Renal oncocytoma: CT features cannot reliably distinguish oncocytoma from other renal neoplasms. Clinical radiology 2009; 64:517–22.doi: 10.1016/j.crad.2008.12.011

  12. [12]

    MRI features of renal oncocytoma and chromophobe renal cell carcinoma

    Rosenkrantz AB, Hindman N, Fitzgerald EF, Niver BE, Melamed J, and Babb JS. MRI features of renal oncocytoma and chromophobe renal cell carcinoma. American Journal of Roentgenology 2010; 195:W421–W427.doi: 10.2214/AJR.10.4718

  13. [13]

    Deep learning for end-to-end kidney cancer diagnosis on multi-phase abdominal computed tomography

    Uhm KH, Jung SW, Choi MH, Shin HK, Yoo JI, et al. Deep learning for end-to-end kidney cancer diagnosis on multi-phase abdominal computed tomography. NPJ precision oncology 2021; 5:54.doi: 10.1038/s41698-021-00195-y

  14. [14]

    Value of radiomics in differential diagnosis of chromophobe renal cell carcinoma and renal oncocytoma

    Li Y, Huang X, Xia Y, and Long L. Value of radiomics in differential diagnosis of chromophobe renal cell carcinoma and renal oncocytoma. Abdominal Radiology 2020; 45:3193–201.doi: 10.1007/s00261-019-02269-9

  15. [15]

    Cancers 2022; 14:3609.doi: 10.3390/cancers14153609

    AlhussainiAJ,SteeleJD,andNabiG.Comparativeanalysisforthedistinctionofchromophobe renal cell carcinoma from renal oncocytoma in computed tomography imaging using machine learning radiomics analysis. Cancers 2022; 14:3609.doi: 10.3390/cancers14153609

  16. [16]

    Apparent diffusion coefficient map-based texture analysis for the differentiation of chromophobe renal cell carcinoma from renal oncocytoma

    Uchida Y, Yoshida S, Arita Y, Shimoda H, Kimura K, et al. Apparent diffusion coefficient map-based texture analysis for the differentiation of chromophobe renal cell carcinoma from renal oncocytoma. Diagnostics 2022; 12:817.doi: 10.3390/diagnostics12040817

  17. [17]

    Use of MRI in differentiation of papillary renal cell carcinoma subtypes: qualitative and quan- titative analysis

    Doshi AM, Ream JM, Kierans AS, Bilbily M, Rusinek H, Huang WC, and Chandarana H. Use of MRI in differentiation of papillary renal cell carcinoma subtypes: qualitative and quan- titative analysis. American Journal of Roentgenology 2016; 206:566–72.doi: 10.2214/AJR.1 5.15004 11

  18. [18]

    Differential diagnosis of type 1 and type 2 papillary renal cell carcinoma based on enhanced CT radiomics nomogram

    Gao Y, Wang X, Wang S, Miao Y, Zhu C, Li C, Huang G, Jiang Y, Li J, Zhao X, et al. Differential diagnosis of type 1 and type 2 papillary renal cell carcinoma based on enhanced CT radiomics nomogram. Frontiers in Oncology 2022; 12:854979.doi: 10.3389/fonc.2022.85 4979

  19. [19]

    2603.02790

    Stegeman M, Philipp L, Graaf F van der, D’Amato M, Grisi C, Builtjes L, Bosma JS, Lefkes J, Weber RA, Meakin JA, et al. Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language. arXiv preprint arXiv:2603.02790 2026.doi: 10.48550/arXiv.2603.02790

  20. [20]

    Incompletely characterized incidental renal masses: emerging data support conservative management

    Silverman SG, Israel GM, and Trinh QD. Incompletely characterized incidental renal masses: emerging data support conservative management. Radiology 2015; 275:28–42.doi: 10.1148/r adiol.14141144

  21. [21]

    2023 Kidney and Kidney Tumor Segmentation Challenge

    Heller N, Isensee F, Tejpau R, Wood A, Papanikolopoulos N, and Weight C. 2023 Kidney and Kidney Tumor Segmentation Challenge. 2023 Apr.doi: 10.5281/zenodo.7840134. Available from: https://doi.org/10.5281/zenodo.7840134

  22. [22]

    The cancer genome atlas kidney renal clear cell carcinoma collection (TCGA-KIRC)(Version 3)[Data set]

    Akin O, Elnajjar P, Heller M, Jarosz R, Erickson BJ, et al. The cancer genome atlas kidney renal clear cell carcinoma collection (TCGA-KIRC)(Version 3)[Data set]. Cancer Imaging Arch 2016.doi: 10.7937/K9/TCIA.2016.V6PBVTDR

  23. [23]

    W. LM, A. GRSC, and S. L. The Cancer Genome Atlas Kidney Chromophobe Collection (TCGA-KICH) (Version 3) [Data set]. Cancer Imaging Arch 2016.doi: 10.7937/K9/TCIA.2 016.YU3RBCZN

  24. [24]

    The cancer genome atlas cervical kidney renal papillary cell carcinoma collection (TCGA-KIRP), version 4

    Linehan M, Gautam R, Kirk S, Lee Y, Roche C, Bonaccio E, Filippini J, Rieger-Christ K, Lemmerman J, and Jarosz R. The cancer genome atlas cervical kidney renal papillary cell carcinoma collection (TCGA-KIRP), version 4. The Cancer Imaging Archive 2016.doi: 10.7 937/K9/TCIA.2016.ACWOGBEF

  25. [25]

    Accessible and Reproducible Renal Cell Carcinoma Research Through Open-Sourcing Data and Annotations

    de Boer S, Häntze H, Ziegelmayer S, Ginneken B van, Prokop M, Bressem KK, and Hering A. Accessible and Reproducible Renal Cell Carcinoma Research Through Open-Sourcing Data and Annotations. medRxiv 2026.doi: 10.64898/2026.04.22.26351451

  26. [26]

    Robust Kidney Abnormality Segmentation: A Validation Study of an AI-Based Framework

    de Boer S, Häntze H, Venkadesh KV, Buser MA, Mamani GEH, Xu L, Adams LC, Nawabi J, Bressem KK, Ginneken B van, et al. Robust Kidney Abnormality Segmentation: A Validation Study of an AI-Based Framework. arXiv preprint arXiv:2505.07573 2025.doi: 10.48550/arXi v.2505.07573

  27. [27]

    Foundation model for cancer imaging biomarkers

    Pai S, Bontempi D, Hadzic I, Prudente V, Sokač M, Chaunzwa TL, Bernatz S, Hosny A, Mak RH, Birkbak NJ, et al. Foundation model for cancer imaging biomarkers. Nature machine intelligence 2024; 6:354–67.doi: 10.1038/s42256-024-00807-9

  28. [28]

    arXiv preprint arXiv:2501.09001

    Pai S, Hadzic I, Bontempi D, Bressem K, Kann BH, Fedorov A, Mak RH, and Aerts HJ. Vision foundation models for computed tomography. arXiv preprint arXiv:2501.09001 2025. doi: 10.48550/arXiv.2501.09001

  29. [29]

    Tissue concepts: Supervised foundation models in computational pathology

    Nicke T, Schäfer JR, Höfener H, Feuerhake F, Merhof D, Kießling F, and Lotz J. Tissue concepts: Supervised foundation models in computational pathology. Computers in biology and medicine 2025; 186:109621.doi: 10.1016/j.compbiomed.2024.109621

  30. [30]

    Proceedings of the 22nd

    Chen T and Guestrin C. Xgboost: A scalable tree boosting system.Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016 :785–94. doi: 10.1145/2939672.293978 12

  31. [31]

    Cancer research 2017; 77:e104–e107.doi: 10.1 158/0008-5472.CAN-17-0339

    VanGriethuysenJJ,FedorovA,ParmarC,HosnyA,AucoinN,etal.Computationalradiomics system to decode the radiographic phenotype. Cancer research 2017; 77:e104–e107.doi: 10.1 158/0008-5472.CAN-17-0339

  32. [32]

    Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians

    Carpenter J and Bithell J. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Statistics in medicine 2000; 19:1141–64.doi: 10.1002/(sici)10 97-0258(20000515)19:9<1141::aid-sim479>3.0.co;2-f

  33. [33]

    FindBounce: Package for multi-field bounce actions

    Mandel M and Betensky RA. Simultaneous confidence intervals based on the percentile boot- strap approach. Computational statistics & data analysis 2008; 52:2158–65.doi: 10.1016/j.c sda.2007.07.005

  34. [34]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    McInnes L, Healy J, and Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 2018.doi: 10.48550/arXiv.1802.03426 13