Validation of Whole-Slide Foundation Models for Image Retrieval in TCGA Data

H.R. Tizhoosh; Judy C. Boughey; Krishna R. Kalari; Matthew P. Goetz; Parsa Esmaeilkhani; Saghir Alfasly; Tianhao Lei; Wataru Uegami

arxiv: 2605.00902 · v1 · submitted 2026-04-28 · 💻 cs.CV · cs.IR

Validation of Whole-Slide Foundation Models for Image Retrieval in TCGA Data

Tianhao Lei , Parsa Esmaeilkhani , Saghir Alfasly , Wataru Uegami , Judy C. Boughey , Matthew P. Goetz , Krishna R. Kalari , H.R. Tizhoosh This is my paper

Pith reviewed 2026-05-09 20:02 UTC · model grok-4.3

classification 💻 cs.CV cs.IR

keywords whole-slide image retrievalfoundation modelshistopathologyimage retrievalmultiple instance learningpatch-based methodsTCGA datasetbenchmark evaluation

0 comments

The pith

Patch-level features drive whole-slide image retrieval performance more than slide-level aggregation on TCGA data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates ten retrieval pipelines on nearly 9400 diagnostic slides spanning 17 organs and 60 diagnoses. It compares four pre-trained slide foundation models against a supervised attention-based aggregator and several patch-sampling strategies, all assessed with patient-level leave-one-out evaluation. Accuracy differences proved larger across organs and diagnoses than across the competing architectures, and one foundation model led only modestly. Patch representations accounted for most of the observed performance while aggregation added little in many settings. This matters because it suggests that complex whole-slide models may not be required for retrieval and that morphology alone faces clear limits for certain diagnoses.

Core claim

Benchmarking on 9387 TCGA slides showed that a slide foundation model achieved the highest overall Top-1 and Top-3 accuracy, yet attention-based multiple instance learning and patch-level retrieval produced comparable scores with no method dominant across all cases. Performance varied more by organ and diagnosis than by architecture; morphologically distinctive entities approached high accuracy while rare or closely related subtypes remained difficult. Misclassifications corresponded to organs known for high inter-observer variability, and the best result reached only approximately 68 percent with some subtypes at zero across every pipeline.

What carries the argument

Leave-one-patient-out retrieval evaluation comparing patch embeddings, attention-based multiple instance learning aggregation, and pre-trained slide foundation models on diagnostic whole-slide images.

If this is right

No single architecture is universally best, so organ-resolved or diagnosis-aware benchmarking is required instead.
Efforts to strengthen patch-level feature representations are likely to yield larger gains than further refinements to slide-level aggregation.
Morphology-only retrieval has an intrinsic performance ceiling for heterogeneous or closely related diagnoses.
Reliable clinical deployment will need multimodal data or ensemble strategies beyond current image-only approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clinical systems may still require site-specific fine-tuning because TCGA data do not capture the full range of real-world staining and scanner differences.
Combining patch retrieval with limited aggregation only for ambiguous cases could improve efficiency without sacrificing accuracy.
Future benchmarks should include non-TCGA cohorts to test whether the observed limits are dataset-specific or fundamental to morphology-based methods.

Load-bearing premise

That TCGA diagnostic slides and leave-one-patient-out evaluation serve as a representative proxy for clinical whole-slide retrieval without major confounding from staining, scanner, or demographic variations.

What would settle it

Re-running the same ten pipelines on an external multi-center set of whole-slide images acquired with different scanners and staining protocols, then checking whether overall accuracy falls substantially below 68 percent or whether patch-only methods lose their relative standing.

read the original abstract

Foundation models are reshaping computational histopathology, yet their value for whole-slide image retrieval relative to strong patch-based and supervised aggregation baselines remains unclear. We benchmarked ten pipelines on 9,387 diagnostic slides spanning 17 organs and 60 diagnoses from The Cancer Genome Atlas (TCGA) using patient-level leave-one-patient-out evaluation. Methods included four pre-trained slide foundation models, a supervised attention-based multiple instance learning (ABMIL) aggregator on patch embeddings, and patch-level retrieval across five sampling densities. Performance varied more across organs and diagnoses than across architectures. Although the slide foundation model TITAN achieved the strongest overall results, its advantage was modest; ABMIL and patch-based methods reached comparable Top-1 and Top-3 accuracy, with no model consistently dominant. Morphologically distinctive entities approached ceiling performance, while rare, heterogeneous, and closely related subtypes remained challenging. Misclassifications aligned with organs exhibiting known inter-observer variability, suggesting an intrinsic ceiling for morphology-only retrieval. Performance was driven primarily by patch-level feature representations, with limited benefit from slide-level aggregation, indicating aggregation may be unnecessary in many settings. These findings argue against a universally optimal architecture and instead support organ-resolved benchmarking, diagnosis-aware or ensemble strategies, stronger feature representations, and multimodal retrieval frameworks. Notably, even the best model achieved only $\approx 68\% \pm 21\%$ retrieval accuracy on TCGA, and some subtypes showed $0\%$ accuracy across all methods, highlighting fundamental limitations of morphology-based representations and the need for substantial progress before reliable clinical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Patch-level retrieval matches slide foundation models on TCGA with organ-driven variation and a hard 68% ceiling, but the work stays within existing benchmarking territory.

read the letter

Patch-level features carry most of the load for whole-slide retrieval here, and adding slide-level aggregation brings only small gains at best. The authors ran ten pipelines on 9,387 TCGA diagnostic slides across 17 organs and 60 diagnoses under patient-level leave-one-out evaluation. They compared four pre-trained slide foundation models, supervised ABMIL, and patch retrieval at different densities, then broke results down by organ and diagnosis. TITAN came out slightly ahead overall, but ABMIL and the patch baselines stayed competitive, and performance tracked how distinctive the morphology was more than which architecture was used. Some subtypes hit zero accuracy across every method, which the paper ties to known inter-observer variability rather than model shortcomings. That gives a realistic picture of current limits for morphology-only retrieval. The scale and the direct head-to-head on public data are the main strengths; the numbers are concrete and the authors do not over-sell a new framework. The soft spots sit in the missing implementation details. The abstract does not spell out the exact patch sampling strategy, whether the foundation models were frozen or adapted, or how the ±21% variance was calculated across organs. TCGA staining and scanner differences are acknowledged as a ceiling factor, but the evaluation stays on diagnostic slides only, so generalizability to other cohorts is not tested. These are standard issues for a benchmark rather than fatal gaps. The central claim about patch dominance holds up internally from the reported pattern. This paper is for computational pathology groups that need current baseline numbers before designing new retrieval systems or deciding whether aggregation is worth the compute. It is not required reading for everyone, but the empirical scope is large enough that a serious referee should see it. I would send it for peer review to get the full methods and any additional controls on the table.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks ten retrieval pipelines on 9,387 TCGA whole-slide images spanning 17 organs and 60 diagnoses using leave-one-patient-out evaluation. The pipelines consist of four slide foundation models, supervised ABMIL on patch embeddings, and patch-level retrieval at five different sampling densities. The primary claims are that performance differences are larger across organs and diagnoses than across methods, that the best slide model (TITAN) has only modest gains over baselines, that patch-level features drive performance with limited additional benefit from slide-level aggregation, and that morphology-only retrieval has an intrinsic accuracy ceiling around 68% with some subtypes at 0% accuracy.

Significance. This empirical study is significant for computational pathology because it demonstrates through direct comparison that current whole-slide foundation models do not substantially outperform simpler patch-based or supervised aggregation methods for retrieval tasks. The large scale, use of public TCGA data, and patient-level evaluation are strengths that enhance reproducibility and generalizability. If the central claim holds, it implies that resources should be directed toward improving underlying patch feature extractors and developing organ-specific or multimodal strategies rather than new universal slide aggregators. The acknowledgment of performance limits due to morphological similarity is a mature and useful contribution.

major comments (2)

[Methods] Methods section: The manuscript does not provide sufficient detail on slide quality control, exclusion criteria for patients or slides, or the specific hyperparameters and training protocol for the ABMIL model and the patch sampling strategies. These details are necessary to evaluate the robustness of the head-to-head comparisons and to confirm that the reported similarities between methods are not affected by implementation choices or post-hoc selection.
[Results] Results section: The claims of 'modest' advantage for TITAN and 'limited benefit' from slide-level aggregation rest on direct comparisons, but without reported confidence intervals, standard errors, or statistical tests for differences between methods (e.g., TITAN vs. ABMIL vs. patch retrieval), it is unclear whether the observed patterns are statistically distinguishable from noise or variation across organs.

minor comments (2)

[Abstract] Abstract: While the breakdown into ten pipelines is logically 4 + 1 + 5, explicitly stating how the five patch densities constitute distinct pipelines would improve immediate clarity for readers.
[Discussion] Discussion: The alignment of misclassifications with organs known for inter-observer variability is noted, but adding one or two specific citations to prior pathology literature on this point would strengthen the interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We have addressed both major comments by expanding the Methods and Results sections with the requested details and analyses to enhance reproducibility and statistical rigor.

read point-by-point responses

Referee: [Methods] Methods section: The manuscript does not provide sufficient detail on slide quality control, exclusion criteria for patients or slides, or the specific hyperparameters and training protocol for the ABMIL model and the patch sampling strategies. These details are necessary to evaluate the robustness of the head-to-head comparisons and to confirm that the reported similarities between methods are not affected by implementation choices or post-hoc selection.

Authors: We agree that additional methodological transparency is warranted. In the revised manuscript we will add a dedicated subsection detailing: (i) TCGA slide quality control steps and exclusion criteria (e.g., image resolution, staining artifacts, and patient-level filters); (ii) complete ABMIL hyperparameters including learning rate, batch size, epochs, attention pooling configuration, and cross-validation protocol; and (iii) precise patch sampling densities, random seed handling, and feature extraction settings for each of the five densities. These additions will allow full replication and confirm that observed method similarities are not artifacts of implementation choices. revision: yes
Referee: [Results] Results section: The claims of 'modest' advantage for TITAN and 'limited benefit' from slide-level aggregation rest on direct comparisons, but without reported confidence intervals, standard errors, or statistical tests for differences between methods (e.g., TITAN vs. ABMIL vs. patch retrieval), it is unclear whether the observed patterns are statistically distinguishable from noise or variation across organs.

Authors: We acknowledge that formal statistical comparison strengthens the claims. While the reported ±21% reflects organ-level variation and the large cohort (9,387 slides) supports the observed trends, the revised manuscript will include: bootstrap-derived 95% confidence intervals for all Top-1/Top-3 accuracies; standard errors stratified by organ; and paired statistical tests (McNemar’s test for accuracy differences and Wilcoxon signed-rank for organ-level comparisons) between TITAN, ABMIL, and patch baselines. These will be presented in updated tables and text to quantify whether differences exceed sampling variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a pure empirical benchmarking study that compares ten retrieval pipelines (four slide foundation models, ABMIL, and patch-level sampling variants) on 9,387 TCGA slides under patient-level LOPO evaluation. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the reported results. Central claims rest on direct head-to-head accuracy measurements that vary by organ/diagnosis rather than architecture; these measurements are externally falsifiable against the public TCGA data and do not reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard machine-learning evaluation practices and public TCGA data without additional free parameters, ad-hoc axioms, or invented entities.

axioms (1)

domain assumption Leave-one-patient-out split prevents data leakage in patient-level evaluation
Invoked to ensure test slides come from unseen patients.

pith-pipeline@v0.9.0 · 5623 in / 1296 out tokens · 53619 ms · 2026-05-09T20:02:22.123883+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 23 canonical work pages · 1 internal anchor

[1]

Ocampo, Theodore Sakellaropoulos, Navneet Narula, Matija Snuderl, David Feny¨o, Andre L

Coudray, N., Ocampo, P.S., Sakellaropoulos, T., Narula, N., Snuderl, M., Feny¨ o, D., Moreira, A.L., Razavian, N., Tsirigos, A.: Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nature Medicine24(10), 1559–1567 (2018) https://doi. org/10.1038/s41591-018-0177-5 . Accessed 2026-03-03

work page doi:10.1038/s41591-018-0177-5 2018
[2]

Nature Biomedical Engineering5(6), 555–570 (2021).https://doi.org/ 10.1038/s41551-020-00682-w

Lu, M.Y., Williamson, D.F.K., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering 5(6), 555–570 (2021) https://doi.org/10.1038/s41551-020-00682-w . Accessed 2026-03-03

work page doi:10.1038/s41551-020-00682-w 2021
[3]

Nature Medicine (2024).https://doi.org/ 10.1038/s41591-024-02857-3

Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., Jaume, G., Song, A.H., Chen, B., Zhang, A., Shao, D., Shaban, M., Williams, M., Oldenburg, L., Weishaupt, L.L., Wang, J.J., Vaidya, A., Le, L.P., Gerber, G., Sahai, S., Williams, W., Mahmood, F.: Towards a general-purpose foundation model for computa- tional pathology. Nature Medicine30(3), 850–862 (2024...

work page doi:10.1038/s41591-024-02857-3 2024
[4]

Nature Communications16(1), 3640 (2025) https://doi.org/10.1038/s41467-025-58796-1

Campanella, G., Chen, S., Singh, M., Verma, R., Muehlstedt, S., Zeng, J., Stock, A., Croken, M., Veremis, B., Elmas, A., Shujski, I., Neittaanm¨ aki, N., Huang, K.-l., Kwan, R., Houldsworth, J., Schoen- feld, A.J., Vanderbilt, C.: A clinical benchmark of public self-supervised pathology foundation models. Nature Communications16(1), 3640 (2025) https://do...

work page doi:10.1038/s41467-025-58796-1 2025
[5]

Nature Medicine30(10), 2924–2935 (2024) https://doi.org/10.1038/s41591-024-03141-0

Vorontsov, E., Bozkurt, A., Casson, A., Shaikovski, G., Zelechowski, M., Severson, K., Zimmermann, E., Hall, J., Tenenholtz, N., Fusi, N., Yang, E., Mathieu, P., Eck, A., Lee, D., Viret, J., Robert, E., Wang, Y.K., Kunz, J.D., Lee, M.C.H., Bernhard, J.H., Godrich, R.A., Oakley, G., Millar, E., Hanna, M., Wen, H., Retamero, J.A., Moye, W.A., Yousfi, R., Ka...

work page doi:10.1038/s41591-024-03141-0 2024
[6]

Nature Communications16(1), 11406 (2025) https://doi.org/10.1038/s41467-025-66220-x

Xu, Y., Wang, Y., Zhou, F., Ma, J., Jin, C., Yang, S., Li, J., Zhang, Z., Zhao, C., Zhou, H., Li, Z., Lin, H., Wang, X., Wang, J., Han, A., Chan, R.C.K., Liang, L., Zhang, X., Chen, H.: A multimodal knowledge-enhanced whole-slide pathology foundation model. Nature Communications16(1), 11406 (2025) https://doi.org/10.1038/s41467-025-66220-x . Accessed 2026-01-19

work page doi:10.1038/s41467-025-66220-x 2025
[7]

Nature medicine, 1–13 (2025)

Ding, T., Wagner, S.J., Song, A.H., Chen, R.J., Lu, M.Y., Zhang, A., Vaidya, A.J., Jaume, G., Shaban, M., Kim, A., et al.: A multimodal whole-slide foundation model for pathology. Nature medicine, 1–13 (2025)

2025
[8]

Nature Medicine30(3), 863–874 (2024) https://doi.org/10.1038/ s41591-024-02856-4

Lu, M.Y., Chen, B., Williamson, D.F.K., Chen, R.J., Liang, I., Ding, T., Jaume, G., Odintsov, I., Le, L.P., Gerber, G., Parwani, A.V., Zhang, A., Mahmood, F.: A visual-language foundation model for computational pathology. Nature Medicine30(3), 863–874 (2024) https://doi.org/10.1038/ s41591-024-02856-4 . Accessed 2026-03-03

2024
[9]

Medical Image Analysis81, 102559 (2022).https://doi.org/10.1016/j.media.2022.102559

Wang, X., Yang, S., Zhang, J., Wang, M., Zhang, J., Yang, W., Huang, J., Han, X.: Transformer-based unsupervised contrastive learning for histopathological image classification. Medical Image Analysis 81, 102559 (2022) https://doi.org/10.1016/j.media.2022.102559 . Accessed 2026-01-12

work page doi:10.1016/j.media.2022.102559 2022
[10]

Nature630(8015), 181–188 (2024).https://doi

Xu, H., Usuyama, N., Bagga, J., Zhang, S., Rao, R., Naumann, T., Wong, C., Gero, Z., Gonz´ alez, J., Gu, Y., Xu, Y., Wei, M., Wang, W., Ma, S., Wei, F., Yang, J., Li, C., Gao, J., Rosemon, J., Bower, T., Lee, S., Weerasinghe, R., Wright, B.J., Robicsek, A., Piening, B., Bifulco, C., Wang, S., Poon, H.: A whole-slide foundation model for digital pathology ...

work page doi:10.1038/s41586-024-07441-w 2024
[11]

Shaikovski, G., Casson, A., Severson, K., Zimmermann, E., Wang, Y.K., Kunz, J.D., Retamero, J.A., Oakley, G., Klimstra, D., Kanan, C., Hanna, M., Zelechowski, M., Viret, J., Tenenholtz, N., Hall, J., Fusi, N., Yousfi, R., Hamilton, P., Moye, W.A., Vorontsov, E., Liu, S., Fuchs, T.J.: PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopa...

work page doi:10.48550/arxiv.2405.10254 2024
[12]

Xiong, C., Chen, H., Sung, J.J.Y.: A Survey of Pathology Foundation Model: Progress and Future Directions. arXiv. arXiv:2504.04045 [cs] (2025). https://doi.org/10.48550/arXiv.2504.04045 . http:// arxiv.org/abs/2504.04045 Accessed 2026-01-07

work page doi:10.48550/arxiv.2504.04045 2025
[13]

Song, A., J

Jaume, G., Vaidya, A., Zhang, A., H. Song, A., J. Chen, R., Sahai, S., Mo, D., Madrigal, E., Phi Le, L., Mahmood, F.: Multistain Pretraining for Slide Representation Learning in Pathology. Computer Vision – ECCV 202415091, 19–37 (2025) https://doi.org/10.1007/978-3-031-73414-4 2 . Series Title: Lecture Notes in Computer Science. Accessed 2026-03-03

work page doi:10.1007/978-3-031-73414-4 2025
[14]

Ilse, M., Tomczak, J.M., Welling, M.: Attention-based Deep Multiple Instance Learning
[15]

Nature Biomedical Engineering (2025).https://doi.org/10.1038/s41551-025-01516-3

Neidlinger, P., El Nahhas, O.S.M., Muti, H.S., Lenz, T., Hoffmeister, M., Brenner, H., Van Treeck, M., Langer, R., Dislich, B., Behrens, H.M., R¨ ocken, C., Foersch, S., Truhn, D., Marra, A., Saldanha, O.L., Kather, J.N.: Benchmarking foundation models as feature extractors for weakly supervised computa- tional pathology. Nature Biomedical Engineering (20...

work page doi:10.1038/s41551-025-01516-3 2025
[16]

Tizhoosh, H.R.: Beyond the Failures: Rethinking Foundation Models in Pathology. arXiv. arXiv:2510.23807 [cs] (2025). https://doi.org/10.48550/arXiv.2510.23807 . http://arxiv.org/abs/2510. 23807 Accessed 2026-01-06

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.23807 2025
[17]

Kalra, et al., Yottixel – an image search engine for large archives of histopathology whole slide images, Medical Image Analysis 65 (2020) 101757

Kalra, S., Tizhoosh, H.R., Choi, C., Shah, S., Diamandis, P., Campbell, C.J.V., Pantanowitz, L.: Yottixel – An Image Search Engine for Large Archives of Histopathology Whole Slide Images. Medical Image Analysis65, 101757 (2020) https://doi.org/10.1016/j.media.2020.101757 . Accessed 2026-01-23

work page doi:10.1016/j.media.2020.101757 2020
[18]

npj Digital Medicine3(1), 31 (2020) https: //doi.org/10.1038/s41746-020-0238-2

Kalra, S., Tizhoosh, H.R., Shah, S., Choi, C., Damaskinos, S., Safarpoor, A., Shafiei, S., Babaie, M., Diamandis, P., Campbell, C.J.V., Pantanowitz, L.: Pan-cancer diagnostic consensus through searching archival histopathology images using artificial intelligence. npj Digital Medicine3(1), 31 (2020) https: //doi.org/10.1038/s41746-020-0238-2 . Accessed 2026-01-27

work page doi:10.1038/s41746-020-0238-2 2020
[19]

IEEE Reviews in Biomedical Engineering18, 350–367 (2025) https://doi.org/10.1109/RBME.2024

Lahr, I., Alfasly, S., Nejat, P., Khan, J., Kottom, L., Kumbhar, V., Alsaafin, A., Shafique, A., Hemati, S., Alabtah, G., Comfere, N., Murphree, D., Mangold, A., Yasir, S., Meroueh, C., Boardman, L., Shah, V.H., Garcia, J.J., Tizhoosh, H.R.: Analysis and Validation of Image Search Engines in Histopathology. IEEE Reviews in Biomedical Engineering18, 350–36...

work page doi:10.1109/rbme.2024 2025
[20]

Scientific Reports15(1), 3990 (2025) https://doi.org/10.1038/s41598-025-88545-9

Alfasly, S., Alabtah, G., Hemati, S., Kalari, K.R., Garcia, J.J., Tizhoosh, H.R.: Validation of histopathology foundation models through whole slide image retrieval. Scientific Reports15(1), 3990 (2025) https://doi.org/10.1038/s41598-025-88545-9 . Accessed 2026-01-14

work page doi:10.1038/s41598-025-88545-9 2025
[21]

Medical image analysis42, 60–88 (2017)

Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., S´ anchez, C.I.: A survey on deep learning in medical image analysis. Medical image analysis42, 60–88 (2017)

2017
[22]

Nature medicine25(8), 1301–1309 (2019)

Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Werneck Krauss Silva, V., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine25(8), 1301–1309 (2019)

2019
[23]

Jama318(22), 2199–2210 (2017)

Ehteshami Bejnordi, B., Veta, M., Diest, P., Van Ginneken, B., Karssemeijer, N., Litjens, G., Van Der Laak, J.A., consortium, C., Hermsen, M., Manson, Q.F.,et al.: Diagnostic assessment of deep learn- ing algorithms for detection of lymph node metastases in women with breast cancer. Jama318(22), 2199–2210 (2017)

2017
[24]

Tizhoosh, H.R., Diamandis, P., Campbell, C.J., Safarpoor, A., Kalra, S., Maleki, D., Riasatian, A., Babaie, M.: Searching images for consensus: can ai remove observer variability in pathology? The American journal of pathology191(10), 1702–1708 (2021)

2021
[25]

Histopathology77(5), 734–741 (2020) https://doi.org/10.1111/his.14167 20

Wolf, J.L., Nederveen, F., Blaauwgeers, H., Marx, A., Nicholson, A.G., Roden, A.C., Str¨ obel, P., Timens, W., Weissferdt, A., Th¨ usen, J., Bakker, M.A.: Interobserver variation in the classification of thymic lesions including biopsies and resection specimens in an international digital microscopy panel. Histopathology77(5), 734–741 (2020) https://doi.o...

work page doi:10.1111/his.14167 2020
[26]

2017.0280

Hernandez-Prera, J.C., Machado, R.A., Asa, S.L., Baloch, Z., Faquin, W.C., Ghossein, R., LiVolsi, V.A., Lloyd, R.V., Mete, O., Nikiforov, Y.E., Seethala, R.R., Suster, S., Thompson, L.D., Turk, A.T., Sadow, P.M., Urken, M.L., Wenig, B.M.: Pathologic reporting of tall cell variant of papillary thyroid cancer: Have we reached a consensus? Thyroid28(12), 168...

work page doi:10.1089/thy 2018
[27]

Cancer88(10), 2342–2349 (2000) https://doi.org/10.1002/(SICI)1097-0142(20000515) 88:10⟨2342::AID-CNCR19⟩3.0.CO;2-X

Aldape, K., Simmons, M.L., Davis, R.L., Miike, R., Wiencke, J., Barger, G., Lee, M., Chen, P., Wren- sch, M.: Discrepancies in diagnoses of neuroepithelial neoplasms: The San Francisco Bay Area Adult Glioma Study. Cancer88(10), 2342–2349 (2000) https://doi.org/10.1002/(SICI)1097-0142(20000515) 88:10⟨2342::AID-CNCR19⟩3.0.CO;2-X

work page doi:10.1002/(sici)1097-0142(20000515 2000
[28]

Nature genetics 45(10), 1113–1120 (2013)

Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: The cancer genome atlas pan-cancer analysis project. Nature genetics 45(10), 1113–1120 (2013)

2013
[29]

preprint (2026) https://doi.org/10.5281/zenodo.19736866

Uegami, W.,et al.: Disease subtyping for computational pathology benchmarking with tcga dataset—white paper authors/creators. preprint (2026) https://doi.org/10.5281/zenodo.19736866

work page doi:10.5281/zenodo.19736866 2026
[30]

Accelerating data processing and benchmarking of ai models for pathology,

Zhang, A., Jaume, G., Vaidya, A., Ding, T., Mahmood, F.: Accelerating data processing and benchmarking of ai models for pathology. arXiv preprint arXiv:2502.06750 (2025)

work page arXiv 2025
[31]

Journal of Pathology Informatics 15, 100375 (2024)

Tizhoosh, H.R., Pantanowitz, L.: On image search in histopathology. Journal of Pathology Informatics 15, 100375 (2024)

2024
[32]

IEEE Reviews in Biomedical Engineering18, 350–367 (2024)

Lahr, I., Alfasly, S., Nejat, P., Khan, J., Kottom, L., Kumbhar, V., Alsaafin, A., Shafique, A., Hemati, S., Alabtah, G.,et al.: Analysis and validation of image search engines in histopathology. IEEE Reviews in Biomedical Engineering18, 350–367 (2024)

2024
[33]

Tizhoosh, H.R., Zhu, S., Lo, H., Chaudhari, V., Mehdi, T.: MinMax Radon Barcodes for Medical Image Retrieval. arXiv. arXiv:1610.00318 [cs] (2016). https://doi.org/10.48550/arXiv.1610.00318 . http:// arxiv.org/abs/1610.00318 Accessed 2026-03-02

work page doi:10.48550/arxiv.1610.00318 2016
[34]

Encyclopedia of research design1(8), 1–8 (2010) 21

Abdi, H.: Holm’s sequential bonferroni procedure. Encyclopedia of research design1(8), 1–8 (2010) 21

2010

[1] [1]

Ocampo, Theodore Sakellaropoulos, Navneet Narula, Matija Snuderl, David Feny¨o, Andre L

Coudray, N., Ocampo, P.S., Sakellaropoulos, T., Narula, N., Snuderl, M., Feny¨ o, D., Moreira, A.L., Razavian, N., Tsirigos, A.: Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nature Medicine24(10), 1559–1567 (2018) https://doi. org/10.1038/s41591-018-0177-5 . Accessed 2026-03-03

work page doi:10.1038/s41591-018-0177-5 2018

[2] [2]

Nature Biomedical Engineering5(6), 555–570 (2021).https://doi.org/ 10.1038/s41551-020-00682-w

Lu, M.Y., Williamson, D.F.K., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering 5(6), 555–570 (2021) https://doi.org/10.1038/s41551-020-00682-w . Accessed 2026-03-03

work page doi:10.1038/s41551-020-00682-w 2021

[3] [3]

Nature Medicine (2024).https://doi.org/ 10.1038/s41591-024-02857-3

Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., Jaume, G., Song, A.H., Chen, B., Zhang, A., Shao, D., Shaban, M., Williams, M., Oldenburg, L., Weishaupt, L.L., Wang, J.J., Vaidya, A., Le, L.P., Gerber, G., Sahai, S., Williams, W., Mahmood, F.: Towards a general-purpose foundation model for computa- tional pathology. Nature Medicine30(3), 850–862 (2024...

work page doi:10.1038/s41591-024-02857-3 2024

[4] [4]

Nature Communications16(1), 3640 (2025) https://doi.org/10.1038/s41467-025-58796-1

Campanella, G., Chen, S., Singh, M., Verma, R., Muehlstedt, S., Zeng, J., Stock, A., Croken, M., Veremis, B., Elmas, A., Shujski, I., Neittaanm¨ aki, N., Huang, K.-l., Kwan, R., Houldsworth, J., Schoen- feld, A.J., Vanderbilt, C.: A clinical benchmark of public self-supervised pathology foundation models. Nature Communications16(1), 3640 (2025) https://do...

work page doi:10.1038/s41467-025-58796-1 2025

[5] [5]

Nature Medicine30(10), 2924–2935 (2024) https://doi.org/10.1038/s41591-024-03141-0

Vorontsov, E., Bozkurt, A., Casson, A., Shaikovski, G., Zelechowski, M., Severson, K., Zimmermann, E., Hall, J., Tenenholtz, N., Fusi, N., Yang, E., Mathieu, P., Eck, A., Lee, D., Viret, J., Robert, E., Wang, Y.K., Kunz, J.D., Lee, M.C.H., Bernhard, J.H., Godrich, R.A., Oakley, G., Millar, E., Hanna, M., Wen, H., Retamero, J.A., Moye, W.A., Yousfi, R., Ka...

work page doi:10.1038/s41591-024-03141-0 2024

[6] [6]

Nature Communications16(1), 11406 (2025) https://doi.org/10.1038/s41467-025-66220-x

Xu, Y., Wang, Y., Zhou, F., Ma, J., Jin, C., Yang, S., Li, J., Zhang, Z., Zhao, C., Zhou, H., Li, Z., Lin, H., Wang, X., Wang, J., Han, A., Chan, R.C.K., Liang, L., Zhang, X., Chen, H.: A multimodal knowledge-enhanced whole-slide pathology foundation model. Nature Communications16(1), 11406 (2025) https://doi.org/10.1038/s41467-025-66220-x . Accessed 2026-01-19

work page doi:10.1038/s41467-025-66220-x 2025

[7] [7]

Nature medicine, 1–13 (2025)

Ding, T., Wagner, S.J., Song, A.H., Chen, R.J., Lu, M.Y., Zhang, A., Vaidya, A.J., Jaume, G., Shaban, M., Kim, A., et al.: A multimodal whole-slide foundation model for pathology. Nature medicine, 1–13 (2025)

2025

[8] [8]

Nature Medicine30(3), 863–874 (2024) https://doi.org/10.1038/ s41591-024-02856-4

Lu, M.Y., Chen, B., Williamson, D.F.K., Chen, R.J., Liang, I., Ding, T., Jaume, G., Odintsov, I., Le, L.P., Gerber, G., Parwani, A.V., Zhang, A., Mahmood, F.: A visual-language foundation model for computational pathology. Nature Medicine30(3), 863–874 (2024) https://doi.org/10.1038/ s41591-024-02856-4 . Accessed 2026-03-03

2024

[9] [9]

Medical Image Analysis81, 102559 (2022).https://doi.org/10.1016/j.media.2022.102559

Wang, X., Yang, S., Zhang, J., Wang, M., Zhang, J., Yang, W., Huang, J., Han, X.: Transformer-based unsupervised contrastive learning for histopathological image classification. Medical Image Analysis 81, 102559 (2022) https://doi.org/10.1016/j.media.2022.102559 . Accessed 2026-01-12

work page doi:10.1016/j.media.2022.102559 2022

[10] [10]

Nature630(8015), 181–188 (2024).https://doi

Xu, H., Usuyama, N., Bagga, J., Zhang, S., Rao, R., Naumann, T., Wong, C., Gero, Z., Gonz´ alez, J., Gu, Y., Xu, Y., Wei, M., Wang, W., Ma, S., Wei, F., Yang, J., Li, C., Gao, J., Rosemon, J., Bower, T., Lee, S., Weerasinghe, R., Wright, B.J., Robicsek, A., Piening, B., Bifulco, C., Wang, S., Poon, H.: A whole-slide foundation model for digital pathology ...

work page doi:10.1038/s41586-024-07441-w 2024

[11] [11]

Shaikovski, G., Casson, A., Severson, K., Zimmermann, E., Wang, Y.K., Kunz, J.D., Retamero, J.A., Oakley, G., Klimstra, D., Kanan, C., Hanna, M., Zelechowski, M., Viret, J., Tenenholtz, N., Hall, J., Fusi, N., Yousfi, R., Hamilton, P., Moye, W.A., Vorontsov, E., Liu, S., Fuchs, T.J.: PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopa...

work page doi:10.48550/arxiv.2405.10254 2024

[12] [12]

Xiong, C., Chen, H., Sung, J.J.Y.: A Survey of Pathology Foundation Model: Progress and Future Directions. arXiv. arXiv:2504.04045 [cs] (2025). https://doi.org/10.48550/arXiv.2504.04045 . http:// arxiv.org/abs/2504.04045 Accessed 2026-01-07

work page doi:10.48550/arxiv.2504.04045 2025

[13] [13]

Song, A., J

Jaume, G., Vaidya, A., Zhang, A., H. Song, A., J. Chen, R., Sahai, S., Mo, D., Madrigal, E., Phi Le, L., Mahmood, F.: Multistain Pretraining for Slide Representation Learning in Pathology. Computer Vision – ECCV 202415091, 19–37 (2025) https://doi.org/10.1007/978-3-031-73414-4 2 . Series Title: Lecture Notes in Computer Science. Accessed 2026-03-03

work page doi:10.1007/978-3-031-73414-4 2025

[14] [14]

Ilse, M., Tomczak, J.M., Welling, M.: Attention-based Deep Multiple Instance Learning

[15] [15]

Nature Biomedical Engineering (2025).https://doi.org/10.1038/s41551-025-01516-3

Neidlinger, P., El Nahhas, O.S.M., Muti, H.S., Lenz, T., Hoffmeister, M., Brenner, H., Van Treeck, M., Langer, R., Dislich, B., Behrens, H.M., R¨ ocken, C., Foersch, S., Truhn, D., Marra, A., Saldanha, O.L., Kather, J.N.: Benchmarking foundation models as feature extractors for weakly supervised computa- tional pathology. Nature Biomedical Engineering (20...

work page doi:10.1038/s41551-025-01516-3 2025

[16] [16]

Tizhoosh, H.R.: Beyond the Failures: Rethinking Foundation Models in Pathology. arXiv. arXiv:2510.23807 [cs] (2025). https://doi.org/10.48550/arXiv.2510.23807 . http://arxiv.org/abs/2510. 23807 Accessed 2026-01-06

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.23807 2025

[17] [17]

Kalra, et al., Yottixel – an image search engine for large archives of histopathology whole slide images, Medical Image Analysis 65 (2020) 101757

Kalra, S., Tizhoosh, H.R., Choi, C., Shah, S., Diamandis, P., Campbell, C.J.V., Pantanowitz, L.: Yottixel – An Image Search Engine for Large Archives of Histopathology Whole Slide Images. Medical Image Analysis65, 101757 (2020) https://doi.org/10.1016/j.media.2020.101757 . Accessed 2026-01-23

work page doi:10.1016/j.media.2020.101757 2020

[18] [18]

npj Digital Medicine3(1), 31 (2020) https: //doi.org/10.1038/s41746-020-0238-2

Kalra, S., Tizhoosh, H.R., Shah, S., Choi, C., Damaskinos, S., Safarpoor, A., Shafiei, S., Babaie, M., Diamandis, P., Campbell, C.J.V., Pantanowitz, L.: Pan-cancer diagnostic consensus through searching archival histopathology images using artificial intelligence. npj Digital Medicine3(1), 31 (2020) https: //doi.org/10.1038/s41746-020-0238-2 . Accessed 2026-01-27

work page doi:10.1038/s41746-020-0238-2 2020

[19] [19]

IEEE Reviews in Biomedical Engineering18, 350–367 (2025) https://doi.org/10.1109/RBME.2024

Lahr, I., Alfasly, S., Nejat, P., Khan, J., Kottom, L., Kumbhar, V., Alsaafin, A., Shafique, A., Hemati, S., Alabtah, G., Comfere, N., Murphree, D., Mangold, A., Yasir, S., Meroueh, C., Boardman, L., Shah, V.H., Garcia, J.J., Tizhoosh, H.R.: Analysis and Validation of Image Search Engines in Histopathology. IEEE Reviews in Biomedical Engineering18, 350–36...

work page doi:10.1109/rbme.2024 2025

[20] [20]

Scientific Reports15(1), 3990 (2025) https://doi.org/10.1038/s41598-025-88545-9

Alfasly, S., Alabtah, G., Hemati, S., Kalari, K.R., Garcia, J.J., Tizhoosh, H.R.: Validation of histopathology foundation models through whole slide image retrieval. Scientific Reports15(1), 3990 (2025) https://doi.org/10.1038/s41598-025-88545-9 . Accessed 2026-01-14

work page doi:10.1038/s41598-025-88545-9 2025

[21] [21]

Medical image analysis42, 60–88 (2017)

Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., S´ anchez, C.I.: A survey on deep learning in medical image analysis. Medical image analysis42, 60–88 (2017)

2017

[22] [22]

Nature medicine25(8), 1301–1309 (2019)

Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Werneck Krauss Silva, V., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine25(8), 1301–1309 (2019)

2019

[23] [23]

Jama318(22), 2199–2210 (2017)

Ehteshami Bejnordi, B., Veta, M., Diest, P., Van Ginneken, B., Karssemeijer, N., Litjens, G., Van Der Laak, J.A., consortium, C., Hermsen, M., Manson, Q.F.,et al.: Diagnostic assessment of deep learn- ing algorithms for detection of lymph node metastases in women with breast cancer. Jama318(22), 2199–2210 (2017)

2017

[24] [24]

Tizhoosh, H.R., Diamandis, P., Campbell, C.J., Safarpoor, A., Kalra, S., Maleki, D., Riasatian, A., Babaie, M.: Searching images for consensus: can ai remove observer variability in pathology? The American journal of pathology191(10), 1702–1708 (2021)

2021

[25] [25]

Histopathology77(5), 734–741 (2020) https://doi.org/10.1111/his.14167 20

Wolf, J.L., Nederveen, F., Blaauwgeers, H., Marx, A., Nicholson, A.G., Roden, A.C., Str¨ obel, P., Timens, W., Weissferdt, A., Th¨ usen, J., Bakker, M.A.: Interobserver variation in the classification of thymic lesions including biopsies and resection specimens in an international digital microscopy panel. Histopathology77(5), 734–741 (2020) https://doi.o...

work page doi:10.1111/his.14167 2020

[26] [26]

2017.0280

Hernandez-Prera, J.C., Machado, R.A., Asa, S.L., Baloch, Z., Faquin, W.C., Ghossein, R., LiVolsi, V.A., Lloyd, R.V., Mete, O., Nikiforov, Y.E., Seethala, R.R., Suster, S., Thompson, L.D., Turk, A.T., Sadow, P.M., Urken, M.L., Wenig, B.M.: Pathologic reporting of tall cell variant of papillary thyroid cancer: Have we reached a consensus? Thyroid28(12), 168...

work page doi:10.1089/thy 2018

[27] [27]

Cancer88(10), 2342–2349 (2000) https://doi.org/10.1002/(SICI)1097-0142(20000515) 88:10⟨2342::AID-CNCR19⟩3.0.CO;2-X

Aldape, K., Simmons, M.L., Davis, R.L., Miike, R., Wiencke, J., Barger, G., Lee, M., Chen, P., Wren- sch, M.: Discrepancies in diagnoses of neuroepithelial neoplasms: The San Francisco Bay Area Adult Glioma Study. Cancer88(10), 2342–2349 (2000) https://doi.org/10.1002/(SICI)1097-0142(20000515) 88:10⟨2342::AID-CNCR19⟩3.0.CO;2-X

work page doi:10.1002/(sici)1097-0142(20000515 2000

[28] [28]

Nature genetics 45(10), 1113–1120 (2013)

Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: The cancer genome atlas pan-cancer analysis project. Nature genetics 45(10), 1113–1120 (2013)

2013

[29] [29]

preprint (2026) https://doi.org/10.5281/zenodo.19736866

Uegami, W.,et al.: Disease subtyping for computational pathology benchmarking with tcga dataset—white paper authors/creators. preprint (2026) https://doi.org/10.5281/zenodo.19736866

work page doi:10.5281/zenodo.19736866 2026

[30] [30]

Accelerating data processing and benchmarking of ai models for pathology,

Zhang, A., Jaume, G., Vaidya, A., Ding, T., Mahmood, F.: Accelerating data processing and benchmarking of ai models for pathology. arXiv preprint arXiv:2502.06750 (2025)

work page arXiv 2025

[31] [31]

Journal of Pathology Informatics 15, 100375 (2024)

Tizhoosh, H.R., Pantanowitz, L.: On image search in histopathology. Journal of Pathology Informatics 15, 100375 (2024)

2024

[32] [32]

IEEE Reviews in Biomedical Engineering18, 350–367 (2024)

Lahr, I., Alfasly, S., Nejat, P., Khan, J., Kottom, L., Kumbhar, V., Alsaafin, A., Shafique, A., Hemati, S., Alabtah, G.,et al.: Analysis and validation of image search engines in histopathology. IEEE Reviews in Biomedical Engineering18, 350–367 (2024)

2024

[33] [33]

Tizhoosh, H.R., Zhu, S., Lo, H., Chaudhari, V., Mehdi, T.: MinMax Radon Barcodes for Medical Image Retrieval. arXiv. arXiv:1610.00318 [cs] (2016). https://doi.org/10.48550/arXiv.1610.00318 . http:// arxiv.org/abs/1610.00318 Accessed 2026-03-02

work page doi:10.48550/arxiv.1610.00318 2016

[34] [34]

Encyclopedia of research design1(8), 1–8 (2010) 21

Abdi, H.: Holm’s sequential bonferroni procedure. Encyclopedia of research design1(8), 1–8 (2010) 21

2010