pith. sign in

arxiv: 2410.06723 · v2 · submitted 2024-10-09 · 📡 eess.IV · cs.CV· cs.LG

Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading under Distribution Shifts

Pith reviewed 2026-05-23 19:18 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.LG
keywords computational pathologyfoundation modelsdistribution shiftprostate cancer gradingwhole-slide imagesrobustnessdomain generalizationweakly supervised learning
0
0 comments X

The pith

Pathology foundation models for prostate cancer grading lose substantial performance when moved to a new hospital site.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks several pathology foundation models as frozen feature extractors inside weakly supervised slide-level graders on the PANDA prostate cancer dataset. These models match or beat a natural-image baseline when training and test slides come from the same collection site. Performance falls sharply, however, on slides from a second site, while the same models remain relatively stable under changes in the distribution of cancer grades. Feature visualizations confirm that site identity separates the representations more strongly than grade identity. The results indicate that visual domain shift, not label shift, is the main barrier to reliable use.

Core claim

Large-scale pretraining produces strong in-distribution representations for prostate cancer grading from whole-slide images, yet these representations do not transfer robustly across collection sites; cross-site visual shifts dominate label-distribution shifts in both performance loss and feature-space separation.

What carries the argument

Frozen patch-level encoders from pathology foundation models inserted into weakly supervised multiple-instance learning models for slide-level grading, together with t-SNE or similar visualization of site versus grade clustering in the resulting embeddings.

If this is right

  • All evaluated pathology foundation models exhibit clear accuracy drops under the Radboud-to-Karolinska site transfer.
  • The same models show smaller degradation when only the label distribution over grade groups is shifted.
  • Embeddings from every tested foundation model continue to separate primarily by collection site rather than by cancer grade.
  • Generalization remains limited by the diversity of the data used to train the downstream slide-level predictor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Methods that explicitly align or adapt representations across sites may be required before these models can be deployed across institutions.
  • Collecting pretraining data from multiple sites and scanners could reduce the observed domain gaps.
  • The same visual-shift problem is likely to appear in other computational pathology tasks that involve different staining batches or scanner vendors.

Load-bearing premise

The Radboud-to-Karolinska split and the weakly supervised slide-level modeling choices in PANDA are representative of the distribution shifts that would appear in real clinical deployment.

What would settle it

Repeating the cross-site evaluation on a third independent collection site that uses similar staining and scanning protocols and finding no large performance drop for any of the tested foundation models.

Figures

Figures reproduced from arXiv: 2410.06723 by Fredrik K. Gustafsson, Mattias Rantalainen.

Figure 1
Figure 1. Figure 1: Performance comparison of UNI, CONCH and Resnet-IN across different PANDA subsets, when utilized as patch-level feature [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison of the ISUP grade models [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Top: Detailed performance comparison of UNI, CONCH and Resnet-IN, when utilized as patch-level feature extractors in the ABMIL ISUP grade model. Bottom: Detailed performance comparison of the three ISUP grade models ABMIL, Mean Feature and kNN, when utilizing UNI as the patch-level feature extractor. All results are mean±std over 10 random cross-validation folds. els based on UNI are still highly sensitive… view at source ↗
Figure 4
Figure 4. Figure 4: We study robustness in terms of two common types of distribution shifts: [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the three evaluated ISUP grade classification models: [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Pathology foundation models (PFMs) have emerged as powerful pretrained encoders for computational pathology, but their robustness under clinically relevant distribution shifts remains insufficiently understood. We benchmark the robustness of recent PFMs in the setting of prostate cancer grading from whole-slide images (WSIs). Using the PANDA dataset, we evaluate PFMs as frozen patch-level feature extractors within weakly supervised slide-level grading models, and assess robustness to two important forms of distribution shift: shifts in WSI image appearance across collection sites, and shifts in the label distribution over cancer grade groups. Across in-distribution settings, PFMs consistently achieve strong performance and clearly outperform a natural-image baseline. Under cross-site transfer from Radboud to Karolinska, however, performance drops substantially for all models, showing that large-scale pretraining alone does not guarantee robust downstream generalization. In contrast, PFMs are less sensitive to label-distribution shift, indicating that visually grounded domain shift is the dominant challenge. Representation analysis further supports these findings by revealing persistent domain separation between sites across all PFMs. While grade-related structure is present, it is comparatively weak, indicating that domain-related variation dominates in the learned feature space. Together, these results provide a comprehensive benchmark of PFMs under distribution shift and highlight an important practical message: although PFMs provide strong representations, generalizability remains constrained by the quality and diversity of the data used to train downstream prediction models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks pathology foundation models (PFMs) as frozen patch-level encoders within weakly supervised slide-level models for prostate cancer grading on the PANDA dataset. It reports strong in-distribution performance that outperforms a natural-image baseline, substantial degradation under cross-site shift (Radboud to Karolinska), comparatively smaller effects from label-distribution shift, and representation analysis showing persistent domain separation that dominates grade-related structure in the feature space. The central claim is that large-scale pretraining alone does not guarantee robust generalization and that visually grounded domain shift is the dominant practical challenge.

Significance. If the empirical patterns hold, the work supplies a useful benchmark demonstrating concrete limits of current PFMs under site-level appearance shifts and supplies a practical takeaway that downstream training data diversity matters more than pretraining scale alone. The representation analysis component adds interpretive value beyond accuracy numbers.

major comments (2)
  1. [Results, cross-site transfer paragraph] Cross-site transfer results: the claim of a 'substantial' drop for all models is presented without reported confidence intervals, p-values, or paired statistical tests against the in-distribution baselines; this weakens the assertion that the degradation is consistent and load-bearing for the conclusion that pretraining does not guarantee robustness.
  2. [Representation analysis subsection] Representation analysis: the statement that 'domain-related variation dominates' rests on visual inspection of embeddings; without quantitative support such as domain-classification accuracy on the frozen features or a direct comparison of cluster separation metrics between domain and grade labels, the dominance claim remains qualitative and does not fully substantiate that visual shift is the primary driver.
minor comments (2)
  1. [Abstract] The abstract states that PFMs are 'less sensitive' to label-distribution shift but does not quantify the relative magnitude of the two shift types (e.g., via delta-AUC or normalized drop); adding a direct side-by-side comparison would improve clarity.
  2. [Methods] The description of the weakly supervised slide-level modeling choices (MIL aggregator, aggregation function, etc.) is referenced but not fully specified in the provided text; expanding this in the methods would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight opportunities to strengthen the statistical rigor and quantitative support in our manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Results, cross-site transfer paragraph] Cross-site transfer results: the claim of a 'substantial' drop for all models is presented without reported confidence intervals, p-values, or paired statistical tests against the in-distribution baselines; this weakens the assertion that the degradation is consistent and load-bearing for the conclusion that pretraining does not guarantee robustness.

    Authors: We agree that adding confidence intervals and statistical tests will make the claims more robust. In the revised manuscript we will report 95% confidence intervals (via bootstrap resampling over slides) for all AUC and accuracy metrics. We will also add paired statistical tests (Wilcoxon signed-rank test on per-slide performance scores) comparing in-distribution versus cross-site results for each model, with p-values and effect sizes. These additions will directly support the consistency of the observed drops. revision: yes

  2. Referee: [Representation analysis subsection] Representation analysis: the statement that 'domain-related variation dominates' rests on visual inspection of embeddings; without quantitative support such as domain-classification accuracy on the frozen features or a direct comparison of cluster separation metrics between domain and grade labels, the dominance claim remains qualitative and does not fully substantiate that visual shift is the primary driver.

    Authors: We acknowledge that the current dominance claim is supported primarily by t-SNE visualizations. In the revision we will add quantitative analyses: (1) linear probe accuracies for predicting site (domain) versus grade from the frozen PFM features, and (2) silhouette scores and between-cluster variance ratios comparing domain-based versus grade-based clustering on the embeddings. These metrics will provide direct quantitative evidence that domain separation is stronger than grade-related structure. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivations or self-referential reductions

full rationale

The paper is a pure empirical benchmark: it measures slide-level grading performance of frozen PFMs on held-out PANDA splits (in-distribution and Radboud-to-Karolinska cross-site) and reports representation statistics. No equations, ansatzes, uniqueness theorems, or fitted parameters are introduced whose outputs are then relabeled as predictions. All reported numbers are direct evaluations on disjoint data; the central claim that visual domain shift dominates is therefore a measured outcome rather than a quantity forced by the modeling choices themselves. Self-citations, if present, are not load-bearing for any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on standard machine-learning assumptions about dataset splits and model training; no free parameters, invented entities, or non-standard axioms are introduced.

axioms (1)
  • domain assumption The PANDA dataset collection-site and grade-group splits constitute meaningful proxies for clinically relevant distribution shifts.
    Invoked when defining in-distribution vs. cross-site and label-shift experiments.

pith-pipeline@v0.9.0 · 5787 in / 1260 out tokens · 23848 ms · 2026-05-23T19:18:30.396941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

  1. [1]

    Towards large-scale training of pathology foundation models

    Nanne Aben, Edwin D de Jong, Ioannis Gatopoulos, Nico- las K¨anzig, Mikhail Karasikov, Axel Lagr ´e, Roman Moser, Joost van Doorn, Fei Tang, et al. Towards large-scale training of pathology foundation models. arXiv preprint arXiv:2404.15217, 2024. 1

  2. [2]

    Artifi- cial intelligence as the next step towards precision pathology

    Bal ´azs Acs, Mattias Rantalainen, and Johan Hartman. Artifi- cial intelligence as the next step towards precision pathology. Journal of Internal Medicine, 288(1):62–81, 2020. 1

  3. [3]

    A systematic pan-cancer study on deep learning-based prediction of multi- omic biomarkers from routine pathology images

    Salim Arslan, Julian Schmidt, Cher Bass, Debapriya Mehro- tra, Andre Geraldes, Shikha Singhal, Julius Hense, Xiusi Li, Pandu Raharja-Liu, Oscar Maiques, et al. A systematic pan-cancer study on deep learning-based prediction of multi- omic biomarkers from routine pathology images. Communi- cations Medicine, 4(1):48, 2024. 1

  4. [4]

    Foundational models in medical imaging: A comprehensive survey and future vision

    Bobby Azad, Reza Azad, Sania Eskandari, Afshin Bo- zorgpour, Amirhossein Kazerouni, Islem Rekik, and Dorit Merhof. Foundational models in medical imaging: A comprehensive survey and future vision. arXiv preprint arXiv:2310.18689, 2023. 1

  5. [5]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 1

  6. [6]

    Artifi- cial intelligence for diagnosis and gleason grading of prostate cancer: the PANDA challenge

    Wouter Bulten, Kimmo Kartasalo, Po-Hsuan Cameron Chen, Peter Str ¨om, Hans Pinckaers, Kunal Nagpal, Yuannan Cai, David F Steiner, Hester Van Boven, Robert Vink, et al. Artifi- cial intelligence for diagnosis and gleason grading of prostate cancer: the PANDA challenge. Nature Medicine, 28(1):154– 163, 2022. 1, 2, 5

  7. [7]

    Clinical-grade computational pathology using weakly supervised deep learning on whole slide images

    Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature Medicine, 25(8):1301–1309, 2019. 1

  8. [8]

    A clinical benchmark of public self-supervised pathology foun- dation models

    Gabriele Campanella, Shengjia Chen, Ruchika Verma, Jen- nifer Zeng, Aryeh Stock, Matt Croken, Brandon Veremis, Abdulkadir Elmas, Kuan-lin Huang, Ricky Kwan, et al. A clinical benchmark of public self-supervised pathology foun- dation models. arXiv preprint arXiv:2407.06508, 2024. 1

  9. [9]

    Towards a general-purpose foundation model for computational pathology

    Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology. Nature Medicine, 30(3):850–862,

  10. [10]

    Artificial intelligence to identify genetic alterations in con- ventional histopathology

    Didem Cifci, Sebastian Foersch, and Jakob Nikolas Kather. Artificial intelligence to identify genetic alterations in con- ventional histopathology. The Journal of Pathology, 257(4): 430–444, 2022. 1

  11. [11]

    Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning

    Nicolas Coudray, Paolo Santiago Ocampo, Theodore Sakel- laropoulos, Navneet Narula, Matija Snuderl, David Feny ¨o, Andre L Moreira, Narges Razavian, and Aristotelis Tsirigos. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Na- ture Medicine, 24(10):1559–1567, 2018. 1

  12. [12]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions (ICLR), 2021. 7

  13. [13]

    Deep learning in cancer pathology: a new generation of clinical biomarkers

    Amelie Echle, Niklas Timon Rindtorff, Titus Josef Brinker, Tom Luedde, Alexander Thomas Pearson, and Jakob Nikolas Kather. Deep learning in cancer pathology: a new generation of clinical biomarkers. British Journal of Cancer , 124(4): 686–696, 2021. 1

  14. [14]

    An update of the gleason grading system

    Jonathan I Epstein. An update of the gleason grading system. The Journal of urology, 183(2):433–440, 2010. 1

  15. [15]

    A contemporary prostate cancer grading system: a validated alternative to the gleason score

    Jonathan I Epstein, Michael J Zelefsky, Daniel D Sjoberg, Joel B Nelson, Lars Egevad, Cristina Magi-Galluzzi, An- drew J Vickers, Anil V Parwani, Victor E Reuter, Samson W Fine, et al. A contemporary prostate cancer grading system: a validated alternative to the gleason score. European urol- ogy, 69(3):428–435, 2016. 1

  16. [16]

    Scaling self-supervised learning for histopathology with masked image modeling

    Alexandre Filiot, Ridouane Ghermi, Antoine Olivier, Paul Jacob, Lucas Fidon, Alice Mac Kain, Charlie Saillard, and Jean-Baptiste Schiratti. Scaling self-supervised learning for histopathology with masked image modeling. medRxiv preprint, 2023. 1 8

  17. [17]

    The clinician and dataset shift in artificial intelligence

    Samuel G Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S Kohane, and Suchi Saria. The clinician and dataset shift in artificial intelligence. New England Journal of Medicine, 385(3):283–286, 2021. 1

  18. [18]

    Gustafsson, Martin Danelljan, and Thomas B

    Fredrik K. Gustafsson, Martin Danelljan, and Thomas B. Sch¨on. How reliable is your regression model’s uncertainty under real-world distribution shifts? Transactions on Ma- chine Learning Research (TMLR), 2023. 1

  19. [19]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 7

  20. [20]

    Benchmarking neu- ral network robustness to common corruptions and perturba- tions

    Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- tions. In International Conference on Learning Representa- tions (ICLR), 2019. 1

  21. [21]

    Colorectal cancer risk stratification on histological slides based on survival curves predicted by deep learning

    Julia H ¨ohn, Eva Krieghoff-Henning, Christoph Wies, Lennard Kiehl, Martin J Hetz, Tabea-Clara Bucher, Jitendra Jonnagaddala, Kurt Zatloukal, Heimo M¨uller, Markus Plass, et al. Colorectal cancer risk stratification on histological slides based on survival curves predicted by deep learning. npj Precision Oncology, 7(1):98, 2023. 1

  22. [22]

    Attention-based deep multiple instance learning

    Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In Inter- national Conference on Machine Learning (ICML) , pages 2127–2136, 2018. 2, 7

  23. [23]

    Domain generalization in computational pathology: survey and guidelines

    Mostafa Jahanifar, Manahil Raza, Kesi Xu, Trinh Vuong, Rob Jewsbury, Adam Shephard, Neda Zamanitajeddin, Jin Tae Kwak, Shan E Ahmed Raza, Fayyaz Minhas, et al. Domain generalization in computational pathology: survey and guidelines. arXiv preprint arXiv:2310.19656, 2023. 1

  24. [24]

    End-to-end prognostication in col- orectal cancer by deep learning: a retrospective, multicentre study

    Xiaofeng Jiang, Michael Hoffmeister, Hermann Brenner, Hannah Sophie Muti, Tanwei Yuan, Sebastian Foersch, Nicholas P West, Alexander Brobeil, Jitendra Jonnagaddala, Nicholas Hawkins, et al. End-to-end prognostication in col- orectal cancer by deep learning: a retrospective, multicentre study. The Lancet Digital Health, 6(1):e33–e43, 2024. 1

  25. [25]

    Wilds: A benchmark of in-the- wild distribution shifts

    Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubra- mani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the- wild distribution shifts. In International Conference on Machine Learning (ICML), pages 5637–5664. PMLR, 2021. 1

  26. [26]

    Benchmarking weakly- supervised deep learning pipelines for whole slide classifica- tion in computational pathology

    Narmin Ghaffari Laleh, Hannah Sophie Muti, Chiara Maria Lavinia Loeffler, Amelie Echle, Oliver Lester Sal- danha, Faisal Mahmood, Ming Y Lu, Christian Trautwein, Rupert Langer, Bastian Dislich, et al. Benchmarking weakly- supervised deep learning pipelines for whole slide classifica- tion in computational pathology. Medical Image Analysis , 79, 2022. 2

  27. [27]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learning Representations (ICLR), 2019. 7

  28. [28]

    Data-efficient and weakly supervised computational pathology on whole- slide images

    Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole- slide images. Nature Biomedical Engineering , 5(6):555– 570, 2021. 6, 7

  29. [29]

    A visual- language foundation model for computational pathology

    Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual- language foundation model for computational pathology. Nature Medicine, 30:863–874, 2024. 1, 6, 7

  30. [30]

    Foundation models for generalist medi- cal artificial intelligence

    Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medi- cal artificial intelligence. Nature, 616(7956):259–265, 2023. 1

  31. [31]

    Hibou: A family of foundational vision transformers for pathology

    Dmitry Nechaev, Alexey Pchelnikov, and Ekaterina Ivanova. Hibou: A family of foundational vision transformers for pathology. arXiv preprint arXiv:2406.05074, 2024. 1

  32. [32]

    Benchmarking foundation models as feature extractors for weakly-supervised computational pathology

    Peter Neidlinger, Omar SM El Nahhas, Hannah Sophie Muti, Tim Lenz, Michael Hoffmeister, Hermann Brenner, Marko van Treeck, Rupert Langer, Bastian Dislich, Hans Michael Behrens, et al. Benchmarking foundation models as feature extractors for weakly-supervised computational pathology. arXiv preprint arXiv:2408.15823, 2024. 1, 4

  33. [33]

    Generalizable biomarker prediction from can- cer pathology slides with self-supervised deep learning: A retrospective multi-centric study

    Jan Moritz Niehues, Philip Quirke, Nicholas P West, Heike I Grabsch, Marko van Treeck, Yoni Schirris, Gregory P Veld- huizen, Gordon GA Hutchins, Susan D Richman, Sebastian Foersch, et al. Generalizable biomarker prediction from can- cer pathology slides with self-supervised deep learning: A retrospective multi-centric study. Cell Reports Medicine , 4 (4)...

  34. [34]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

  35. [35]

    Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshmi- narayanan, and Jasper Snoek

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshmi- narayanan, and Jasper Snoek. Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), 2019. 1

  36. [36]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (JMLR), 12:2825–2830, 2011. 7

  37. [37]

    Dataset shift in ma- chine learning, 2009

    Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in ma- chine learning, 2009. 1

  38. [38]

    Imagenet large 9 scale visual recognition challenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large 9 scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115:211–252, 2015. 7

  39. [39]

    H-optimus-0, 2024

    Charlie Saillard, Rodolphe Jenatton, Felipe Llinares-L ´opez, Zelda Mariet, David Cahan´e, Eric Durand, and Jean-Philippe Vert. H-optimus-0, 2024. 5

  40. [40]

    Artificial intelligence in histopathology: enhancing cancer research and clinical on- cology

    Artem Shmatko, Narmin Ghaffari Laleh, Moritz Ger- stung, and Jakob Nikolas Kather. Artificial intelligence in histopathology: enhancing cancer research and clinical on- cology. Nature Cancer, 3(9):1026–1038, 2022. 1

  41. [41]

    Artificial intelligence for diagnosis and grading of prostate cancer in biopsies: a population-based, diagnostic study

    Peter Str ¨om, Kimmo Kartasalo, Henrik Olsson, Leslie Solorzano, Brett Delahunt, Daniel M Berney, David G Bost- wick, Andrew J Evans, David J Grignon, Peter A Humphrey, et al. Artificial intelligence for diagnosis and grading of prostate cancer in biopsies: a population-based, diagnostic study. The Lancet Oncology, 21(2):222–232, 2020. 1

  42. [42]

    Prediction of recurrence risk in endometrial cancer with multimodal deep learning

    Sarah V olinsky-Fremond, Nanda Horeweg, Sonali Andani, Jurriaan Barkey Wolf, Maxime W Lafarge, Cor D de Kroon, Gitte Ørtoft, Estrid Høgdall, Jouke Dijkstra, Jan J Jobsen, et al. Prediction of recurrence risk in endometrial cancer with multimodal deep learning. Nature Medicine, pages 1– 12, 2024. 1

  43. [43]

    A foundation model for clinical-grade computational pathology and rare cancers detection

    Eugene V orontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nature Medicine , pages 1–12, 2024. 1, 5

  44. [44]

    Improved breast cancer histological grading using deep learning

    Y Wang, B Acs, S Robertson, B Liu, Leslie Solorzano, Car- olina W ¨ahlby, J Hartman, and M Rantalainen. Improved breast cancer histological grading using deep learning. An- nals of Oncology, 33(1):89–98, 2022. 1

  45. [45]

    Meneghetti, Omar S

    Georg W ¨olflein, Dyke Ferber, Asier R. Meneghetti, Omar S. M. El Nahhas, Daniel Truhn, Zunamys I. Carrero, David J. Harrison, Ognjen Arandjelovi ´c, and Jakob Nikolas Kather. Benchmarking pathology feature extractors for whole slide image classification. arXiv preprint arXiv:2311.11772v5 ,

  46. [46]

    A whole-slide foundation model for digital pathology from real-world data

    Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier Gonz ´alez, Yu Gu, et al. A whole-slide foundation model for digital pathology from real-world data. Nature, pages 1–8, 2024. 1, 5

  47. [47]

    Coca: Contrastive captioners are image-text foundation models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research (TMLR), 2022. 7

  48. [48]

    Image BERT pre-training with online tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image BERT pre-training with online tokenizer. InInternational Conference on Learn- ing Representations (ICLR), 2022. 7

  49. [49]

    Virchow2: Scaling self-supervised mixed magnification models in pathology

    Eric Zimmermann, Eugene V orontsov, Julian Viret, Adam Casson, Michal Zelechowski, George Shaikovski, Neil Tenenholtz, James Hall, Thomas Fuchs, Nicolo Fusi, et al. Virchow2: Scaling self-supervised mixed magnification models in pathology. arXiv preprint arXiv:2408.00738 ,

  50. [50]

    Supplementary Tables Table S1

    1, 5 10 Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading under Distribution Shifts Supplementary Material A. Supplementary Tables Table S1. Raw numerical results for Figure 1. All results are mean±std over 10 random cross-validation folds. PANDA Karolinska Radboud Radboud→Karolinska Radboud-U Radboud-U→Karolinska-U Radboud-...