pith. sign in

arxiv: 2604.24679 · v1 · submitted 2026-04-27 · 💻 cs.CV · cs.LG

Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction

Pith reviewed 2026-05-08 04:15 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords pathology foundation modelsbreast cancer survival predictionwhole-slide imagesmodel benchmarkingexternal validationhistopathologysurvival analysiscomputational pathology
0
0 comments X

The pith

H-optimus-1 achieves the strongest performance in predicting breast cancer survival from pathology images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks multiple pathology foundation models on predicting how long breast cancer patients will survive based on whole-slide histopathology images. It applies the same steps for pulling features from image patches and building survival models to three separate patient groups with long-term records, training on one group and testing on the other two. The results show that H-optimus-1 ranks highest overall, newer model generations beat older ones, yet gains between many recent models stay small. A much smaller distilled model even edges out its larger teacher model while using under 8 percent of the parameters. The work matters because it helps decide which models to use in real clinics where both accuracy and speed matter for handling large images.

Core claim

The paper finds that H-optimus-1 delivers the best survival predictions when its image features are used in a fixed survival model, with second-generation pathology foundation models outperforming first-generation ones across external tests. Differences among many recent models remain modest, indicating that simply adding more pretraining data or parameters yields limited further benefit. The compact H0-mini model slightly surpasses its teacher H-optimus-0 despite far fewer parameters and faster feature extraction, based on training in one cohort and validation in two others totaling over 5,400 patients.

What carries the argument

A standardized pipeline that extracts patch-level features from whole-slide images with each foundation model then applies one shared survival modeling framework, evaluated through training on one cohort and external testing on two others.

If this is right

  • Second-generation pathology foundation models provide better representations for survival prediction than first-generation versions.
  • Further scaling of pretraining data or model size alone brings only modest gains for breast cancer survival tasks.
  • Compact distilled models can match or exceed larger models while cutting computation time for feature extraction.
  • External validation across separate cohorts supports the reliability of the performance comparisons.
  • Clinical settings gain concrete options for choosing accurate yet efficient models when processing large histopathology slides.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar standardized benchmarks applied to other cancer types or tasks such as metastasis prediction could test whether generational gains and diminishing returns appear more widely.
  • Task-specific adaptation of the top models might narrow remaining performance gaps more than continued pretraining scale.
  • The edge shown by the distilled model points to greater investment in compression techniques for pathology foundation models.
  • For many clinical image tasks, balancing predictive strength against inference speed may prove more practical than pursuing ever-larger models.

Load-bearing premise

The fixed patch extraction and survival modeling steps treat every model family equally without built-in advantages for any particular architecture or pretraining approach.

What would settle it

Repeating the exact benchmark with a changed survival modeling method or on a new independent cohort where a different model clearly outperforms H-optimus-1 by a meaningful margin would falsify the top ranking.

Figures

Figures reproduced from arXiv: 2604.24679 by Constance Boissin, David A. Clifton, Fredrik K. Gustafsson, Johan Vallon-Christersson, Mattias Rantalainen.

Figure 1
Figure 1. Figure 1: Main model comparison across evaluation settings, showing performance in terms of C-index (↑) for all thirteen evaluated models. Models are evaluated for recurrence-free survival (RFS) and progression-free survival (PFS), each assessed both for the full cohort (‘All Patients’) and the ‘ER+ & HER2-’ patient subgroup. Bars show the bootstrap mean C-index with 95% confidence intervals view at source ↗
Figure 2
Figure 2. Figure 2: Two-group Kaplan-Meier risk stratification for three representative models. KM survival curves showing stratification into low- and high-risk groups for RFS and PFS, each assessed for the full cohort (‘All Patients’) and the ‘ER+ & HER2-’ patient subgroup. Results for Resnet-IN (left column), UNI (middle), and H-optimus-1 (right). Each plot includes the C-index, log-rank test p-value, and the number of pat… view at source ↗
Figure 3
Figure 3. Figure 3: Four-group Kaplan-Meier risk stratification for three representative models. KM survival curves showing stratification into four risk groups for RFS and PFS, each assessed for the full cohort (‘All Patients’) and the ‘ER+ & HER2-’ patient subgroup. Results for Resnet-IN (left column), UNI (middle), and H-optimus-1 (right). Note the difference in range of the y-axis between RFS and PFS. 6 view at source ↗
Figure 4
Figure 4. Figure 4: Effect of training data size on survival prediction performance for three representative models. C-index performance for Resnet-IN, UNI, and H-optimus-1 as a function of the fraction of training data used (10%, 25%, 50%, 75%, 100%) across all four evaluation settings (RFS and PFS, ‘All Patients’ and ‘ER+ & HER2-’). Results are reported as mean ± standard deviation (std) over five random seeds, where a new … view at source ↗
Figure 5
Figure 5. Figure 5: Effect of training data size on survival prediction performance for the three top-performing models. C-index performance for H-optimus-0, H-optimus-1, and H0-mini as a function of the fraction of training data used (10%, 25%, 50%, 75%, 100%) across all four evaluation settings. The experimental setup is identical to view at source ↗
Figure 6
Figure 6. Figure 6: UMAP visualizations of learned feature representations for three representative models. UMAP projections of the mean patch-level feature vectors for Resnet-IN (left column), UNI (middle), and H-optimus-1 (right) on the combined KS-Solna and SCAN-B￾Lund evaluation set, with one point per patient. Upper row: points are colored by dataset, green for KS-Solna and orange for SCAN￾B-Lund. Lower row: points are c… view at source ↗
read the original abstract

Pathology foundation models (PFMs) have recently emerged as powerful pretrained encoders for computational pathology, enabling transfer learning across a wide range of downstream tasks. However, systematic comparisons of these models for clinically meaningful prediction problems remain limited, especially in the context of survival prediction under external validation. In this study, we benchmark widely used and recently proposed PFMs for breast cancer survival prediction from whole-slide histopathology images. Using a standardized pipeline based on patch-level feature extraction and a unified survival modeling framework, we evaluate model representations across three independent clinical cohorts comprising more than 5,400 patients with long-term follow-up. Models are trained on one cohort and evaluated on two independent external cohorts, enabling a rigorous assessment of cross-dataset generalization. Overall, H-optimus-1 achieves the strongest survival prediction performance. More broadly, we observe consistent generational improvements across model families, with second-generation PFMs outperforming their first-generation counterparts. However, absolute performance differences between many recent PFMs remain modest, suggesting diminishing returns from further scaling of pretraining data or model size alone. Notably, the compact distilled model H0-mini slightly outperforms its larger teacher model H-optimus-0, despite using fewer than 8% of the parameters and enabling significantly faster feature extraction. Together, these results provide the first large-scale, externally validated benchmark of PFMs for breast cancer survival prediction, and offer practical guidance for efficient deployment of PFMs in clinical workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript benchmarks multiple pathology foundation models (PFMs) for breast cancer survival prediction from whole-slide histopathology images. Using a standardized pipeline of patch-level feature extraction followed by a unified survival modeling framework, the authors evaluate first- and second-generation PFMs on three independent external cohorts comprising more than 5,400 patients. They conclude that H-optimus-1 achieves the strongest performance, second-generation models consistently outperform first-generation counterparts, absolute differences among recent PFMs are modest (implying diminishing returns from scaling), and the compact distilled model H0-mini slightly outperforms its larger teacher H-optimus-0 despite using <8% of the parameters.

Significance. If the reported rankings prove statistically robust, the work supplies a valuable large-scale, externally validated benchmark for PFM selection in clinically relevant survival tasks. Strengths include the cross-cohort design, unified evaluation pipeline that enables direct comparisons, and attention to model efficiency via the distilled variant. These elements can guide practical deployment choices in computational pathology.

major comments (3)
  1. [Results section (performance tables)] Results section (performance tables): the central claims that H-optimus-1 is strongest, that second-generation PFMs consistently outperform first-generation ones, and that differences are modest enough to indicate diminishing returns rest on single-point concordance estimates. No bootstrap confidence intervals, cross-validation standard errors, or paired statistical tests (e.g., DeLong or permutation tests) between models are reported, so modest gaps could arise from cohort sampling variability or downstream-head stochasticity rather than representation quality.
  2. [Methods (survival modeling framework)] Methods (survival modeling framework): the unified survival head is trained with a single set of hyperparameters and, apparently, a single random seed per model. Combined with point-estimate external evaluation, this leaves the modest performance differences (including H0-mini vs. H-optimus-0) vulnerable to optimization noise and undermines the reliability of the ranking and generational-improvement conclusions.
  3. [Abstract and Results] Abstract and Results: the abstract states that 'absolute performance differences between many recent PFMs remain modest' and that H0-mini 'slightly outperforms' its teacher, yet no numerical deltas, confidence intervals, or p-values are supplied. Without these, the practical significance of the observed gaps and the claim of diminishing returns cannot be evaluated.
minor comments (2)
  1. [Abstract] The abstract does not list the concrete concordance values or cohort sizes that support the headline claims; adding one or two key numbers would improve readability.
  2. [Figures and Tables] Figure legends and table captions should explicitly state the exact survival metric (C-index, IBS, etc.) and whether any clinical covariates were included in the Cox models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We appreciate the emphasis on statistical rigor and have carefully considered each point. Below we provide point-by-point responses and outline the revisions we will make to strengthen the reliability of our benchmark results.

read point-by-point responses
  1. Referee: Results section (performance tables): the central claims that H-optimus-1 is strongest, that second-generation PFMs consistently outperform first-generation ones, and that differences are modest enough to indicate diminishing returns rest on single-point concordance estimates. No bootstrap confidence intervals, cross-validation standard errors, or paired statistical tests (e.g., DeLong or permutation tests) between models are reported, so modest gaps could arise from cohort sampling variability or downstream-head stochasticity rather than representation quality.

    Authors: We agree that single-point estimates limit the interpretability of model rankings and the claim of diminishing returns. In the revised manuscript we will add bootstrap confidence intervals (1,000 resamples) for all C-index values on the external cohorts and perform paired permutation tests between models to determine whether observed differences are statistically significant. These additions will be presented in updated performance tables and discussed in the Results section to substantiate the reported ordering and generational trends. revision: yes

  2. Referee: Methods (survival modeling framework): the unified survival head is trained with a single set of hyperparameters and, apparently, a single random seed per model. Combined with point-estimate external evaluation, this leaves the modest performance differences (including H0-mini vs. H-optimus-0) vulnerable to optimization noise and undermines the reliability of the ranking and generational-improvement conclusions.

    Authors: We acknowledge that reliance on a single seed and fixed hyperparameters introduces potential stochastic variability. We will revise the Methods and Results to report performance averaged over five independent random seeds for the survival head, including standard deviations. A modest hyperparameter sensitivity analysis for the survival modeling framework will also be included to confirm that the relative ordering of PFMs remains stable. These changes will directly address concerns about optimization noise. revision: yes

  3. Referee: Abstract and Results: the abstract states that 'absolute performance differences between many recent PFMs remain modest' and that H0-mini 'slightly outperforms' its teacher, yet no numerical deltas, confidence intervals, or p-values are supplied. Without these, the practical significance of the observed gaps and the claim of diminishing returns cannot be evaluated.

    Authors: We will update both the abstract and the Results section to include explicit numerical C-index deltas between key models, the newly computed bootstrap confidence intervals, and p-values from the permutation tests. While the abstract must remain concise, we will incorporate the most salient quantitative findings and direct readers to the detailed tables and supplementary statistical analyses for complete reporting. This will allow readers to assess the practical magnitude of the observed differences. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmarking on external cohorts with no derivations or self-referential fits

full rationale

The paper is a pure empirical benchmarking study. It extracts patch-level features from various PFMs using a fixed pipeline, fits a standard survival model on one cohort, and reports C-index (or equivalent) on two held-out external cohorts. No equations, first-principles derivations, or parameter fits are presented as 'predictions' that later reduce to the inputs by construction. Claims about generational improvements and modest differences follow directly from tabulated performance numbers on independent data; the pipeline is described as standardized and externally reproducible. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The evaluation is falsifiable via new cohorts or statistical tests, satisfying the criteria for non-circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study applying existing PFMs and standard survival techniques; introduces no new axioms, free parameters beyond routine training, or invented entities.

free parameters (1)
  • Survival model hyperparameters
    Routine parameters fitted during training on cohorts but not defining the comparative claims.

pith-pipeline@v0.9.0 · 9743 in / 1028 out tokens · 84234 ms · 2026-05-08T04:15:25.132906+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    A systematic pan-cancer study on deep learning-based prediction of multi- omic biomarkers from routine pathology images.Communi- cations Medicine, 4(1):48, 2024

    Salim Arslan, Julian Schmidt, Cher Bass, Debapriya Mehro- tra, Andre Geraldes, Shikha Singhal, Julius Hense, Xiusi Li, Pandu Raharja-Liu, Oscar Maiques, et al. A systematic pan-cancer study on deep learning-based prediction of multi- omic biomarkers from routine pathology images.Communi- cations Medicine, 4(1):48, 2024. 1

  2. [2]

    Kazerouni, I

    Bobby Azad, Reza Azad, Sania Eskandari, Afshin Bo- zorgpour, Amirhossein Kazerouni, Islem Rekik, and Dorit 14 Merhof. Foundational models in medical imaging: A comprehensive survey and future vision.arXiv preprint arXiv:2310.18689, 2023. 1

  3. [3]

    Evaluating vision and pathology foundation models for computational pathology: a compre- hensive benchmark study.medRxiv, pages 2025–05, 2025

    Rohan Bareja, Francisco Carrillo-Perez, Yuanning Zheng, Marija Pizurica, Tarak Nath Nandi, Jeanne Shen, Ravi Mad- duri, and Olivier Gevaert. Evaluating vision and pathology foundation models for computational pathology: a compre- hensive benchmark study.medRxiv, pages 2025–05, 2025. 1

  4. [4]

    Foundation models in computational pathology: A review of challenges, opportunities, and impact

    Mohsin Bilal, Manahil Raza, Youssef Altherwy, Anas Al- suhaibani, Abdulrahman Abduljabbar, Fahdah Almarshad, Paul Golding, Nasir Rajpoot, et al. Foundation models in computational pathology: A review of challenges, opportu- nities, and impact.arXiv preprint arXiv:2502.08333, 2025. 1, 3

  5. [5]

    H-optimus-1, 2025

    Bioptimus. H-optimus-1, 2025. URLhttps : / / huggingface.co/bioptimus/H- optimus- 1. 3, 13

  6. [6]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. 1

  7. [7]

    A comprehensive eval- uation of histopathology foundation models for ovarian can- cer subtype classification.npj Precision Oncology, 9(1):33,

    Jack Breen, Katie Allen, Kieran Zucker, Lucy Godson, Nico- las M Orsi, and Nishant Ravikumar. A comprehensive eval- uation of histopathology foundation models for ovarian can- cer subtype classification.npj Precision Oncology, 9(1):33,

  8. [8]

    Artifi- cial intelligence for diagnosis and gleason grading of prostate cancer: the PANDA challenge.Nature Medicine, 28(1):154– 163, 2022

    Wouter Bulten, Kimmo Kartasalo, Po-Hsuan Cameron Chen, Peter Str ¨om, Hans Pinckaers, Kunal Nagpal, Yuannan Cai, David F Steiner, Hester Van Boven, Robert Vink, et al. Artifi- cial intelligence for diagnosis and gleason grading of prostate cancer: the PANDA challenge.Nature Medicine, 28(1):154– 163, 2022. 1

  9. [9]

    A clinical benchmark of public self-supervised pathol- ogy foundation models.Nature Communications, 16(1): 3640, 2025

    Gabriele Campanella, Shengjia Chen, Manbir Singh, Ruchika Verma, Silke Muehlstedt, Jennifer Zeng, Aryeh Stock, Matt Croken, Brandon Veremis, Abdulkadir Elmas, et al. A clinical benchmark of public self-supervised pathol- ogy foundation models.Nature Communications, 16(1): 3640, 2025. 1

  10. [10]

    Towards a general-purpose foundation model for computational pathology.Nature Medicine, 30(3):850–862,

    Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology.Nature Medicine, 30(3):850–862,

  11. [11]

    A multimodal whole-slide foundation model for pathology

    Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guil- laume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology. Nature Medicine, pages 1–13, 2025. 1, 3, 4, 14

  12. [12]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR), 2021....

  13. [13]

    Distilling foundation models for robust and efficient models in digital pathology

    Alexandre Filiot, Nicolas Dop, Oussama Tchita, Auriane Riou, R ´emy Dubois, Thomas Peeters, Daria Valter, Marin Scalbert, Charlie Saillard, Genevi`eve Robin, et al. Distilling foundation models for robust and efficient models in digital pathology. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 162–...

  14. [14]

    Evaluating the yield of medical tests.Jama, 247(18):2543–2546, 1982

    Frank E Harrell, Robert M Califf, David B Pryor, Kerry L Lee, and Robert A Rosati. Evaluating the yield of medical tests.Jama, 247(18):2543–2546, 1982. 13

  15. [15]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 3, 13

  16. [16]

    Colorectal cancer risk stratification on histological slides based on survival curves predicted by deep learning

    Julia H ¨ohn, Eva Krieghoff-Henning, Christoph Wies, Lennard Kiehl, Martin J Hetz, Tabea-Clara Bucher, Jitendra Jonnagaddala, Kurt Zatloukal, Heimo M¨uller, Markus Plass, et al. Colorectal cancer risk stratification on histological slides based on survival curves predicted by deep learning. npj Precision Oncology, 7(1):98, 2023. 1

  17. [17]

    Ahmed Raza, Fayyaz Minhas, and Nasir Rajpoot

    Mostafa Jahanifar, Manahil Raza, Kesi Xu, Trinh Thi Le Vuong, Robert Jewsbury, Adam Shephard, Neda Zamanita- jeddin, Jin Tae Kwak, Shan E. Ahmed Raza, Fayyaz Minhas, and Nasir Rajpoot. Domain generalization in computational pathology: Survey and guidelines.ACM Computing Surveys, Just Accepted, April 2025. doi: 10.1145/3724391. URL https://doi.org/10.1145/...

  18. [18]

    End-to-end prognostication in col- orectal cancer by deep learning: a retrospective, multicentre study.The Lancet Digital Health, 6(1):e33–e43, 2024

    Xiaofeng Jiang, Michael Hoffmeister, Hermann Brenner, Hannah Sophie Muti, Tanwei Yuan, Sebastian Foersch, Nicholas P West, Alexander Brobeil, Jitendra Jonnagaddala, Nicholas Hawkins, et al. End-to-end prognostication in col- orectal cancer by deep learning: a retrospective, multicentre study.The Lancet Digital Health, 6(1):e33–e43, 2024. 1

  19. [19]

    Nonparametric estima- tion from incomplete observations.Journal of the American statistical association, 53(282):457–481, 1958

    Edward L Kaplan and Paul Meier. Nonparametric estima- tion from incomplete observations.Journal of the American statistical association, 53(282):457–481, 1958. 13

  20. [20]

    A comprehensive benchmark of histopathology foundation models for kidney digital pathol- ogy images.arXiv preprint arXiv:2603.15967, 2026

    Harishwar Reddy Kasireddy, Patricio S La Rosa, Ak- shita Gupta, Anindya S Paul, Jamie L Fermin, William L Clapp, Meryl A Waldman, Tarek M El-Ashkar, Sanjay Jain, Luis Rodrigues, et al. A comprehensive benchmark of histopathology foundation models for kidney digital pathol- ogy images.arXiv preprint arXiv:2603.15967, 2026. 1

  21. [21]

    Deep- Surv: personalized treatment recommender system using a cox proportional hazards deep neural network.BMC Medi- cal Research Methodology, 18(1):24, 2018

    Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deep- Surv: personalized treatment recommender system using a cox proportional hazards deep neural network.BMC Medi- cal Research Methodology, 18(1):24, 2018. 13

  22. [22]

    A survey on com- putational pathology foundation models: Datasets, adaptation strategies, and evaluation tasks

    Dong Li, Guihong Wan, Xintao Wu, Xinyu Wu, Ajit J Nir- mal, Christine G Lian, Peter K Sorger, Yevgeniy R Semenov, and Chen Zhao. A survey on computational pathology foun- dation models: Datasets, adaptation strategies, and evalua- tion tasks.arXiv preprint arXiv:2501.15724, 2025. 1

  23. [23]

    Unveiling institution-specific bias in pathology foundation models: Detriments, causes, and potential solutions.arXiv preprint arXiv:2502.16889, 2025

    Weiping Lin, Shen Liu, Runchen Zhu, and Liansheng Wang. Unveiling institution-specific bias in pathology foundation models: Detriments, causes, and potential solutions.arXiv preprint arXiv:2502.16889, 2025. 3 15

  24. [24]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learn- ing Representations (ICLR), 2019. URLhttps : / / openreview.net/forum?id=Bkg6RiCqY7. 13

  25. [25]

    A visual- language foundation model for computational pathology

    Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual- language foundation model for computational pathology. Nature Medicine, 30:863–874, 2024. 1, 3, 4, 14

  26. [26]

    PathBench: A comprehensive comparison benchmark for pathology foundation models towards preci- sion oncology.arXiv preprint arXiv:2505.20202, 2025

    Jiabo Ma, Yingxue Xu, Fengtao Zhou, Yihui Wang, Cheng Jin, Zhengrui Guo, Jianfeng Wu, On Ki Tang, Huajun Zhou, Xi Wang, et al. PathBench: A comprehensive comparison benchmark for pathology foundation models towards preci- sion oncology.arXiv preprint arXiv:2505.20202, 2025. 3

  27. [27]

    A method for normalizing histology slides for quantitative analysis

    Marc Macenko, Marc Niethammer, James S Marron, David Borland, John T Woosley, Xiaojun Guan, Charles Schmitt, and Nancy E Thomas. A method for normalizing histology slides for quantitative analysis. In2009 IEEE International Symposium on Biomedical Imaging: from Nano to Macro, pages 1107–1110. IEEE, 2009. 12

  28. [28]

    Uni2-h, 2024

    MahmoodLab. Uni2-h, 2024. URLhttps : / / huggingface.co/MahmoodLab/UNI2-h. 3, 13

  29. [29]

    THUNDER: Tile- level Histopathology image UNDERstanding benchmark

    Pierre Marza, Leo Fillioux, Sofi `ene Boutaj, Kunal Mahatha, Christian Desrosiers, Pablo Piantanida, Jose Dolz, Stergios Christodoulidis, and Maria Vakalopoulou. THUNDER: Tile- level Histopathology image UNDERstanding benchmark. In The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. 1

  30. [30]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projection for dimen- sion reduction.arXiv preprint arXiv:1802.03426, 2018. 7

  31. [31]

    Foundation models for generalist medi- cal artificial intelligence.Nature, 616(7956):259–265, 2023

    Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medi- cal artificial intelligence.Nature, 616(7956):259–265, 2023. 1

  32. [32]

    Benchmarking foundation models as feature extractors for weakly supervised computational pathology

    Peter Neidlinger, Omar SM El Nahhas, Hannah Sophie Muti, Tim Lenz, Michael Hoffmeister, Hermann Brenner, Marko van Treeck, Rupert Langer, Bastian Dislich, Hans Michael Behrens, et al. Benchmarking foundation models as feature extractors for weakly supervised computational pathology. Nature Biomedical Engineering, pages 1–11, 2025. 1

  33. [33]

    Generalizable biomarker prediction from can- cer pathology slides with self-supervised deep learning: A retrospective multi-centric study.Cell Reports Medicine, 4 (4), 2023

    Jan Moritz Niehues, Philip Quirke, Nicholas P West, Heike I Grabsch, Marko van Treeck, Yoni Schirris, Gregory P Veld- huizen, Gordon GA Hutchins, Susan D Richman, Sebastian Foersch, et al. Generalizable biomarker prediction from can- cer pathology slides with self-supervised deep learning: A retrospective multi-centric study.Cell Reports Medicine, 4 (4), 2023. 1

  34. [34]

    Otsu, A threshold selection method from gray-level histograms

    Nobuyuki Otsu. A threshold selection method from gray- level histograms.IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62–66, 1979. doi: 10.1109/TSMC.1979. 4310076. 12

  35. [35]

    Diatom autofo- cusing in brightfield microscopy: a comparative study

    Jose Luis Pech-Pacheco, Gabriel Cristobal, Jesus Chamorro- Martinez, and Joaquin Fernandez-Valdivia. Diatom autofo- cusing in brightfield microscopy: a comparative study. In Proceedings of the 15th International Conference on Pattern Recognition (ICPR), volume 3, pages 314–317. IEEE, 2000. 12

  36. [36]

    Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115:211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115:211–252, 2015. 13

  37. [37]

    H-optimus-0, 2024

    Charlie Saillard, Rodolphe Jenatton, Felipe Llinares-L ´opez, Zelda Mariet, David Cahan´e, Eric Durand, and Jean-Philippe Vert. H-optimus-0, 2024. URLhttps://github. com/bioptimus/releases/tree/main/models/ h-optimus/v0. 1, 3, 13

  38. [38]

    Validation of an AI-based solu- tion for breast cancer risk stratification using routine digital histopathology images.Breast Cancer Research, 26(1):123,

    Abhinav Sharma, Sandy Kang L ¨ovgren, Kajsa Ledesma Eriksson, Yinxi Wang, Stephanie Robertson, Johan Hartman, and Mattias Rantalainen. Validation of an AI-based solu- tion for breast cancer risk stratification using routine digital histopathology images.Breast Cancer Research, 26(1):123,

  39. [39]

    Development and prognostic validation of a three-level nhg-like deep learning-based model for histolog- ical grading of breast cancer.Breast Cancer Research, 26 (1):17, 2024

    Abhinav Sharma, Philippe Weitz, Yinxi Wang, Bojing Liu, Johan Vallon-Christersson, Johan Hartman, and Mattias Rantalainen. Development and prognostic validation of a three-level nhg-like deep learning-based model for histolog- ical grading of breast cancer.Breast Cancer Research, 26 (1):17, 2024. 12

  40. [40]

    Mor- phological prototyping for unsupervised slide representation learning in computational pathology

    Andrew H Song, Richard J Chen, Tong Ding, Drew FK Williamson, Guillaume Jaume, and Faisal Mahmood. Mor- phological prototyping for unsupervised slide representation learning in computational pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 4, 11, 12

  41. [41]

    Scanner-Induced Domain Shifts Undermine the Robustness of Patho- logy Foundation Models

    Erik Thiringer, Fredrik K Gustafsson, Kajsa Ledesma Eriks- son, and Mattias Rantalainen. Scanner-induced domain shifts undermine the robustness of pathology foundation models.arXiv preprint arXiv:2601.04163, 2026. 3

  42. [42]

    Cross comparison and prognostic assessment of breast cancer multigene signatures in a large population-based con- temporary clinical series.Scientific Reports, 9(1):12184,

    Johan Vallon-Christersson, Jari H ¨akkinen, Cecilia Hegardt, Lao H Saal, Christer Larsson, Anna Ehinger, Henrik Lind- man, Helena Olofsson, Tobias Sj ¨oblom, Fredrik W ¨arnberg, et al. Cross comparison and prognostic assessment of breast cancer multigene signatures in a large population-based con- temporary clinical series.Scientific Reports, 9(1):12184,

  43. [43]

    Prediction of recurrence risk in endometrial cancer with multimodal deep learning.Nature Medicine, pages 1– 12, 2024

    Sarah V olinsky-Fremond, Nanda Horeweg, Sonali Andani, Jurriaan Barkey Wolf, Maxime W Lafarge, Cor D de Kroon, Gitte Ørtoft, Estrid Høgdall, Jouke Dijkstra, Jan J Jobsen, et al. Prediction of recurrence risk in endometrial cancer with multimodal deep learning.Nature Medicine, pages 1– 12, 2024. 1

  44. [44]

    A foundation model for clinical-grade computational pathology and rare cancers detection.Nature Medicine, pages 1–12, 2024

    Eugene V orontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, et al. A foundation model for clinical-grade computational pathology and rare cancers detection.Nature Medicine, pages 1–12, 2024. 1, 3, 13

  45. [45]

    Transformer-based unsupervised contrastive learning for histopathological image classification.Medical Image Anal- ysis, 84:102710, 2023

    Hongming Wang, Jie Wang, Liya Yu, and Dinggang Shen. Transformer-based unsupervised contrastive learning for histopathological image classification.Medical Image Anal- ysis, 84:102710, 2023. 3, 13 16

  46. [46]

    RetCCL: Clustering-guided contrastive learning for whole-slide image retrieval.Medical Image Analysis, 83: 102645, 2023

    Xiyue Wang, Yuexi Du, Sen Yang, Jun Zhang, Minghui Wang, Jing Zhang, Wei Yang, Junzhou Huang, and Xiao Han. RetCCL: Clustering-guided contrastive learning for whole-slide image retrieval.Medical Image Analysis, 83: 102645, 2023. 3, 13

  47. [47]

    Improved breast cancer histological grading using deep learning.An- nals of Oncology, 33(1):89–98, 2022

    Y Wang, B Acs, S Robertson, B Liu, Leslie Solorzano, Car- olina W ¨ahlby, J Hartman, and M Rantalainen. Improved breast cancer histological grading using deep learning.An- nals of Oncology, 33(1):89–98, 2022. 1, 12

  48. [48]

    A whole-slide foundation model for digital pathology from real-world data.Nature, pages 1–8, 2024

    Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier Gonz ´alez, Yu Gu, et al. A whole-slide foundation model for digital pathology from real-world data.Nature, pages 1–8, 2024. 1, 3, 13

  49. [49]

    & Welling, M

    Eric Zimmermann, Eugene V orontsov, Julian Viret, Adam Casson, Michal Zelechowski, George Shaikovski, Neil Tenenholtz, James Hall, Thomas Fuchs, Nicolo Fusi, et al. Virchow2: Scaling self-supervised mixed magnification models in pathology.arXiv preprint arXiv:2408.00738,

  50. [50]

    Supplementary Figures This section contains Figure S1 - S6

    3, 13 17 Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction Supplementary Material A. Supplementary Figures This section contains Figure S1 - S6. 0 2 4 6 8 10 12 14 Time (years) 0.5 0.6 0.7 0.8 0.9 1.0Survival probability Resnet-IN | C-index: 0.612 Low risk score Medium low risk score Medium high risk score High risk score Low ...