Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare

Elizabeth W. Miller; Jeffrey D. Blume

arxiv: 2603.00192 · v2 · submitted 2026-02-27 · 💻 cs.LG · stat.AP· stat.ML

Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare

Elizabeth W. Miller , Jeffrey D. Blume This is my paper

Pith reviewed 2026-05-15 18:48 UTC · model grok-4.3

classification 💻 cs.LG stat.APstat.ML

keywords prediction instabilitymachine learninghealthcareindividual risk estimatesoptimization randomnessdecision thresholdsmodel validation

0 comments

The pith

Optimization randomness alone can make the same patient's risk estimate vary as much as retraining on entirely new data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in overparameterized machine learning models for healthcare, random choices during optimization and initialization alone produce large differences in predicted risk for any single patient. Standard performance numbers such as accuracy or log-loss hide this variability because they average across many patients. The authors introduce two simple checks: one that measures how wide the range of risk scores becomes across repeated trainings, and another that counts how often a patient crosses a clinical decision threshold. On both simulated data and the GUSTO-I clinical trial set, neural networks exhibit far more patient-level flip-flops than logistic regression, and the size of the instability matches the size seen when the entire training set is resampled. The central message is that clinical trust requires checking stability at the individual level, not only aggregate accuracy.

Core claim

Even when data and model architecture are held fixed, randomness introduced by optimization and initialization can lead to materially different risk estimates for the same patient; this procedural arbitrariness is comparable in magnitude to the variability obtained by resampling the entire training dataset and can change threshold-based treatment recommendations.

What carries the argument

empirical prediction interval width (ePIW) and empirical decision flip rate (eDFR), two complementary diagnostics that quantify spread in continuous risk scores and instability in binary clinical decisions across repeated trainings with identical data and architecture.

If this is right

Flexible models such as neural networks show substantially greater individual-level instability than simpler models such as logistic regression.
Models that look identical on aggregate metrics can still differ markedly in which patients receive a given treatment recommendation.
Risk instability concentrated near clinical thresholds can directly alter patient management decisions.
Routine model validation in healthcare should include explicit checks for individual-level stability in addition to aggregate performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeating the training process a modest number of times and reporting the range of predictions could become a low-cost addition to existing validation pipelines.
The same diagnostics could be applied to other high-stakes domains where decisions are made at the individual level, such as credit scoring or recidivism prediction.
If instability proves hard to reduce, averaging predictions across multiple independently trained models may offer a practical mitigation.

Load-bearing premise

The observed differences in individual predictions are caused mainly by optimization and initialization randomness rather than by unmeasured aspects of the data or training process.

What would settle it

If multiple trainings with the exact same data and architecture produce individual risk estimates whose spread is no larger than the spread obtained from a single model trained on bootstrap-resampled versions of the data, the claim of comparable instability would be falsified.

Figures

Figures reproduced from arXiv: 2603.00192 by Elizabeth W. Miller, Jeffrey D. Blume.

**Figure 1.** Figure 1: Individual-level prediction instability in simulation (top) and the GUSTO-I clinical dataset (bottom). Panels [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Individual-level prediction distributions across model families. Each panel displays the empirical distribution [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Prediction instability for a fixed out-of-sample individual (true risk [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Predicted risk versus true risk across model classes, training sample sizes, and sources of stochasticity. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Individual-level prediction instability, bias, and mean squared error as functions of true risk in the simulated [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

In healthcare, predictive models increasingly inform patient-level decisions, yet little attention is paid to the variability in individual risk estimates and its impact on treatment decisions. For overparameterized models, now standard in machine learning, a substantial source of variability often goes undetected. Even when the data and model architecture are held fixed, randomness introduced by optimization and initialization can lead to materially different risk estimates for the same patient. This problem is largely obscured by standard evaluation practices, which rely on aggregate performance metrics (e.g., log-loss, accuracy) that are agnostic to individual-level stability. As a result, models with indistinguishable aggregate performance can nonetheless exhibit substantial procedural arbitrariness, which can undermine clinical trust. We propose an evaluation framework that quantifies individual-level prediction instability by using two complementary diagnostics: empirical prediction interval width (ePIW), which captures variability in continuous risk estimates, and empirical decision flip rate (eDFR), which measures instability in threshold-based clinical decisions. We apply these diagnostics to simulated data and GUSTO-I clinical dataset. Across observed settings, we find that for flexible machine-learning models, randomness arising solely from optimization and initialization can induce individual-level variability comparable to that produced by resampling the entire training dataset. Neural networks exhibit substantially greater instability in individual risk predictions compared to logistic regression models. Risk estimate instability near clinically relevant decision thresholds can alter treatment recommendations. These findings that stability diagnostics should be incorporated into routine model validation for assessing clinical reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new ePIW and eDFR diagnostics usefully flag patient-level instability from training randomness in flexible models, but the resampling comparison mixes opt and data effects without clean separation.

read the letter

The paper's core contribution is a pair of simple diagnostics that measure how much a single patient's risk score or decision can shift across repeated training runs on the same data. ePIW tracks the spread in continuous predictions and eDFR counts threshold flips. On both simulated data and the GUSTO-I set they show that neural nets produce noticeably more of this instability than logistic regression, and that the scale can reach the level seen when the training set itself is resampled. That observation is worth having on record for anyone validating clinical models, because standard aggregate metrics will miss it entirely. The authors are right that this kind of procedural arbitrariness matters when a model is meant to support individual treatment choices. The work is straightforward and the motivation is practical. The main limitation is the resampling comparison. When each bootstrap replicate is trained with its own independent optimization and initialization, the variability attributed to data resampling necessarily includes the same sources of randomness that are being isolated in the fixed-data arm. Without a variance decomposition or fixed-seed controls across resamples, it is hard to tell how much of the reported comparability is truly opt/init versus the combined effect. The abstract and methods description give no quantitative details on run counts, effect sizes, or confidence intervals, which leaves the strength of the claims difficult to judge from the text alone. This is the sort of paper that belongs in a methods or applied ML journal rather than a top-tier venue. Readers who build or audit clinical prediction tools would get value from the diagnostics themselves, even if the exact comparison to resampling needs tightening. It is coherent on its own terms and engages the right literature on stability, so it deserves a serious referee who can ask for the missing controls and numbers.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes two diagnostics—empirical prediction interval width (ePIW) and empirical decision flip rate (eDFR)—to quantify individual-level prediction instability arising from optimization and initialization randomness in flexible machine-learning models for healthcare. It reports that, on simulated data and the GUSTO-I dataset, this source of variability produces individual risk estimate changes comparable in magnitude to those induced by resampling the full training set, with neural networks exhibiting substantially higher instability than logistic regression; instability near decision thresholds can alter clinical recommendations. The work argues that aggregate metrics obscure this procedural arbitrariness and that stability diagnostics should be routine.

Significance. If the central comparison holds after clarification, the result is significant because it identifies a practically consequential but previously under-quantified source of patient-level arbitrariness in clinical ML. The proposed diagnostics are simple to compute from repeated training runs and directly address a gap between standard validation practices and the requirements of individualized decision-making.

major comments (2)

[Methods (resampling procedure)] Methods section describing the resampling experiment: the reported comparability between opt/init variability (fixed data) and full-dataset resampling variability is difficult to interpret because the resampling arm appears to permit independent optimization runs on each bootstrap replicate. Without fixed random seeds or an explicit variance decomposition isolating data-induced effects, the observed individual-level differences may be driven by the union of both sources rather than opt/init randomness alone. This directly affects the load-bearing claim that opt/init randomness suffices to produce comparable instability.
[Results] Results section and abstract claims: no numerical values, confidence intervals, or effect-size summaries are supplied for ePIW or eDFR on either the simulated or GUSTO-I experiments, nor are sample sizes or number of repeated training runs stated. Without these quantities it is impossible to judge whether the reported comparability is statistically reliable or clinically material.

minor comments (1)

[Abstract] Abstract: the sentence 'These findings that stability diagnostics should be incorporated' is grammatically incomplete; rephrase to 'These findings suggest that stability diagnostics should be incorporated'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to improve methodological clarity and quantitative reporting.

read point-by-point responses

Referee: [Methods (resampling procedure)] Methods section describing the resampling experiment: the reported comparability between opt/init variability (fixed data) and full-dataset resampling variability is difficult to interpret because the resampling arm appears to permit independent optimization runs on each bootstrap replicate. Without fixed random seeds or an explicit variance decomposition isolating data-induced effects, the observed individual-level differences may be driven by the union of both sources rather than opt/init randomness alone. This directly affects the load-bearing claim that opt/init randomness suffices to produce comparable instability.

Authors: We acknowledge the validity of this observation. In the submitted manuscript, each bootstrap replicate was indeed trained with independent optimization and initialization, so the resampling-arm variability conflates data effects with procedural randomness. This ambiguity weakens the direct comparison. We will revise the Methods section to fix random seeds for optimization and initialization across all bootstrap replicates, add an explicit variance decomposition separating data-induced and optimization-induced components, and re-run the experiments. The updated results will demonstrate that opt/init variability on fixed data remains comparable in magnitude to data-only variability, thereby preserving the central claim while making the evidence cleaner and more interpretable. revision: yes
Referee: [Results] Results section and abstract claims: no numerical values, confidence intervals, or effect-size summaries are supplied for ePIW or eDFR on either the simulated or GUSTO-I experiments, nor are sample sizes or number of repeated training runs stated. Without these quantities it is impossible to judge whether the reported comparability is statistically reliable or clinically material.

Authors: We agree that the absence of specific numerical summaries limits evaluation of the results. The revised manuscript will add tables reporting mean ePIW and eDFR values together with 95% confidence intervals for both the simulated and GUSTO-I experiments. We will also state the number of repeated training runs (50 for simulations, 30 for GUSTO-I) and the sample sizes, and include effect-size ratios comparing opt/init variability to resampling variability. These quantitative details will be incorporated into the Results section, abstract, and supplementary material to permit assessment of statistical reliability and clinical materiality. revision: yes

Circularity Check

0 steps flagged

Empirical diagnostics defined from repeated runs without self-referential reduction

full rationale

The paper defines ePIW and eDFR directly from multiple independent training runs on fixed data and reports observed variability on simulated and GUSTO-I datasets. The central comparison to full-dataset resampling is an empirical observation rather than a derivation that reduces by the paper's own equations to quantities fitted from the same runs. No self-citation chain, ansatz smuggling, or fitted-input-called-prediction pattern appears in the load-bearing steps; the diagnostics remain independent of the instability magnitudes they measure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work defines empirical diagnostics from repeated model trainings and applies them to existing datasets without additional postulates.

pith-pipeline@v0.9.0 · 5563 in / 1106 out tokens · 27438 ms · 2026-05-15T18:48:08.166340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction.The New England Journal of Medicine, 329(10):673–682, 1993

The GUSTO Investigators. An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction.The New England Journal of Medicine, 329(10):673–682, 1993

work page 1993
[2]

Poon, Christy Harris Lemak, Juan C

Eric G. Poon, Christy Harris Lemak, Juan C. Rojas, Janet Guptill, and David Classen. Adoption of artificial intelligence in healthcare: Survey of health system priorities, successes, and challenges.Journal of the American Medical Informatics Association, 32(7):1093–1100, 2025

work page 2025
[3]

Machine learning and healthcare: Potential benefits and issues.The Journal of Ambulatory Care Management, 46(2):114–120, 2023

John Atkinson and Emily Atkinson. Machine learning and healthcare: Potential benefits and issues.The Journal of Ambulatory Care Management, 46(2):114–120, 2023

work page 2023
[4]

Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yi-An Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natara...

work page 2022
[5]

Bansak, Elisabeth Paulson, and Dominik Rothenhaeusler

Kirk C. Bansak, Elisabeth Paulson, and Dominik Rothenhaeusler. Learning under random distributional shifts. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors,Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pages 3943–3951. PMLR, 02–04 May 2024

work page 2024
[6]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017

work page 2017
[7]

Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

Ziad Obermeyer, Brian Powers, Christine V ogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

work page 2019
[8]

Learning from noisy labels with deep neural networks: A survey.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8135–8153, 2023

Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8135–8153, 2023

work page 2023
[9]

Yusufujiang, C.L.A

M. Yusufujiang, C.L.A. Navarro, J.A. Damen, et al. Prediction models developed using artificial intelligence: similar predictive performances with highly varying predictions for individuals – an illustration in deep vein thrombosis.Diagnostic and Prognostic Research, 10(1), 2026. 10 Prediction Instability in ML for Healthcare

work page 2026
[10]

Riley, Andrew Pate, Paramjit Dhiman, Laura Archer, Gary P

Richard D. Riley, Andrew Pate, Paramjit Dhiman, Laura Archer, Gary P. Martin, and Gary S. Collins. Clinical prediction models and the multiverse of madness.BMC Medicine, 21(1), 2023

work page 2023
[11]

Riley and Gary S

Richard D. Riley and Gary S. Collins. Stability of clinical prediction models developed using statistical or machine learning methods.Biometrical Journal, 65(8), 2023

work page 2023
[12]

Springer, 2009

Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The Elements of Statistical Learning. Springer, 2009

work page 2009
[13]

Different horses for different courses: Comparing bias mitigation algorithms in ml

Prakhar Ganesh, Usman Gohar, Lu Cheng, and Golnoosh Farnadi. Different horses for different courses: Comparing bias mitigation algorithms in ml. InWorkshop on Algorithmic Fairness through the Lens of Metrics and Evaluation (AFME) at NeurIPS, 2024. arXiv:2411.11101

work page arXiv 2024
[14]

Lelkes, Akshit Tyagi, Jonas Kemp, Ethan Steinberg, N

Daniel Lopez-Martinez, Alex Yakubovich, Martin Seneviratne, Adam D. Lelkes, Akshit Tyagi, Jonas Kemp, Ethan Steinberg, N. Lance Downing, Ron C. Li, Keith E. Morse, Nigam H. Shah, and Ming-Jun Chen. Instability in clinical risk stratification models using deep learning. In Antonio Parziale, Monica Agrawal, Shalmali Joshi, Irene Y . Chen, Shengpu Tang, Luis...

work page 2022
[15]

Collins, Johannes B

Gary S. Collins, Johannes B. Reitsma, Douglas G. Altman, and Karel G.M. Moons. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod).Circulation, 131(2):211–219, 2015

work page 2015
[16]

Statistical modeling: The two cultures.Statistical Science, 16(3):199–215, 2001

Leo Breiman. Statistical modeling: The two cultures.Statistical Science, 16(3):199–215, 2001

work page 2001
[17]

Lesia Semenova, Cynthia Rudin, and Ronald E. Parr. On the existence of simpler machine learning models. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), 2019

work page 2019
[18]

Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

work page 2019
[19]

Tibshirani

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.The Annals of Statistics, 50(2):949–986, 2022

work page 2022
[20]

The estimation of prediction error: Covariance penalties and cross-validation.Journal of the American Statistical Association, 99(467):619–632, 2004

Bradley Efron. The estimation of prediction error: Covariance penalties and cross-validation.Journal of the American Statistical Association, 99(467):619–632, 2004

work page 2004
[21]

Steyerberg, Marinus J

Ewout W. Steyerberg, Marinus J. C. Eijkemans, Eric Boersma, J. Dik F. Habbema, and Hans C. van Houwelin- gen. Equally valid models gave divergent predictions for mortality in acute myocardial infarction patients in a comparison of logistic regression models.Journal of Clinical Epidemiology, 58(4):383–390, 2005

work page 2005
[22]

Riley, Danielle van der Windt, Peter Croft, and Karel G

Richard D. Riley, Danielle van der Windt, Peter Croft, and Karel G. M. Moons, editors.Prognosis Research in Healthcare: Concepts, Methods and Impact. Oxford University Press, Oxford, UK, 2019

work page 2019
[23]

Goodfellow, Oriol Vinyals, and Andrew M

Ian J. Goodfellow, Oriol Vinyals, and Andrew M. Saxe. Qualitatively characterizing neural network optimization problems. InProceedings of the International Conference on Learning Representations (ICLR), 2015. Published as a conference paper at ICLR 2015

work page 2015
[24]

All models are wrong and yours are useless: making clinical prediction models impactful for patients.npj Precision Oncology, 8(1):54, 2024

Florian Markowetz. All models are wrong and yours are useless: making clinical prediction models impactful for patients.npj Precision Oncology, 8(1):54, 2024

work page 2024
[25]

Adoption challenges to artificial intelligence literacy in public healthcare: An evidence based study in saudi arabia.Frontiers in Public Health, 13:1558772, 2025

Rakesh Kumar, Ajay Singh, Ahmed Subahi Ahmed Kassar, Mohammed Ismail Humaida, Sudhanshu Joshi, and Manu Sharma. Adoption challenges to artificial intelligence literacy in public healthcare: An evidence based study in saudi arabia.Frontiers in Public Health, 13:1558772, 2025

work page 2025
[26]

Nora Arvai, Gellért Katonai, and Bertalan Mesko. Health care professionals’ concerns about medical ai and psychological barriers and strategies for successful implementation: Scoping review.Journal of Medical Internet Research, 27:e66986, 2025. 11 Prediction Instability in ML for Healthcare A Supplementary Results and Figures Table 4: Empirical prediction...

work page 2025

[1] [1]

An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction.The New England Journal of Medicine, 329(10):673–682, 1993

The GUSTO Investigators. An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction.The New England Journal of Medicine, 329(10):673–682, 1993

work page 1993

[2] [2]

Poon, Christy Harris Lemak, Juan C

Eric G. Poon, Christy Harris Lemak, Juan C. Rojas, Janet Guptill, and David Classen. Adoption of artificial intelligence in healthcare: Survey of health system priorities, successes, and challenges.Journal of the American Medical Informatics Association, 32(7):1093–1100, 2025

work page 2025

[3] [3]

Machine learning and healthcare: Potential benefits and issues.The Journal of Ambulatory Care Management, 46(2):114–120, 2023

John Atkinson and Emily Atkinson. Machine learning and healthcare: Potential benefits and issues.The Journal of Ambulatory Care Management, 46(2):114–120, 2023

work page 2023

[4] [4]

Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yi-An Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natara...

work page 2022

[5] [5]

Bansak, Elisabeth Paulson, and Dominik Rothenhaeusler

Kirk C. Bansak, Elisabeth Paulson, and Dominik Rothenhaeusler. Learning under random distributional shifts. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors,Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pages 3943–3951. PMLR, 02–04 May 2024

work page 2024

[6] [6]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017

work page 2017

[7] [7]

Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

Ziad Obermeyer, Brian Powers, Christine V ogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

work page 2019

[8] [8]

Learning from noisy labels with deep neural networks: A survey.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8135–8153, 2023

Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8135–8153, 2023

work page 2023

[9] [9]

Yusufujiang, C.L.A

M. Yusufujiang, C.L.A. Navarro, J.A. Damen, et al. Prediction models developed using artificial intelligence: similar predictive performances with highly varying predictions for individuals – an illustration in deep vein thrombosis.Diagnostic and Prognostic Research, 10(1), 2026. 10 Prediction Instability in ML for Healthcare

work page 2026

[10] [10]

Riley, Andrew Pate, Paramjit Dhiman, Laura Archer, Gary P

Richard D. Riley, Andrew Pate, Paramjit Dhiman, Laura Archer, Gary P. Martin, and Gary S. Collins. Clinical prediction models and the multiverse of madness.BMC Medicine, 21(1), 2023

work page 2023

[11] [11]

Riley and Gary S

Richard D. Riley and Gary S. Collins. Stability of clinical prediction models developed using statistical or machine learning methods.Biometrical Journal, 65(8), 2023

work page 2023

[12] [12]

Springer, 2009

Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The Elements of Statistical Learning. Springer, 2009

work page 2009

[13] [13]

Different horses for different courses: Comparing bias mitigation algorithms in ml

Prakhar Ganesh, Usman Gohar, Lu Cheng, and Golnoosh Farnadi. Different horses for different courses: Comparing bias mitigation algorithms in ml. InWorkshop on Algorithmic Fairness through the Lens of Metrics and Evaluation (AFME) at NeurIPS, 2024. arXiv:2411.11101

work page arXiv 2024

[14] [14]

Lelkes, Akshit Tyagi, Jonas Kemp, Ethan Steinberg, N

Daniel Lopez-Martinez, Alex Yakubovich, Martin Seneviratne, Adam D. Lelkes, Akshit Tyagi, Jonas Kemp, Ethan Steinberg, N. Lance Downing, Ron C. Li, Keith E. Morse, Nigam H. Shah, and Ming-Jun Chen. Instability in clinical risk stratification models using deep learning. In Antonio Parziale, Monica Agrawal, Shalmali Joshi, Irene Y . Chen, Shengpu Tang, Luis...

work page 2022

[15] [15]

Collins, Johannes B

Gary S. Collins, Johannes B. Reitsma, Douglas G. Altman, and Karel G.M. Moons. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod).Circulation, 131(2):211–219, 2015

work page 2015

[16] [16]

Statistical modeling: The two cultures.Statistical Science, 16(3):199–215, 2001

Leo Breiman. Statistical modeling: The two cultures.Statistical Science, 16(3):199–215, 2001

work page 2001

[17] [17]

Lesia Semenova, Cynthia Rudin, and Ronald E. Parr. On the existence of simpler machine learning models. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), 2019

work page 2019

[18] [18]

Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

work page 2019

[19] [19]

Tibshirani

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.The Annals of Statistics, 50(2):949–986, 2022

work page 2022

[20] [20]

The estimation of prediction error: Covariance penalties and cross-validation.Journal of the American Statistical Association, 99(467):619–632, 2004

Bradley Efron. The estimation of prediction error: Covariance penalties and cross-validation.Journal of the American Statistical Association, 99(467):619–632, 2004

work page 2004

[21] [21]

Steyerberg, Marinus J

Ewout W. Steyerberg, Marinus J. C. Eijkemans, Eric Boersma, J. Dik F. Habbema, and Hans C. van Houwelin- gen. Equally valid models gave divergent predictions for mortality in acute myocardial infarction patients in a comparison of logistic regression models.Journal of Clinical Epidemiology, 58(4):383–390, 2005

work page 2005

[22] [22]

Riley, Danielle van der Windt, Peter Croft, and Karel G

Richard D. Riley, Danielle van der Windt, Peter Croft, and Karel G. M. Moons, editors.Prognosis Research in Healthcare: Concepts, Methods and Impact. Oxford University Press, Oxford, UK, 2019

work page 2019

[23] [23]

Goodfellow, Oriol Vinyals, and Andrew M

Ian J. Goodfellow, Oriol Vinyals, and Andrew M. Saxe. Qualitatively characterizing neural network optimization problems. InProceedings of the International Conference on Learning Representations (ICLR), 2015. Published as a conference paper at ICLR 2015

work page 2015

[24] [24]

All models are wrong and yours are useless: making clinical prediction models impactful for patients.npj Precision Oncology, 8(1):54, 2024

Florian Markowetz. All models are wrong and yours are useless: making clinical prediction models impactful for patients.npj Precision Oncology, 8(1):54, 2024

work page 2024

[25] [25]

Adoption challenges to artificial intelligence literacy in public healthcare: An evidence based study in saudi arabia.Frontiers in Public Health, 13:1558772, 2025

Rakesh Kumar, Ajay Singh, Ahmed Subahi Ahmed Kassar, Mohammed Ismail Humaida, Sudhanshu Joshi, and Manu Sharma. Adoption challenges to artificial intelligence literacy in public healthcare: An evidence based study in saudi arabia.Frontiers in Public Health, 13:1558772, 2025

work page 2025

[26] [26]

Nora Arvai, Gellért Katonai, and Bertalan Mesko. Health care professionals’ concerns about medical ai and psychological barriers and strategies for successful implementation: Scoping review.Journal of Medical Internet Research, 27:e66986, 2025. 11 Prediction Instability in ML for Healthcare A Supplementary Results and Figures Table 4: Empirical prediction...

work page 2025