Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare
Pith reviewed 2026-05-15 18:48 UTC · model grok-4.3
The pith
Optimization randomness alone can make the same patient's risk estimate vary as much as retraining on entirely new data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Even when data and model architecture are held fixed, randomness introduced by optimization and initialization can lead to materially different risk estimates for the same patient; this procedural arbitrariness is comparable in magnitude to the variability obtained by resampling the entire training dataset and can change threshold-based treatment recommendations.
What carries the argument
empirical prediction interval width (ePIW) and empirical decision flip rate (eDFR), two complementary diagnostics that quantify spread in continuous risk scores and instability in binary clinical decisions across repeated trainings with identical data and architecture.
If this is right
- Flexible models such as neural networks show substantially greater individual-level instability than simpler models such as logistic regression.
- Models that look identical on aggregate metrics can still differ markedly in which patients receive a given treatment recommendation.
- Risk instability concentrated near clinical thresholds can directly alter patient management decisions.
- Routine model validation in healthcare should include explicit checks for individual-level stability in addition to aggregate performance.
Where Pith is reading between the lines
- Repeating the training process a modest number of times and reporting the range of predictions could become a low-cost addition to existing validation pipelines.
- The same diagnostics could be applied to other high-stakes domains where decisions are made at the individual level, such as credit scoring or recidivism prediction.
- If instability proves hard to reduce, averaging predictions across multiple independently trained models may offer a practical mitigation.
Load-bearing premise
The observed differences in individual predictions are caused mainly by optimization and initialization randomness rather than by unmeasured aspects of the data or training process.
What would settle it
If multiple trainings with the exact same data and architecture produce individual risk estimates whose spread is no larger than the spread obtained from a single model trained on bootstrap-resampled versions of the data, the claim of comparable instability would be falsified.
Figures
read the original abstract
In healthcare, predictive models increasingly inform patient-level decisions, yet little attention is paid to the variability in individual risk estimates and its impact on treatment decisions. For overparameterized models, now standard in machine learning, a substantial source of variability often goes undetected. Even when the data and model architecture are held fixed, randomness introduced by optimization and initialization can lead to materially different risk estimates for the same patient. This problem is largely obscured by standard evaluation practices, which rely on aggregate performance metrics (e.g., log-loss, accuracy) that are agnostic to individual-level stability. As a result, models with indistinguishable aggregate performance can nonetheless exhibit substantial procedural arbitrariness, which can undermine clinical trust. We propose an evaluation framework that quantifies individual-level prediction instability by using two complementary diagnostics: empirical prediction interval width (ePIW), which captures variability in continuous risk estimates, and empirical decision flip rate (eDFR), which measures instability in threshold-based clinical decisions. We apply these diagnostics to simulated data and GUSTO-I clinical dataset. Across observed settings, we find that for flexible machine-learning models, randomness arising solely from optimization and initialization can induce individual-level variability comparable to that produced by resampling the entire training dataset. Neural networks exhibit substantially greater instability in individual risk predictions compared to logistic regression models. Risk estimate instability near clinically relevant decision thresholds can alter treatment recommendations. These findings that stability diagnostics should be incorporated into routine model validation for assessing clinical reliability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes two diagnostics—empirical prediction interval width (ePIW) and empirical decision flip rate (eDFR)—to quantify individual-level prediction instability arising from optimization and initialization randomness in flexible machine-learning models for healthcare. It reports that, on simulated data and the GUSTO-I dataset, this source of variability produces individual risk estimate changes comparable in magnitude to those induced by resampling the full training set, with neural networks exhibiting substantially higher instability than logistic regression; instability near decision thresholds can alter clinical recommendations. The work argues that aggregate metrics obscure this procedural arbitrariness and that stability diagnostics should be routine.
Significance. If the central comparison holds after clarification, the result is significant because it identifies a practically consequential but previously under-quantified source of patient-level arbitrariness in clinical ML. The proposed diagnostics are simple to compute from repeated training runs and directly address a gap between standard validation practices and the requirements of individualized decision-making.
major comments (2)
- [Methods (resampling procedure)] Methods section describing the resampling experiment: the reported comparability between opt/init variability (fixed data) and full-dataset resampling variability is difficult to interpret because the resampling arm appears to permit independent optimization runs on each bootstrap replicate. Without fixed random seeds or an explicit variance decomposition isolating data-induced effects, the observed individual-level differences may be driven by the union of both sources rather than opt/init randomness alone. This directly affects the load-bearing claim that opt/init randomness suffices to produce comparable instability.
- [Results] Results section and abstract claims: no numerical values, confidence intervals, or effect-size summaries are supplied for ePIW or eDFR on either the simulated or GUSTO-I experiments, nor are sample sizes or number of repeated training runs stated. Without these quantities it is impossible to judge whether the reported comparability is statistically reliable or clinically material.
minor comments (1)
- [Abstract] Abstract: the sentence 'These findings that stability diagnostics should be incorporated' is grammatically incomplete; rephrase to 'These findings suggest that stability diagnostics should be incorporated'.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to improve methodological clarity and quantitative reporting.
read point-by-point responses
-
Referee: [Methods (resampling procedure)] Methods section describing the resampling experiment: the reported comparability between opt/init variability (fixed data) and full-dataset resampling variability is difficult to interpret because the resampling arm appears to permit independent optimization runs on each bootstrap replicate. Without fixed random seeds or an explicit variance decomposition isolating data-induced effects, the observed individual-level differences may be driven by the union of both sources rather than opt/init randomness alone. This directly affects the load-bearing claim that opt/init randomness suffices to produce comparable instability.
Authors: We acknowledge the validity of this observation. In the submitted manuscript, each bootstrap replicate was indeed trained with independent optimization and initialization, so the resampling-arm variability conflates data effects with procedural randomness. This ambiguity weakens the direct comparison. We will revise the Methods section to fix random seeds for optimization and initialization across all bootstrap replicates, add an explicit variance decomposition separating data-induced and optimization-induced components, and re-run the experiments. The updated results will demonstrate that opt/init variability on fixed data remains comparable in magnitude to data-only variability, thereby preserving the central claim while making the evidence cleaner and more interpretable. revision: yes
-
Referee: [Results] Results section and abstract claims: no numerical values, confidence intervals, or effect-size summaries are supplied for ePIW or eDFR on either the simulated or GUSTO-I experiments, nor are sample sizes or number of repeated training runs stated. Without these quantities it is impossible to judge whether the reported comparability is statistically reliable or clinically material.
Authors: We agree that the absence of specific numerical summaries limits evaluation of the results. The revised manuscript will add tables reporting mean ePIW and eDFR values together with 95% confidence intervals for both the simulated and GUSTO-I experiments. We will also state the number of repeated training runs (50 for simulations, 30 for GUSTO-I) and the sample sizes, and include effect-size ratios comparing opt/init variability to resampling variability. These quantitative details will be incorporated into the Results section, abstract, and supplementary material to permit assessment of statistical reliability and clinical materiality. revision: yes
Circularity Check
Empirical diagnostics defined from repeated runs without self-referential reduction
full rationale
The paper defines ePIW and eDFR directly from multiple independent training runs on fixed data and reports observed variability on simulated and GUSTO-I datasets. The central comparison to full-dataset resampling is an empirical observation rather than a derivation that reduces by the paper's own equations to quantities fitted from the same runs. No self-citation chain, ansatz smuggling, or fitted-input-called-prediction pattern appears in the load-bearing steps; the diagnostics remain independent of the instability magnitudes they measure.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The GUSTO Investigators. An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction.The New England Journal of Medicine, 329(10):673–682, 1993
work page 1993
-
[2]
Poon, Christy Harris Lemak, Juan C
Eric G. Poon, Christy Harris Lemak, Juan C. Rojas, Janet Guptill, and David Classen. Adoption of artificial intelligence in healthcare: Survey of health system priorities, successes, and challenges.Journal of the American Medical Informatics Association, 32(7):1093–1100, 2025
work page 2025
-
[3]
John Atkinson and Emily Atkinson. Machine learning and healthcare: Potential benefits and issues.The Journal of Ambulatory Care Management, 46(2):114–120, 2023
work page 2023
-
[4]
Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yi-An Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natara...
work page 2022
-
[5]
Bansak, Elisabeth Paulson, and Dominik Rothenhaeusler
Kirk C. Bansak, Elisabeth Paulson, and Dominik Rothenhaeusler. Learning under random distributional shifts. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors,Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pages 3943–3951. PMLR, 02–04 May 2024
work page 2024
-
[6]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017
work page 2017
-
[7]
Ziad Obermeyer, Brian Powers, Christine V ogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019
work page 2019
-
[8]
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8135–8153, 2023
work page 2023
-
[9]
M. Yusufujiang, C.L.A. Navarro, J.A. Damen, et al. Prediction models developed using artificial intelligence: similar predictive performances with highly varying predictions for individuals – an illustration in deep vein thrombosis.Diagnostic and Prognostic Research, 10(1), 2026. 10 Prediction Instability in ML for Healthcare
work page 2026
-
[10]
Riley, Andrew Pate, Paramjit Dhiman, Laura Archer, Gary P
Richard D. Riley, Andrew Pate, Paramjit Dhiman, Laura Archer, Gary P. Martin, and Gary S. Collins. Clinical prediction models and the multiverse of madness.BMC Medicine, 21(1), 2023
work page 2023
-
[11]
Richard D. Riley and Gary S. Collins. Stability of clinical prediction models developed using statistical or machine learning methods.Biometrical Journal, 65(8), 2023
work page 2023
-
[12]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The Elements of Statistical Learning. Springer, 2009
work page 2009
-
[13]
Different horses for different courses: Comparing bias mitigation algorithms in ml
Prakhar Ganesh, Usman Gohar, Lu Cheng, and Golnoosh Farnadi. Different horses for different courses: Comparing bias mitigation algorithms in ml. InWorkshop on Algorithmic Fairness through the Lens of Metrics and Evaluation (AFME) at NeurIPS, 2024. arXiv:2411.11101
-
[14]
Lelkes, Akshit Tyagi, Jonas Kemp, Ethan Steinberg, N
Daniel Lopez-Martinez, Alex Yakubovich, Martin Seneviratne, Adam D. Lelkes, Akshit Tyagi, Jonas Kemp, Ethan Steinberg, N. Lance Downing, Ron C. Li, Keith E. Morse, Nigam H. Shah, and Ming-Jun Chen. Instability in clinical risk stratification models using deep learning. In Antonio Parziale, Monica Agrawal, Shalmali Joshi, Irene Y . Chen, Shengpu Tang, Luis...
work page 2022
-
[15]
Gary S. Collins, Johannes B. Reitsma, Douglas G. Altman, and Karel G.M. Moons. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod).Circulation, 131(2):211–219, 2015
work page 2015
-
[16]
Statistical modeling: The two cultures.Statistical Science, 16(3):199–215, 2001
Leo Breiman. Statistical modeling: The two cultures.Statistical Science, 16(3):199–215, 2001
work page 2001
-
[17]
Lesia Semenova, Cynthia Rudin, and Ronald E. Parr. On the existence of simpler machine learning models. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), 2019
work page 2019
-
[18]
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019
work page 2019
-
[19]
Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.The Annals of Statistics, 50(2):949–986, 2022
work page 2022
-
[20]
Bradley Efron. The estimation of prediction error: Covariance penalties and cross-validation.Journal of the American Statistical Association, 99(467):619–632, 2004
work page 2004
-
[21]
Ewout W. Steyerberg, Marinus J. C. Eijkemans, Eric Boersma, J. Dik F. Habbema, and Hans C. van Houwelin- gen. Equally valid models gave divergent predictions for mortality in acute myocardial infarction patients in a comparison of logistic regression models.Journal of Clinical Epidemiology, 58(4):383–390, 2005
work page 2005
-
[22]
Riley, Danielle van der Windt, Peter Croft, and Karel G
Richard D. Riley, Danielle van der Windt, Peter Croft, and Karel G. M. Moons, editors.Prognosis Research in Healthcare: Concepts, Methods and Impact. Oxford University Press, Oxford, UK, 2019
work page 2019
-
[23]
Goodfellow, Oriol Vinyals, and Andrew M
Ian J. Goodfellow, Oriol Vinyals, and Andrew M. Saxe. Qualitatively characterizing neural network optimization problems. InProceedings of the International Conference on Learning Representations (ICLR), 2015. Published as a conference paper at ICLR 2015
work page 2015
-
[24]
Florian Markowetz. All models are wrong and yours are useless: making clinical prediction models impactful for patients.npj Precision Oncology, 8(1):54, 2024
work page 2024
-
[25]
Rakesh Kumar, Ajay Singh, Ahmed Subahi Ahmed Kassar, Mohammed Ismail Humaida, Sudhanshu Joshi, and Manu Sharma. Adoption challenges to artificial intelligence literacy in public healthcare: An evidence based study in saudi arabia.Frontiers in Public Health, 13:1558772, 2025
work page 2025
-
[26]
Nora Arvai, Gellért Katonai, and Bertalan Mesko. Health care professionals’ concerns about medical ai and psychological barriers and strategies for successful implementation: Scoping review.Journal of Medical Internet Research, 27:e66986, 2025. 11 Prediction Instability in ML for Healthcare A Supplementary Results and Figures Table 4: Empirical prediction...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.