Recognition: unknown
An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness
Pith reviewed 2026-05-08 03:54 UTC · model grok-4.3
The pith
Updating clinical AI models can cause predictions to flip for many cases, increase arbitrariness, and worsen fairness across patient groups.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When models predicting severe hyperglycemia are retrained on new weekly observations from the four U.S. type 1 diabetes datasets, different update strategies produce substantial prediction flips, higher arbitrariness, and degraded accuracy equity plus unbalanced error rates across subpopulations defined by sociodemographic variables. The authors therefore propose multiple dimensions for continuous monitoring and state that such monitoring is required for trustworthy clinical decision support systems.
What carries the argument
The empirical testbed that applies multiple model-update strategies to high-resolution CGM data and measures stability via prediction flips, arbitrariness via output variability, and fairness via accuracy equity and error-rate balance across subgroups.
Load-bearing premise
The risks seen in these four diabetes datasets and the tested update strategies will appear in other clinical prediction tasks and real-world deployments.
What would settle it
A new study that applies the same update strategies to an independent clinical dataset and finds no increase in prediction flips, arbitrariness, or fairness degradation would falsify the claim.
Figures
read the original abstract
Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data become stale due to changes in demographics, environment, or patient behaviors, model performance can degrade substantially. While updating models with new training data is necessary, such updates may also introduce new risks. We evaluated the proposed monitoring framework on four publicly available U.S.-based Type 1 Diabetes datasets containing high-resolution continuous glucose monitoring (CGM) data, comprising approximately 11,300 weekly observations from 496 participants under 20 years of age. All datasets included structured sociodemographic information. Using the prediction of severe hyperglycemia events in children with type 1 diabetes as a case study, we examine how different model update strategies can adversely affect model stability (e.g., by causing predictions to "flip" for a large number of cases after an update), increase arbitrariness in predictions, or worsen accuracy equity and the balance of error rates across subpopulations. We propose multiple dimensions for continuous monitoring to detect these issues and argue that such monitoring is essential for the development of trustworthy clinical decision support systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical evaluation of risks associated with updating AI/ML models in clinical settings, using prediction of severe hyperglycemia events in pediatric Type 1 diabetes as a case study. It analyzes four publicly available U.S.-based CGM datasets comprising ~11,300 weekly observations from 496 participants under 20, incorporating sociodemographic data, and examines how different model update strategies affect stability (e.g., prediction flips), arbitrariness, and fairness/equity across subpopulations. The authors propose multiple dimensions for continuous monitoring to detect these issues and argue that such monitoring is essential for trustworthy clinical decision support systems.
Significance. If the empirical findings hold, the work provides concrete, data-driven evidence of practical risks in model updating for clinical AI, particularly stability and equity concerns in a high-stakes pediatric diabetes context with real CGM and sociodemographic data. This strengthens the case for proactive monitoring frameworks and could inform deployment practices, though the focused scope on one prediction task limits immediate broad impact without further validation.
minor comments (3)
- Abstract: While the setup and dataset sizes are described, the abstract would benefit from a brief mention of key quantitative outcomes (e.g., magnitude of prediction flips or fairness metric changes) to better convey the strength of the observed risks.
- Methods section: The specific machine learning models, hyperparameters, and exact definitions/thresholds used to quantify 'arbitrariness' and 'prediction flips' should be stated explicitly with pseudocode or equations for reproducibility.
- Results: Tables or figures comparing update strategies to a no-update baseline would clarify whether the reported adverse effects are attributable to the updates themselves rather than baseline variability.
Simulated Author's Rebuttal
We thank the referee for their positive summary of the manuscript, recognition of its empirical contributions on model update risks in clinical AI, and recommendation for minor revision. We appreciate the acknowledgment that the findings provide concrete evidence for stability, arbitrariness, and fairness concerns in a pediatric diabetes context.
Circularity Check
Purely empirical study with no derivation chain or fitted predictions
full rationale
The paper is an empirical evaluation of model-update risks on four pediatric T1D CGM datasets. It reports observed effects on stability, arbitrariness, and fairness under different update strategies and proposes monitoring dimensions based on those observations. No mathematical derivations, first-principles predictions, or equations are present; claims rest directly on measured outcomes from public data rather than any reduction to fitted parameters or self-citation chains. The analysis is therefore self-contained with no load-bearing steps that collapse to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- model hyperparameters and update thresholds
axioms (1)
- domain assumption The four U.S.-based Type 1 Diabetes datasets with CGM and sociodemographic data are representative of real clinical populations and data shifts.
Reference graph
Works this paper leans on
-
[1]
Temporal quality degradation in ai models,
D. Vela, A. Sharp, R. Zhang, T. Nguyen, A. Hoang, and O. S. Pianykh, “Temporal quality degradation in ai models,”Sci. rep. 12(1), 2022
2022
-
[2]
A moving target in ai-assisted decision-making: Dataset shift, model updating, and the problem of update opacity,
J. Hatherley, “A moving target in ai-assisted decision-making: Dataset shift, model updating, and the problem of update opacity,”Ethics and Inf. Tech. 27(2), 2025
2025
-
[3]
Clinical artificial intelligence quality improve- ment: towards continual monitoring and updating of AI algorithms in healthcare,
J. Feng, R. V . Phillips, I. Malenica, A. Bishara, A. E. Hubbard, L. A. Celi, and R. Pirracchio, “Clinical artificial intelligence quality improve- ment: towards continual monitoring and updating of AI algorithms in healthcare,”NPJ Digital Med. 5(1), 2022
2022
-
[4]
The impact of bias on drift detection in AI health software,
A. Khoshravan Azar, B. Draghi, Y . Rotalinti, P. Myles, and A. Tucker, “The impact of bias on drift detection in AI health software,” inProc. AIME. Springer, 2023
2023
-
[5]
The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects,
M. Mermillod, A. Bugaiska, and P. Bonin, “The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects,” 2013
2013
-
[6]
Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024
G. M. van de Ven, N. Soures, and D. Kudithipudi, “Continual learning and catastrophic forgetting,”preprint arXiv:2403.05175, 2024
-
[7]
Updating methods for artificial intelligence–based clinical predic- tion models: a scoping review,
L. M. Meijerink, Z. S. Dunias, A. M. Leeuwenberg, A. A. de Hond, D. A. Jenkins, G. P. Martin, M. Sperrin, N. Peek, R. Spijker, L. Hooft et al., “Updating methods for artificial intelligence–based clinical predic- tion models: a scoping review,”J. of Clinical Epidemiology 178, 2025
2025
-
[8]
I. Zhang and D. Rothenh ¨ausler, “Data quality or data quantity? prior- itizing data collection under distribution shift with the data usefulness coefficient,”preprint arXiv:2504.06570, 2025
-
[9]
Tackling data and model drift in ai: Strategies for maintaining accuracy during ml model inference,
S. Patchipala, “Tackling data and model drift in ai: Strategies for maintaining accuracy during ml model inference,”Int. J. of Sci. and Research 10(2), 2023
2023
-
[10]
Detecting and remediating harmful data shifts for the responsible deployment of clinical AI models,
V . Subasri, A. Krishnan, A. Kore, A. Dhalla, D. Pandya, B. Wang, D. Malkin, F. Razak, A. A. Verma, A. Goldenberget al., “Detecting and remediating harmful data shifts for the responsible deployment of clinical AI models,”JAMA Network Open 8(6), 2025
2025
-
[11]
Towards a framework for adapting machine learning components,
M. Casimiro, P. Romano, D. Garlan, and L. Rodrigues, “Towards a framework for adapting machine learning components,” inProc. ACSOS. IEEE, 2022
2022
-
[12]
When to retrain a machine learning model,
R. Florence, S. Leo, S. Kyle, C. Mark, and M. Thomas, “When to retrain a machine learning model,”preprint arXiv:2505.14903, 2025
-
[13]
Representative data selection for efficient medical incremental learning,
B.-Q. Wei, J.-J. Chen, Y .-C. Tseng, and P.-T. P. Kuo, “Representative data selection for efficient medical incremental learning,” inProc. EMBC. IEEE, 2023
2023
-
[14]
arXiv preprint arXiv:2408.04154 , year=
J. H. Shen, I. D. Raji, and I. Y . Chen, “The data addition dilemma,” preprint arXiv:2408.04154, 2024
-
[15]
Continuous optimization of a hierar- chical bayesian network for Friedreich’s ataxia severity classification,
S. Dissanayake, R. Krishna, P. N. Pathirana, M. K. Horne, D. J. Smulewicz, and L. A. Corben, “Continuous optimization of a hierar- chical bayesian network for Friedreich’s ataxia severity classification,” inProc. EMBC. IEEE, 2024
2024
-
[16]
Mitigating catastrophic forgetting in medical imaging via incremental learning,
V . E. Leonardet al., “Mitigating catastrophic forgetting in medical imaging via incremental learning,”Procedia Comp. Sci. 269, 2025
2025
-
[17]
Keeping medical ai healthy and trustworthy: A review of detection and correction methods for system degradation,
H. Guan, D. Bates, and L. Zhou, “Keeping medical ai healthy and trustworthy: A review of detection and correction methods for system degradation,”IEEE Transactions on Biomedical Engineering, 2026
2026
-
[18]
Adaptive self-healing ai models for enhancing selfi performance with real-time retraining detection
M. Emmanuel and A. Rajuroy, “Adaptive self-healing ai models for enhancing selfi performance with real-time retraining detection.”
-
[19]
Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings,
S. E. Davis, C. G. Walsh, and M. E. Matheny, “Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings,”Frontiers in Digital Health 4, 2022
2022
-
[20]
J. Y . Kim, A. Hasan, K. C. Kellogg, W. Ratliff, S. G. Murray, H. Suresh, A. Valladares, K. Shaw, D. Tobey, D. E. Vidalet al., “Development and preliminary testing of health equity across the AI lifecycle (HEAAL): A framework for healthcare delivery organizations to mitigate the risk of ai solutions worsening health inequities,”PLOS Digital Health 3(5), 2024
2024
-
[21]
Adaptive fairness in continuous learning ai healthcare systems: Frameworks for dynamic equity alignment,
M. H. Rahman, M. D. Hossain, K. M. R. Hossan, M. K. S. Uddin, A. A. Zaiem, and M. B. Ullah, “Adaptive fairness in continuous learning ai healthcare systems: Frameworks for dynamic equity alignment,”preprint SSRN 5437834, 2025
2025
-
[22]
Fairness evolution in continual learning for medical imaging,
M. Ceccon, D. Dalle Pezze, A. Fabris, and G. A. Susto, “Fairness evolution in continual learning for medical imaging,” inProc. IF AC. Elsevier, 2025
2025
-
[23]
The curious case of arbitrariness in machine learning,
P. Ganesh, A. Taik, and G. Farnadi, “The curious case of arbitrariness in machine learning,”preprint arXiv:2501.14959, 2025
-
[24]
Towards stable machine learning model retraining via slowly varying sequences,
D. Bertsimas, V . Digalakis Jr, Y . Ma, and P. Paschalidis, “Towards stable machine learning model retraining via slowly varying sequences,” preprint arXiv:2403.19871, 2024
-
[25]
Fairness of artificial intelligence in healthcare: review and recommendations,
D. Ueda, T. Kakinuma, S. Fujita, K. Kamagata, Y . Fushimi, R. Ito, Y . Matsui, T. Nozaki, T. Nakaura, N. Fujimaet al., “Fairness of artificial intelligence in healthcare: review and recommendations,”Japanese J. of Radiology 42(1), 2024
2024
-
[26]
Watch out for updates: Understanding the effects of model explanation updates in AI-assisted decision making,
X. Wang and M. Yin, “Watch out for updates: Understanding the effects of model explanation updates in AI-assisted decision making,” inProc. CHI, 2023, pp. 1–19
2023
-
[27]
Measuring fairness preferences is important for artificial intelligence in health care,
A.-F. N ¨aher, I. Krumpal, E.-M. Ant ˜ao, E. Ong, M. Rojo, F. Kaggwa, F. Balzer, L. A. Celi, K. Braune, L. H. Wieleret al., “Measuring fairness preferences is important for artificial intelligence in health care,”The Lancet Digital Health 6(5), 2024
2024
-
[28]
Dis- parate model performance and stability in machine learning clinical support for diabetes and heart diseases,
I. Bilionis, R. C. Berrios, L. Fernandez-Luque, and C. Castillo, “Dis- parate model performance and stability in machine learning clinical support for diabetes and heart diseases,” inProc. AMIA, 2025
2025
-
[29]
Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities,
J. K. Paulus and D. M. Kent, “Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities,”NPJ digital medicine 3(1), 2020
2020
-
[30]
Allocation multiplicity: Evaluating the promises of the rashomon set,
S. Jain, M. Wang, K. Creel, and A. Wilson, “Allocation multiplicity: Evaluating the promises of the rashomon set,” inProc. F AccT, 2025
2025
-
[31]
Be inten- tional about fairness!: Fairness, size, and multiplicity in the Rashomon set,
G. Dai, P. Ravishankar, R. Yuan, E. Black, and D. B. Neill, “Be inten- tional about fairness!: Fairness, size, and multiplicity in the Rashomon set,” inProc. EEAMO, 2025
2025
-
[32]
Exploration of the Rashomon set assists trustworthy expla- nations for medical data,
K. Kobyli ´nska, M. Krzyzi ´nski, R. Machowicz, M. Adamek, and P. Biecek, “Exploration of the Rashomon set assists trustworthy expla- nations for medical data,”IEEE J. of Biomed. and Health Inf. 28(11), 2024
2024
-
[33]
Arbitrariness and social prediction: The confounding role of variance in fair classification,
A. F. Cooper, K. Lee, M. Z. Choksi, S. Barocas, C. De Sa, J. Grim- melmann, J. Kleinberg, S. Sen, and B. Zhang, “Arbitrariness and social prediction: The confounding role of variance in fair classification,” in Proc. AAAI, 2024
2024
-
[34]
A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification
A. N. Angelopoulos and S. Bates, “A gentle introduction to confor- mal prediction and distribution-free uncertainty quantification,”preprint arXiv:2107.07511, 2021
work page internal anchor Pith review arXiv 2021
-
[35]
Updates in human-AI teams: Understanding and addressing the performance/compatibility tradeoff,
G. Bansal, B. Nushi, E. Kamar, D. S. Weld, W. S. Lasecki, and E. Horvitz, “Updates in human-AI teams: Understanding and addressing the performance/compatibility tradeoff,” inProc. AAAI, 2019
2019
-
[36]
Regulating AI adaptation: An analysis of AI medical device updates,
K. Wu, E. Wu, K. Rodolfa, D. E. Ho, and J. Zou, “Regulating AI adaptation: An analysis of AI medical device updates,”preprint arXiv:2407.16900, 2024
-
[37]
Clinical targets for continuous glucose monitoring data interpretation: recommendations from the international consensus on time in range,
T. Battelino, T. Danne, R. M. Bergenstal, S. A. Amiel, R. Beck, T. Biester, E. Bosi, B. A. Buckingham, W. T. Cefalu, K. L. Closeet al., “Clinical targets for continuous glucose monitoring data interpretation: recommendations from the international consensus on time in range,” Diabetes care 42(8), 2019
2019
-
[38]
Effects of continuous glucose monitoring on metrics of glycemic control in dia- betes: a systematic review with meta-analysis of randomized controlled trials,
M. I. Maiorino, S. Signoriello, A. Maio, P. Chiodini, G. Bellastella, L. Scappaticcio, M. Longo, D. Giugliano, and K. Esposito, “Effects of continuous glucose monitoring on metrics of glycemic control in dia- betes: a systematic review with meta-analysis of randomized controlled trials,”Diabetes Care 43(5), 2020
2020
-
[39]
Hyperglycemia,
M. Mouri and M. Badireddy, “Hyperglycemia,” inStatPearls [Online]. StatPearls Publishing, 2023
2023
-
[40]
Standards of care in diabetes—2023,
Diabetes Care, “Standards of care in diabetes—2023,”Diabetes Care 46, 2023
2023
-
[41]
PEDAP Public Dataset (Release 4),
PEDAP Trial Study Group, “PEDAP Public Dataset (Release 4),” Re- trieved from https://public.jaeb.org/dataset/599, 2024, clinicalTrials.gov Identifier: NCT04796779
2024
-
[42]
Insulin Only Bionic Pancreas Pivotal Trial (IOBP2) Dataset,
Bionic Pancreas Research Group, “Insulin Only Bionic Pancreas Pivotal Trial (IOBP2) Dataset,” Retrieved from https://public.jaeb.org/dataset/579, 2019, clinicalTrials.gov Identifier: NCT04200313
2019
-
[43]
DCLP5 Dataset,
University of Virginia, “DCLP5 Dataset,” Retrieved from http://public.jaeb.org/dataset/535, 2019, clinicalTrials.gov Identifier: NCT03844789
2019
-
[44]
CGM Intervention in Teens and Young Adults with T1D (CITY Public Dataset),
Jaeb Center for Health Research, “CGM Intervention in Teens and Young Adults with T1D (CITY Public Dataset),” Retrieved from http://public.jaeb.org/dataset/565, 2017, clinicalTrials.gov Identi- fier: NCT03263494
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.