arxiv: 2604.23954 · v1 · submitted 2026-04-27 · 💻 cs.AI

Recognition: unknown

An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness

Ioannis Bilionis , Ricardo C. Berrios , Luis Fernandez-Luque , Carlos Castillo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI model updatesclinical decision supportmodel stabilityprediction fairnesstype 1 diabetescontinuous glucose monitoringhyperglycemia prediction

0 comments

The pith

Updating clinical AI models can cause predictions to flip for many cases, increase arbitrariness, and worsen fairness across patient groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines the downsides of updating AI models used for clinical decisions when new data arrives. Using four datasets of continuous glucose monitoring from children with type 1 diabetes, it tests how retraining affects whether the model gives the same output for the same patient, whether outputs become more arbitrary, and whether accuracy remains equitable across subgroups. The work shows that common update approaches can produce large numbers of changed predictions and uneven error rates. It concludes that developers need ongoing monitoring across these dimensions to keep clinical systems trustworthy as data evolves.

Core claim

When models predicting severe hyperglycemia are retrained on new weekly observations from the four U.S. type 1 diabetes datasets, different update strategies produce substantial prediction flips, higher arbitrariness, and degraded accuracy equity plus unbalanced error rates across subpopulations defined by sociodemographic variables. The authors therefore propose multiple dimensions for continuous monitoring and state that such monitoring is required for trustworthy clinical decision support systems.

What carries the argument

The empirical testbed that applies multiple model-update strategies to high-resolution CGM data and measures stability via prediction flips, arbitrariness via output variability, and fairness via accuracy equity and error-rate balance across subgroups.

Load-bearing premise

The risks seen in these four diabetes datasets and the tested update strategies will appear in other clinical prediction tasks and real-world deployments.

What would settle it

A new study that applies the same update strategies to an independent clinical dataset and finds no increase in prediction flips, arbitrariness, or fairness degradation would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.23954 by Carlos Castillo, Ioannis Bilionis, Luis Fernandez-Luque, Ricardo C. Berrios.

**Figure 1.** Figure 1: Illustration of model evaluation schemes. (a) In the prospective setting, the model is evaluated on the immediately subsequent batch, reflecting a realistic deployment scenario where future data are unseen at training time. (b) In the retrospective setting, a fixed stratified patient-level holdout subset is defined once and used for evaluation across all retraining phases. The illustration is for the last-… view at source ↗

**Figure 3.** Figure 3: Subgroup-level temporal stability analysis across protected attributes. Panels report patient-level (a) worsening temporal self-consistency rates across retraining phases, and (b) prediction flip rates. Points denote subgroup means with 95% confidence intervals. For each protected attribute, Group A and Group B correspond respectively to: sex (male vs. female), age (younger vs. older children), caregiver e… view at source ↗

**Figure 4.** Figure 4: Representative examples illustrating heterogeneous effects of abstention across datasets and protected features. Panels (a)–(b) correspond to DB4 with household income as the protected feature, where abstention leads to modest improvements in overall predictive performance (AUC) as well as reductions in both Equal Opportunity (EO) and Demographic Parity (DP) disparities. In contrast, panels (c)–(d) show D… view at source ↗

read the original abstract

Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data become stale due to changes in demographics, environment, or patient behaviors, model performance can degrade substantially. While updating models with new training data is necessary, such updates may also introduce new risks. We evaluated the proposed monitoring framework on four publicly available U.S.-based Type 1 Diabetes datasets containing high-resolution continuous glucose monitoring (CGM) data, comprising approximately 11,300 weekly observations from 496 participants under 20 years of age. All datasets included structured sociodemographic information. Using the prediction of severe hyperglycemia events in children with type 1 diabetes as a case study, we examine how different model update strategies can adversely affect model stability (e.g., by causing predictions to "flip" for a large number of cases after an update), increase arbitrariness in predictions, or worsen accuracy equity and the balance of error rates across subpopulations. We propose multiple dimensions for continuous monitoring to detect these issues and argue that such monitoring is essential for the development of trustworthy clinical decision support systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Model updates on fresh CGM data for pediatric type 1 diabetes can flip predictions, add arbitrariness, and shift fairness across groups, and this paper measures those effects on four real datasets.

read the letter

This paper shows that updating models for predicting severe hyperglycemia in children with type 1 diabetes using new continuous glucose monitoring data can lead to unstable predictions that flip for many cases, arbitrary shifts, and fairness problems across subpopulations. It works with four public U.S. datasets covering around 11,300 weekly observations from 496 kids under 20. The authors test various update strategies and measure impacts on stability, arbitrariness, and equity in accuracy and error rates, using the available sociodemographic details to break out subpopulations. What the work does well is bring these general ML concerns down to a specific, high-stakes clinical example with real data. The empirical demonstration makes the risks tangible rather than abstract. Proposing monitoring dimensions to detect such issues is a practical step forward for trustworthy systems. The soft spots are mostly about scope. Everything is limited to this one prediction task in U.S. pediatric T1D cases, so the findings might not translate directly to other clinical AI applications or different populations. The monitoring framework is suggested based on these observations but isn't tested for how well it catches problems in new settings. Readers working on clinical AI deployment, model maintenance, or fairness in healthcare will find this relevant. It gives a clear case study that highlights why ongoing monitoring matters when data changes. The empirical approach is grounded enough that it deserves a serious referee. I would send this to peer review.

Referee Report

0 major / 3 minor

Summary. The manuscript presents an empirical evaluation of risks associated with updating AI/ML models in clinical settings, using prediction of severe hyperglycemia events in pediatric Type 1 diabetes as a case study. It analyzes four publicly available U.S.-based CGM datasets comprising ~11,300 weekly observations from 496 participants under 20, incorporating sociodemographic data, and examines how different model update strategies affect stability (e.g., prediction flips), arbitrariness, and fairness/equity across subpopulations. The authors propose multiple dimensions for continuous monitoring to detect these issues and argue that such monitoring is essential for trustworthy clinical decision support systems.

Significance. If the empirical findings hold, the work provides concrete, data-driven evidence of practical risks in model updating for clinical AI, particularly stability and equity concerns in a high-stakes pediatric diabetes context with real CGM and sociodemographic data. This strengthens the case for proactive monitoring frameworks and could inform deployment practices, though the focused scope on one prediction task limits immediate broad impact without further validation.

minor comments (3)

Abstract: While the setup and dataset sizes are described, the abstract would benefit from a brief mention of key quantitative outcomes (e.g., magnitude of prediction flips or fairness metric changes) to better convey the strength of the observed risks.
Methods section: The specific machine learning models, hyperparameters, and exact definitions/thresholds used to quantify 'arbitrariness' and 'prediction flips' should be stated explicitly with pseudocode or equations for reproducibility.
Results: Tables or figures comparing update strategies to a no-update baseline would clarify whether the reported adverse effects are attributable to the updates themselves rather than baseline variability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of its empirical contributions on model update risks in clinical AI, and recommendation for minor revision. We appreciate the acknowledgment that the findings provide concrete evidence for stability, arbitrariness, and fairness concerns in a pediatric diabetes context.

Circularity Check

0 steps flagged

Purely empirical study with no derivation chain or fitted predictions

full rationale

The paper is an empirical evaluation of model-update risks on four pediatric T1D CGM datasets. It reports observed effects on stability, arbitrariness, and fairness under different update strategies and proposes monitoring dimensions based on those observations. No mathematical derivations, first-principles predictions, or equations are present; claims rest directly on measured outcomes from public data rather than any reduction to fitted parameters or self-citation chains. The analysis is therefore self-contained with no load-bearing steps that collapse to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the representativeness of the chosen public datasets and standard assumptions in machine learning about training and updating procedures.

free parameters (1)

model hyperparameters and update thresholds
Used in training and deciding when to update models but not specified in the abstract.

axioms (1)

domain assumption The four U.S.-based Type 1 Diabetes datasets with CGM and sociodemographic data are representative of real clinical populations and data shifts.
Central to generalizing the observed risks beyond this specific case study.

pith-pipeline@v0.9.0 · 5513 in / 1220 out tokens · 61448 ms · 2026-05-08T03:54:43.039688+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Temporal quality degradation in ai models,

D. Vela, A. Sharp, R. Zhang, T. Nguyen, A. Hoang, and O. S. Pianykh, “Temporal quality degradation in ai models,”Sci. rep. 12(1), 2022

2022
[2]

A moving target in ai-assisted decision-making: Dataset shift, model updating, and the problem of update opacity,

J. Hatherley, “A moving target in ai-assisted decision-making: Dataset shift, model updating, and the problem of update opacity,”Ethics and Inf. Tech. 27(2), 2025

2025
[3]

Clinical artificial intelligence quality improve- ment: towards continual monitoring and updating of AI algorithms in healthcare,

J. Feng, R. V . Phillips, I. Malenica, A. Bishara, A. E. Hubbard, L. A. Celi, and R. Pirracchio, “Clinical artificial intelligence quality improve- ment: towards continual monitoring and updating of AI algorithms in healthcare,”NPJ Digital Med. 5(1), 2022

2022
[4]

The impact of bias on drift detection in AI health software,

A. Khoshravan Azar, B. Draghi, Y . Rotalinti, P. Myles, and A. Tucker, “The impact of bias on drift detection in AI health software,” inProc. AIME. Springer, 2023

2023
[5]

The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects,

M. Mermillod, A. Bugaiska, and P. Bonin, “The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects,” 2013

2013
[6]

Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024

G. M. van de Ven, N. Soures, and D. Kudithipudi, “Continual learning and catastrophic forgetting,”preprint arXiv:2403.05175, 2024

work page arXiv 2024
[7]

Updating methods for artificial intelligence–based clinical predic- tion models: a scoping review,

L. M. Meijerink, Z. S. Dunias, A. M. Leeuwenberg, A. A. de Hond, D. A. Jenkins, G. P. Martin, M. Sperrin, N. Peek, R. Spijker, L. Hooft et al., “Updating methods for artificial intelligence–based clinical predic- tion models: a scoping review,”J. of Clinical Epidemiology 178, 2025

2025
[8]

Data quality or data quantity? prior- itizing data collection under distribution shift with the data usefulness coefficient,

I. Zhang and D. Rothenh ¨ausler, “Data quality or data quantity? prior- itizing data collection under distribution shift with the data usefulness coefficient,”preprint arXiv:2504.06570, 2025

work page arXiv 2025
[9]

Tackling data and model drift in ai: Strategies for maintaining accuracy during ml model inference,

S. Patchipala, “Tackling data and model drift in ai: Strategies for maintaining accuracy during ml model inference,”Int. J. of Sci. and Research 10(2), 2023

2023
[10]

Detecting and remediating harmful data shifts for the responsible deployment of clinical AI models,

V . Subasri, A. Krishnan, A. Kore, A. Dhalla, D. Pandya, B. Wang, D. Malkin, F. Razak, A. A. Verma, A. Goldenberget al., “Detecting and remediating harmful data shifts for the responsible deployment of clinical AI models,”JAMA Network Open 8(6), 2025

2025
[11]

Towards a framework for adapting machine learning components,

M. Casimiro, P. Romano, D. Garlan, and L. Rodrigues, “Towards a framework for adapting machine learning components,” inProc. ACSOS. IEEE, 2022

2022
[12]

When to retrain a machine learning model,

R. Florence, S. Leo, S. Kyle, C. Mark, and M. Thomas, “When to retrain a machine learning model,”preprint arXiv:2505.14903, 2025

work page arXiv 2025
[13]

Representative data selection for efficient medical incremental learning,

B.-Q. Wei, J.-J. Chen, Y .-C. Tseng, and P.-T. P. Kuo, “Representative data selection for efficient medical incremental learning,” inProc. EMBC. IEEE, 2023

2023
[14]

arXiv preprint arXiv:2408.04154 , year=

J. H. Shen, I. D. Raji, and I. Y . Chen, “The data addition dilemma,” preprint arXiv:2408.04154, 2024

work page arXiv 2024
[15]

Continuous optimization of a hierar- chical bayesian network for Friedreich’s ataxia severity classification,

S. Dissanayake, R. Krishna, P. N. Pathirana, M. K. Horne, D. J. Smulewicz, and L. A. Corben, “Continuous optimization of a hierar- chical bayesian network for Friedreich’s ataxia severity classification,” inProc. EMBC. IEEE, 2024

2024
[16]

Mitigating catastrophic forgetting in medical imaging via incremental learning,

V . E. Leonardet al., “Mitigating catastrophic forgetting in medical imaging via incremental learning,”Procedia Comp. Sci. 269, 2025

2025
[17]

Keeping medical ai healthy and trustworthy: A review of detection and correction methods for system degradation,

H. Guan, D. Bates, and L. Zhou, “Keeping medical ai healthy and trustworthy: A review of detection and correction methods for system degradation,”IEEE Transactions on Biomedical Engineering, 2026

2026
[18]

Adaptive self-healing ai models for enhancing selfi performance with real-time retraining detection

M. Emmanuel and A. Rajuroy, “Adaptive self-healing ai models for enhancing selfi performance with real-time retraining detection.”
[19]

Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings,

S. E. Davis, C. G. Walsh, and M. E. Matheny, “Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings,”Frontiers in Digital Health 4, 2022

2022
[20]

J. Y . Kim, A. Hasan, K. C. Kellogg, W. Ratliff, S. G. Murray, H. Suresh, A. Valladares, K. Shaw, D. Tobey, D. E. Vidalet al., “Development and preliminary testing of health equity across the AI lifecycle (HEAAL): A framework for healthcare delivery organizations to mitigate the risk of ai solutions worsening health inequities,”PLOS Digital Health 3(5), 2024

2024
[21]

Adaptive fairness in continuous learning ai healthcare systems: Frameworks for dynamic equity alignment,

M. H. Rahman, M. D. Hossain, K. M. R. Hossan, M. K. S. Uddin, A. A. Zaiem, and M. B. Ullah, “Adaptive fairness in continuous learning ai healthcare systems: Frameworks for dynamic equity alignment,”preprint SSRN 5437834, 2025

2025
[22]

Fairness evolution in continual learning for medical imaging,

M. Ceccon, D. Dalle Pezze, A. Fabris, and G. A. Susto, “Fairness evolution in continual learning for medical imaging,” inProc. IF AC. Elsevier, 2025

2025
[23]

The curious case of arbitrariness in machine learning,

P. Ganesh, A. Taik, and G. Farnadi, “The curious case of arbitrariness in machine learning,”preprint arXiv:2501.14959, 2025

work page arXiv 2025
[24]

Towards stable machine learning model retraining via slowly varying sequences,

D. Bertsimas, V . Digalakis Jr, Y . Ma, and P. Paschalidis, “Towards stable machine learning model retraining via slowly varying sequences,” preprint arXiv:2403.19871, 2024

work page arXiv 2024
[25]

Fairness of artificial intelligence in healthcare: review and recommendations,

D. Ueda, T. Kakinuma, S. Fujita, K. Kamagata, Y . Fushimi, R. Ito, Y . Matsui, T. Nozaki, T. Nakaura, N. Fujimaet al., “Fairness of artificial intelligence in healthcare: review and recommendations,”Japanese J. of Radiology 42(1), 2024

2024
[26]

Watch out for updates: Understanding the effects of model explanation updates in AI-assisted decision making,

X. Wang and M. Yin, “Watch out for updates: Understanding the effects of model explanation updates in AI-assisted decision making,” inProc. CHI, 2023, pp. 1–19

2023
[27]

Measuring fairness preferences is important for artificial intelligence in health care,

A.-F. N ¨aher, I. Krumpal, E.-M. Ant ˜ao, E. Ong, M. Rojo, F. Kaggwa, F. Balzer, L. A. Celi, K. Braune, L. H. Wieleret al., “Measuring fairness preferences is important for artificial intelligence in health care,”The Lancet Digital Health 6(5), 2024

2024
[28]

Dis- parate model performance and stability in machine learning clinical support for diabetes and heart diseases,

I. Bilionis, R. C. Berrios, L. Fernandez-Luque, and C. Castillo, “Dis- parate model performance and stability in machine learning clinical support for diabetes and heart diseases,” inProc. AMIA, 2025

2025
[29]

Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities,

J. K. Paulus and D. M. Kent, “Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities,”NPJ digital medicine 3(1), 2020

2020
[30]

Allocation multiplicity: Evaluating the promises of the rashomon set,

S. Jain, M. Wang, K. Creel, and A. Wilson, “Allocation multiplicity: Evaluating the promises of the rashomon set,” inProc. F AccT, 2025

2025
[31]

Be inten- tional about fairness!: Fairness, size, and multiplicity in the Rashomon set,

G. Dai, P. Ravishankar, R. Yuan, E. Black, and D. B. Neill, “Be inten- tional about fairness!: Fairness, size, and multiplicity in the Rashomon set,” inProc. EEAMO, 2025

2025
[32]

Exploration of the Rashomon set assists trustworthy expla- nations for medical data,

K. Kobyli ´nska, M. Krzyzi ´nski, R. Machowicz, M. Adamek, and P. Biecek, “Exploration of the Rashomon set assists trustworthy expla- nations for medical data,”IEEE J. of Biomed. and Health Inf. 28(11), 2024

2024
[33]

Arbitrariness and social prediction: The confounding role of variance in fair classification,

A. F. Cooper, K. Lee, M. Z. Choksi, S. Barocas, C. De Sa, J. Grim- melmann, J. Kleinberg, S. Sen, and B. Zhang, “Arbitrariness and social prediction: The confounding role of variance in fair classification,” in Proc. AAAI, 2024

2024
[34]

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

A. N. Angelopoulos and S. Bates, “A gentle introduction to confor- mal prediction and distribution-free uncertainty quantification,”preprint arXiv:2107.07511, 2021

work page internal anchor Pith review arXiv 2021
[35]

Updates in human-AI teams: Understanding and addressing the performance/compatibility tradeoff,

G. Bansal, B. Nushi, E. Kamar, D. S. Weld, W. S. Lasecki, and E. Horvitz, “Updates in human-AI teams: Understanding and addressing the performance/compatibility tradeoff,” inProc. AAAI, 2019

2019
[36]

Regulating AI adaptation: An analysis of AI medical device updates,

K. Wu, E. Wu, K. Rodolfa, D. E. Ho, and J. Zou, “Regulating AI adaptation: An analysis of AI medical device updates,”preprint arXiv:2407.16900, 2024

work page arXiv 2024
[37]

Clinical targets for continuous glucose monitoring data interpretation: recommendations from the international consensus on time in range,

T. Battelino, T. Danne, R. M. Bergenstal, S. A. Amiel, R. Beck, T. Biester, E. Bosi, B. A. Buckingham, W. T. Cefalu, K. L. Closeet al., “Clinical targets for continuous glucose monitoring data interpretation: recommendations from the international consensus on time in range,” Diabetes care 42(8), 2019

2019
[38]

Effects of continuous glucose monitoring on metrics of glycemic control in dia- betes: a systematic review with meta-analysis of randomized controlled trials,

M. I. Maiorino, S. Signoriello, A. Maio, P. Chiodini, G. Bellastella, L. Scappaticcio, M. Longo, D. Giugliano, and K. Esposito, “Effects of continuous glucose monitoring on metrics of glycemic control in dia- betes: a systematic review with meta-analysis of randomized controlled trials,”Diabetes Care 43(5), 2020

2020
[39]

Hyperglycemia,

M. Mouri and M. Badireddy, “Hyperglycemia,” inStatPearls [Online]. StatPearls Publishing, 2023

2023
[40]

Standards of care in diabetes—2023,

Diabetes Care, “Standards of care in diabetes—2023,”Diabetes Care 46, 2023

2023
[41]

PEDAP Public Dataset (Release 4),

PEDAP Trial Study Group, “PEDAP Public Dataset (Release 4),” Re- trieved from https://public.jaeb.org/dataset/599, 2024, clinicalTrials.gov Identifier: NCT04796779

2024
[42]

Insulin Only Bionic Pancreas Pivotal Trial (IOBP2) Dataset,

Bionic Pancreas Research Group, “Insulin Only Bionic Pancreas Pivotal Trial (IOBP2) Dataset,” Retrieved from https://public.jaeb.org/dataset/579, 2019, clinicalTrials.gov Identifier: NCT04200313

2019
[43]

DCLP5 Dataset,

University of Virginia, “DCLP5 Dataset,” Retrieved from http://public.jaeb.org/dataset/535, 2019, clinicalTrials.gov Identifier: NCT03844789

2019
[44]

CGM Intervention in Teens and Young Adults with T1D (CITY Public Dataset),

Jaeb Center for Health Research, “CGM Intervention in Teens and Young Adults with T1D (CITY Public Dataset),” Retrieved from http://public.jaeb.org/dataset/565, 2017, clinicalTrials.gov Identi- fier: NCT03263494

2017