Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study

Michael O. Eniolade

arxiv: 2605.21566 · v1 · pith:VUVG546Vnew · submitted 2026-05-20 · 💻 cs.LG

Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study

Michael O. Eniolade This is my paper

Pith reviewed 2026-05-22 09:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords chronic kidney diseaserisk predictioncalibrationconformal predictionmachine learningexternal validationdeployment readinessclinical prediction models

0 comments

The pith

Machine learning models for chronic kidney disease risk prediction achieve perfect internal scores but lose all calibration and coverage on external hospital data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Five classifiers reached AUROC of 1.00 on the internal UCI CKD test set. After isotonic recalibration their calibration error dropped near zero internally, yet the same models produced AUROC values between 0.48 and 0.58, Expected Calibration Error between 0.68 and 0.76, and conformal coverage of only 0.21 to 0.25 against a 90 percent target when applied to the MIMIC-IV demo cohort. None of the models scored higher than four out of sixteen points on the eight-criterion deployment readiness checklist. A sympathetic reader would conclude that internal discrimination performance alone supplies no warrant for clinical use and that calibration stability plus uncertainty coverage must be verified on external data before any deployment consideration.

Core claim

Near-perfect internal performance did not transfer. All five models reached AUROC 1.00 on the UCI test set. Isotonic recalibration reduced internal ECE to 0.000-0.022. On MIMIC-IV, AUROC fell to 0.48-0.58, ECE rose to 0.68-0.76, and conformal coverage dropped from 0.80-0.98 to 0.21-0.25 against a 90% target. No model scored above 4 out of 16 on the deployment readiness checklist.

What carries the argument

The distributional stress-test that applies each recalibrated model to an external cohort under prevalence shift and feature missingness, scored by Expected Calibration Error, Brier Score, split conformal prediction, and the eight-criterion deployment readiness framework.

If this is right

Calibration stability and conformal coverage must be evaluated on external data before deployment.
Internal AUROC scores alone are insufficient evidence of clinical suitability for CKD risk models.
Models must demonstrate robustness to prevalence shifts and missing features to qualify as deployment-ready.
Uncertainty communication via conformal prediction requires external validation to be trusted in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transfer failure pattern is likely to appear in other medical prediction tasks where training and deployment populations differ in disease prevalence.
Clinical ML pipelines may need to treat external-cohort calibration testing as a required gate rather than an optional check.
Hospitals evaluating risk models could demand published external validation results on calibration and coverage as standard procurement evidence.

Load-bearing premise

The UCI CKD dataset and MIMIC-IV demo cohort are representative enough to demonstrate generalizability failure, and the eight-criterion deployment readiness framework provides a meaningful measure of clinical suitability.

What would settle it

A model that maintained AUROC above 0.80, ECE below 0.10, and conformal coverage within five points of the 90 percent target when tested on an independent external cohort with different CKD prevalence would directly contradict the transfer failure claim.

Figures

Figures reproduced from arXiv: 2605.21566 by Michael O. Eniolade.

**Figure 2.** Figure 2: Reliability diagrams for SVM and NB before and after post-hoc calibration on the [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Reliability diagrams for the best-calibrated variant of each classifier on the MIMIC [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Prediction set size distribution (UCI test set). Bar charts of set size (0 = empty, [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Individual-level uncertainty display for NB (top 50 UCI test patients, sorted by [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Deployment readiness heatmap. Rows are models; columns are the eight criteria. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Machine learning models for chronic kidney disease (CKD) risk prediction often post strong discrimination scores on internal test sets. Calibration and uncertainty quantification get far less attention, leaving clinicians without reliable information about whether the probability outputs are accurate. We trained five classifiers on the UCI CKD dataset (400 patients, 62.5% CKD prevalence): logistic regression, random forest, XGBoost, SVM with Platt scaling, and Gaussian naive Bayes. We evaluated each across calibration quality, conformal prediction coverage, and an eight-criterion deployment readiness framework. A distributional stress-test applied the best-calibrated variant of each model to the open-access MIMIC-IV demo cohort (97 patients, 23.7% CKD) to assess behaviour under prevalence shift and feature missingness. We measured calibration before and after Platt scaling and isotonic regression using Expected Calibration Error and Brier Score, and quantified uncertainty through split conformal prediction targeting 90% marginal coverage. All five models reached AUROC 1.00 on the UCI test set. Isotonic recalibration reduced internal ECE to 0.000-0.022. On MIMIC-IV, AUROC fell to 0.48-0.58, ECE rose to 0.68-0.76, and conformal coverage dropped from 0.80-0.98 to 0.21-0.25 against a 90% target. No model scored above 4 out of 16 on the deployment readiness checklist. Near-perfect internal performance did not transfer. Calibration stability and conformal coverage should be evaluated on external data before any clinical prediction model moves toward deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper trains five classifiers (logistic regression, random forest, XGBoost, SVM+Platt, Gaussian naive Bayes) on the UCI CKD dataset (n=400, 62.5% prevalence) and reports perfect internal discrimination (AUROC 1.00) with strong calibration after isotonic regression (ECE 0.000-0.022). It then applies the models to the MIMIC-IV demo cohort (n=97, 23.7% prevalence) as a distributional stress test, finding degraded AUROC (0.48-0.58), elevated ECE (0.68-0.76), conformal coverage collapse (0.21-0.25 vs 90% target), and no model exceeding 4/16 on an eight-criterion deployment-readiness checklist. The central claim is that near-perfect internal performance does not transfer and that calibration stability plus conformal coverage must be evaluated externally before clinical deployment.

Significance. If the empirical findings are robust, the work supplies a concrete, reproducible demonstration using open datasets that internal AUROC and calibration metrics are insufficient for assessing clinical readiness. It highlights the practical value of split conformal prediction and a structured deployment checklist under prevalence shift and missingness, reinforcing the broader literature on external validation for medical ML models.

major comments (3)

[Results, MIMIC-IV stress-test] External validation results (MIMIC-IV demo cohort): with only 97 patients and ~23 positive cases, the reported coverage drop from 0.80-0.98 to 0.21-0.25 and AUROC degradation are subject to high binomial variance and prevalence shift. No bootstrap CIs, sensitivity analyses to sample size, or power calculations are described, so the 65-point coverage collapse cannot be confidently attributed to model properties rather than finite-sample instability.
[Methods, Deployment Readiness Framework] Deployment readiness framework: the conclusion that no model scores above 4/16 rests on an eight-criterion checklist whose selection, weighting, and clinical validation are not detailed in the methods or appendix. Without this, the low scores are difficult to interpret as a load-bearing measure of deployment suitability.
[Methods, Uncertainty Quantification] Conformal prediction implementation: split conformal prediction is applied targeting 90% marginal coverage, yet the external calibration set size (subset of n=97) is small enough that finite-sample deviations are expected. The manuscript should quantify how the observed coverage relates to the theoretical guarantee under the reported prevalence shift.

minor comments (2)

[Abstract] Abstract and results tables would benefit from explicit enumeration of the eight deployment-readiness criteria rather than a summary score alone.
[Methods, Data Preprocessing] Feature missingness handling during the MIMIC-IV stress-test is mentioned but not quantified (e.g., percentage of missing values per feature or imputation strategy).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The comments highlight important statistical and methodological considerations that we have addressed through revisions and additional analyses. We respond to each major comment below.

read point-by-point responses

Referee: [Results, MIMIC-IV stress-test] External validation results (MIMIC-IV demo cohort): with only 97 patients and ~23 positive cases, the reported coverage drop from 0.80-0.98 to 0.21-0.25 and AUROC degradation are subject to high binomial variance and prevalence shift. No bootstrap CIs, sensitivity analyses to sample size, or power calculations are described, so the 65-point coverage collapse cannot be confidently attributed to model properties rather than finite-sample instability.

Authors: We agree that the small MIMIC-IV demo cohort size (n=97, ~23 positives) introduces substantial sampling variability and that the observed metric degradations must be interpreted with caution. In the revised manuscript we have added bootstrap confidence intervals (1000 resamples) for AUROC, ECE, Brier score, and conformal coverage on the external cohort. We also performed a sensitivity analysis by repeatedly subsampling the UCI internal test set to n=97 while preserving prevalence, which reproduced qualitatively similar drops in performance and coverage. A brief power discussion has been added to the limitations section acknowledging that larger external cohorts would be needed for more precise estimates. revision: yes
Referee: [Methods, Deployment Readiness Framework] Deployment readiness framework: the conclusion that no model scores above 4/16 rests on an eight-criterion checklist whose selection, weighting, and clinical validation are not detailed in the methods or appendix. Without this, the low scores are difficult to interpret as a load-bearing measure of deployment suitability.

Authors: We acknowledge that the eight-criterion deployment-readiness checklist was insufficiently documented. The criteria were derived from a synthesis of clinical ML deployment literature (e.g., guidelines on data quality, interpretability, fairness, regulatory readiness, and prospective validation). In the revised Methods section we now explicitly list each criterion, describe the equal-weight scoring scheme, and provide references. An expanded table in the appendix shows per-model scores with justifications. We have also added a limitations paragraph noting that the checklist is a pragmatic starting point rather than a clinically validated instrument. revision: yes
Referee: [Methods, Uncertainty Quantification] Conformal prediction implementation: split conformal prediction is applied targeting 90% marginal coverage, yet the external calibration set size (subset of n=97) is small enough that finite-sample deviations are expected. The manuscript should quantify how the observed coverage relates to the theoretical guarantee under the reported prevalence shift.

Authors: We appreciate the reminder that the marginal coverage guarantee of split conformal prediction assumes exchangeability, which is violated by the prevalence shift and missingness pattern between UCI and MIMIC-IV. We have added a dedicated subsection in Methods that (i) restates the theoretical guarantee, (ii) notes its inapplicability under distribution shift, and (iii) quantifies the empirical coverage deviation relative to the 90% target. We further discuss how the small external calibration subset contributes to finite-sample variability and include a short simulation study (reported in the appendix) showing expected coverage under similar shift magnitudes. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation study

full rationale

The manuscript is an empirical evaluation study that trains classifiers on the UCI CKD dataset, measures internal performance metrics (AUROC, ECE, Brier score, conformal coverage), applies the models to the external MIMIC-IV demo cohort, and scores them against an eight-criterion deployment-readiness checklist. All reported results are direct computations on held-out data; no derivations, parameter fits presented as predictions, or self-citation chains are used to support the central claims. The observed performance drop and low deployment scores follow immediately from the tabulated measurements without any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on assumptions about dataset representativeness and the validity of chosen calibration and readiness metrics rather than new parameters or entities.

axioms (2)

domain assumption The UCI CKD dataset with 400 patients is appropriate for training and internal testing of risk models.
Primary source for model development and internal metrics.
domain assumption The MIMIC-IV demo cohort with 97 patients provides a valid test of prevalence shift and missingness.
Used for distributional stress-test.

pith-pipeline@v0.9.0 · 5826 in / 1324 out tokens · 49496 ms · 2026-05-22T09:34:53.251349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

[1]

Chronic kidney disease and the global public health agenda: an international consensus

Francis A, Harhay MN, Ong ACM, et al. Chronic kidney disease and the global public health agenda: an international consensus. Nature Reviews Nephrology. 2024;20:473-85

work page 2024
[2]

KDIGO 2024 Clinical Practice Guideline for the Evaluation and Management of Chronic Kidney Disease

KDIGO CKD Work Group. KDIGO 2024 Clinical Practice Guideline for the Evaluation and Management of Chronic Kidney Disease. Kidney International. 2024;105(4S):S117- 314

work page 2024
[3]

Machine learning prediction models for chronic kidney disease using national health insurance claim data in Taiwan

Krishnamurthy S, et al. Machine learning prediction models for chronic kidney disease using national health insurance claim data in Taiwan. Healthcare (Basel). 2021;9(5):546

work page 2021
[4]

Machine learning to predict end stage kidney disease in chronic kidney disease

Bai Q, Su C, Tang W, Li Y. Machine learning to predict end stage kidney disease in chronic kidney disease. Scientific Reports. 2022;12:8377

work page 2022
[5]

Machine learning models for predicting short-term progression in patients with stage 4 chronic kidney disease: a multi-center validation study

Li J, et al. Machine learning models for predicting short-term progression in patients with stage 4 chronic kidney disease: a multi-center validation study. Scientific Reports. 2025;15:39285

work page 2025
[6]

Artificial intelligence in chronic kidney disease management: a scoping review

Sabanayagam C, et al. Artificial intelligence in chronic kidney disease management: a scoping review. Theranostics. 2025;15(10):4566-78

work page 2025
[7]

Multinational assessment of accuracy of equations for predicting risk of kidney failure: a meta-analysis

Tangri N, Grams ME, Levey AS, et al. Multinational assessment of accuracy of equations for predicting risk of kidney failure: a meta-analysis. JAMA. 2016;315(2):164-74

work page 2016
[8]

The Kidney Failure Risk Equation for prediction of end stage renal disease in UK primary care: an external validation and clinical impact projection cohort study

Major RW, Shepherd D, Medcalf JF, et al. The Kidney Failure Risk Equation for prediction of end stage renal disease in UK primary care: an external validation and clinical impact projection cohort study. PLOS Medicine. 2019;16(11):e1002955

work page 2019
[9]

External validation, recalibration, and clinical utility of the kidney failure risk equation in patients with advanced CKD: a nationwide retrospective cohort analysis in Peru

Bravo-Zuniga JI, et al. External validation, recalibration, and clinical utility of the kidney failure risk equation in patients with advanced CKD: a nationwide retrospective cohort analysis in Peru. BMC Nephrology. 2025;26:688. 24

work page 2025
[10]

Calibration: the Achilles heel of predictive analytics

Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW. Calibration: the Achilles heel of predictive analytics. BMC Medicine. 2019;17:230

work page 2019
[11]

Risk models to predict chronic kidney disease and its progression: a systematic review

Echouffo-Tcheugui JB, Kengne AP. Risk models to predict chronic kidney disease and its progression: a systematic review. PLOS Medicine. 2012;9(11):e1001344

work page 2012
[12]

Modeling unknowns: a vision for uncertainty-aware machine learning in healthcare

Campagner A, Biganzoli EM, Balsano C, Cereda C, Cabitza F. Modeling unknowns: a vision for uncertainty-aware machine learning in healthcare. International Journal of Medical Informatics. 2025;203:106014

work page 2025
[13]

Clinical AI tools must convey predictive uncertainty for each individual patient

Banerji CRS, Chakraborti T, Harbron C, et al. Clinical AI tools must convey predictive uncertainty for each individual patient. Nature Medicine. 2023;29:2996-8

work page 2023
[14]

Risk prediction score for chronic kidney disease in healthy adults and adults with type 2 diabetes: systematic review

Gonz´ alez-Rocha A, Colli VA, Denova-Guti´ errez E. Risk prediction score for chronic kidney disease in healthy adults and adults with type 2 diabetes: systematic review. Preventing Chronic Disease. 2023;20:220380

work page 2023
[15]

Evaluation of clinical prediction models (part 1): from development to external validation

Collins GS, Dhiman P, Ma J, et al. Evaluation of clinical prediction models (part 1): from development to external validation. BMJ. 2024;384:e074819

work page 2024
[16]

Evaluation of clinical prediction models (part 2): how to undertake an external validation study

Riley RD, Archer L, Snell KIE, et al. Evaluation of clinical prediction models (part 2): how to undertake an external validation study. BMJ. 2024;384:e074820

work page 2024
[17]

TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods

Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378

work page 2024
[18]

Chronic Kidney Disease Dataset

Rubini L, Soundarapandian P, Eswaran P. Chronic Kidney Disease Dataset. UCI Ma- chine Learning Repository. 2015. Accessed: 2025-12-01.https://doi.org/10.24432/ C5G020

work page 2015
[19]

MIMIC-IV, a freely accessible electronic health record dataset

Johnson AEW, Bulgarelli L, Shen L, et al. MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data. 2023;10:1. 25

work page 2023
[20]

Predicting good probabilities with supervised learning

Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning; 2005. p. 625-32

work page 2005
[21]

On Calibration of Modern Neural Networks

Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning; 2017. p. 1321-30. ArXiv:1706.04599

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Probabilistic outputs for support vector machines and comparisons to regular- ized likelihood methods

Platt J. Probabilistic outputs for support vector machines and comparisons to regular- ized likelihood methods. In: Advances in Large Margin Classifiers. MIT Press; 1999. p. 61-74

work page 1999
[23]

Transforming classifier scores into accurate multiclass probability estimates

Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002. p. 694-9

work page 2002
[24]

MAPIE: an open-source library for distribution-free uncertainty quantification

Taquet V, Blot V, Morzadec T, Lacombe L, Brunel N. MAPIE: an open-source library for distribution-free uncertainty quantification. arXiv preprint arXiv:220712274. 2022

work page 2022
[25]

A gentle introduction to conformal prediction and distribution-free uncertainty quantification

Angelopoulos AN, Bates S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:210707511. 2021

work page 2021
[26]

Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE- AI

Vasey B, Nagendran M, Campbell B, et al. Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE- AI. BMJ. 2022;377:e070904

work page 2022
[27]

Assessing calibration and bias of a deployed machine learning malnutrition prediction model within a large healthcare system

Liou L, et al. Assessing calibration and bias of a deployed machine learning malnutrition prediction model within a large healthcare system. npj Digital Medicine. 2024;7:149

work page 2024
[28]

Conformal prediction in clinical medical sciences

Vazquez J, Facelli JC. Conformal prediction in clinical medical sciences. Journal of Healthcare Informatics Research. 2022;6:241-52. 26

work page 2022
[29]

Conformal prediction enables disease course prediction and allows individualized diagnostic uncertainty in multiple sclerosis

Sreenivasan AP, et al. Conformal prediction enables disease course prediction and allows individualized diagnostic uncertainty in multiple sclerosis. npj Digital Medicine. 2025;8:224

work page 2025
[30]

Evaluation of clinical prediction models (part 3): calculating the sample size required for an external validation study

Riley RD, Snell KIE, Archer L, et al. Evaluation of clinical prediction models (part 3): calculating the sample size required for an external validation study. BMJ. 2024;384:e074821. 27

work page 2024

[1] [1]

Chronic kidney disease and the global public health agenda: an international consensus

Francis A, Harhay MN, Ong ACM, et al. Chronic kidney disease and the global public health agenda: an international consensus. Nature Reviews Nephrology. 2024;20:473-85

work page 2024

[2] [2]

KDIGO 2024 Clinical Practice Guideline for the Evaluation and Management of Chronic Kidney Disease

KDIGO CKD Work Group. KDIGO 2024 Clinical Practice Guideline for the Evaluation and Management of Chronic Kidney Disease. Kidney International. 2024;105(4S):S117- 314

work page 2024

[3] [3]

Machine learning prediction models for chronic kidney disease using national health insurance claim data in Taiwan

Krishnamurthy S, et al. Machine learning prediction models for chronic kidney disease using national health insurance claim data in Taiwan. Healthcare (Basel). 2021;9(5):546

work page 2021

[4] [4]

Machine learning to predict end stage kidney disease in chronic kidney disease

Bai Q, Su C, Tang W, Li Y. Machine learning to predict end stage kidney disease in chronic kidney disease. Scientific Reports. 2022;12:8377

work page 2022

[5] [5]

Machine learning models for predicting short-term progression in patients with stage 4 chronic kidney disease: a multi-center validation study

Li J, et al. Machine learning models for predicting short-term progression in patients with stage 4 chronic kidney disease: a multi-center validation study. Scientific Reports. 2025;15:39285

work page 2025

[6] [6]

Artificial intelligence in chronic kidney disease management: a scoping review

Sabanayagam C, et al. Artificial intelligence in chronic kidney disease management: a scoping review. Theranostics. 2025;15(10):4566-78

work page 2025

[7] [7]

Multinational assessment of accuracy of equations for predicting risk of kidney failure: a meta-analysis

Tangri N, Grams ME, Levey AS, et al. Multinational assessment of accuracy of equations for predicting risk of kidney failure: a meta-analysis. JAMA. 2016;315(2):164-74

work page 2016

[8] [8]

The Kidney Failure Risk Equation for prediction of end stage renal disease in UK primary care: an external validation and clinical impact projection cohort study

Major RW, Shepherd D, Medcalf JF, et al. The Kidney Failure Risk Equation for prediction of end stage renal disease in UK primary care: an external validation and clinical impact projection cohort study. PLOS Medicine. 2019;16(11):e1002955

work page 2019

[9] [9]

External validation, recalibration, and clinical utility of the kidney failure risk equation in patients with advanced CKD: a nationwide retrospective cohort analysis in Peru

Bravo-Zuniga JI, et al. External validation, recalibration, and clinical utility of the kidney failure risk equation in patients with advanced CKD: a nationwide retrospective cohort analysis in Peru. BMC Nephrology. 2025;26:688. 24

work page 2025

[10] [10]

Calibration: the Achilles heel of predictive analytics

Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW. Calibration: the Achilles heel of predictive analytics. BMC Medicine. 2019;17:230

work page 2019

[11] [11]

Risk models to predict chronic kidney disease and its progression: a systematic review

Echouffo-Tcheugui JB, Kengne AP. Risk models to predict chronic kidney disease and its progression: a systematic review. PLOS Medicine. 2012;9(11):e1001344

work page 2012

[12] [12]

Modeling unknowns: a vision for uncertainty-aware machine learning in healthcare

Campagner A, Biganzoli EM, Balsano C, Cereda C, Cabitza F. Modeling unknowns: a vision for uncertainty-aware machine learning in healthcare. International Journal of Medical Informatics. 2025;203:106014

work page 2025

[13] [13]

Clinical AI tools must convey predictive uncertainty for each individual patient

Banerji CRS, Chakraborti T, Harbron C, et al. Clinical AI tools must convey predictive uncertainty for each individual patient. Nature Medicine. 2023;29:2996-8

work page 2023

[14] [14]

Risk prediction score for chronic kidney disease in healthy adults and adults with type 2 diabetes: systematic review

Gonz´ alez-Rocha A, Colli VA, Denova-Guti´ errez E. Risk prediction score for chronic kidney disease in healthy adults and adults with type 2 diabetes: systematic review. Preventing Chronic Disease. 2023;20:220380

work page 2023

[15] [15]

Evaluation of clinical prediction models (part 1): from development to external validation

Collins GS, Dhiman P, Ma J, et al. Evaluation of clinical prediction models (part 1): from development to external validation. BMJ. 2024;384:e074819

work page 2024

[16] [16]

Evaluation of clinical prediction models (part 2): how to undertake an external validation study

Riley RD, Archer L, Snell KIE, et al. Evaluation of clinical prediction models (part 2): how to undertake an external validation study. BMJ. 2024;384:e074820

work page 2024

[17] [17]

TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods

Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378

work page 2024

[18] [18]

Chronic Kidney Disease Dataset

Rubini L, Soundarapandian P, Eswaran P. Chronic Kidney Disease Dataset. UCI Ma- chine Learning Repository. 2015. Accessed: 2025-12-01.https://doi.org/10.24432/ C5G020

work page 2015

[19] [19]

MIMIC-IV, a freely accessible electronic health record dataset

Johnson AEW, Bulgarelli L, Shen L, et al. MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data. 2023;10:1. 25

work page 2023

[20] [20]

Predicting good probabilities with supervised learning

Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning; 2005. p. 625-32

work page 2005

[21] [21]

On Calibration of Modern Neural Networks

Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning; 2017. p. 1321-30. ArXiv:1706.04599

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Probabilistic outputs for support vector machines and comparisons to regular- ized likelihood methods

Platt J. Probabilistic outputs for support vector machines and comparisons to regular- ized likelihood methods. In: Advances in Large Margin Classifiers. MIT Press; 1999. p. 61-74

work page 1999

[23] [23]

Transforming classifier scores into accurate multiclass probability estimates

Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002. p. 694-9

work page 2002

[24] [24]

MAPIE: an open-source library for distribution-free uncertainty quantification

Taquet V, Blot V, Morzadec T, Lacombe L, Brunel N. MAPIE: an open-source library for distribution-free uncertainty quantification. arXiv preprint arXiv:220712274. 2022

work page 2022

[25] [25]

A gentle introduction to conformal prediction and distribution-free uncertainty quantification

Angelopoulos AN, Bates S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:210707511. 2021

work page 2021

[26] [26]

Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE- AI

Vasey B, Nagendran M, Campbell B, et al. Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE- AI. BMJ. 2022;377:e070904

work page 2022

[27] [27]

Assessing calibration and bias of a deployed machine learning malnutrition prediction model within a large healthcare system

Liou L, et al. Assessing calibration and bias of a deployed machine learning malnutrition prediction model within a large healthcare system. npj Digital Medicine. 2024;7:149

work page 2024

[28] [28]

Conformal prediction in clinical medical sciences

Vazquez J, Facelli JC. Conformal prediction in clinical medical sciences. Journal of Healthcare Informatics Research. 2022;6:241-52. 26

work page 2022

[29] [29]

Conformal prediction enables disease course prediction and allows individualized diagnostic uncertainty in multiple sclerosis

Sreenivasan AP, et al. Conformal prediction enables disease course prediction and allows individualized diagnostic uncertainty in multiple sclerosis. npj Digital Medicine. 2025;8:224

work page 2025

[30] [30]

Evaluation of clinical prediction models (part 3): calculating the sample size required for an external validation study

Riley RD, Snell KIE, Archer L, et al. Evaluation of clinical prediction models (part 3): calculating the sample size required for an external validation study. BMJ. 2024;384:e074821. 27

work page 2024