pith. machine review for the scientific record. sign in

arxiv: 2604.04698 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Explainable Machine Learning for Sepsis Outcome Prediction Using a Novel Romanian Electronic Health Record Dataset

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:07 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords sepsismachine learningexplainable AIelectronic health recordseosinopeniaoutcome predictionSHAP
0
0 comments X

The pith

Machine learning models on a new Romanian EHR dataset predict sepsis outcomes at up to 0.983 AUC and flag eosinopenia as a key signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains five explainable machine learning models on a dataset of 12,286 hospitalizations that includes demographics, ICD-10 codes, and hundreds of laboratory tests. It evaluates three binary classification tasks that separate sepsis patients by final status and measures performance with accuracy and AUC while using SHAP to surface which inputs drive the predictions. The strongest results appear when distinguishing deceased from recovered patients, where cardiovascular comorbidities, urea, liver enzymes, platelet counts, and eosinophil percentage rank highest. Eosinopenia stands out as an underused marker that current clinical standards overlook. The authors conclude that the attained performance levels support moving these models toward clinical use.

Core claim

Models trained on the Romanian sepsis EHR dataset reach AUC 0.983 and accuracy 0.93 for the deceased-versus-recovered task; SHAP explanations consistently rank eosinophil percentage among the top predictors alongside cardiovascular comorbidities, urea, aspartate aminotransferase, and platelet count.

What carries the argument

SHAP explanations applied to the trained models to rank the contribution of laboratory values and comorbidities to outcome predictions.

If this is right

  • Eosinophil percentage could be added to existing sepsis risk scores because it ranks as a strong predictor here.
  • High internal AUC on the deceased-versus-recovered task indicates the models may be ready for prospective clinical testing.
  • Limiting input features to the 10–50 most frequent lab tests still preserves useful performance while increasing the number of usable patient records.
  • Explainable outputs help clinicians see why a prediction is made rather than treating the model as a black box.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Repeating the same SHAP analysis on datasets from other countries would show whether eosinopenia remains important outside the Romanian population.
  • Combining the top SHAP features with established scores such as SOFA might yield a hybrid rule that improves calibration without losing interpretability.
  • Deploying the models in real-time EHR dashboards would allow measurement of whether clinicians actually change decisions when shown the explanations.

Load-bearing premise

That models trained and tested on internal splits of data from one Romanian hospital will perform similarly on patients from other hospitals or regions.

What would settle it

Retraining or testing the reported models on an independent sepsis EHR collection from a different hospital system yields AUC well below 0.9 for the same classification tasks.

Figures

Figures reproduced from arXiv: 2604.04698 by Andrei-Alexandru Bunea, Dan-Matei Popovici, Ion Daniel, Octavian Andronic, Ovidiu Ghibea.

Figure 1
Figure 1. Figure 1: Descriptive statistics of the dataset: (a) age distribution, (b) sex distribution, and [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AUC comparison of individual classifiers for Task 1—Deceased vs. Discharged. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model performance comparison (AUC) for Deceased vs. Recovered patients across [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model performance comparison (AUC) for Recovered vs. Ameliorated patients [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SHAP summary plot for Deceased vs. Discharged top classifier (HistGB model [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

We develop and analyze explainable machine learning (ML) models for sepsis outcome prediction using a novel Electronic Health Record (EHR) dataset from 12,286 hospitalizations at a large emergency hospital in Romania. The dataset includes demographics, International Classification of Diseases (ICD-10) diagnostics, and 600 types of laboratory tests. This study aims to identify clinically strong predictors while achieving state-of-the-art results across three classification tasks: (1)deceased vs. discharged, (2)deceased vs. recovered, and (3)recovered vs. ameliorated. We trained five ML models to capture complex distributions while preserving clinical interpretability. Experiments explored the trade-off between feature richness and patient coverage, using subsets of the 10--50 most frequent laboratory tests. Model performance was evaluated using accuracy and area under the curve (AUC), and explainability was assessed using SHapley Additive exPlanations (SHAP). The highest performance was obtained for the deceased vs. recovered case study (AUC=0.983, accuracy=0.93). SHAP analysis identified several strong predictors such as cardiovascular comorbidities, urea levels, aspartate aminotransferase, platelet count, and eosinophil percentage. Eosinopenia emerged as a top predictor, highlighting its value as an underutilized marker that is not included in current assessment standards, while the high performance suggests the applicability of these models in clinical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces a novel EHR dataset from 12,286 sepsis hospitalizations at a single Romanian hospital, including demographics, ICD-10 codes, and 600 lab tests. It trains five ML models on three binary tasks (deceased vs. discharged, deceased vs. recovered, recovered vs. ameliorated) using subsets of the 10-50 most frequent labs, evaluates with accuracy and AUC, and applies SHAP to identify predictors such as cardiovascular comorbidities, urea, AST, platelets, and eosinophil percentage. Peak results are AUC 0.983 and accuracy 0.93 on deceased vs. recovered, with the conclusion that eosinopenia is a valuable underutilized marker and that the models suggest clinical applicability.

Significance. A new public or shareable sepsis EHR dataset from an under-represented region combined with SHAP-based interpretability is potentially valuable for the community. If the reported internal performance proves robust under proper validation, the identification of eosinophil percentage as a top predictor could prompt re-examination of current sepsis scoring systems. However, the single-center design and missing methodological safeguards limit the strength of any claim to immediate clinical utility.

major comments (3)
  1. [Abstract] Abstract: The headline AUC of 0.983 and accuracy of 0.93 for the deceased-vs-recovered task are presented without any description of the cross-validation procedure, missing-data strategy, class-imbalance correction, or whether the selection of the 10–50 most frequent laboratory tests occurred inside or outside the CV loop. This information is required to assess whether the metrics are optimistically biased.
  2. [Abstract] Abstract and Discussion: The statement that the high performance 'suggests the applicability of these models in clinical settings' rests entirely on internal performance within one hospital’s 12k-record cohort. No external validation cohort, temporal hold-out across years, or multi-center test is reported, leaving the transportability of the learned boundaries (and therefore the clinical-applicability claim) unsupported.
  3. [Methods] Methods (feature selection and SHAP): The trade-off experiments that retain only the most frequent labs and the subsequent SHAP ranking of eosinophil percentage are load-bearing for the paper’s interpretability contribution, yet no details are given on hyperparameter tuning, leakage prevention, or stability of the SHAP rankings across different feature-subset sizes.
minor comments (1)
  1. [Abstract] Abstract: The three tasks are labeled (1) deceased vs. discharged, (2) deceased vs. recovered, and (3) recovered vs. ameliorated. Clarifying the clinical distinction between 'discharged' and 'recovered' (and whether these labels are mutually exclusive) would prevent reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments have helped us identify areas where the manuscript can be clarified and strengthened. Below we provide point-by-point responses to the major comments. We have revised the manuscript to incorporate additional methodological details and to moderate claims of clinical applicability in light of the single-center design.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline AUC of 0.983 and accuracy of 0.93 for the deceased-vs-recovered task are presented without any description of the cross-validation procedure, missing-data strategy, class-imbalance correction, or whether the selection of the 10–50 most frequent laboratory tests occurred inside or outside the CV loop. This information is required to assess whether the metrics are optimistically biased.

    Authors: We agree that these details are essential for assessing potential bias. The revised manuscript now includes an expanded Methods section describing the 5-fold stratified cross-validation, median imputation for missing laboratory values, and class-weighting to address imbalance. Feature selection of the most frequent labs was performed on the full cohort prior to cross-validation to ensure adequate patient coverage across subsets; we now explicitly state this choice and discuss its implications for potential optimistic bias. A brief summary of the validation procedure has also been added to the abstract. revision: yes

  2. Referee: [Abstract] Abstract and Discussion: The statement that the high performance 'suggests the applicability of these models in clinical settings' rests entirely on internal performance within one hospital’s 12k-record cohort. No external validation cohort, temporal hold-out across years, or multi-center test is reported, leaving the transportability of the learned boundaries (and therefore the clinical-applicability claim) unsupported.

    Authors: We acknowledge that the single-center nature of the dataset limits claims about transportability. In the revised abstract and Discussion we have replaced the original phrasing with more cautious language indicating that the results 'suggest potential applicability subject to external validation.' We have also added an explicit limitations subsection highlighting the absence of temporal or multi-center testing and the consequent need for further studies before clinical deployment. revision: partial

  3. Referee: [Methods] Methods (feature selection and SHAP): The trade-off experiments that retain only the most frequent labs and the subsequent SHAP ranking of eosinophil percentage are load-bearing for the paper’s interpretability contribution, yet no details are given on hyperparameter tuning, leakage prevention, or stability of the SHAP rankings across different feature-subset sizes.

    Authors: We appreciate this observation. The revised Methods section now details the hyperparameter tuning procedure (grid search within cross-validation folds for each model), confirms that frequency-based feature selection was performed once on the full dataset for the trade-off experiments while model training and SHAP computation occurred inside the CV loop to limit leakage, and reports SHAP stability by showing that eosinophil percentage remains among the top-ranked features across the 10-, 20-, 30-, 40-, and 50-lab subsets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML evaluation on held-out data

full rationale

The paper describes a standard empirical pipeline: collection of a new single-center EHR dataset, selection of frequent lab features, training of off-the-shelf ML classifiers on three binary outcome tasks, evaluation via accuracy and AUC on held-out data, and post-hoc SHAP attribution. No equations, derivations, or self-referential steps appear; performance numbers are direct outputs of train/test splits rather than fitted parameters renamed as predictions. No self-citations support load-bearing uniqueness claims or ansatzes, and no known results are merely renamed. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of the single-hospital dataset, standard i.i.d. assumptions for ML training, and the reliability of SHAP for ranking clinical predictors without causal claims.

free parameters (2)
  • Number of most frequent lab tests retained (10-50)
    Chosen to trade off feature richness against patient coverage; exact selection rule not detailed in abstract.
  • Hyperparameters of the five ML models
    Tuned internally but values and tuning procedure not reported in abstract.
axioms (2)
  • domain assumption The collected EHR records accurately reflect true clinical states and outcomes.
    Required for any supervised learning claim on observational data.
  • domain assumption SHAP values provide stable and clinically meaningful feature attributions.
    Standard assumption when using SHAP for interpretability in medical ML.

pith-pipeline@v0.9.0 · 5572 in / 1373 out tokens · 85391 ms · 2026-05-10T19:07:51.186414+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    Global, regional, and national sepsis incidence and mortality, 1990–2017: analysis for the Global Burden of Disease Study.The Lancet

    Rudd KE, Johnson SC, Agesa KM, et al. Global, regional, and national sepsis incidence and mortality, 1990–2017: analysis for the Global Burden of Disease Study.The Lancet. 2020;395(10219):200–211. doi:10.1016/S0140-6736(19)32989-7

  2. [2]

    Australian Commission on Safety and Quality in Health Care.A Review of the Impacts of Surviving Sepsis for Australian Patients. 2021

  3. [3]

    Sepsis Burden in a Major Romanian Emer- gency Center—An 18-Year Retrospective Analysis of Mortality and Risk Factors.Medic- ina

    Muşat F, Păduraru DN, Bolocan A, et al. Sepsis Burden in a Major Romanian Emer- gency Center—An 18-Year Retrospective Analysis of Mortality and Risk Factors.Medic- ina. 2025;61(5):864. doi:10.3390/medicina61050864

  4. [4]

    The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure.Intensive Care Medicine

    Vincent JL, Moreno R, Takala J, et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure.Intensive Care Medicine. 1996;22(7):707–710. doi:10.1007/s001340050156

  5. [5]

    The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3).JMS SKIMS

    Rather AR, Kasana B. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3).JMS SKIMS. 2015;18(2):162–164. doi:10.33883/jms.v18i2.269

  6. [6]

    1993.The Origins of Order: Self-Organization and Selection in Evolution

    Donabedian A. The Apache II Severity of Disease Classification Sys- tem.An Introduction to Quality Assurance in Health Care. 2002:159–162. doi:10.1093/oso/9780195158090.005.0005

  7. [7]

    A New Simplified Acute Physiology Score (SAPS II) Based on a European/North American Multicenter Study.JAMA

    Le Gall JR. A New Simplified Acute Physiology Score (SAPS II) Based on a European/North American Multicenter Study.JAMA. 1993;270(24):2957. doi:10.1001/jama.1993.03510240069035

  8. [8]

    Wongtangman K, Santer P, Wachtendorf LJ, et al. Association of Sedation, Coma, and In-Hospital Mortality in Mechanically Ventilated Patients With Coronavirus Disease 2019–Related Acute Respiratory Distress Syndrome: A Retrospective Cohort Study. Critical Care Medicine. 2021;49(9):1524–1534. doi:10.1097/ccm.0000000000005053

  9. [9]

    The eICU Col- laborative Research Database, a freely available multi-center database for critical care research.Scientific Data

    Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU Col- laborative Research Database, a freely available multi-center database for critical care research.Scientific Data. 2018;5(1). doi:10.1038/sdata.2018.178

  10. [10]

    Survival prediction of patients with sepsis from age, sex, and septic episode number alone.Scientific Reports

    Chicco D, Jurman G. Survival prediction of patients with sepsis from age, sex, and septic episode number alone.Scientific Reports. 2020;10(1). doi:10.1038/s41598-020-73558-3

  11. [11]

    Diwan S, Gandhi V, Baidya Kayal E, Khanna P, Mehndiratta A. Explainable machine learning models for mortality prediction in patients with sepsis in tertiary care hos- pital ICU in low- to middle-income countries.Intensive Care Medicine Experimental. 2025;13(1). doi:10.1186/s40635-025-00765-5

  12. [12]

    Predicting sepsis in-hospital mortality with machine learning: a multi-center study using clinical and inflammatory biomarkers.European Journal of Medical Research

    Zhang G, Shao F, Yuan W, et al. Predicting sepsis in-hospital mortality with machine learning: a multi-center study using clinical and inflammatory biomarkers.European Journal of Medical Research. 2024;29(1). doi:10.1186/s40001-024-01756-0 14

  13. [13]

    Machine-learning models for prediction of sepsis patients mor- tality.Medicina Intensiva

    Bao C, Deng F, Zhao S. Machine-learning models for prediction of sepsis patients mor- tality.Medicina Intensiva. 2023;47(6):315–325. doi:10.1016/j.medin.2022.06.004

  14. [14]

    Development and validation of a novel blending machine learning model for hospital mortality prediction in ICU patients with Sepsis

    Zeng Z, Yao S, Zheng J, Gong X. Development and validation of a novel blending machine learning model for hospital mortality prediction in ICU patients with Sepsis. BioData Mining. 2021;14(1). doi:10.1186/s13040-021-00276-5

  15. [15]

    Predicting 30-days mortality for MIMIC-III patients with sepsis-3: a machine learning approach using XGboost.Journal of Translational Medicine

    Hou N, Li M, He L, et al. Predicting 30-days mortality for MIMIC-III patients with sepsis-3: a machine learning approach using XGboost.Journal of Translational Medicine. 2020;18(1). doi:10.1186/s12967-020-02620-5

  16. [16]

    Explainable machine learning for real- time deterioration alert prediction to guide pre-emptive treatment.Scientific Reports

    Brankovic A, Hassanzadeh H, Good N, et al. Explainable machine learning for real- time deterioration alert prediction to guide pre-emptive treatment.Scientific Reports. 2022;12(1). doi:10.1038/s41598-022-15877-1

  17. [17]

    Development and Validation of a Machine Learning Algorithm Using Clinical Pages to Predict Imminent Clinical Deterioration

    Steitz BD, McCoy AB, Reese TJ, et al. Development and Validation of a Machine Learning Algorithm Using Clinical Pages to Predict Imminent Clinical Deterioration. Journal of General Internal Medicine. 2023;39(1):27–35. doi:10.1007/s11606-023-08349- 3

  18. [18]

    AI-Powered early warning systems for clinical deterioration significantly improve patient outcomes: a meta-analysis.BMC Medical Informatics and Decision Making

    Yuan S, Yang Z, Li J, Wu C, Liu S. AI-Powered early warning systems for clinical deterioration significantly improve patient outcomes: a meta-analysis.BMC Medical Informatics and Decision Making. 2025;25(1). doi:10.1186/s12911-025-03048-x

  19. [19]

    Less is more: Detecting clinical deterioration in the hospital with machine learning using only age, heart rate, and respiratory rate.Resuscitation

    Akel MA, Carey KA, Winslow CJ, Churpek MM, Edelson DP. Less is more: Detecting clinical deterioration in the hospital with machine learning using only age, heart rate, and respiratory rate.Resuscitation. 2021;168:6–10. doi:10.1016/j.resuscitation.2021.08.024

  20. [20]

    Thiele D, Rodseth R, Friedland R, et al. Machine Learning Models for the Early Real- Time Prediction of Deterioration in Intensive Care Units—A Novel Approach to the Early Identification of High-Risk Patients.Journal of Clinical Medicine. 2025;14(2):350. doi:10.3390/jcm14020350

  21. [21]

    Interpretable Machine Learning for Early Prediction of PrognosisinSepsis: ADiscoveryandValidationStudy.Infectious Diseases and Therapy

    Hu C, Li L, Huang W, et al. Interpretable Machine Learning for Early Prediction of PrognosisinSepsis: ADiscoveryandValidationStudy.Infectious Diseases and Therapy. 2022;11(3):1117–1132. doi:10.1007/s40121-022-00628-6

  22. [22]

    Development and validation of an interpretable machine learning for mortality prediction in patients with sepsis.Frontiers in Artificial Intelligence

    He B, Qiu Z. Development and validation of an interpretable machine learning for mortality prediction in patients with sepsis.Frontiers in Artificial Intelligence. 2024;7. doi:10.3389/frai.2024.1348907

  23. [23]

    Zhang G, Wang T, An L, et al. U-shaped correlation of lymphocyte count with all-cause hospital mortality in sepsis and septic shock patients: a MIMIC-IV and eICU-CRD database study.International Journal of Emergency Medicine. 2024;17(1). doi:10.1186/s12245-024-00682-6 15

  24. [24]

    AI-Driven Innovations for Early Sepsis Detection by Combining Predictive Accuracy With Blood Count Analysis in an Emergency Setting: Retrospective Study

    Lin TH, Chung HY, Jian MJ, et al. AI-Driven Innovations for Early Sepsis Detection by Combining Predictive Accuracy With Blood Count Analysis in an Emergency Setting: Retrospective Study. JMIR Publications Inc.; 2024. doi:10.2196/preprints.56155

  25. [25]

    Park SW, Yeo NY, Kang S, et al. Early Prediction of Mortality for Septic Patients Visit- ing Emergency Room Based on Explainable Machine Learning: A Real-World Multicen- ter Study.Journal of Korean Medical Science. 2024;39(5). doi:10.3346/jkms.2024.39.e53

  26. [26]

    Necroptosis-based glioblastoma prognostic subtypes: implications for TME remodeling and therapy response

    Fan SH, Pang MM, Si M, et al. Quantitative changes in platelet count in response to dif- ferent pathogens: an analysis of patients with sepsis in both retrospective and prospec- tive cohorts.Annals of Medicine. 2024;56(1). doi:10.1080/07853890.2024.2405073

  27. [27]

    Li D, Hou J, Shi Z, et al. Frailty Index-laboratory and lymphocyte subset patterns in predicting 28-day mortality among elderly sepsis patients: a multicenter observational cohort study.Frontiers in Immunology. 2025;16. doi:10.3389/fimmu.2025.1624655

  28. [28]

    Prognostic Value of the AST/ALT Ratio in Patients with Septic Shock: A Prospective, Multicenter, Registry-Based Observational Study

    Choi S, Nah S, Suh GJ, et al. Prognostic Value of the AST/ALT Ratio in Patients with Septic Shock: A Prospective, Multicenter, Registry-Based Observational Study. Diagnostics. 2025;15(14):1773. doi:10.3390/diagnostics15141773

  29. [29]

    PMID: 41047921, https://doi.org/10.1080/07853890.2025.2568119

    Pinte L, Dumitru AC, Usurelu AC, et al. Low eosinophils and their dynamic as a predictor of death in patients with infections: a systematic review and meta-analysis of cohort studies.Annals of Medicine. 2025;57(1). doi:10.1080/07853890.2025.2541084

  30. [30]

    Eosinopenia is a reliable marker of sepsis on admission to medical intensive care units.Critical Care

    Abidi K, Khoudri I, Belayachi J, et al. Eosinopenia is a reliable marker of sepsis on admission to medical intensive care units.Critical Care. 2008;12(2). doi:10.1186/cc6883

  31. [31]

    Absolute Eosinophil Counts as a Prognostic Marker in Patients with Sepsis.Annals of African Medicine

    Shravani S, Kulkarni A, Aslam SM, Suhail KM, Shaji RM. Absolute Eosinophil Counts as a Prognostic Marker in Patients with Sepsis.Annals of African Medicine. 2025;24(2):332–336. doi:10.4103/aam.aam_203_24

  32. [32]

    The role of eosinophils in sepsis and acute respiratory distress syndrome: a scoping review.Canadian Journal of Anesthesia

    Al Duhailib Z, Farooqi M, Piticaru J, Alhazzani W, Nair P. The role of eosinophils in sepsis and acute respiratory distress syndrome: a scoping review.Canadian Journal of Anesthesia. 2021;68(5):715–726. doi:10.1007/s12630-021-01920-8 16 Supplementary Materials S1. Dataset and Laboratory Test Coverage Each diagnostic was associated with a high-level comorb...