Ensemble Deep Learning Models for Early Detection of Meningitis in ICU: Multi-center Study
Pith reviewed 2026-05-18 06:08 UTC · model grok-4.3
The pith
A stacking ensemble of random forest, LightGBM, and deep neural network models achieves over 99.9 percent negative predictive value for ruling out meningitis on internal ICU test sets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The stacking ensemble combining RF, LightGBM, and DNN performed well on internal test sets, exhibiting an NPV greater than 99.9% even with substantial class imbalance. While performance was lower on the external eICU cohort compared to the internal test sets, sensitivity remained robust. Therefore, the stacking ensemble may serve as a rule-out screening option for ERs and ICUs after additional prospective multi-site validation studies for its efficacy in real-world.
What carries the argument
The stacking ensemble that combines predictions from random forest, LightGBM, and deep neural network models to classify meningitis cases from ICU patient data.
If this is right
- The high internal NPV supports using the ensemble to reduce unnecessary lumbar punctures or broad antibiotic use in low-risk ICU patients.
- Robust external sensitivity indicates the model can still catch most true meningitis cases across different hospital systems.
- The approach demonstrates how tree-based and neural models can be combined for rare-event medical prediction tasks with imbalanced labels.
- Pending validation, the ensemble could be deployed as an initial screening layer in electronic health record systems for ER and ICU triage.
Where Pith is reading between the lines
- Similar stacking methods could be tested on other low-prevalence ICU conditions where missing a case is costly but false positives are tolerable.
- Integration with streaming vital-sign and lab data might allow the model to update risk scores continuously rather than at fixed admission snapshots.
- If external performance gaps persist, site-specific recalibration or additional features from local populations could narrow the drop seen in the eICU cohort.
Load-bearing premise
The multi-center ICU datasets used for training and internal testing are representative of future real-world patient populations so the high negative predictive value and robust sensitivity hold without major distribution shift or unmeasured confounding.
What would settle it
A prospective multi-site validation study in which the ensemble's negative predictive value drops substantially below 99 percent on new ICU admissions would show the model does not reliably rule out meningitis in practice.
read the original abstract
The stacking ensemble combining RF, LightGBM, and DNN performed well on internal test sets, exhibiting an NPV greater than 99.9% even with substantial class imbalance. While performance was lower on the external eICU cohort compared to the internal test sets, sensitivity remained robust. Therefore, the stacking ensemble may serve as a rule-out screening option for ERs and ICUs after additional prospective multi-site validation studies for its efficacy in real-world.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a stacking ensemble combining Random Forest, LightGBM, and a Deep Neural Network for early detection of meningitis in multi-center ICU data. It reports NPV exceeding 99.9% on internal held-out test sets despite class imbalance, lower but still robust sensitivity on an external eICU cohort, and concludes that the model may serve as a rule-out screening option for ERs and ICUs pending prospective multi-site validation.
Significance. If the performance generalizes beyond the reported cohorts, the high internal NPV could support clinically useful rule-out decisions that reduce unnecessary lumbar punctures or broad-spectrum antibiotics in ICU/ER settings. The multi-center training plus external eICU validation is a methodological strength that provides some independent grounding for the claims.
major comments (2)
- [Methods] Methods section: the manuscript states performance numbers (NPV >99.9%, sensitivity on eICU) but supplies no information on feature engineering, missing-data handling, exact DNN architecture and training details, hyperparameter search, or statistical testing procedures. These omissions are load-bearing for the central performance claims and prevent assessment of reproducibility or bias.
- [Results] Results and Discussion: the observed performance drop on the external eICU cohort is noted, yet no subgroup analyses, prevalence-adjusted metrics, or explicit checks for distribution shift (demographics, feature distributions, diagnostic criteria) are provided. This directly affects the strength of the generalizability argument underlying the rule-out screening recommendation.
minor comments (2)
- [Table 1] Table 1 or cohort description: clarify the exact prevalence of meningitis in each center and the eICU cohort to contextualize the NPV figures.
- [Figures] Figure captions: add confidence intervals or standard errors to all reported metrics for clearer interpretation of the ensemble versus base models.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments have helped us identify areas where additional transparency and analysis will strengthen the work. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Methods] Methods section: the manuscript states performance numbers (NPV >99.9%, sensitivity on eICU) but supplies no information on feature engineering, missing-data handling, exact DNN architecture and training details, hyperparameter search, or statistical testing procedures. These omissions are load-bearing for the central performance claims and prevent assessment of reproducibility or bias.
Authors: We agree that these details are essential for reproducibility and bias assessment. In the revised manuscript we have expanded the Methods section with a full account of feature engineering (variable selection from vital signs, laboratory values and demographics, plus normalization and temporal aggregation steps), missing-data handling (forward-fill for time-series variables combined with multiple imputation by chained equations for static features), the exact DNN architecture (three hidden layers with 128-64-32 neurons, ReLU activations, 0.3 dropout, trained with Adam optimizer, batch size 64, learning rate 0.001, up to 100 epochs with early stopping on validation loss), hyperparameter search (grid search over learning rate, batch size, layer sizes and dropout rates using 5-fold cross-validation), and statistical procedures (bootstrap resampling for 95% confidence intervals on NPV and sensitivity, plus DeLong test for AUC comparisons). These additions are now explicitly documented. revision: yes
-
Referee: [Results] Results and Discussion: the observed performance drop on the external eICU cohort is noted, yet no subgroup analyses, prevalence-adjusted metrics, or explicit checks for distribution shift (demographics, feature distributions, diagnostic criteria) are provided. This directly affects the strength of the generalizability argument underlying the rule-out screening recommendation.
Authors: We acknowledge that further analyses are needed to contextualize the performance drop and support the generalizability claim. In the revision we have added subgroup performance tables stratified by age, sex and primary admission diagnosis. Prevalence-adjusted PPV and NPV are now reported across a range of plausible meningitis prevalences (0.5%–5%). Distribution-shift checks include two-sample Kolmogorov-Smirnov tests and standardized mean differences for continuous features, chi-square tests for categorical variables, and a side-by-side comparison of demographic and laboratory distributions between the multi-center training set and the eICU cohort. Potential differences in diagnostic coding practices across sites are discussed in the limitations. These results are presented in a new subsection of Results and integrated into the Discussion while retaining the call for prospective multi-site validation. revision: yes
Circularity Check
No significant circularity: empirical performance on independent test sets
full rationale
The paper reports stacking ensemble performance metrics (NPV >99.9% internally, robust sensitivity externally) evaluated on held-out internal test sets and a separate eICU cohort. These are direct empirical measurements on data partitions independent of model fitting, not quantities defined or forced by the training process itself. No equations, self-definitional steps, fitted-input-as-prediction reductions, or load-bearing self-citations appear in the abstract or described claims. The derivation chain consists of standard ML training followed by out-of-sample evaluation, which remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- ensemble hyperparameters and combiner weights
axioms (1)
- domain assumption The multi-center training and test distributions are representative of real-world ICU populations for generalization.
Reference graph
Works this paper leans on
-
[1]
Introduction Meningitis is an acute and potentially life-threatening inflammatory process of the meninges surrounding the brain and spinal cord (Nagarathna et al. [1], Dutta et al. [2], ). Detection as early and accurately as possible will help to avoid adverse events such as neurologic impairment or death (Jonge et al. [3], Natrajan et al. [4], Hueth et ...
-
[2]
Results 2.1. Risk Features We analyzed feature importance to improve data quality and increase diagnostic difficulty, emulating the scenes in the ER. To achieve strong performance, 6962 variables served as the training features, where the majority are ICD codes (Table 1). Given that the top 100 features had captured 96% of the importance (Figure 2), we de...
-
[3]
Conclusion The viability of application on EL for the early detection of meningitis in the ER or ICU is demonstrated by this study. Through careful data preprocessing and feature selection, we can extract key features such as gender and high-risk ICD codes to drive predictive models with clinically plausible factors. Three base models, including Random Fo...
-
[4]
Gender” and a bunch of “ICD Codes
Methodology 4.1. Daraset Overview This study utilizes the MIMIC-III v1.4 database, a publicly available, de-identified critical care database developed by the MIT Laboratory for Computational Physiology. The database captures clinical data for more than 46,000 ICU cases at Beth Israel Deaconess Medical Center between 2001 and 2012. Structured data were ex...
work page 2001
-
[5]
Area Under the Curve (AUC): To evaluate overall model classification abilities
-
[6]
Sensitivity: To evaluate the ability to identify meningitis cases out of actual positive cases (true positive rate)
-
[7]
Specificity: To evaluate the accuracy of identifying non-meningitis cases out of actual negative cases(true negative rate)
-
[8]
Positive Predictive Value (PPV): To indicate how well a positive prediction is made. (predictive positive rate)
-
[9]
Negative Predictive Value (NPV): To show how well a negative prediction is made. (predictive negative rate)
-
[10]
F1-score: Measure the trade-off between Sensitivity and PPV on class-imbalanced data. 16 The three base learners are trained on the balanced training sets, and the models are evaluated using a 5-fold cross-validation approach. Models’ performances are evaluated across standard evaluation metrics (AUC, sensitivity, specificity, PPV, NPV, and F1-score) to m...
-
[11]
Discussion and Future directions This research highlights how ensemble learning could help facilitate early diagnosis of meningitis in the ER or ICU, utilizing structured electronic health record data. Although the meta-model demonstrated robust meningitis detection under both regular and challenging clinical scenarios, several limitations should be ackno...
-
[12]
Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐ The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: References:
-
[13]
Nagarathna, S., Hb, V., & Chandramuki, A. (2012). Laboratory Diagnosis of Meningitis. https://www.intechopen.com/chapters/34329
work page 2012
-
[14]
Dutta, K., Ghosh, S., & Basu, A. (2015). Infections and inflammation in the brain and spinal cord: A Dangerous Liaison. https://link.springer.com/chapter/10.1007/978-981-10-1711-7_4
-
[15]
Jonge, R. de, Furth, A. van, & Wassenaar, M. (2009). Predicting sequelae and death after bacterial meningitis in childhood: a systematic review of prognostic studies. https://link.springer.com/article/10.1186/1471-2334-10-232
-
[16]
Natrajan, M., Daniel, B., & Grace, Ga. (2019). Tuberculous meningitis in children: Clinical management & outcome. The Indian Journal of Medical Research. https://ijmr.org.in/tuberculous-meningitis-in-children-clinical-management-outcome/
work page 2019
-
[17]
Hueth, K., Thompson-Leduc, P., Totev, T., & Milbers, K. (2021). Assessment of the impact of a meningitis/encephalitis panel on hospital length of stay: a systematic review and meta-analysis. Antibiotics. https://www.mdpi.com/2079-6382/11/8/1028
work page 2021
-
[18]
Minatogawa, A., Ohara, J., Horinishi, Y., Sano, C., & Ohta, R. (2022). Meningitis With Staphylococcus aureus Bacteremia in an Older Patient With Nonspecific Symptoms: A Case Report. Cureus. https://www.cureus.com/articles/133079-meningitis-with-staphylococcus-aureus-bacteremia-in -an-older-patient-with-nonspecific-symptoms-a-case-report
work page 2022
-
[19]
D., Franklin, D., Simpson, J., & Kerr, F
Souza, R. D., Franklin, D., Simpson, J., & Kerr, F. (2002). Atypical Presentation of Tuberculosis Meningitis: A Case Report. Scottish Medical Journal. https://journals.sagepub.com/doi/10.1177/003693300204700107
- [20]
-
[21]
Lee, T., Shah, N., Haack, A., & Baxter, S. (2019). Clinical implementation of predictive models embedded within electronic health record systems: a systematic review. Informatics. https://www.mdpi.com/2227-9709/7/3/25
work page 2019
-
[22]
Swinckels, L., Bennis, F., & Ziesemer, K. (2023). The use of deep learning and machine learning on longitudinal electronic health records for the early detection and prevention of diseases: scoping review. https://www.jmir.org/2024/1/e48320/
work page 2023
-
[23]
Singh, H., Giardina, T., & Forjuoh, S. (2011). Electronic health record-based surveillance of diagnostic errors in primary care. https://qualitysafety.bmj.com/content/21/2/93.short 18
work page 2011
-
[24]
kaur, H., Pannu, H., & Malhi, A. (2019). A Systematic Review on Imbalanced Data Challenges in Machine Learning. ACM Computing Surveys (CSUR). https://dl.acm.org/doi/10.1145/3343440
-
[25]
Ali, A., Shamsuddin, S., & Ralescu, A. (2014). Classification with class imbalance problem: A review. https://www.semanticscholar.org/paper/1e4870524f8de44d4f18c8f9f80eb797dfd25c89
work page 2014
-
[26]
Mena, L., & Gonzalez, J. A. (2005). Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic. https://www.semanticscholar.org/paper/c6a0b19fa24f94f7186857d3b5b7ee3bf494bb8c
work page 2005
- [27]
-
[28]
Fiorentini, N., & Losa, M. (2020). Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms. https://www.mdpi.com/2412-3811/5/7/61
work page 2020
-
[29]
Almeida, H., Meurs, M.-J., Kosseim, L., Butler, G., & Tsang, A. (2014). Machine Learning for Biomedical Literature Triage. PLoS ONE. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115892
-
[30]
Divina, F., Gilson, A., Gómez-Vela, F. A., Torres, M., & Torres, J. F. (2018). Stacking Ensemble Learning for Short-Term Electricity Consumption Forecasting. Energies. https://www.mdpi.com/1996-1073/11/4/949
work page 2018
-
[31]
Chatzimparmpas, A., Martins, R. M., Kucher, K., & Kerren, A. (2020). StackGenVis: Alignment of Data, Algorithms, and Models for Stacking Ensemble Learning Using Performance Metrics. IEEE Transactions on Visualization and Computer Graphics. https://arxiv.org/abs/2005.01575
-
[32]
Jiang, W., Chen, Z., Xiang, Y., Shao, D., Ma, L., & Zhang, J. (2019). SSEM: A Novel Self-Adaptive Stacking Ensemble Model for Classification. IEEE Access. https://www.semanticscholar.org/paper/6eab65f09eaaf93a875a22fdd43250feadfa4063
work page 2019
-
[33]
Savin, I., Ershova, K., Kurdyumova, N., Potapov, A., & Kravchuk, A. (2018). Healthcare-associated ventriculitis and meningitis in a neuro-ICU: Incidence and risk factors selected by machine learning approach. Journal of Critical Care, 45, 95–104. https://doi.org/10.1016/j.jcrc.2018.01.022
-
[34]
Šeho, L., Šutković, H., Tabak, V., Tahirović, S., Smajović, A., Bečić, E., ... & Badnjević, A. (2022). Using artificial intelligence in diagnostics of meningitis. IFAC-PapersOnLine, 55(4), 56-61
work page 2022
-
[35]
Karanika, M., Vasilopoulou, V. A., Katsioulis, A. T., Papastergiou, P., Theodoridou, M. N., & Hadjichristodoulou, C. S. (2009). Diagnostic clinical and laboratory findings in response to predetermining bacterial pathogen: data from the Meningitis Registry. PloS one, 4(7), e6426
work page 2009
-
[36]
Messai, A., Drif, A., Ouyahia, A., Guechi, M., Rais, M., Kaderali, L., & Cherifi, H. (2024, August). Transparent AI Models for Meningococcal Meningitis Diagnosis: Evaluating Interpretability and Performance Metrics. In 2024 IEEE 12th International Conference on Intelligent Systems (IS) (pp. 1-8). IEEE
work page 2024
-
[37]
Wang, P., Cheng, S., Li, Y., Liu, L., Liu, J., Zhao, Q., & Luo, S. (2022). Prediction of lumbar drainage-related meningitis based on supervised machine learning algorithms. Frontiers in Public Health, 10, 910479. 19
work page 2022
-
[38]
Denisko, D., & Hoffman, M. M. (2018, February 11). Classification and interaction in random forests. Proceedings of the National Academy of Sciences. https://pnas.org/doi/full/10.1073/pnas.1800256115
-
[39]
Duroux, R., & Scornet, E. (2017). Impact of subsampling and tree depth on random forests. Esaim: Probability and Statistics. https://www.semanticscholar.org/paper/418c01434c035fc335088625dec7ab2597f2d6c4
work page 2017
- [40]
-
[41]
Shi, J., Fan, X., Wu, J., Chen, J., & Chen, W. (2018, August). DeepDiagnosis: DNN-based diagnosis prediction from pediatric big healthcare data. In 2018 Sixth International Conference on Advanced Cloud and Big Data (CBD) (pp. 287-292). IEEE
work page 2018
-
[42]
Neural Factorization Machines for Sparse Predictive Analytics
He, X., & Chua, T.-S. (2017, August 6). Neural Factorization Machines for Sparse Predictive Analytics. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://arxiv.org/abs/1708.05027
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[43]
Wu, X., Gao, X., Zhang, W., Luo, R., & Wang, J. (2019, August 4). Learning over categorical data using counting features: with an application on click-through rate estimation. Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data. https://www.semanticscholar.org/paper/d2b5cce1e764198002040ad6e77e97fc00cff6cd
work page 2019
- [45]
-
[46]
Rajendran, K., Jayabalan, M., & Thiruchelvam, V. (2019). Predicting Breast Cancer via Supervised Machine Learning Methods on Class Imbalanced Data. International Journal of Advanced Computer Science and Applications. https://www.semanticscholar.org/paper/c1c3a05e3bb329aa0917699cde23f9ae28e948ef
work page 2019
-
[47]
Naimi, A., & Balzer, L. (2017). Stacked generalization: an introduction to super learning. European Journal of Epidemiology. https://link.springer.com/article/10.1007/s10654-018-0390-z
-
[48]
Yan, J., & Han, S. (2018). Classifying Imbalanced Data Sets by a Novel RE-Sample and Cost-Sensitive Stacked Generalization Method. Mathematical Problems in Engineering. https://www.hindawi.com/journals/mpe/2018/5036710/
work page 2018
-
[49]
Davies, M., & Laan, M. J. van der. (2016). Optimal Spatial Prediction Using Ensemble Machine Learning. The International Journal of Biostatistics. https://www.semanticscholar.org/paper/f9558b4d832cde16ebdd098d994c37b80da239e9
work page 2016
-
[50]
Riley, R., Ensor, J., Snell, K., Debray, T., Altman, D., Moons, K., & Collins, G. (2016). External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. The BMJ. https://www.semanticscholar.org/paper/441a6bfce52887c4f2bce624532caae2cec6cc5d 20
work page 2016
-
[51]
Zhang, D., Yin, C., Zeng, J., Yuan, X., & Zhang, P. (2020). Combining structured and unstructured data for predictive models: a deep learning approach. BMC Medical Informatics and Decision Making. https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-01297-6
-
[52]
Damasevicius, R., Abayomi-Alli, O., Maskeliūnas, R., & Abayomi-alli, A. (2020). BiLSTM with Data Augmentation using Interpolation Methods to Improve Early Detection of Parkinson Disease. 2020 15th Conference on Computer Science and Information Systems (FedCSIS). https://annals-csis.org/Volume_21/drp/188.html
work page 2020
-
[53]
Pezoulas, V., Kourou, K. D., Kalatzis, F., Exarchos, T., Zampeli, E., Gandolfo, S., Goules, A., Baldini, C., Skopouli, F., Vita, S. D., Tzioufas, A., & Fotiadis, D. (2020). Overcoming the Barriers That Obscure the Interlinking and Analysis of Clinical Data Through Harmonization and Incremental Learning. IEEE Open Journal of Engineering in Medicine and Bio...
work page 2020
-
[54]
Lin, C.-T., Huang, K.-C., Pal, N., Cao, Z., Liu, Y.-T., Fang, C.-N., Hsieh, T.-Y., Lin, Y.-Y., & Wu, S.-L. (2019). Adaptive Subspace Sampling for Class Imbalance Processing-Some clarifications, algorithm, and further investigation including applications to Brain Computer Interface. 2020 International Conference on Fuzzy Theory and Its Applications (iFUZZY...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.