pith. sign in

arxiv: 2510.15218 · v3 · submitted 2025-10-17 · 💻 cs.LG

Ensemble Deep Learning Models for Early Detection of Meningitis in ICU: Multi-center Study

Pith reviewed 2026-05-18 06:08 UTC · model grok-4.3

classification 💻 cs.LG
keywords ensemble learningmeningitis detectionintensive care unitstacking ensemblenegative predictive valuemachine learningearly detectionmulti-center study
0
0 comments X

The pith

A stacking ensemble of random forest, LightGBM, and deep neural network models achieves over 99.9 percent negative predictive value for ruling out meningitis on internal ICU test sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains and evaluates ensemble machine learning models on multi-center ICU data to detect meningitis early. The stacking combination of random forest, LightGBM, and deep neural network models reaches a negative predictive value above 99.9 percent on internal test sets despite heavy class imbalance. Performance declines on the external eICU cohort, yet sensitivity stays robust. The authors conclude the ensemble could function as a rule-out screening aid in emergency rooms and intensive care units once prospective multi-site studies confirm its real-world performance.

Core claim

The stacking ensemble combining RF, LightGBM, and DNN performed well on internal test sets, exhibiting an NPV greater than 99.9% even with substantial class imbalance. While performance was lower on the external eICU cohort compared to the internal test sets, sensitivity remained robust. Therefore, the stacking ensemble may serve as a rule-out screening option for ERs and ICUs after additional prospective multi-site validation studies for its efficacy in real-world.

What carries the argument

The stacking ensemble that combines predictions from random forest, LightGBM, and deep neural network models to classify meningitis cases from ICU patient data.

If this is right

  • The high internal NPV supports using the ensemble to reduce unnecessary lumbar punctures or broad antibiotic use in low-risk ICU patients.
  • Robust external sensitivity indicates the model can still catch most true meningitis cases across different hospital systems.
  • The approach demonstrates how tree-based and neural models can be combined for rare-event medical prediction tasks with imbalanced labels.
  • Pending validation, the ensemble could be deployed as an initial screening layer in electronic health record systems for ER and ICU triage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar stacking methods could be tested on other low-prevalence ICU conditions where missing a case is costly but false positives are tolerable.
  • Integration with streaming vital-sign and lab data might allow the model to update risk scores continuously rather than at fixed admission snapshots.
  • If external performance gaps persist, site-specific recalibration or additional features from local populations could narrow the drop seen in the eICU cohort.

Load-bearing premise

The multi-center ICU datasets used for training and internal testing are representative of future real-world patient populations so the high negative predictive value and robust sensitivity hold without major distribution shift or unmeasured confounding.

What would settle it

A prospective multi-site validation study in which the ensemble's negative predictive value drops substantially below 99 percent on new ICU admissions would show the model does not reliably rule out meningitis in practice.

read the original abstract

The stacking ensemble combining RF, LightGBM, and DNN performed well on internal test sets, exhibiting an NPV greater than 99.9% even with substantial class imbalance. While performance was lower on the external eICU cohort compared to the internal test sets, sensitivity remained robust. Therefore, the stacking ensemble may serve as a rule-out screening option for ERs and ICUs after additional prospective multi-site validation studies for its efficacy in real-world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a stacking ensemble combining Random Forest, LightGBM, and a Deep Neural Network for early detection of meningitis in multi-center ICU data. It reports NPV exceeding 99.9% on internal held-out test sets despite class imbalance, lower but still robust sensitivity on an external eICU cohort, and concludes that the model may serve as a rule-out screening option for ERs and ICUs pending prospective multi-site validation.

Significance. If the performance generalizes beyond the reported cohorts, the high internal NPV could support clinically useful rule-out decisions that reduce unnecessary lumbar punctures or broad-spectrum antibiotics in ICU/ER settings. The multi-center training plus external eICU validation is a methodological strength that provides some independent grounding for the claims.

major comments (2)
  1. [Methods] Methods section: the manuscript states performance numbers (NPV >99.9%, sensitivity on eICU) but supplies no information on feature engineering, missing-data handling, exact DNN architecture and training details, hyperparameter search, or statistical testing procedures. These omissions are load-bearing for the central performance claims and prevent assessment of reproducibility or bias.
  2. [Results] Results and Discussion: the observed performance drop on the external eICU cohort is noted, yet no subgroup analyses, prevalence-adjusted metrics, or explicit checks for distribution shift (demographics, feature distributions, diagnostic criteria) are provided. This directly affects the strength of the generalizability argument underlying the rule-out screening recommendation.
minor comments (2)
  1. [Table 1] Table 1 or cohort description: clarify the exact prevalence of meningitis in each center and the eICU cohort to contextualize the NPV figures.
  2. [Figures] Figure captions: add confidence intervals or standard errors to all reported metrics for clearer interpretation of the ensemble versus base models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments have helped us identify areas where additional transparency and analysis will strengthen the work. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Methods] Methods section: the manuscript states performance numbers (NPV >99.9%, sensitivity on eICU) but supplies no information on feature engineering, missing-data handling, exact DNN architecture and training details, hyperparameter search, or statistical testing procedures. These omissions are load-bearing for the central performance claims and prevent assessment of reproducibility or bias.

    Authors: We agree that these details are essential for reproducibility and bias assessment. In the revised manuscript we have expanded the Methods section with a full account of feature engineering (variable selection from vital signs, laboratory values and demographics, plus normalization and temporal aggregation steps), missing-data handling (forward-fill for time-series variables combined with multiple imputation by chained equations for static features), the exact DNN architecture (three hidden layers with 128-64-32 neurons, ReLU activations, 0.3 dropout, trained with Adam optimizer, batch size 64, learning rate 0.001, up to 100 epochs with early stopping on validation loss), hyperparameter search (grid search over learning rate, batch size, layer sizes and dropout rates using 5-fold cross-validation), and statistical procedures (bootstrap resampling for 95% confidence intervals on NPV and sensitivity, plus DeLong test for AUC comparisons). These additions are now explicitly documented. revision: yes

  2. Referee: [Results] Results and Discussion: the observed performance drop on the external eICU cohort is noted, yet no subgroup analyses, prevalence-adjusted metrics, or explicit checks for distribution shift (demographics, feature distributions, diagnostic criteria) are provided. This directly affects the strength of the generalizability argument underlying the rule-out screening recommendation.

    Authors: We acknowledge that further analyses are needed to contextualize the performance drop and support the generalizability claim. In the revision we have added subgroup performance tables stratified by age, sex and primary admission diagnosis. Prevalence-adjusted PPV and NPV are now reported across a range of plausible meningitis prevalences (0.5%–5%). Distribution-shift checks include two-sample Kolmogorov-Smirnov tests and standardized mean differences for continuous features, chi-square tests for categorical variables, and a side-by-side comparison of demographic and laboratory distributions between the multi-center training set and the eICU cohort. Potential differences in diagnostic coding practices across sites are discussed in the limitations. These results are presented in a new subsection of Results and integrated into the Discussion while retaining the call for prospective multi-site validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical performance on independent test sets

full rationale

The paper reports stacking ensemble performance metrics (NPV >99.9% internally, robust sensitivity externally) evaluated on held-out internal test sets and a separate eICU cohort. These are direct empirical measurements on data partitions independent of model fitting, not quantities defined or forced by the training process itself. No equations, self-definitional steps, fitted-input-as-prediction reductions, or load-bearing self-citations appear in the abstract or described claims. The derivation chain consists of standard ML training followed by out-of-sample evaluation, which remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus the domain premise that the collected multi-center data distribution matches future deployment populations; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • ensemble hyperparameters and combiner weights
    Hyperparameters for the base learners and any meta-learner weights are chosen or tuned on the training data to achieve the reported metrics.
axioms (1)
  • domain assumption The multi-center training and test distributions are representative of real-world ICU populations for generalization.
    Invoked to support claims of utility as a rule-out tool after prospective validation.

pith-pipeline@v0.9.0 · 5601 in / 1509 out tokens · 38830 ms · 2026-05-18T06:08:11.745131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

  1. [1]

    [1], Dutta et al

    Introduction Meningitis is an acute and potentially life-threatening inflammatory process of the meninges surrounding the brain and spinal cord (Nagarathna et al. [1], Dutta et al. [2], ). Detection as early and accurately as possible will help to avoid adverse events such as neurologic impairment or death (Jonge et al. [3], Natrajan et al. [4], Hueth et ...

  2. [2]

    Risk Features We analyzed feature importance to improve data quality and increase diagnostic difficulty, emulating the scenes in the ER

    Results 2.1. Risk Features We analyzed feature importance to improve data quality and increase diagnostic difficulty, emulating the scenes in the ER. To achieve strong performance, 6962 variables served as the training features, where the majority are ICD codes (Table 1). Given that the top 100 features had captured 96% of the importance (Figure 2), we de...

  3. [3]

    Conclusion The viability of application on EL for the early detection of meningitis in the ER or ICU is demonstrated by this study. Through careful data preprocessing and feature selection, we can extract key features such as gender and high-risk ICD codes to drive predictive models with clinically plausible factors. Three base models, including Random Fo...

  4. [4]

    Gender” and a bunch of “ICD Codes

    Methodology 4.1. Daraset Overview This study utilizes the MIMIC-III v1.4 database, a publicly available, de-identified critical care database developed by the MIT Laboratory for Computational Physiology. The database captures clinical data for more than 46,000 ICU cases at Beth Israel Deaconess Medical Center between 2001 and 2012. Structured data were ex...

  5. [5]

    Area Under the Curve (AUC): To evaluate overall model classification abilities

  6. [6]

    Sensitivity: To evaluate the ability to identify meningitis cases out of actual positive cases (true positive rate)

  7. [7]

    Specificity: To evaluate the accuracy of identifying non-meningitis cases out of actual negative cases(true negative rate)

  8. [8]

    (predictive positive rate)

    Positive Predictive Value (PPV): To indicate how well a positive prediction is made. (predictive positive rate)

  9. [9]

    (predictive negative rate)

    Negative Predictive Value (NPV): To show how well a negative prediction is made. (predictive negative rate)

  10. [10]

    regular" non-meningitis samples for evaluation, resulting in excellent outcomes (AUC 0.9637, F1-score 0.9242). Testing Set 2 replaces these

    F1-score: Measure the trade-off between Sensitivity and PPV on class-imbalanced data. 16 The three base learners are trained on the balanced training sets, and the models are evaluated using a 5-fold cross-validation approach. Models’ performances are evaluated across standard evaluation metrics (AUC, sensitivity, specificity, PPV, NPV, and F1-score) to m...

  11. [11]

    Discussion and Future directions This research highlights how ensemble learning could help facilitate early diagnosis of meningitis in the ER or ICU, utilizing structured electronic health record data. Although the meta-model demonstrated robust meningitis detection under both regular and challenging clinical scenarios, several limitations should be ackno...

  12. [12]

    ☐ The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: References:

    Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐ The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: References:

  13. [13]

    Nagarathna, S., Hb, V., & Chandramuki, A. (2012). Laboratory Diagnosis of Meningitis. https://www.intechopen.com/chapters/34329

  14. [14]

    Dutta, K., Ghosh, S., & Basu, A. (2015). Infections and inflammation in the brain and spinal cord: A Dangerous Liaison. https://link.springer.com/chapter/10.1007/978-981-10-1711-7_4

  15. [15]

    de, Furth, A

    Jonge, R. de, Furth, A. van, & Wassenaar, M. (2009). Predicting sequelae and death after bacterial meningitis in childhood: a systematic review of prognostic studies. https://link.springer.com/article/10.1186/1471-2334-10-232

  16. [16]

    Natrajan, M., Daniel, B., & Grace, Ga. (2019). Tuberculous meningitis in children: Clinical management & outcome. The Indian Journal of Medical Research. https://ijmr.org.in/tuberculous-meningitis-in-children-clinical-management-outcome/

  17. [17]

    Hueth, K., Thompson-Leduc, P., Totev, T., & Milbers, K. (2021). Assessment of the impact of a meningitis/encephalitis panel on hospital length of stay: a systematic review and meta-analysis. Antibiotics. https://www.mdpi.com/2079-6382/11/8/1028

  18. [18]

    Minatogawa, A., Ohara, J., Horinishi, Y., Sano, C., & Ohta, R. (2022). Meningitis With Staphylococcus aureus Bacteremia in an Older Patient With Nonspecific Symptoms: A Case Report. Cureus. https://www.cureus.com/articles/133079-meningitis-with-staphylococcus-aureus-bacteremia-in -an-older-patient-with-nonspecific-symptoms-a-case-report

  19. [19]

    D., Franklin, D., Simpson, J., & Kerr, F

    Souza, R. D., Franklin, D., Simpson, J., & Kerr, F. (2002). Atypical Presentation of Tuberculosis Meningitis: A Case Report. Scottish Medical Journal. https://journals.sagepub.com/doi/10.1177/003693300204700107

  20. [20]

    Wang, J., Luo, J., Ye, M., Wang, X., Zhong, Y., Chang, A., Huang, G., Yin, Z., Xiao, C., Sun, J., & Ma, F. (2024). Recent Advances in Predictive Modeling with Electronic Health Records. IJCAI : Proceedings of the Conference. https://arxiv.org/abs/2402.01077

  21. [21]

    Lee, T., Shah, N., Haack, A., & Baxter, S. (2019). Clinical implementation of predictive models embedded within electronic health record systems: a systematic review. Informatics. https://www.mdpi.com/2227-9709/7/3/25

  22. [22]

    Swinckels, L., Bennis, F., & Ziesemer, K. (2023). The use of deep learning and machine learning on longitudinal electronic health records for the early detection and prevention of diseases: scoping review. https://www.jmir.org/2024/1/e48320/

  23. [23]

    Singh, H., Giardina, T., & Forjuoh, S. (2011). Electronic health record-based surveillance of diagnostic errors in primary care. https://qualitysafety.bmj.com/content/21/2/93.short 18

  24. [24]

    kaur, H., Pannu, H., & Malhi, A. (2019). A Systematic Review on Imbalanced Data Challenges in Machine Learning. ACM Computing Surveys (CSUR). https://dl.acm.org/doi/10.1145/3343440

  25. [25]

    Ali, A., Shamsuddin, S., & Ralescu, A. (2014). Classification with class imbalance problem: A review. https://www.semanticscholar.org/paper/1e4870524f8de44d4f18c8f9f80eb797dfd25c89

  26. [26]

    Mena, L., & Gonzalez, J. A. (2005). Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic. https://www.semanticscholar.org/paper/c6a0b19fa24f94f7186857d3b5b7ee3bf494bb8c

  27. [27]

    Jeon, Y.-S., & Lim, D.-J. (2019). PSU: Particle Stacking Undersampling Method for Highly Imbalanced Big Data. IEEE Access. https://ieeexplore.ieee.org/document/9142186/

  28. [28]

    Fiorentini, N., & Losa, M. (2020). Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms. https://www.mdpi.com/2412-3811/5/7/61

  29. [29]

    Almeida, H., Meurs, M.-J., Kosseim, L., Butler, G., & Tsang, A. (2014). Machine Learning for Biomedical Literature Triage. PLoS ONE. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115892

  30. [30]

    A., Torres, M., & Torres, J

    Divina, F., Gilson, A., Gómez-Vela, F. A., Torres, M., & Torres, J. F. (2018). Stacking Ensemble Learning for Short-Term Electricity Consumption Forecasting. Energies. https://www.mdpi.com/1996-1073/11/4/949

  31. [31]

    M., Kucher, K., & Kerren, A

    Chatzimparmpas, A., Martins, R. M., Kucher, K., & Kerren, A. (2020). StackGenVis: Alignment of Data, Algorithms, and Models for Stacking Ensemble Learning Using Performance Metrics. IEEE Transactions on Visualization and Computer Graphics. https://arxiv.org/abs/2005.01575

  32. [32]

    Jiang, W., Chen, Z., Xiang, Y., Shao, D., Ma, L., & Zhang, J. (2019). SSEM: A Novel Self-Adaptive Stacking Ensemble Model for Classification. IEEE Access. https://www.semanticscholar.org/paper/6eab65f09eaaf93a875a22fdd43250feadfa4063

  33. [33]

    Savin, I., Ershova, K., Kurdyumova, N., Potapov, A., & Kravchuk, A. (2018). Healthcare-associated ventriculitis and meningitis in a neuro-ICU: Incidence and risk factors selected by machine learning approach. Journal of Critical Care, 45, 95–104. https://doi.org/10.1016/j.jcrc.2018.01.022

  34. [34]

    & Badnjević, A

    Šeho, L., Šutković, H., Tabak, V., Tahirović, S., Smajović, A., Bečić, E., ... & Badnjević, A. (2022). Using artificial intelligence in diagnostics of meningitis. IFAC-PapersOnLine, 55(4), 56-61

  35. [35]

    A., Katsioulis, A

    Karanika, M., Vasilopoulou, V. A., Katsioulis, A. T., Papastergiou, P., Theodoridou, M. N., & Hadjichristodoulou, C. S. (2009). Diagnostic clinical and laboratory findings in response to predetermining bacterial pathogen: data from the Meningitis Registry. PloS one, 4(7), e6426

  36. [36]

    (2024, August)

    Messai, A., Drif, A., Ouyahia, A., Guechi, M., Rais, M., Kaderali, L., & Cherifi, H. (2024, August). Transparent AI Models for Meningococcal Meningitis Diagnosis: Evaluating Interpretability and Performance Metrics. In 2024 IEEE 12th International Conference on Intelligent Systems (IS) (pp. 1-8). IEEE

  37. [37]

    Wang, P., Cheng, S., Li, Y., Liu, L., Liu, J., Zhao, Q., & Luo, S. (2022). Prediction of lumbar drainage-related meningitis based on supervised machine learning algorithms. Frontiers in Public Health, 10, 910479. 19

  38. [38]

    Denisko, D., & Hoffman, M. M. (2018, February 11). Classification and interaction in random forests. Proceedings of the National Academy of Sciences. https://pnas.org/doi/full/10.1073/pnas.1800256115

  39. [39]

    Duroux, R., & Scornet, E. (2017). Impact of subsampling and tree depth on random forests. Esaim: Probability and Statistics. https://www.semanticscholar.org/paper/418c01434c035fc335088625dec7ab2597f2d6c4

  40. [40]

    & Liu, T

    Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30

  41. [41]

    (2018, August)

    Shi, J., Fan, X., Wu, J., Chen, J., & Chen, W. (2018, August). DeepDiagnosis: DNN-based diagnosis prediction from pediatric big healthcare data. In 2018 Sixth International Conference on Advanced Cloud and Big Data (CBD) (pp. 287-292). IEEE

  42. [42]

    Neural Factorization Machines for Sparse Predictive Analytics

    He, X., & Chua, T.-S. (2017, August 6). Neural Factorization Machines for Sparse Predictive Analytics. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://arxiv.org/abs/1708.05027

  43. [43]

    (2019, August 4)

    Wu, X., Gao, X., Zhang, W., Luo, R., & Wang, J. (2019, August 4). Learning over categorical data using counting features: with an application on click-through rate estimation. Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data. https://www.semanticscholar.org/paper/d2b5cce1e764198002040ad6e77e97fc00cff6cd

  44. [45]

    Bader-El-Den, M., Teitei, E., & Perry, T. (2019). Biased Random Forest For Dealing With the Class Imbalance Problem. IEEE Transactions on Neural Networks and Learning Systems. https://ieeexplore.ieee.org/document/8541100/

  45. [46]

    Rajendran, K., Jayabalan, M., & Thiruchelvam, V. (2019). Predicting Breast Cancer via Supervised Machine Learning Methods on Class Imbalanced Data. International Journal of Advanced Computer Science and Applications. https://www.semanticscholar.org/paper/c1c3a05e3bb329aa0917699cde23f9ae28e948ef

  46. [47]

    Naimi, A., & Balzer, L. (2017). Stacked generalization: an introduction to super learning. European Journal of Epidemiology. https://link.springer.com/article/10.1007/s10654-018-0390-z

  47. [48]

    Yan, J., & Han, S. (2018). Classifying Imbalanced Data Sets by a Novel RE-Sample and Cost-Sensitive Stacked Generalization Method. Mathematical Problems in Engineering. https://www.hindawi.com/journals/mpe/2018/5036710/

  48. [49]

    Davies, M., & Laan, M. J. van der. (2016). Optimal Spatial Prediction Using Ensemble Machine Learning. The International Journal of Biostatistics. https://www.semanticscholar.org/paper/f9558b4d832cde16ebdd098d994c37b80da239e9

  49. [50]

    Riley, R., Ensor, J., Snell, K., Debray, T., Altman, D., Moons, K., & Collins, G. (2016). External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. The BMJ. https://www.semanticscholar.org/paper/441a6bfce52887c4f2bce624532caae2cec6cc5d 20

  50. [51]

    Zhang, D., Yin, C., Zeng, J., Yuan, X., & Zhang, P. (2020). Combining structured and unstructured data for predictive models: a deep learning approach. BMC Medical Informatics and Decision Making. https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-01297-6

  51. [52]

    Damasevicius, R., Abayomi-Alli, O., Maskeliūnas, R., & Abayomi-alli, A. (2020). BiLSTM with Data Augmentation using Interpolation Methods to Improve Early Detection of Parkinson Disease. 2020 15th Conference on Computer Science and Information Systems (FedCSIS). https://annals-csis.org/Volume_21/drp/188.html

  52. [53]

    D., Kalatzis, F., Exarchos, T., Zampeli, E., Gandolfo, S., Goules, A., Baldini, C., Skopouli, F., Vita, S

    Pezoulas, V., Kourou, K. D., Kalatzis, F., Exarchos, T., Zampeli, E., Gandolfo, S., Goules, A., Baldini, C., Skopouli, F., Vita, S. D., Tzioufas, A., & Fotiadis, D. (2020). Overcoming the Barriers That Obscure the Interlinking and Analysis of Clinical Data Through Harmonization and Incremental Learning. IEEE Open Journal of Engineering in Medicine and Bio...

  53. [54]

    Lin, C.-T., Huang, K.-C., Pal, N., Cao, Z., Liu, Y.-T., Fang, C.-N., Hsieh, T.-Y., Lin, Y.-Y., & Wu, S.-L. (2019). Adaptive Subspace Sampling for Class Imbalance Processing-Some clarifications, algorithm, and further investigation including applications to Brain Computer Interface. 2020 International Conference on Fuzzy Theory and Its Applications (iFUZZY...