Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy
Pith reviewed 2026-05-22 13:39 UTC · model grok-4.3
The pith
A hybrid ensemble of Random Forest, XGBoost, LightGBM and support-vector classifier with logistic regression meta-learner reaches 97.2 percent accuracy predicting stroke from ten routine variables.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that rigorous preprocessing followed by a stacking architecture that combines Random Forest, XGBoost, LightGBM and a support-vector classifier, with logistic regression as meta-learner, converts a 4,981-record public stroke cohort into a predictor that reaches 97.2 percent accuracy and 97.15 percent F1-score, substantially outperforming any individual algorithm.
What carries the argument
The stacking ensemble with logistic regression meta-learner built on features chosen by point-biserial correlation and random-forest Gini importance after SMOTE balancing and outlier removal.
If this is right
- Rigorous preprocessing plus hybrid stacking can push tabular medical prediction models above the 95 percent accuracy threshold.
- Low-cost routine variables become sufficient for high-performance stroke-risk assessment.
- Diverse base learners with a linear meta-learner improve results over the best single model such as LightGBM.
- The framework remains interpretable while delivering near-clinical performance on public tabular data.
Where Pith is reading between the lines
- The same preprocessing-plus-stacking recipe may transfer to other tabular medical prediction tasks that suffer from class imbalance and moderate sample sizes.
- If validated on broader populations, the approach could support real-time risk scoring inside electronic health record systems with minimal added data collection.
- Testing whether the selected features remain stable when the model is retrained on data from different countries would reveal how sensitive the pipeline is to population shifts.
Load-bearing premise
The 4,981-record public cohort is representative of real-world stroke distributions and the combination of outlier removal, SMOTE oversampling and feature selection does not create artifacts that inflate performance on data drawn from the same distribution.
What would settle it
Retraining and testing the identical hybrid pipeline on an independent stroke dataset collected from a different population or healthcare system and checking whether accuracy remains above 95 percent.
Figures
read the original abstract
Brain stroke remains one of the principal causes of death and disability worldwide, yet most tabular-data prediction models still hover below the 95% accuracy threshold, limiting real-world utility. Addressing this gap, the present work develops and validates a completely data-driven and interpretable machine-learning framework designed to predict strokes using ten routinely gathered demographic, lifestyle, and clinical variables sourced from a public cohort of 4,981 records. We employ a detailed exploratory data analysis (EDA) to understand the dataset's structure and distribution, followed by rigorous data preprocessing, including handling missing values, outlier removal, and class imbalance correction using Synthetic Minority Over-sampling Technique (SMOTE). To streamline feature selection, point-biserial correlation and random-forest Gini importance were utilized, and ten varied algorithms-encompassing tree ensembles, boosting, kernel methods, and a multilayer neural network-were optimized using stratified five-fold cross-validation. Their predictions based on probabilities helped us build the proposed model, which included Random Forest, XGBoost, LightGBM, and a support-vector classifier, with logistic regression acting as a meta-learner. The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model, LightGBM, which had an accuracy of 91.4%. Our study's findings indicate that rigorous preprocessing, coupled with a diverse hybrid model, can convert low-cost tabular data into a nearly clinical-grade stroke-risk assessment tool.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a hybrid ensemble model (Random Forest, XGBoost, LightGBM, SVC with logistic regression meta-learner) for binary stroke prediction on a public tabular cohort of 4,981 records using 10 demographic/lifestyle/clinical features. After EDA, missing-value handling, outlier removal, SMOTE oversampling, and dual feature selection (point-biserial correlation plus RF Gini importance), the authors report 97.2% accuracy and 97.15% F1-score under stratified 5-fold cross-validation, outperforming the best single model (LightGBM at 91.4%). The work emphasizes interpretability and the utility of routine tabular data for near-clinical-grade risk assessment.
Significance. If the reported performance is shown to be free of leakage, the hybrid architecture plus explicit feature-selection synergy would constitute a useful incremental advance in tabular stroke modeling, illustrating how careful preprocessing and stacking can push accuracy above the 95% threshold on modest-sized public cohorts. The use of a reproducible public dataset and the combination of correlation and importance-based selection are positive elements that could be built upon.
major comments (2)
- [Methods] Methods section (cross-validation and preprocessing description): the reported 97.2% accuracy and 97.15% F1-score rest on stratified 5-fold CV performed after SMOTE and feature selection (point-biserial + RF Gini) were applied to the full 4,981-record cohort. No statement indicates that SMOTE ratios, correlation thresholds, or importance rankings were recomputed inside each training fold using only training data. This global preprocessing introduces leakage that can inflate metrics and undermine the generalization claim relative to LightGBM (91.4%).
- [Results] Results and abstract: the headline performance gap (97.2% vs. 91.4%) is large enough that even modest leakage could account for it; the manuscript supplies no external validation set, no calibration plots, and no sensitivity analysis of the SMOTE ratio or retained-feature count, leaving the central claim dependent on an unverified assumption that the reported CV reflects true out-of-distribution performance.
minor comments (3)
- [Methods] Abstract and Methods: hyperparameter search details (grid, random, or Bayesian; search space; number of trials) are not provided, making it impossible to assess whether the individual base models were fairly optimized.
- [Results] Table or results section: class distribution before/after SMOTE and the exact number of features retained after selection should be stated explicitly to allow replication.
- [Discussion] Discussion: the claim that the framework is 'completely data-driven and interpretable' would be strengthened by reporting feature importances or SHAP values for the final hybrid model rather than only for the base learners.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important issues around potential data leakage and the need for additional validation analyses to support our performance claims. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Methods] Methods section (cross-validation and preprocessing description): the reported 97.2% accuracy and 97.15% F1-score rest on stratified 5-fold CV performed after SMOTE and feature selection (point-biserial + RF Gini) were applied to the full 4,981-record cohort. No statement indicates that SMOTE ratios, correlation thresholds, or importance rankings were recomputed inside each training fold using only training data. This global preprocessing introduces leakage that can inflate metrics and undermine the generalization claim relative to LightGBM (91.4%).
Authors: We agree that the manuscript text does not explicitly describe performing SMOTE, point-biserial correlation, and RF Gini importance strictly inside each training fold. This omission leaves open the possibility of leakage. We will revise the Methods section to implement and document a nested procedure: all preprocessing and feature-selection steps will be recomputed using only training data within each fold of the stratified 5-fold CV. We have re-executed the experiments under this corrected protocol and will report the updated metrics in the revised manuscript. This change directly addresses the leakage concern and strengthens the comparison to the single-model baseline. revision: yes
-
Referee: [Results] Results and abstract: the headline performance gap (97.2% vs. 91.4%) is large enough that even modest leakage could account for it; the manuscript supplies no external validation set, no calibration plots, and no sensitivity analysis of the SMOTE ratio or retained-feature count, leaving the central claim dependent on an unverified assumption that the reported CV reflects true out-of-distribution performance.
Authors: We acknowledge that the reported performance gap warrants additional safeguards. In the revision we will add: (i) calibration plots for the ensemble and baseline models, (ii) sensitivity analyses showing how accuracy and F1-score vary with different SMOTE ratios and different numbers of retained features, and (iii) an independent held-out test set (approximately 20 % of the data) on which final performance will be reported after all model selection and preprocessing decisions are frozen on the training portion. These additions will provide direct evidence that the observed improvement is not an artifact of leakage and will better substantiate the generalization claim. revision: yes
Circularity Check
SMOTE and feature selection on full dataset before CV creates leakage, making 97.2% accuracy a fitted metric
specific steps
-
fitted input called prediction
[Abstract (preprocessing and CV description)]
"followed by rigorous data preprocessing, including handling missing values, outlier removal, and class imbalance correction using Synthetic Minority Over-sampling Technique (SMOTE). To streamline feature selection, point-biserial correlation and random-forest Gini importance were utilized, and ten varied algorithms-encompassing tree ensembles, boosting, kernel methods, and a multilayer neural network-were optimized using stratified five-fold cross-validation. ... The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%"
Preprocessing and feature selection are presented as completed on the full cohort before CV is applied. When SMOTE and correlation/importance thresholds use the entire dataset, test-fold records shape the synthetic minority samples and the retained feature set; the subsequent CV therefore measures performance on a transformed dataset that already contains test information, forcing the 97.2% accuracy and the claimed improvement over LightGBM by construction rather than by independent prediction.
full rationale
The paper describes preprocessing (outlier removal, SMOTE, point-biserial correlation, RF Gini feature selection) followed by stratified 5-fold CV on the 4981-record cohort, with no indication that these steps were nested inside training folds. This allows test-set information to influence synthetic samples and selected features, so the reported accuracy/F1 (and the gap over LightGBM) reduces to an in-sample fit rather than generalization. This matches the fitted-input-called-prediction pattern and justifies the reader's 6.0 score. No self-citation or definitional circularity is present; the derivation is otherwise self-contained but the evaluation step is not.
Axiom & Free-Parameter Ledger
free parameters (3)
- SMOTE oversampling ratio
- Number of retained features
- Base-model hyperparameters
axioms (2)
- domain assumption Samples are independent and identically distributed.
- ad hoc to paper SMOTE-generated samples preserve the true conditional distribution of stroke given the features.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We employ a detailed exploratory data analysis (EDA) ... class imbalance correction using Synthetic Minority Over-sampling Technique (SMOTE). To streamline feature selection, point-biserial correlation and random-forest Gini importance were utilized ... proposed model, which included Random Forest, XGBoost, LightGBM, and a support-vector classifier, with logistic regression acting as a meta-learner.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
D. C. Lukas, W. Harvey, M. S. Suzana, The e ffectiveness of physical exercise in stroke patient recovery: A systematic review, International Journal of Health and Pharmaceutical (IJHP) 4 (4) (2024) 575–580
work page 2024
-
[2]
K. DING, P. NGUYEN, An unobtrusive and lightweight ear-worn system for continuous epileptic seizure detection (2024)
work page 2024
- [3]
- [4]
-
[5]
cancer statistics review 1973–1988
W. A. Bleyer, What can be learned about childhood cancer from “cancer statistics review 1973–1988”, Cancer 71 (S10) (1993) 3229–3236
work page 1973
-
[6]
Y . Niu, X. Tao, Q. Chang, M. Hu, X. Li, X. Gao, Machine learning-based feature selection and classification for cerebral infarction screening: an experimental study, PeerJ Computer Science 11 (2025) e2704
work page 2025
-
[7]
Cairns, The cancer problem, Scientific American 233 (5) (1975) 64–79
J. Cairns, The cancer problem, Scientific American 233 (5) (1975) 64–79
work page 1975
-
[8]
I. Abousaber, A novel explainable attention-based meta-learning frame- work for imbalanced brain stroke prediction (2025)
work page 2025
-
[9]
K. Sundaram, B. Lanitha, K. Kamaraj, A. K. Ramamoorthy, Enhanced brain stroke prediction: An ensemble of random forest, logistic regression and xgboost, in: 2024 International Conference on Emerging Research in Computational Science (ICERCS), IEEE, 2024, pp. 1–5
work page 2024
- [10]
-
[11]
C.-H. Hsu, X. Chen, W. Lin, C. Jiang, Y . Zhang, Z. Hao, Y .-C. Chung, Ef- fective multiple cancer disease diagnosis frameworks for improved health- care using machine learning, Measurement 175 (2021) 109145
work page 2021
-
[12]
I. T. Akbasli, Full-Filled Brain Stroke Dataset, https: //www.kaggle.com/datasets/zzettrkalpakbal/ full-fi-lled-brain-stroke-dataset , accessed: 2025-05-19 (2022)
work page 2025
-
[13]
M. Dahouda, I. Kasongo, A deep-learned embedding technique for cate- gorical features encoding, IEEE Access 9 (2021) 114381–114391
work page 2021
-
[14]
M. K. Dahouda, I. Joe, A deep-learned embedding technique for categori- cal features encoding, IEEE Access 9 (2021) 114381–114391
work page 2021
-
[15]
S. Jazaeri, M. Dehghani, Error analysis and outlier detection in subsidence monitoring based on persistent scatterer interferometry, Advances in Space Research (2025)
work page 2025
-
[16]
L. A. Ma’rifah, I. Afrianty, E. Budianita, F. Syafria, Klasifikasi tulang tengkorak berdasarkan jenis kelamin menggunakan correlation-based fea- ture selection (cfs) dengan backpropagation neural network (bpnn), Jurnal Informatika: Jurnal Pengembangan IT 10 (2) (2025) 333–347
work page 2025
-
[17]
J. C. Garc´ıa Merino, M. d. l. L. Tobarra Abad, A. Robles G´omez, R. Pas- tor Vargas, P. Vidal Balboa, A. Dionisio Rocha, R. Jardim Gon c ¸alves, Assessing feature selection techniques for ai-based iot network intrusion detection (2025)
work page 2025
-
[18]
G. Giannini, A. Mousa, E. Steiner, N. Artamonova, M. Kafka, I. Heidegger, Real-world monitoring strategies and predictors guiding the transition from active surveillance to treatment in isup 1 prostate cancer (2025)
work page 2025
- [19]
-
[20]
J. O. Popov Wir´en, K. Nordenram, Machine learning for anti-poaching: Decision tree applications on the savannah (2025). Yousuf Islam et al. / (2025) 1–17 17
work page 2025
-
[21]
S. Raj, V . Namdeo, P. Singh, A. Srivastava, Identification and prioritization of disease candidate genes using biomedical named entity recognition and random forest classification, Computers in Biology and Medicine 192 (2025) 110320
work page 2025
-
[22]
T. Li, W. Qi, X. Mao, G. Jia, W. Zhang, X. Li, H. Pan, D. Wang, Predic- tion of lumbar disc degeneration based on interpretable machine learning models: Retrospective cohort study, The Spine Journal (2025)
work page 2025
-
[23]
S. Y . Suk, L. H. Sang, Y .-J. Rhie, C. H. Wook, J. Kim, L. Y . Ah, Y .-M. Kim, K. J. Hye, A. M. Bae, H. Y . Hee, et al., Development of ai-based growth prediction models for children with growth disorders: a 3-year analysis using the lg growth study, in: Endocrine Abstracts, V ol. 110, Bioscientifica, 2025
work page 2025
-
[24]
J. Q. E. Tan, H. S. Ng, R. Woodman, B. Koczwara, Cardiovascular medi- cation and health service use in individuals with cancer: A retrospective population-based cohort study, Cancer Medicine 14 (9) (2025) e70911
work page 2025
-
[25]
A. Neelam, K. N. Mishra, P. Padmanabhan, G. P. Ghantasala, Accurate identification of the blast disease in rice crop using artificial neural network compared with support vector machine algorithm, in: Intelligent Com- puting and Communication Techniques: Proceedings of the International Conference on Intelligent Computing and Communication Techniques (ICI...
work page 2024
-
[26]
H. Meng, J. Zhang, Y . Chang, Z. Zheng, A new method for predicting chlorophyll-a concentration in a reservoir: Coupling efdc hydrodynamic and water quality model with convlstm-mlp network, Journal of Hydrology (2025) 133485
work page 2025
-
[27]
M. U. Umar, A. Walli, A. Qazi, A. Nawaz, M. Jalal, Novel sub-grade soil improvement using marble dust and rice husk ash: Prediction and valida- tion via machine learning models, International Journal of Computational Materials Science and Engineering (2025)
work page 2025
-
[28]
S. Juneja, B. S. Bhati, Advancements in disease diagnosis: A review of machine learning, ensemble learning and deep learning algorithms, in: Intelligent Computing and Communication Techniques: Proceedings of the International Conference on Intelligent Computing and Communication Techniques (ICICCT 2024), New Delhi, India, 28-29 June, 2024 (V olume 1), CRC...
work page 2024
-
[29]
M. S. Khan, T. Peng, H. Akhlaq, M. A. Khan, Comparative analysis of automated machine learning for hyperparameter optimization and explain- able artificial intelligence models, IEEE Access (2025)
work page 2025
-
[30]
M. J. U. Chowdhury, A. Hussan, D. A. I. Hridoy, A. S. Sikder, Incorpo- rating an integrated software system for stroke prediction using machine learning algorithms and artificial neural network, in: 2023 IEEE 13th An- nual Computing and Communication Workshop and Conference (CCWC), IEEE, 2023, pp. 0222–0228
work page 2023
-
[31]
U. N. Wisesty, T. A. B. Wirayuda, F. Sthevanie, R. Rismala, Analysis of data and feature processing on stroke prediction using wide range machine learning model, Jurnal Online Informatika 9 (1) (2024) 29–40
work page 2024
- [32]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.