pith. sign in

arxiv: 2505.15844 · v1 · pith:IZBHVEBXnew · submitted 2025-05-18 · 🧬 q-bio.QM · cs.LG· stat.AP

Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy

Pith reviewed 2026-05-22 13:39 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LGstat.AP
keywords stroke predictionhybrid machine learningtabular dataensemble learningfeature selectionSMOTEmedical risk assessmentstacking classifier
0
0 comments X

The pith

A hybrid ensemble of Random Forest, XGBoost, LightGBM and support-vector classifier with logistic regression meta-learner reaches 97.2 percent accuracy predicting stroke from ten routine variables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a completely data-driven machine-learning system to predict brain stroke using only ten common demographic, lifestyle and clinical variables drawn from a public set of 4,981 records. After exploratory analysis, the authors remove missing values and outliers, correct class imbalance with SMOTE, and select features by point-biserial correlation plus random-forest importance. Ten base learners are trained under stratified five-fold cross-validation; their probability outputs are then stacked with logistic regression as the meta-learner. The resulting hybrid model attains 97.2 percent accuracy and 97.15 percent F1-score, well above the strongest single model (LightGBM at 91.4 percent). A sympathetic reader would care because the work claims that ordinary tabular data, after disciplined cleaning and ensembling, can support near-clinical-grade risk assessment without costly imaging or lab tests.

Core claim

The central claim is that rigorous preprocessing followed by a stacking architecture that combines Random Forest, XGBoost, LightGBM and a support-vector classifier, with logistic regression as meta-learner, converts a 4,981-record public stroke cohort into a predictor that reaches 97.2 percent accuracy and 97.15 percent F1-score, substantially outperforming any individual algorithm.

What carries the argument

The stacking ensemble with logistic regression meta-learner built on features chosen by point-biserial correlation and random-forest Gini importance after SMOTE balancing and outlier removal.

If this is right

  • Rigorous preprocessing plus hybrid stacking can push tabular medical prediction models above the 95 percent accuracy threshold.
  • Low-cost routine variables become sufficient for high-performance stroke-risk assessment.
  • Diverse base learners with a linear meta-learner improve results over the best single model such as LightGBM.
  • The framework remains interpretable while delivering near-clinical performance on public tabular data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preprocessing-plus-stacking recipe may transfer to other tabular medical prediction tasks that suffer from class imbalance and moderate sample sizes.
  • If validated on broader populations, the approach could support real-time risk scoring inside electronic health record systems with minimal added data collection.
  • Testing whether the selected features remain stable when the model is retrained on data from different countries would reveal how sensitive the pipeline is to population shifts.

Load-bearing premise

The 4,981-record public cohort is representative of real-world stroke distributions and the combination of outlier removal, SMOTE oversampling and feature selection does not create artifacts that inflate performance on data drawn from the same distribution.

What would settle it

Retraining and testing the identical hybrid pipeline on an independent stroke dataset collected from a different population or healthcare system and checking whether accuracy remains above 95 percent.

Figures

Figures reproduced from arXiv: 2505.15844 by Md. Jalal Uddin Chowdhury, Sumon Chandra Das, Yousuf Islam.

Figure 2
Figure 2. Figure 2: Class Distribution of Stroke Status 2.2.2. Numerical Feature Analysis We carried out a distribution analysis of each numerical fea￾ture (age, avg glucose level, and bmi) through descriptive statis￾tics and visualization. Central tendency measures, dispersion, and shape parameters (skewness and kurtosis) were calculated to characterize the distributions. Skewness, a measure of the distribution asymmetry, wa… view at source ↗
Figure 1
Figure 1. Figure 1: A visual overview of the key steps involved in this study [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Numerical Feature Distributions [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Categorical Feature Distributions [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Box Plots Before and After Outlier Removal [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Random Forest Feature Importance [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Class Distribution After Applying SMOTE to Training Data [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: All and Important Features Between All Models [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Model Performance Comparison [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Correlation Based Model Performance Heatmap [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Random Forest Based Model Performance Heatmap [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
read the original abstract

Brain stroke remains one of the principal causes of death and disability worldwide, yet most tabular-data prediction models still hover below the 95% accuracy threshold, limiting real-world utility. Addressing this gap, the present work develops and validates a completely data-driven and interpretable machine-learning framework designed to predict strokes using ten routinely gathered demographic, lifestyle, and clinical variables sourced from a public cohort of 4,981 records. We employ a detailed exploratory data analysis (EDA) to understand the dataset's structure and distribution, followed by rigorous data preprocessing, including handling missing values, outlier removal, and class imbalance correction using Synthetic Minority Over-sampling Technique (SMOTE). To streamline feature selection, point-biserial correlation and random-forest Gini importance were utilized, and ten varied algorithms-encompassing tree ensembles, boosting, kernel methods, and a multilayer neural network-were optimized using stratified five-fold cross-validation. Their predictions based on probabilities helped us build the proposed model, which included Random Forest, XGBoost, LightGBM, and a support-vector classifier, with logistic regression acting as a meta-learner. The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model, LightGBM, which had an accuracy of 91.4%. Our study's findings indicate that rigorous preprocessing, coupled with a diverse hybrid model, can convert low-cost tabular data into a nearly clinical-grade stroke-risk assessment tool.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes a hybrid ensemble model (Random Forest, XGBoost, LightGBM, SVC with logistic regression meta-learner) for binary stroke prediction on a public tabular cohort of 4,981 records using 10 demographic/lifestyle/clinical features. After EDA, missing-value handling, outlier removal, SMOTE oversampling, and dual feature selection (point-biserial correlation plus RF Gini importance), the authors report 97.2% accuracy and 97.15% F1-score under stratified 5-fold cross-validation, outperforming the best single model (LightGBM at 91.4%). The work emphasizes interpretability and the utility of routine tabular data for near-clinical-grade risk assessment.

Significance. If the reported performance is shown to be free of leakage, the hybrid architecture plus explicit feature-selection synergy would constitute a useful incremental advance in tabular stroke modeling, illustrating how careful preprocessing and stacking can push accuracy above the 95% threshold on modest-sized public cohorts. The use of a reproducible public dataset and the combination of correlation and importance-based selection are positive elements that could be built upon.

major comments (2)
  1. [Methods] Methods section (cross-validation and preprocessing description): the reported 97.2% accuracy and 97.15% F1-score rest on stratified 5-fold CV performed after SMOTE and feature selection (point-biserial + RF Gini) were applied to the full 4,981-record cohort. No statement indicates that SMOTE ratios, correlation thresholds, or importance rankings were recomputed inside each training fold using only training data. This global preprocessing introduces leakage that can inflate metrics and undermine the generalization claim relative to LightGBM (91.4%).
  2. [Results] Results and abstract: the headline performance gap (97.2% vs. 91.4%) is large enough that even modest leakage could account for it; the manuscript supplies no external validation set, no calibration plots, and no sensitivity analysis of the SMOTE ratio or retained-feature count, leaving the central claim dependent on an unverified assumption that the reported CV reflects true out-of-distribution performance.
minor comments (3)
  1. [Methods] Abstract and Methods: hyperparameter search details (grid, random, or Bayesian; search space; number of trials) are not provided, making it impossible to assess whether the individual base models were fairly optimized.
  2. [Results] Table or results section: class distribution before/after SMOTE and the exact number of features retained after selection should be stated explicitly to allow replication.
  3. [Discussion] Discussion: the claim that the framework is 'completely data-driven and interpretable' would be strengthened by reporting feature importances or SHAP values for the final hybrid model rather than only for the base learners.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important issues around potential data leakage and the need for additional validation analyses to support our performance claims. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Methods] Methods section (cross-validation and preprocessing description): the reported 97.2% accuracy and 97.15% F1-score rest on stratified 5-fold CV performed after SMOTE and feature selection (point-biserial + RF Gini) were applied to the full 4,981-record cohort. No statement indicates that SMOTE ratios, correlation thresholds, or importance rankings were recomputed inside each training fold using only training data. This global preprocessing introduces leakage that can inflate metrics and undermine the generalization claim relative to LightGBM (91.4%).

    Authors: We agree that the manuscript text does not explicitly describe performing SMOTE, point-biserial correlation, and RF Gini importance strictly inside each training fold. This omission leaves open the possibility of leakage. We will revise the Methods section to implement and document a nested procedure: all preprocessing and feature-selection steps will be recomputed using only training data within each fold of the stratified 5-fold CV. We have re-executed the experiments under this corrected protocol and will report the updated metrics in the revised manuscript. This change directly addresses the leakage concern and strengthens the comparison to the single-model baseline. revision: yes

  2. Referee: [Results] Results and abstract: the headline performance gap (97.2% vs. 91.4%) is large enough that even modest leakage could account for it; the manuscript supplies no external validation set, no calibration plots, and no sensitivity analysis of the SMOTE ratio or retained-feature count, leaving the central claim dependent on an unverified assumption that the reported CV reflects true out-of-distribution performance.

    Authors: We acknowledge that the reported performance gap warrants additional safeguards. In the revision we will add: (i) calibration plots for the ensemble and baseline models, (ii) sensitivity analyses showing how accuracy and F1-score vary with different SMOTE ratios and different numbers of retained features, and (iii) an independent held-out test set (approximately 20 % of the data) on which final performance will be reported after all model selection and preprocessing decisions are frozen on the training portion. These additions will provide direct evidence that the observed improvement is not an artifact of leakage and will better substantiate the generalization claim. revision: yes

Circularity Check

1 steps flagged

SMOTE and feature selection on full dataset before CV creates leakage, making 97.2% accuracy a fitted metric

specific steps
  1. fitted input called prediction [Abstract (preprocessing and CV description)]
    "followed by rigorous data preprocessing, including handling missing values, outlier removal, and class imbalance correction using Synthetic Minority Over-sampling Technique (SMOTE). To streamline feature selection, point-biserial correlation and random-forest Gini importance were utilized, and ten varied algorithms-encompassing tree ensembles, boosting, kernel methods, and a multilayer neural network-were optimized using stratified five-fold cross-validation. ... The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%"

    Preprocessing and feature selection are presented as completed on the full cohort before CV is applied. When SMOTE and correlation/importance thresholds use the entire dataset, test-fold records shape the synthetic minority samples and the retained feature set; the subsequent CV therefore measures performance on a transformed dataset that already contains test information, forcing the 97.2% accuracy and the claimed improvement over LightGBM by construction rather than by independent prediction.

full rationale

The paper describes preprocessing (outlier removal, SMOTE, point-biserial correlation, RF Gini feature selection) followed by stratified 5-fold CV on the 4981-record cohort, with no indication that these steps were nested inside training folds. This allows test-set information to influence synthetic samples and selected features, so the reported accuracy/F1 (and the gap over LightGBM) reduces to an in-sample fit rather than generalization. This matches the fitted-input-called-prediction pattern and justifies the reader's 6.0 score. No self-citation or definitional circularity is present; the derivation is otherwise self-contained but the evaluation step is not.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central performance claim rests on standard machine-learning assumptions plus several data-processing choices that are tuned to the given dataset.

free parameters (3)
  • SMOTE oversampling ratio
    Chosen to balance the minority stroke class; exact ratio not stated in abstract but directly affects training distribution.
  • Number of retained features
    Set to ten after correlation and Gini ranking; the cutoff is a modeling decision that influences all downstream results.
  • Base-model hyperparameters
    Optimized inside cross-validation but specific values are not reported; these control the individual learners whose outputs feed the meta-learner.
axioms (2)
  • domain assumption Samples are independent and identically distributed.
    Required for the validity of cross-validation estimates.
  • ad hoc to paper SMOTE-generated samples preserve the true conditional distribution of stroke given the features.
    Invoked when oversampling is used to correct class imbalance.

pith-pipeline@v0.9.0 · 5817 in / 1646 out tokens · 58813 ms · 2026-05-22T13:39:00.497739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    D. C. Lukas, W. Harvey, M. S. Suzana, The e ffectiveness of physical exercise in stroke patient recovery: A systematic review, International Journal of Health and Pharmaceutical (IJHP) 4 (4) (2024) 575–580

  2. [2]

    K. DING, P. NGUYEN, An unobtrusive and lightweight ear-worn system for continuous epileptic seizure detection (2024)

  3. [3]

    Gupta, N

    A. Gupta, N. Mishra, N. Jatana, S. Malik, K. A. Gepreel, F. Asmat, S. N. Mohanty, Predicting stroke risk: an effective stroke prediction model based on neural networks, Journal of Neurorestoratology 13 (1) (2025) 100156

  4. [4]

    Hasan, F

    M. Hasan, F. Yasmin, M. M. Hassan, X. Yu, S. Yeasmin, H. Joshi, S. M. S. Islam, Enhancing stroke disease classification through machine learning models via a novel voting system by feature selection techniques, PloS one 20 (1) (2025) e0312914

  5. [5]

    cancer statistics review 1973–1988

    W. A. Bleyer, What can be learned about childhood cancer from “cancer statistics review 1973–1988”, Cancer 71 (S10) (1993) 3229–3236

  6. [6]

    Y . Niu, X. Tao, Q. Chang, M. Hu, X. Li, X. Gao, Machine learning-based feature selection and classification for cerebral infarction screening: an experimental study, PeerJ Computer Science 11 (2025) e2704

  7. [7]

    Cairns, The cancer problem, Scientific American 233 (5) (1975) 64–79

    J. Cairns, The cancer problem, Scientific American 233 (5) (1975) 64–79

  8. [8]

    Abousaber, A novel explainable attention-based meta-learning frame- work for imbalanced brain stroke prediction (2025)

    I. Abousaber, A novel explainable attention-based meta-learning frame- work for imbalanced brain stroke prediction (2025)

  9. [9]

    Sundaram, B

    K. Sundaram, B. Lanitha, K. Kamaraj, A. K. Ramamoorthy, Enhanced brain stroke prediction: An ensemble of random forest, logistic regression and xgboost, in: 2024 International Conference on Emerging Research in Computational Science (ICERCS), IEEE, 2024, pp. 1–5

  10. [10]

    Gupta, A

    N. Gupta, A. Anwar, T. A. Fattah, M. K. Quamre, P. Kumar, Address- ing imbalanced data in stroke prediction: An oversampling approach for improved accuracy, in: International Conference on Universal Threats in Expert Applications and Solutions, Springer, 2024, pp. 373–381

  11. [11]

    C.-H. Hsu, X. Chen, W. Lin, C. Jiang, Y . Zhang, Z. Hao, Y .-C. Chung, Ef- fective multiple cancer disease diagnosis frameworks for improved health- care using machine learning, Measurement 175 (2021) 109145

  12. [12]

    I. T. Akbasli, Full-Filled Brain Stroke Dataset, https: //www.kaggle.com/datasets/zzettrkalpakbal/ full-fi-lled-brain-stroke-dataset , accessed: 2025-05-19 (2022)

  13. [13]

    Dahouda, I

    M. Dahouda, I. Kasongo, A deep-learned embedding technique for cate- gorical features encoding, IEEE Access 9 (2021) 114381–114391

  14. [14]

    M. K. Dahouda, I. Joe, A deep-learned embedding technique for categori- cal features encoding, IEEE Access 9 (2021) 114381–114391

  15. [15]

    Jazaeri, M

    S. Jazaeri, M. Dehghani, Error analysis and outlier detection in subsidence monitoring based on persistent scatterer interferometry, Advances in Space Research (2025)

  16. [16]

    L. A. Ma’rifah, I. Afrianty, E. Budianita, F. Syafria, Klasifikasi tulang tengkorak berdasarkan jenis kelamin menggunakan correlation-based fea- ture selection (cfs) dengan backpropagation neural network (bpnn), Jurnal Informatika: Jurnal Pengembangan IT 10 (2) (2025) 333–347

  17. [17]

    J. C. Garc´ıa Merino, M. d. l. L. Tobarra Abad, A. Robles G´omez, R. Pas- tor Vargas, P. Vidal Balboa, A. Dionisio Rocha, R. Jardim Gon c ¸alves, Assessing feature selection techniques for ai-based iot network intrusion detection (2025)

  18. [18]

    Giannini, A

    G. Giannini, A. Mousa, E. Steiner, N. Artamonova, M. Kafka, I. Heidegger, Real-world monitoring strategies and predictors guiding the transition from active surveillance to treatment in isup 1 prostate cancer (2025)

  19. [19]

    Suguna, J

    R. Suguna, J. Suriya Prakash, H. Aditya Pai, T. Mahesh, V . Vinoth Kumar, T. E. Yimer, Mitigating class imbalance in churn prediction with ensemble methods and smote, Scientific Reports 15 (1) (2025) 1–20

  20. [20]

    J. O. Popov Wir´en, K. Nordenram, Machine learning for anti-poaching: Decision tree applications on the savannah (2025). Yousuf Islam et al. / (2025) 1–17 17

  21. [21]

    S. Raj, V . Namdeo, P. Singh, A. Srivastava, Identification and prioritization of disease candidate genes using biomedical named entity recognition and random forest classification, Computers in Biology and Medicine 192 (2025) 110320

  22. [22]

    T. Li, W. Qi, X. Mao, G. Jia, W. Zhang, X. Li, H. Pan, D. Wang, Predic- tion of lumbar disc degeneration based on interpretable machine learning models: Retrospective cohort study, The Spine Journal (2025)

  23. [23]

    S. Y . Suk, L. H. Sang, Y .-J. Rhie, C. H. Wook, J. Kim, L. Y . Ah, Y .-M. Kim, K. J. Hye, A. M. Bae, H. Y . Hee, et al., Development of ai-based growth prediction models for children with growth disorders: a 3-year analysis using the lg growth study, in: Endocrine Abstracts, V ol. 110, Bioscientifica, 2025

  24. [24]

    J. Q. E. Tan, H. S. Ng, R. Woodman, B. Koczwara, Cardiovascular medi- cation and health service use in individuals with cancer: A retrospective population-based cohort study, Cancer Medicine 14 (9) (2025) e70911

  25. [25]

    Neelam, K

    A. Neelam, K. N. Mishra, P. Padmanabhan, G. P. Ghantasala, Accurate identification of the blast disease in rice crop using artificial neural network compared with support vector machine algorithm, in: Intelligent Com- puting and Communication Techniques: Proceedings of the International Conference on Intelligent Computing and Communication Techniques (ICI...

  26. [26]

    H. Meng, J. Zhang, Y . Chang, Z. Zheng, A new method for predicting chlorophyll-a concentration in a reservoir: Coupling efdc hydrodynamic and water quality model with convlstm-mlp network, Journal of Hydrology (2025) 133485

  27. [27]

    M. U. Umar, A. Walli, A. Qazi, A. Nawaz, M. Jalal, Novel sub-grade soil improvement using marble dust and rice husk ash: Prediction and valida- tion via machine learning models, International Journal of Computational Materials Science and Engineering (2025)

  28. [28]

    Juneja, B

    S. Juneja, B. S. Bhati, Advancements in disease diagnosis: A review of machine learning, ensemble learning and deep learning algorithms, in: Intelligent Computing and Communication Techniques: Proceedings of the International Conference on Intelligent Computing and Communication Techniques (ICICCT 2024), New Delhi, India, 28-29 June, 2024 (V olume 1), CRC...

  29. [29]

    M. S. Khan, T. Peng, H. Akhlaq, M. A. Khan, Comparative analysis of automated machine learning for hyperparameter optimization and explain- able artificial intelligence models, IEEE Access (2025)

  30. [30]

    M. J. U. Chowdhury, A. Hussan, D. A. I. Hridoy, A. S. Sikder, Incorpo- rating an integrated software system for stroke prediction using machine learning algorithms and artificial neural network, in: 2023 IEEE 13th An- nual Computing and Communication Workshop and Conference (CCWC), IEEE, 2023, pp. 0222–0228

  31. [31]

    U. N. Wisesty, T. A. B. Wirayuda, F. Sthevanie, R. Rismala, Analysis of data and feature processing on stroke prediction using wide range machine learning model, Jurnal Online Informatika 9 (1) (2024) 29–40

  32. [32]

    Hassan, S

    A. Hassan, S. Gulzar Ahmad, E. Ullah Munir, I. Ali Khan, N. Ramzan, Predictive modelling and identification of key risk factors for stroke using machine learning, Scientific Reports 14 (1) (2024) 11498