pith. sign in

arxiv: 2509.02826 · v2 · submitted 2025-09-02 · 💻 cs.LG · cs.AI· stat.AP· stat.CO

Ensemble Learning for Healthcare: A Comparative Analysis of Hybrid Voting and Ensemble Stacking in Obesity Risk Prediction

Pith reviewed 2026-05-18 19:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.APstat.CO
keywords ensemble learningobesity risk predictionstackingmajority votingmachine learninghealthcare predictionbase learnerscomparative analysis
0
0 comments X

The pith

Ensemble stacking outperforms hybrid majority voting for obesity risk prediction, especially on complex datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests hybrid majority voting against ensemble stacking for predicting obesity risk on two health datasets. It selects the top three models from nine algorithms after fifty hyperparameter configurations, applies balancing and outlier detection, then builds majority hard voting, weighted hard voting, and stacking with a multi-layer perceptron meta-classifier. Stacking matches or exceeds the voting methods, showing its clearest advantage on the dataset with more intricate patterns. A sympathetic reader would care because improved risk models could support earlier interventions for a condition strongly linked to diabetes, heart disease, and cancer. The work positions stacking as the stronger option when data complexity rises while treating voting as a reliable simpler choice.

Core claim

On Dataset-1 weighted hard voting and stacking both reached accuracy 0.920304 and F1-score near 0.920, outperforming majority hard voting. On Dataset-2 stacking achieved accuracy 0.989837 and F1 0.989825, beating majority hard voting at accuracy 0.981707 while weighted hard voting performed worst. The results establish that stacking supplies stronger predictive capability for complex data distributions, with hybrid majority voting remaining a robust alternative.

What carries the argument

Ensemble construction from the top three base learners chosen from nine machine learning algorithms, assembled either as hybrid hard voting (majority or weighted) or as stacking with a multi-layer perceptron meta-classifier.

If this is right

  • Stacking is preferable when obesity data exhibits complex distributions.
  • Hybrid majority voting serves as a dependable lower-complexity option.
  • Tuning and selecting multiple base learners before ensembling improves reliability for healthcare tasks.
  • The comparative results can inform model choice in other medical prediction settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stacking preference may appear in risk models for related conditions such as diabetes.
  • Real-world clinical streams with missing values could narrow or widen the observed performance gap.
  • Varying the meta-classifier beyond the multi-layer perceptron might further optimize stacking results.

Load-bearing premise

Selecting the top three models after evaluating fifty hyperparameter configurations plus dataset balancing and outlier detection produces an unbiased comparison of the two ensemble approaches without selection effects or data artifacts.

What would settle it

On a new obesity dataset processed identically, stacking fails to match or exceed the accuracy and F1-score of the voting ensembles.

read the original abstract

Obesity is a critical global health issue driven by dietary, physiological, and environmental factors, and is strongly associated with chronic diseases such as diabetes, cardiovascular disorders, and cancer. Machine learning has emerged as a promising approach for early obesity risk prediction, yet a comparative evaluation of ensemble techniques -- particularly hybrid majority voting and ensemble stacking -- remains limited. This study aims to compare hybrid majority voting and ensemble stacking methods for obesity risk prediction, identifying which approach delivers higher accuracy and efficiency. The analysis seeks to highlight the complementary strengths of these ensemble techniques in guiding better predictive model selection for healthcare applications. Two datasets were utilized to evaluate three ensemble models: Majority Hard Voting, Weighted Hard Voting, and Stacking (with a Multi-Layer Perceptron as meta-classifier). A pool of nine Machine Learning (ML) algorithms, evaluated across a total of 50 hyperparameter configurations, was analyzed to identify the top three models to serve as base learners for the ensemble methods. Preprocessing steps involved dataset balancing, and outlier detection, and model performance was evaluated using Accuracy and F1-Score. On Dataset-1, weighted hard voting and stacking achieved nearly identical performance (Accuracy: 0.920304, F1: 0.920070), outperforming majority hard voting. On Dataset-2, stacking demonstrated superior results (Accuracy: 0.989837, F1: 0.989825) compared to majority hard voting (Accuracy: 0.981707, F1: 0.981675) and weighted hard voting, which showed the lowest performance. The findings confirm that ensemble stacking provides stronger predictive capability, particularly for complex data distributions, while hybrid majority voting remains a robust alternative.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript compares three ensemble methods—Majority Hard Voting, Weighted Hard Voting, and Stacking (MLP meta-learner)—for obesity risk prediction on two datasets. Nine base ML algorithms are evaluated across 50 hyperparameter configurations to select the top three as base learners. After preprocessing (balancing and outlier detection), Accuracy and F1-Score are reported: on Dataset-1, weighted voting and stacking reach ~0.9203 accuracy; on Dataset-2, stacking reaches 0.9898 accuracy and outperforms the voting variants. The authors conclude that stacking provides stronger predictive capability for complex data distributions while majority voting remains a robust alternative.

Significance. If the comparative results hold under unbiased evaluation, the work offers practical empirical guidance on ensemble selection for healthcare risk prediction tasks. The use of two datasets and concrete numeric results (Accuracy/F1) is a strength, but the absence of error bars, statistical tests, or nested validation limits the strength of claims about superiority for 'complex data distributions.' The contribution is incremental rather than foundational.

major comments (1)
  1. [Methods / Experimental Setup] The model selection pipeline (evaluation of nine algorithms over 50 hyperparameter configurations to choose the top three base learners for all ensembles) is performed on the same data regime later used to report final Accuracy/F1 on Dataset-1 (0.9203) and Dataset-2 (0.9898 for stacking). This introduces selection bias that directly inflates the headline numbers and prevents clean attribution of the observed gap (especially the ~0.008 difference on Dataset-2) to the ensemble method itself rather than to which models were permitted to participate. Nested cross-validation or a held-out selection set is required to support the central claim.
minor comments (2)
  1. [Data Description] Dataset-1 and Dataset-2 are referenced only by number; their sources, sizes, feature counts, and class distributions should be stated explicitly in the data section for reproducibility.
  2. [Results] No standard deviations, confidence intervals, or statistical significance tests accompany the reported Accuracy and F1 values; adding these would strengthen the comparison between ensembles.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the emphasis on methodological rigor in the experimental setup. Below we provide a point-by-point response to the major comment, explaining our position and outlining the revisions we will make.

read point-by-point responses
  1. Referee: [Methods / Experimental Setup] The model selection pipeline (evaluation of nine algorithms over 50 hyperparameter configurations to choose the top three base learners for all ensembles) is performed on the same data regime later used to report final Accuracy/F1 on Dataset-1 (0.9203) and Dataset-2 (0.9898 for stacking). This introduces selection bias that directly inflates the headline numbers and prevents clean attribution of the observed gap (especially the ~0.008 difference on Dataset-2) to the ensemble method itself rather than to which models were permitted to participate. Nested cross-validation or a held-out selection set is required to support the central claim.

    Authors: We agree that conducting the base-learner selection and hyperparameter search on the same data later used for final reporting can introduce selection bias and produce somewhat optimistic absolute performance figures. This is a legitimate methodological concern. At the same time, because the identical selection procedure (nine algorithms, 50 configurations, top-three base learners) was applied uniformly to all three ensemble methods, the relative comparisons between Majority Hard Voting, Weighted Hard Voting, and Stacking remain internally consistent and are not confounded by differential model selection. The performance gap observed on Dataset-2, where stacking reaches 0.9898 accuracy while the voting variants are lower, can therefore still be attributed to the ensemble strategy itself. To fully address the referee’s point and strengthen the claims, we will revise the manuscript to adopt nested cross-validation: an outer loop for unbiased performance estimation and an inner loop for model selection and hyperparameter tuning. Updated results, methodology description, and any changes to the reported numbers will be included in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical ensemble comparison

full rationale

The paper reports direct experimental results from training nine base ML models across 50 hyperparameter configurations on two obesity datasets, selecting the top three, and measuring Accuracy/F1 for Majority Hard Voting, Weighted Hard Voting, and Stacking ensembles. These are standard held-out performance metrics with no mathematical derivation, first-principles claim, or quantity that reduces by construction to the selection step or fitted inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The study is self-contained empirical benchmarking without logical loops.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard supervised learning assumptions and empirical tuning rather than new theoretical constructs; the main added cost is the specific hyperparameter search and preprocessing choices.

free parameters (1)
  • Hyperparameter configurations for base learners = 50 configurations
    Fifty configurations across nine algorithms were evaluated to select the top three base learners for the ensembles.
axioms (1)
  • domain assumption The two datasets are representative of real-world obesity risk factors and suitable for supervised prediction after balancing and outlier removal.
    Invoked to justify applying the models to healthcare risk prediction.

pith-pipeline@v0.9.0 · 5853 in / 1301 out tokens · 51880 ms · 2026-05-18T19:03:03.547788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    Diagnostics 13(15), 2610 (2023)

    Solomon, D.D., Khan, S., Garg, S., Gupta, G., Almjally, A., Alabduallah, B.I., Alsagri, H.S., Ibrahim, M.M., Abdallah, A.M.A.: Hybrid majority voting: Prediction and classification model for obesity. Diagnostics 13(15), 2610 (2023)

  2. [2]

    Journal of Exercise Science & Physical Activity Reviews 2(1), 104–113 (2024)

    Pinar, A., Yagin, F.H., Georgian, B.: Use of logistic regression method in predicting obesity levels with machine learning method. Journal of Exercise Science & Physical Activity Reviews 2(1), 104–113 (2024)

  3. [3]

    Frontiers in endocrinology 12, 706978 (2021)

    Lin, X., Li, H.: Obesity: epidemiology, pathophysiology, and therapeutics. Frontiers in endocrinology 12, 706978 (2021)

  4. [4]

    Frontiers in Public Health 10, 998782 (2023)

    Jeon, J., Lee, S., Oh, C.: Age-specific risk factors for the prediction of obesity using a machine learning approach. Frontiers in Public Health 10, 998782 (2023)

  5. [5]

    In: 2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), pp

    Liu, L., Wei, W., Chow, K.-H., Loper, M., Gursoy, E., Truex, S., Wu, Y.: Deep neural network ensembles against deception: Ensemble diversity, accuracy and robustness. In: 2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), pp. 274–282 (2019). IEEE

  6. [6]

    In: 2017 IEEE International Conference on 23 INnovations in Intelligent Systems and Applications (INISTA), pp

    Leon, F., Floria, S.-A., B˘ adic˘ a, C.: Evaluating the effect of voting methods on ensemble-based classification. In: 2017 IEEE International Conference on 23 INnovations in Intelligent Systems and Applications (INISTA), pp. 1–6 (2017). IEEE

  7. [7]

    In: International Conference on Data Analytics and Insights, pp

    Dey, R., Mathur, R.: Ensemble learning method using stacking with base learner, a comparison. In: International Conference on Data Analytics and Insights, pp. 159–169 (2023). Springer

  8. [8]

    Sinop ¨Universitesi Fen Bilimleri Dergisi 9(1), 217–239 (2024) https://doi.org/10.33484/ sinopfbd.1445215

    Koklu, N., Sulak, S.A.: Using artificial intelligence techniques for the analysis of obesity status according to the individuals’ social and physical activities. Sinop ¨Universitesi Fen Bilimleri Dergisi 9(1), 217–239 (2024) https://doi.org/10.33484/ sinopfbd.1445215

  9. [9]

    https://github.com/pymche/ Machine-Learning-Obesity-Classification

    pymche: Machine-Learning-Obesity-Classification. https://github.com/pymche/ Machine-Learning-Obesity-Classification. GitHub repository, accessed August 26, 2025 (2020)

  10. [10]

    International Journal of Data Science and Analytics, 1–10 (2024)

    Dutta, R.R., Mukherjee, I., Chakraborty, C.: Obesity disease risk prediction using machine learning. International Journal of Data Science and Analytics, 1–10 (2024)

  11. [11]

    Plos one 19(1), 0292100 (2024)

    Talari, P., N, B., Kaur, G., Alshahrani, H., Al Reshan, M.S., Sulaiman, A., Shaikh, A.: Hybrid feature selection and classification technique for early prediction and severity of diabetes type 2. Plos one 19(1), 0292100 (2024)

  12. [12]

    : Obesity prediction using machine learning techniques

    Musa, F., Basaky, F., et al. : Obesity prediction using machine learning techniques. Journal of Applied Artificial Intelligence 3(1), 24–33 (2022)

  13. [13]

    In: IDDM, pp

    Rodr´ ıguez, E., Rodr´ ıguez, E., Nascimento, L., Silva, A.F., Marins, F.A.S.: Machine learning techniques to predict overweight or obesity. In: IDDM, pp. 190–204 (2021)

  14. [14]

    In: Recent Findings in Intelligent Computing Techniques: Proceedings of the 5th ICACNI 2017, Volume 2, pp

    Jindal, K., Baliyan, N., Rana, P.S.: Obesity prediction using ensemble machine learning approaches. In: Recent Findings in Intelligent Computing Techniques: Proceedings of the 5th ICACNI 2017, Volume 2, pp. 355–362. Springer, ??? (2018)

  15. [15]

    In: Proceedings of the International Conference on Software Engineering (ICSE), pp

    Basili, V.R., Weiss, D.M.: A methodology for collecting valid software engineering data. In: Proceedings of the International Conference on Software Engineering (ICSE), pp. 75–77 (1984)

  16. [16]

    2020.pandas-dev/pandas: Pandas

    team, T.: pandas-dev/pandas: Pandas. Zenodo (2020). https://doi.org/10.5281/ zenodo.3509134 . https://doi.org/10.5281/zenodo.3509134

  17. [17]

    https://seaborn

    Seaborn Developers: Seaborn Documentation — Version 0.13.2. https://seaborn. pydata.org/. Accessed: 2025-08-29 (2025)

  18. [18]

    Journal of Machine Learning Research 12, 2825–2830 (2011) 24

    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011) 24

  19. [19]

    Accessed: 2025-08-27 (2025)

    Scikit-learn developers: sklearn.linear model.LogisticRegression — scikit-learn doc- umentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.linear model.LogisticRegression.html

  20. [20]

    Accessed: 2025-08-27 (2025)

    Scikit-learn developers: sklearn.neighbors.KNeighborsClassifier — scikit-learn doc- umentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.neighbors.KNeighborsClassifier.html

  21. [21]

    Naive Bayes — scikit-learn documentation

    Scikit-learn developers: 1.9. Naive Bayes — scikit-learn documentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/naive bayes.html

  22. [22]

    Accessed: 2025-08-27 (2025)

    Scikit-learn developers: sklearn.tree.DecisionTreeClassifier — scikit-learn docu- mentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.tree.DecisionTreeClassifier.html

  23. [23]

    Accessed: 2025-08-27 (2025)

    Scikit-learn developers: sklearn.ensemble.RandomForestClassifier — scikit-learn documentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/ modules/generated/sklearn.ensemble.RandomForestClassifier.html

  24. [24]

    Accessed: 27 August 2025 (2025)

    Scikit-learn developers: sklearn.ensemble.GradientBoostingClassifier — scikit-learn documentation. Accessed: 27 August 2025 (2025). https://scikit-learn.org/stable/ modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

  25. [25]

    Accessed: 2025-08-27 (2025)

    Scikit-learn developers: sklearn.ensemble.AdaBoostClassifier — scikit-learn docu- mentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.ensemble.AdaBoostClassifier.html

  26. [26]

    Accessed: 27 August 2025 (2025)

    Scikit-learn developers: sklearn.svm.SVC — scikit-learn documentation. Accessed: 27 August 2025 (2025). https://scikit-learn.org/stable/modules/generated/sklearn. svm.SVC.html

  27. [27]

    Accessed: 2025-08-27 (2025)

    Scikit-learn developers: sklearn.neural network.MLPClassifier — scikit-learn docu- mentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.neural network.MLPClassifier.html

  28. [28]

    Accessed: 2025-08-27 (2024)

    Scikit-learn developers: sklearn.metrics.roc auc score — scikit-learn 1.5.0 docu- mentation. Accessed: 2025-08-27 (2024). https://scikit-learn.org/stable/modules/ generated/sklearn.metrics.roc auc score.html

  29. [29]

    Accessed: 2025-08-27 (2025)

    Scikit-learn developers: sklearn.metrics.average precision score — scikit-learn doc- umentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.metrics.average precision score.html

  30. [30]

    Accessed: 2025-08-27 (2025)

    Scikit-learn developers: sklearn.metrics.precision score — scikit-learn documen- tation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.metrics.precision score.html 25

  31. [31]

    Accessed: 2025-08-27 (2025)

    Scikit-learn developers: sklearn.metrics.recall score — scikit-learn documentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/generated/ sklearn.metrics.recall score.html

  32. [32]

    Accessed: 2025-08-27 (2025)

    Scikit-learn developers: sklearn.metrics.f1 score — scikit-learn documentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/generated/ sklearn.metrics.f1 score.html

  33. [33]

    Accessed: 2025-08-27 (2025)

    Scikit-learn developers: sklearn.metrics.accuracy score — scikit-learn documen- tation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.metrics.accuracy score.html 26