Ensemble Learning for Healthcare: A Comparative Analysis of Hybrid Voting and Ensemble Stacking in Obesity Risk Prediction
Pith reviewed 2026-05-18 19:03 UTC · model grok-4.3
The pith
Ensemble stacking outperforms hybrid majority voting for obesity risk prediction, especially on complex datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On Dataset-1 weighted hard voting and stacking both reached accuracy 0.920304 and F1-score near 0.920, outperforming majority hard voting. On Dataset-2 stacking achieved accuracy 0.989837 and F1 0.989825, beating majority hard voting at accuracy 0.981707 while weighted hard voting performed worst. The results establish that stacking supplies stronger predictive capability for complex data distributions, with hybrid majority voting remaining a robust alternative.
What carries the argument
Ensemble construction from the top three base learners chosen from nine machine learning algorithms, assembled either as hybrid hard voting (majority or weighted) or as stacking with a multi-layer perceptron meta-classifier.
If this is right
- Stacking is preferable when obesity data exhibits complex distributions.
- Hybrid majority voting serves as a dependable lower-complexity option.
- Tuning and selecting multiple base learners before ensembling improves reliability for healthcare tasks.
- The comparative results can inform model choice in other medical prediction settings.
Where Pith is reading between the lines
- The same stacking preference may appear in risk models for related conditions such as diabetes.
- Real-world clinical streams with missing values could narrow or widen the observed performance gap.
- Varying the meta-classifier beyond the multi-layer perceptron might further optimize stacking results.
Load-bearing premise
Selecting the top three models after evaluating fifty hyperparameter configurations plus dataset balancing and outlier detection produces an unbiased comparison of the two ensemble approaches without selection effects or data artifacts.
What would settle it
On a new obesity dataset processed identically, stacking fails to match or exceed the accuracy and F1-score of the voting ensembles.
read the original abstract
Obesity is a critical global health issue driven by dietary, physiological, and environmental factors, and is strongly associated with chronic diseases such as diabetes, cardiovascular disorders, and cancer. Machine learning has emerged as a promising approach for early obesity risk prediction, yet a comparative evaluation of ensemble techniques -- particularly hybrid majority voting and ensemble stacking -- remains limited. This study aims to compare hybrid majority voting and ensemble stacking methods for obesity risk prediction, identifying which approach delivers higher accuracy and efficiency. The analysis seeks to highlight the complementary strengths of these ensemble techniques in guiding better predictive model selection for healthcare applications. Two datasets were utilized to evaluate three ensemble models: Majority Hard Voting, Weighted Hard Voting, and Stacking (with a Multi-Layer Perceptron as meta-classifier). A pool of nine Machine Learning (ML) algorithms, evaluated across a total of 50 hyperparameter configurations, was analyzed to identify the top three models to serve as base learners for the ensemble methods. Preprocessing steps involved dataset balancing, and outlier detection, and model performance was evaluated using Accuracy and F1-Score. On Dataset-1, weighted hard voting and stacking achieved nearly identical performance (Accuracy: 0.920304, F1: 0.920070), outperforming majority hard voting. On Dataset-2, stacking demonstrated superior results (Accuracy: 0.989837, F1: 0.989825) compared to majority hard voting (Accuracy: 0.981707, F1: 0.981675) and weighted hard voting, which showed the lowest performance. The findings confirm that ensemble stacking provides stronger predictive capability, particularly for complex data distributions, while hybrid majority voting remains a robust alternative.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares three ensemble methods—Majority Hard Voting, Weighted Hard Voting, and Stacking (MLP meta-learner)—for obesity risk prediction on two datasets. Nine base ML algorithms are evaluated across 50 hyperparameter configurations to select the top three as base learners. After preprocessing (balancing and outlier detection), Accuracy and F1-Score are reported: on Dataset-1, weighted voting and stacking reach ~0.9203 accuracy; on Dataset-2, stacking reaches 0.9898 accuracy and outperforms the voting variants. The authors conclude that stacking provides stronger predictive capability for complex data distributions while majority voting remains a robust alternative.
Significance. If the comparative results hold under unbiased evaluation, the work offers practical empirical guidance on ensemble selection for healthcare risk prediction tasks. The use of two datasets and concrete numeric results (Accuracy/F1) is a strength, but the absence of error bars, statistical tests, or nested validation limits the strength of claims about superiority for 'complex data distributions.' The contribution is incremental rather than foundational.
major comments (1)
- [Methods / Experimental Setup] The model selection pipeline (evaluation of nine algorithms over 50 hyperparameter configurations to choose the top three base learners for all ensembles) is performed on the same data regime later used to report final Accuracy/F1 on Dataset-1 (0.9203) and Dataset-2 (0.9898 for stacking). This introduces selection bias that directly inflates the headline numbers and prevents clean attribution of the observed gap (especially the ~0.008 difference on Dataset-2) to the ensemble method itself rather than to which models were permitted to participate. Nested cross-validation or a held-out selection set is required to support the central claim.
minor comments (2)
- [Data Description] Dataset-1 and Dataset-2 are referenced only by number; their sources, sizes, feature counts, and class distributions should be stated explicitly in the data section for reproducibility.
- [Results] No standard deviations, confidence intervals, or statistical significance tests accompany the reported Accuracy and F1 values; adding these would strengthen the comparison between ensembles.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the emphasis on methodological rigor in the experimental setup. Below we provide a point-by-point response to the major comment, explaining our position and outlining the revisions we will make.
read point-by-point responses
-
Referee: [Methods / Experimental Setup] The model selection pipeline (evaluation of nine algorithms over 50 hyperparameter configurations to choose the top three base learners for all ensembles) is performed on the same data regime later used to report final Accuracy/F1 on Dataset-1 (0.9203) and Dataset-2 (0.9898 for stacking). This introduces selection bias that directly inflates the headline numbers and prevents clean attribution of the observed gap (especially the ~0.008 difference on Dataset-2) to the ensemble method itself rather than to which models were permitted to participate. Nested cross-validation or a held-out selection set is required to support the central claim.
Authors: We agree that conducting the base-learner selection and hyperparameter search on the same data later used for final reporting can introduce selection bias and produce somewhat optimistic absolute performance figures. This is a legitimate methodological concern. At the same time, because the identical selection procedure (nine algorithms, 50 configurations, top-three base learners) was applied uniformly to all three ensemble methods, the relative comparisons between Majority Hard Voting, Weighted Hard Voting, and Stacking remain internally consistent and are not confounded by differential model selection. The performance gap observed on Dataset-2, where stacking reaches 0.9898 accuracy while the voting variants are lower, can therefore still be attributed to the ensemble strategy itself. To fully address the referee’s point and strengthen the claims, we will revise the manuscript to adopt nested cross-validation: an outer loop for unbiased performance estimation and an inner loop for model selection and hyperparameter tuning. Updated results, methodology description, and any changes to the reported numbers will be included in the revised version. revision: yes
Circularity Check
No significant circularity in empirical ensemble comparison
full rationale
The paper reports direct experimental results from training nine base ML models across 50 hyperparameter configurations on two obesity datasets, selecting the top three, and measuring Accuracy/F1 for Majority Hard Voting, Weighted Hard Voting, and Stacking ensembles. These are standard held-out performance metrics with no mathematical derivation, first-principles claim, or quantity that reduces by construction to the selection step or fitted inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The study is self-contained empirical benchmarking without logical loops.
Axiom & Free-Parameter Ledger
free parameters (1)
- Hyperparameter configurations for base learners =
50 configurations
axioms (1)
- domain assumption The two datasets are representative of real-world obesity risk factors and suitable for supervised prediction after balancing and outlier removal.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A pool of nine Machine Learning (ML) algorithms, evaluated across a total of 50 hyperparameter configurations, was analyzed to identify the top three models to serve as base learners for the ensemble methods.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Diagnostics 13(15), 2610 (2023)
Solomon, D.D., Khan, S., Garg, S., Gupta, G., Almjally, A., Alabduallah, B.I., Alsagri, H.S., Ibrahim, M.M., Abdallah, A.M.A.: Hybrid majority voting: Prediction and classification model for obesity. Diagnostics 13(15), 2610 (2023)
work page 2023
-
[2]
Journal of Exercise Science & Physical Activity Reviews 2(1), 104–113 (2024)
Pinar, A., Yagin, F.H., Georgian, B.: Use of logistic regression method in predicting obesity levels with machine learning method. Journal of Exercise Science & Physical Activity Reviews 2(1), 104–113 (2024)
work page 2024
-
[3]
Frontiers in endocrinology 12, 706978 (2021)
Lin, X., Li, H.: Obesity: epidemiology, pathophysiology, and therapeutics. Frontiers in endocrinology 12, 706978 (2021)
work page 2021
-
[4]
Frontiers in Public Health 10, 998782 (2023)
Jeon, J., Lee, S., Oh, C.: Age-specific risk factors for the prediction of obesity using a machine learning approach. Frontiers in Public Health 10, 998782 (2023)
work page 2023
-
[5]
In: 2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), pp
Liu, L., Wei, W., Chow, K.-H., Loper, M., Gursoy, E., Truex, S., Wu, Y.: Deep neural network ensembles against deception: Ensemble diversity, accuracy and robustness. In: 2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), pp. 274–282 (2019). IEEE
work page 2019
-
[6]
Leon, F., Floria, S.-A., B˘ adic˘ a, C.: Evaluating the effect of voting methods on ensemble-based classification. In: 2017 IEEE International Conference on 23 INnovations in Intelligent Systems and Applications (INISTA), pp. 1–6 (2017). IEEE
work page 2017
-
[7]
In: International Conference on Data Analytics and Insights, pp
Dey, R., Mathur, R.: Ensemble learning method using stacking with base learner, a comparison. In: International Conference on Data Analytics and Insights, pp. 159–169 (2023). Springer
work page 2023
-
[8]
Koklu, N., Sulak, S.A.: Using artificial intelligence techniques for the analysis of obesity status according to the individuals’ social and physical activities. Sinop ¨Universitesi Fen Bilimleri Dergisi 9(1), 217–239 (2024) https://doi.org/10.33484/ sinopfbd.1445215
work page 2024
-
[9]
https://github.com/pymche/ Machine-Learning-Obesity-Classification
pymche: Machine-Learning-Obesity-Classification. https://github.com/pymche/ Machine-Learning-Obesity-Classification. GitHub repository, accessed August 26, 2025 (2020)
work page 2025
-
[10]
International Journal of Data Science and Analytics, 1–10 (2024)
Dutta, R.R., Mukherjee, I., Chakraborty, C.: Obesity disease risk prediction using machine learning. International Journal of Data Science and Analytics, 1–10 (2024)
work page 2024
-
[11]
Plos one 19(1), 0292100 (2024)
Talari, P., N, B., Kaur, G., Alshahrani, H., Al Reshan, M.S., Sulaiman, A., Shaikh, A.: Hybrid feature selection and classification technique for early prediction and severity of diabetes type 2. Plos one 19(1), 0292100 (2024)
work page 2024
-
[12]
: Obesity prediction using machine learning techniques
Musa, F., Basaky, F., et al. : Obesity prediction using machine learning techniques. Journal of Applied Artificial Intelligence 3(1), 24–33 (2022)
work page 2022
-
[13]
Rodr´ ıguez, E., Rodr´ ıguez, E., Nascimento, L., Silva, A.F., Marins, F.A.S.: Machine learning techniques to predict overweight or obesity. In: IDDM, pp. 190–204 (2021)
work page 2021
-
[14]
Jindal, K., Baliyan, N., Rana, P.S.: Obesity prediction using ensemble machine learning approaches. In: Recent Findings in Intelligent Computing Techniques: Proceedings of the 5th ICACNI 2017, Volume 2, pp. 355–362. Springer, ??? (2018)
work page 2017
-
[15]
In: Proceedings of the International Conference on Software Engineering (ICSE), pp
Basili, V.R., Weiss, D.M.: A methodology for collecting valid software engineering data. In: Proceedings of the International Conference on Software Engineering (ICSE), pp. 75–77 (1984)
work page 1984
-
[16]
2020.pandas-dev/pandas: Pandas
team, T.: pandas-dev/pandas: Pandas. Zenodo (2020). https://doi.org/10.5281/ zenodo.3509134 . https://doi.org/10.5281/zenodo.3509134
-
[17]
Seaborn Developers: Seaborn Documentation — Version 0.13.2. https://seaborn. pydata.org/. Accessed: 2025-08-29 (2025)
work page 2025
-
[18]
Journal of Machine Learning Research 12, 2825–2830 (2011) 24
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011) 24
work page 2011
-
[19]
Scikit-learn developers: sklearn.linear model.LogisticRegression — scikit-learn doc- umentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.linear model.LogisticRegression.html
work page 2025
-
[20]
Scikit-learn developers: sklearn.neighbors.KNeighborsClassifier — scikit-learn doc- umentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.neighbors.KNeighborsClassifier.html
work page 2025
-
[21]
Naive Bayes — scikit-learn documentation
Scikit-learn developers: 1.9. Naive Bayes — scikit-learn documentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/naive bayes.html
work page 2025
-
[22]
Scikit-learn developers: sklearn.tree.DecisionTreeClassifier — scikit-learn docu- mentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.tree.DecisionTreeClassifier.html
work page 2025
-
[23]
Scikit-learn developers: sklearn.ensemble.RandomForestClassifier — scikit-learn documentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/ modules/generated/sklearn.ensemble.RandomForestClassifier.html
work page 2025
-
[24]
Accessed: 27 August 2025 (2025)
Scikit-learn developers: sklearn.ensemble.GradientBoostingClassifier — scikit-learn documentation. Accessed: 27 August 2025 (2025). https://scikit-learn.org/stable/ modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
work page 2025
-
[25]
Scikit-learn developers: sklearn.ensemble.AdaBoostClassifier — scikit-learn docu- mentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.ensemble.AdaBoostClassifier.html
work page 2025
-
[26]
Accessed: 27 August 2025 (2025)
Scikit-learn developers: sklearn.svm.SVC — scikit-learn documentation. Accessed: 27 August 2025 (2025). https://scikit-learn.org/stable/modules/generated/sklearn. svm.SVC.html
work page 2025
-
[27]
Scikit-learn developers: sklearn.neural network.MLPClassifier — scikit-learn docu- mentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.neural network.MLPClassifier.html
work page 2025
-
[28]
Scikit-learn developers: sklearn.metrics.roc auc score — scikit-learn 1.5.0 docu- mentation. Accessed: 2025-08-27 (2024). https://scikit-learn.org/stable/modules/ generated/sklearn.metrics.roc auc score.html
work page 2025
-
[29]
Scikit-learn developers: sklearn.metrics.average precision score — scikit-learn doc- umentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.metrics.average precision score.html
work page 2025
-
[30]
Scikit-learn developers: sklearn.metrics.precision score — scikit-learn documen- tation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.metrics.precision score.html 25
work page 2025
-
[31]
Scikit-learn developers: sklearn.metrics.recall score — scikit-learn documentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/generated/ sklearn.metrics.recall score.html
work page 2025
-
[32]
Scikit-learn developers: sklearn.metrics.f1 score — scikit-learn documentation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/generated/ sklearn.metrics.f1 score.html
work page 2025
-
[33]
Scikit-learn developers: sklearn.metrics.accuracy score — scikit-learn documen- tation. Accessed: 2025-08-27 (2025). https://scikit-learn.org/stable/modules/ generated/sklearn.metrics.accuracy score.html 26
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.