Recognition: 1 theorem link
· Lean TheoremNovel GPU Boruta algorithms for feature selection from high-dimensional data
Pith reviewed 2026-05-12 03:51 UTC · model grok-4.3
The pith
Two GPU versions of the Boruta algorithm select features from high-dimensional data much faster than the original while matching its accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present two GPU-accelerated Boruta variants, one based on permutation importance and one on impurity reduction, that achieve substantial reductions in running time on high-dimensional data while producing feature selections whose accuracy matches that of the sequential Boruta algorithm, as shown by direct comparisons on both self-constructed and publicly available datasets.
What carries the argument
The parallel GPU implementations of the two Boruta procedures that compute feature importance scores across many trees or permutations simultaneously.
If this is right
- High-dimensional datasets can now complete Boruta feature selection in practical wall-clock times on ordinary GPU hardware.
- The permutation-based GPU version preserves the statistical properties of the original Boruta most closely.
- The impurity-reduction GPU version may assign inflated importance to some features, so downstream analysis should treat its rankings with extra caution.
- Large-scale data analysis becomes more cost-effective because the same hardware now handles the full wrapper procedure.
Where Pith is reading between the lines
- The same parallelization pattern could be applied to other wrapper feature selection methods that rely on repeated model training.
- Reproducibility checks between CPU and GPU runs would benefit from explicit control of random number streams across device boundaries.
- The observed overestimation in the impurity version points to possible adjustments in how importance is aggregated across parallel workers.
Load-bearing premise
The GPU parallel code produces importance scores and final feature selections that are statistically equivalent to the original sequential Boruta without systematic biases from parallel random generation or memory access.
What would settle it
Run both the original Boruta and the two GPU versions on the same fixed dataset using identical random seeds and check whether the ranked importance values and the final selected feature set differ by more than random variation.
read the original abstract
Most feature selection algorithms, especially wrapper methods, run inefficiently on CPU based platforms because of their high computational complexity. This inefficiency makes them unsuitable for processing large scale datasets. To address this challenge, the present study proposed two GPU accelerated versions of the Boruta feature selection procedure, in which Boruta-Permut relies on permutation based feature importance and Boruta-TreeImp employs importance based on impurity reduction. To evaluate these methods we conducted experiments on both a self constructed dataset and several publicly available datasets. The experimental results show that the proposed GPU accelerated algorithms greatly improve computational efficiency while preserving feature selection accuracy comparable to the original Boruta algorithm. In our analysis we also observe that the impurity reduction based version can overestimate the importance of some features. Overall these findings suggest that performing Boruta feature selection on GPUs offers an effective and cost efficient solution for large scale data analysis, which is a good deal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces two GPU-accelerated variants of the Boruta feature selection algorithm: Boruta-Permut, which employs permutation-based feature importance, and Boruta-TreeImp, which uses impurity reduction for importance scoring. Experiments are performed on a self-constructed dataset and several publicly available datasets, with the abstract claiming that these implementations greatly improve computational efficiency while preserving feature selection accuracy comparable to the original sequential Boruta procedure. The authors additionally note that the impurity-reduction variant can overestimate the importance of some features.
Significance. If the GPU versions faithfully replicate the original Boruta's stochastic processes (random permutations for shadow features and consistent importance rankings) and yield statistically equivalent feature selections, the work would address a practical bottleneck in applying wrapper-based feature selection to high-dimensional data. The dual variants provide implementation choices, and evaluation across multiple datasets offers some breadth. The explicit caveat on impurity overestimation is a useful observation, though it directly qualifies one of the proposed methods.
major comments (3)
- [Abstract] Abstract: The central claim that the GPU algorithms 'preserve feature selection accuracy comparable to the original Boruta algorithm' is unsupported by any quantitative metrics, selected-feature overlap statistics, error bars, or statistical tests comparing GPU and CPU outputs on identical inputs. This absence makes the accuracy-preservation assertion unevaluable and load-bearing for the paper's main contribution.
- [Abstract] Abstract: The statement that 'the impurity reduction based version can overestimate the importance of some features' indicates a systematic deviation in Boruta-TreeImp from the original procedure's importance distribution, which is used to compare real features against shadow features. This directly undermines the comparability claim for the TreeImp variant without further analysis of how overestimation affects acceptance/rejection thresholds.
- [Experimental results] Experimental results (inferred from abstract description): No verification is reported that the GPU implementations produce identical (or noise-equivalent) feature selections to the sequential Boruta on the same random seeds and inputs. Divergences in parallel RNG streams, tree-construction ordering, or floating-point accumulation could alter which features cross significance thresholds, violating the assumption that efficiency gains preserve statistical behavior.
minor comments (2)
- [Abstract] The abstract's final sentence contains unclear and informal phrasing ('which is a good deal').
- [Abstract] Minor terminology inconsistency: 'self constructed dataset' should be hyphenated as 'self-constructed dataset'.
Simulated Author's Rebuttal
We appreciate the referee's insightful comments on our manuscript. We address each major comment below, providing clarifications and outlining revisions to enhance the rigor of our claims regarding accuracy preservation and implementation fidelity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the GPU algorithms 'preserve feature selection accuracy comparable to the original Boruta algorithm' is unsupported by any quantitative metrics, selected-feature overlap statistics, error bars, or statistical tests comparing GPU and CPU outputs on identical inputs. This absence makes the accuracy-preservation assertion unevaluable and load-bearing for the paper's main contribution.
Authors: We agree that quantitative evidence is necessary to support the comparability claim. In the revised version, we will add tables and figures showing feature selection overlap (e.g., percentage of common features selected), mean accuracy differences with standard deviations across multiple runs, and appropriate statistical tests to demonstrate that the GPU variants yield statistically equivalent results to the CPU Boruta. revision: yes
-
Referee: [Abstract] Abstract: The statement that 'the impurity reduction based version can overestimate the importance of some features' indicates a systematic deviation in Boruta-TreeImp from the original procedure's importance distribution, which is used to compare real features against shadow features. This directly undermines the comparability claim for the TreeImp variant without further analysis of how overestimation affects acceptance/rejection thresholds.
Authors: The observation of overestimation in the impurity-based variant is indeed noted in our manuscript as a caveat. To address this, we will include additional analysis in the results section comparing the importance score distributions between Boruta-TreeImp and the original Boruta, and evaluate the impact on feature acceptance rates. This will provide a clearer picture of when and how the overestimation affects the final selections. revision: yes
-
Referee: [Experimental results] Experimental results (inferred from abstract description): No verification is reported that the GPU implementations produce identical (or noise-equivalent) feature selections to the sequential Boruta on the same random seeds and inputs. Divergences in parallel RNG streams, tree-construction ordering, or floating-point accumulation could alter which features cross significance thresholds, violating the assumption that efficiency gains preserve statistical behavior.
Authors: We recognize the importance of verifying the statistical equivalence of the GPU implementations. While exact replication on identical seeds may not be feasible due to differences in parallel computing environments and RNG implementations, we will report results from repeated experiments with varied seeds, including overlap statistics and consistency measures. We will also add a discussion on potential numerical differences and how they are mitigated. revision: partial
Circularity Check
No circularity: empirical benchmarking of GPU Boruta ports
full rationale
The paper introduces two GPU implementations of the pre-existing Boruta algorithm (permutation-based and impurity-based) and validates them solely through runtime and accuracy experiments on external datasets. No equations, derivations, fitted parameters, or self-citations are used to establish the central claims; efficiency gains and 'comparable accuracy' are asserted directly from measured wall-clock times and feature-selection outcomes versus the sequential reference. The single observational note on impurity overestimation is an empirical finding, not a load-bearing premise. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Kursa, Miron B. and Rudnicki, Witold R. , Title =. Journal of Statistical Software , Volume =. doi:10.18637/jss.v036.i11 , Year =
-
[2]
Graf, F. and Kriegel, H.-P. and Schubert, M. and Poelsterl, S. and Cavallaro, A. , title =. 2011 , howpublished =
work page 2011
-
[3]
Fernandes and Kelwin and Vinagre and Pedro and Cortez and Paulo and Sernadela and Pedro , title =. 2015 , howpublished =
work page 2015
- [4]
- [5]
-
[6]
doi:10.5281/zenodo.4247618 , url =
Eoghan Keany , title =. doi:10.5281/zenodo.4247618 , url =
-
[7]
Feature importances with a forest of trees , howpublished =
-
[8]
Gewers, Felipe L. and Ferreira, Gustavo R. and Arruda, Henrique F. De and Silva, Filipi N. and Comin, Cesar H. and Amancio, Diego R. and Costa, Luciano Da F. , title =. 2021 , issue_date =. doi:10.1145/3447755 , journal =
-
[9]
FISHER, R. A. , title =. Annals of Eugenics , volume =. doi:https://doi.org/10.1111/j.1469-1809.1936.tb02137.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-1809.1936.tb02137.x , abstract =
-
[10]
ACM Computing Surveys , author =
Li, Jundong and Cheng, Kewei and Wang, Suhang and Morstatter, Fred and Trevino, Robert P. and Tang, Jiliang and Liu, Huan , title =. 2017 , issue_date =. doi:10.1145/3136625 , journal =
-
[11]
Sebastián, Carlos and González-Guillén, Carlos E. , year=. A feature selection method based on Shapley values robust for concept shift in regression , volume=. Neural Computing and Applications , publisher=. doi:10.1007/s00521-024-09745-4 , number=
-
[12]
Lixia Fu and Guoshu Jia and Zhenming Liu and Xiaocong Pang and Yimin Cui , keywords =. The applications and advances of artificial intelligence in drug regulation: A global perspective , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.apsb.2024.11.006 , url =
-
[13]
Zhang, Xuan and Wang, Limei and Helwig, Jacob and Luo, Youzhi and Fu, Cong and Xie, Yaochen and Liu, Meng and Lin, Yuchao and Xu, Zhao and Yan, Keqiang and Adams, Keir and Weiler, Maurice and Li, Xiner and Fu, Tianfan and Wang, Yucheng and Strasser, Alex and Yu, Haiyang and Xie, YuQing and Fu, Xiang and Xu, Shenglong and Liu, Yi and Du, Yuanqi and Saxton,...
- [14]
-
[15]
Hey, Tony and Tansley, Stewart and Tolle, Kristin M , title =. 2009 , pages =
work page 2009
-
[16]
Figueroa Barraza, Joaquín and López Droguett, Enrique and Martins, Marcelo Ramos , TITLE =. Sensors , VOLUME =. 2021 , NUMBER =
work page 2021
-
[17]
Venkatesh, B. and Anuradha, J. , doi =. A Review of Feature Selection and Its Methods , journal =. 2019 , pages =
work page 2019
-
[18]
Filter Methods for Feature Selection in Supervised Machine Learning Applications - Review and Benchmark , author=. 2021 , eprint=
work page 2021
-
[19]
An introduction to variable and feature selection , year =
Guyon, Isabelle and Elisseeff, Andr\'. An introduction to variable and feature selection , year =. J. Mach. Learn. Res. , month = mar, pages =
- [20]
-
[21]
Frontiers in Bioinformatics , volume =
A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction , author =. Frontiers in Bioinformatics , volume =. 2022 , doi =
work page 2022
-
[22]
Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets , author =. PLOS ONE , volume =. 2019 , doi =
work page 2019
-
[23]
Njoku, Uchechukwu and Bilalli, Besim and Abelló, Alberto and Bontempi, Gianluca , title =. 26th International Conference on Extending Database Technology (EDBT 2023): Ioannina, Greece, March 28-31: Proceedings , year =
work page 2023
-
[24]
Jesús González and Julio Ortega and Miguel Damas and Pedro Martín-Smith and John Q. Gan , keywords =. A new multi-objective wrapper method for feature selection – Accuracy and stability analysis for BCI , journal =. 2019 , issn =. doi:https://doi.org/10.1016/j.neucom.2019.01.017 , url =
-
[25]
Kamal A. ElDahshan, AbdAllah A. AlHabshy, Luay Thamer Mohammed , TITLE =. Computers, Materials & Continua , VOLUME =. 2023 , NUMBER =
work page 2023
-
[26]
AIP Conference Proceedings , volume =
Handhika, Tri and Murni and Fahreza, Rafi Mochamad , title =. AIP Conference Proceedings , volume =. 2023 , month =. doi:10.1063/5.0114178 , url =
-
[27]
International Journal of Pattern Recognition and Artificial Intelligence , volume =
Gasmi, Safa and Djebbar, Akila and Merouani, Hayet Farida , title =. International Journal of Pattern Recognition and Artificial Intelligence , volume =. 2023 , doi =
work page 2023
-
[28]
International Journal of Science and Research Archive , volume =
Boruta based feature selection model for heart disease prediction , author =. International Journal of Science and Research Archive , volume =. 2023 , doi =
work page 2023
-
[29]
Briefings in Bioinformatics , volume =
Evaluation of variable selection methods for random forests and omics data sets , author =. Briefings in Bioinformatics , volume =. 2019 , doi =
work page 2019
-
[30]
S. Kullback and R. A. Leibler , title =. The Annals of Mathematical Statistics , number =. 1951 , doi =
work page 1951
-
[31]
Permutation Importance vs Random Forest Feature Importance (MDI) — scikit-learn documentation , year =
-
[32]
Important Complexity Reduction of Random Forest in Multi-Classification Problem , year=
Hassine, Kawther and Erbad, Aiman and Hamila, Ridha , booktitle=. Important Complexity Reduction of Random Forest in Multi-Classification Problem , year=
-
[33]
Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence , author=. arXiv preprint arXiv:2002.04803 , year=
-
[34]
Feature Selection via GANs (GANFS): Enhancing Machine Learning Models for DDoS Mitigation , author=. 2025 , eprint=
work page 2025
-
[35]
Cloud Computing Services and Cloud Solutions - Alibaba Cloud , url =
-
[36]
Rent GPUs | Vast.ai , url =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.