arxiv: 2605.09950 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Novel GPU Boruta algorithms for feature selection from high-dimensional data

Xurui Li , Zhiguo Gan , Jiaming Zhang , Zheng Liu , Diannan Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords feature selectionBoruta algorithmGPU accelerationhigh-dimensional datamachine learningpermutation importanceimpurity reductionparallel algorithms

0 comments

The pith

Two GPU versions of the Boruta algorithm select features from high-dimensional data much faster than the original while matching its accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops two parallel implementations of the Boruta feature selection procedure to make it practical for large datasets that overwhelm CPU versions. Boruta-Permut calculates importance through permutations while Boruta-TreeImp uses impurity reduction from decision trees. Experiments on a custom dataset and several public ones show the GPU codes deliver large speed gains without changing the set of features chosen. A reader would care because wrapper methods like Boruta are otherwise too slow for real high-dimensional problems, so this change turns an accurate but expensive technique into a usable one for big data.

Core claim

The authors present two GPU-accelerated Boruta variants, one based on permutation importance and one on impurity reduction, that achieve substantial reductions in running time on high-dimensional data while producing feature selections whose accuracy matches that of the sequential Boruta algorithm, as shown by direct comparisons on both self-constructed and publicly available datasets.

What carries the argument

The parallel GPU implementations of the two Boruta procedures that compute feature importance scores across many trees or permutations simultaneously.

If this is right

High-dimensional datasets can now complete Boruta feature selection in practical wall-clock times on ordinary GPU hardware.
The permutation-based GPU version preserves the statistical properties of the original Boruta most closely.
The impurity-reduction GPU version may assign inflated importance to some features, so downstream analysis should treat its rankings with extra caution.
Large-scale data analysis becomes more cost-effective because the same hardware now handles the full wrapper procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same parallelization pattern could be applied to other wrapper feature selection methods that rely on repeated model training.
Reproducibility checks between CPU and GPU runs would benefit from explicit control of random number streams across device boundaries.
The observed overestimation in the impurity version points to possible adjustments in how importance is aggregated across parallel workers.

Load-bearing premise

The GPU parallel code produces importance scores and final feature selections that are statistically equivalent to the original sequential Boruta without systematic biases from parallel random generation or memory access.

What would settle it

Run both the original Boruta and the two GPU versions on the same fixed dataset using identical random seeds and check whether the ranked importance values and the final selected feature set differ by more than random variation.

read the original abstract

Most feature selection algorithms, especially wrapper methods, run inefficiently on CPU based platforms because of their high computational complexity. This inefficiency makes them unsuitable for processing large scale datasets. To address this challenge, the present study proposed two GPU accelerated versions of the Boruta feature selection procedure, in which Boruta-Permut relies on permutation based feature importance and Boruta-TreeImp employs importance based on impurity reduction. To evaluate these methods we conducted experiments on both a self constructed dataset and several publicly available datasets. The experimental results show that the proposed GPU accelerated algorithms greatly improve computational efficiency while preserving feature selection accuracy comparable to the original Boruta algorithm. In our analysis we also observe that the impurity reduction based version can overestimate the importance of some features. Overall these findings suggest that performing Boruta feature selection on GPUs offers an effective and cost efficient solution for large scale data analysis, which is a good deal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPU Boruta ports deliver practical speed for high-dimensional data but rest on thin evidence that the parallel versions match the original selections exactly.

read the letter

The paper introduces two GPU versions of Boruta, one using permutation importance and the other using impurity reduction from trees. They tested on a constructed dataset plus public ones and report faster runs with accuracy close to the standard CPU Boruta. The impurity version gets a direct note that it overestimates some features. This is the concrete new piece: specific GPU implementations that target the known slowdown of wrapper methods on large data. It is useful engineering because Boruta is established and the CPU version becomes impractical once dimensions grow. The authors avoid overselling by flagging the bias in the impurity path. The main soft spot is the validation. The abstract claims comparable accuracy and big efficiency gains but supplies no timings, no agreement rates on selected features, no error bars, and no side-by-side check that the same features cross the threshold. The stress-test point about parallel RNG streams or memory access changing permutation distributions and importance scores is reasonable; any shift there would alter which features survive the shadow comparison even if average accuracy looks similar. The paper is for applied ML users who already like Boruta and need it to scale to bigger tables. A practitioner could try the approach if the full version ships code and clearer equivalence tests. I would bring it to reading group as maybe to walk through the GPU details. I would not cite it yet. It deserves peer review so referees can ask for the missing quantitative comparisons and code.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces two GPU-accelerated variants of the Boruta feature selection algorithm: Boruta-Permut, which employs permutation-based feature importance, and Boruta-TreeImp, which uses impurity reduction for importance scoring. Experiments are performed on a self-constructed dataset and several publicly available datasets, with the abstract claiming that these implementations greatly improve computational efficiency while preserving feature selection accuracy comparable to the original sequential Boruta procedure. The authors additionally note that the impurity-reduction variant can overestimate the importance of some features.

Significance. If the GPU versions faithfully replicate the original Boruta's stochastic processes (random permutations for shadow features and consistent importance rankings) and yield statistically equivalent feature selections, the work would address a practical bottleneck in applying wrapper-based feature selection to high-dimensional data. The dual variants provide implementation choices, and evaluation across multiple datasets offers some breadth. The explicit caveat on impurity overestimation is a useful observation, though it directly qualifies one of the proposed methods.

major comments (3)

[Abstract] Abstract: The central claim that the GPU algorithms 'preserve feature selection accuracy comparable to the original Boruta algorithm' is unsupported by any quantitative metrics, selected-feature overlap statistics, error bars, or statistical tests comparing GPU and CPU outputs on identical inputs. This absence makes the accuracy-preservation assertion unevaluable and load-bearing for the paper's main contribution.
[Abstract] Abstract: The statement that 'the impurity reduction based version can overestimate the importance of some features' indicates a systematic deviation in Boruta-TreeImp from the original procedure's importance distribution, which is used to compare real features against shadow features. This directly undermines the comparability claim for the TreeImp variant without further analysis of how overestimation affects acceptance/rejection thresholds.
[Experimental results] Experimental results (inferred from abstract description): No verification is reported that the GPU implementations produce identical (or noise-equivalent) feature selections to the sequential Boruta on the same random seeds and inputs. Divergences in parallel RNG streams, tree-construction ordering, or floating-point accumulation could alter which features cross significance thresholds, violating the assumption that efficiency gains preserve statistical behavior.

minor comments (2)

[Abstract] The abstract's final sentence contains unclear and informal phrasing ('which is a good deal').
[Abstract] Minor terminology inconsistency: 'self constructed dataset' should be hyphenated as 'self-constructed dataset'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's insightful comments on our manuscript. We address each major comment below, providing clarifications and outlining revisions to enhance the rigor of our claims regarding accuracy preservation and implementation fidelity.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the GPU algorithms 'preserve feature selection accuracy comparable to the original Boruta algorithm' is unsupported by any quantitative metrics, selected-feature overlap statistics, error bars, or statistical tests comparing GPU and CPU outputs on identical inputs. This absence makes the accuracy-preservation assertion unevaluable and load-bearing for the paper's main contribution.

Authors: We agree that quantitative evidence is necessary to support the comparability claim. In the revised version, we will add tables and figures showing feature selection overlap (e.g., percentage of common features selected), mean accuracy differences with standard deviations across multiple runs, and appropriate statistical tests to demonstrate that the GPU variants yield statistically equivalent results to the CPU Boruta. revision: yes
Referee: [Abstract] Abstract: The statement that 'the impurity reduction based version can overestimate the importance of some features' indicates a systematic deviation in Boruta-TreeImp from the original procedure's importance distribution, which is used to compare real features against shadow features. This directly undermines the comparability claim for the TreeImp variant without further analysis of how overestimation affects acceptance/rejection thresholds.

Authors: The observation of overestimation in the impurity-based variant is indeed noted in our manuscript as a caveat. To address this, we will include additional analysis in the results section comparing the importance score distributions between Boruta-TreeImp and the original Boruta, and evaluate the impact on feature acceptance rates. This will provide a clearer picture of when and how the overestimation affects the final selections. revision: yes
Referee: [Experimental results] Experimental results (inferred from abstract description): No verification is reported that the GPU implementations produce identical (or noise-equivalent) feature selections to the sequential Boruta on the same random seeds and inputs. Divergences in parallel RNG streams, tree-construction ordering, or floating-point accumulation could alter which features cross significance thresholds, violating the assumption that efficiency gains preserve statistical behavior.

Authors: We recognize the importance of verifying the statistical equivalence of the GPU implementations. While exact replication on identical seeds may not be feasible due to differences in parallel computing environments and RNG implementations, we will report results from repeated experiments with varied seeds, including overlap statistics and consistency measures. We will also add a discussion on potential numerical differences and how they are mitigated. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmarking of GPU Boruta ports

full rationale

The paper introduces two GPU implementations of the pre-existing Boruta algorithm (permutation-based and impurity-based) and validates them solely through runtime and accuracy experiments on external datasets. No equations, derivations, fitted parameters, or self-citations are used to establish the central claims; efficiency gains and 'comparable accuracy' are asserted directly from measured wall-clock times and feature-selection outcomes versus the sequential reference. The single observational note on impurity overestimation is an empirical finding, not a load-bearing premise. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is purely algorithmic and empirical. It relies on standard assumptions about random forest importance measures and GPU parallel execution but introduces no new free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5459 in / 997 out tokens · 39404 ms · 2026-05-12T03:51:32.367437+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

and Rudnicki, Witold R

Kursa, Miron B. and Rudnicki, Witold R. , Title =. Journal of Statistical Software , Volume =. doi:10.18637/jss.v036.i11 , Year =

work page doi:10.18637/jss.v036.i11
[2]

and Kriegel, H.-P

Graf, F. and Kriegel, H.-P. and Schubert, M. and Poelsterl, S. and Cavallaro, A. , title =. 2011 , howpublished =

work page 2011
[3]

2015 , howpublished =

Fernandes and Kelwin and Vinagre and Pedro and Cortez and Paulo and Sernadela and Pedro , title =. 2015 , howpublished =

work page 2015
[4]

2017 , howpublished =

Candanedo and Luis , title =. 2017 , howpublished =

work page 2017
[5]

2004 , howpublished =

Guyon and Isabelle , title =. 2004 , howpublished =

work page 2004
[6]

doi:10.5281/zenodo.4247618 , url =

Eoghan Keany , title =. doi:10.5281/zenodo.4247618 , url =

work page doi:10.5281/zenodo.4247618
[7]

Feature importances with a forest of trees , howpublished =

work page
[8]

and Ferreira, Gustavo R

Gewers, Felipe L. and Ferreira, Gustavo R. and Arruda, Henrique F. De and Silva, Filipi N. and Comin, Cesar H. and Amancio, Diego R. and Costa, Luciano Da F. , title =. 2021 , issue_date =. doi:10.1145/3447755 , journal =

work page doi:10.1145/3447755 2021
[9]

FISHER, R. A. , title =. Annals of Eugenics , volume =. doi:https://doi.org/10.1111/j.1469-1809.1936.tb02137.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-1809.1936.tb02137.x , abstract =

work page doi:10.1111/j.1469-1809.1936.tb02137.x 1936
[10]

ACM Computing Surveys , author =

Li, Jundong and Cheng, Kewei and Wang, Suhang and Morstatter, Fred and Trevino, Robert P. and Tang, Jiliang and Liu, Huan , title =. 2017 , issue_date =. doi:10.1145/3136625 , journal =

work page doi:10.1145/3136625 2017
[11]

Sebastián, Carlos and González-Guillén, Carlos E. , year=. A feature selection method based on Shapley values robust for concept shift in regression , volume=. Neural Computing and Applications , publisher=. doi:10.1007/s00521-024-09745-4 , number=

work page doi:10.1007/s00521-024-09745-4
[12]

The applications and advances of artificial intelligence in drug regulation: A global perspective , journal =

Lixia Fu and Guoshu Jia and Zhenming Liu and Xiaocong Pang and Yimin Cui , keywords =. The applications and advances of artificial intelligence in drug regulation: A global perspective , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.apsb.2024.11.006 , url =

work page doi:10.1016/j.apsb.2024.11.006 2025
[13]

and Tehrani, Aria Mansouri and Wang, Rui and Daigavane, Ameya and Bohde, Montgomery and Kurtin, Jerry and Huang, Qian and Phung, Tuong and Xu, Minkai and K

Zhang, Xuan and Wang, Limei and Helwig, Jacob and Luo, Youzhi and Fu, Cong and Xie, Yaochen and Liu, Meng and Lin, Yuchao and Xu, Zhao and Yan, Keqiang and Adams, Keir and Weiler, Maurice and Li, Xiner and Fu, Tianfan and Wang, Yucheng and Strasser, Alex and Yu, Haiyang and Xie, YuQing and Fu, Xiang and Xu, Shenglong and Liu, Yi and Du, Yuanqi and Saxton,...

work page doi:10.1561/2200000115
[14]

, title =

Janet, Jon Paul and Kulik, Heather J. , title =. 2020 , doi =

work page 2020
[15]

2009 , pages =

Hey, Tony and Tansley, Stewart and Tolle, Kristin M , title =. 2009 , pages =

work page 2009
[16]

Sensors , VOLUME =

Figueroa Barraza, Joaquín and López Droguett, Enrique and Martins, Marcelo Ramos , TITLE =. Sensors , VOLUME =. 2021 , NUMBER =

work page 2021
[17]

and Anuradha, J

Venkatesh, B. and Anuradha, J. , doi =. A Review of Feature Selection and Its Methods , journal =. 2019 , pages =

work page 2019
[18]

2021 , eprint=

Filter Methods for Feature Selection in Supervised Machine Learning Applications - Review and Benchmark , author=. 2021 , eprint=

work page 2021
[19]

An introduction to variable and feature selection , year =

Guyon, Isabelle and Elisseeff, Andr\'. An introduction to variable and feature selection , year =. J. Mach. Learn. Res. , month = mar, pages =

work page
[20]

, title =

Guyon, Isabelle and Gunn, Steve and Nikravesh, Masoud and Zadeh, Lotfi A. , title =. 2006 , isbn =

work page 2006
[21]

Frontiers in Bioinformatics , volume =

A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction , author =. Frontiers in Bioinformatics , volume =. 2022 , doi =

work page 2022
[22]

PLOS ONE , volume =

Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets , author =. PLOS ONE , volume =. 2019 , doi =

work page 2019
[23]

26th International Conference on Extending Database Technology (EDBT 2023): Ioannina, Greece, March 28-31: Proceedings , year =

Njoku, Uchechukwu and Bilalli, Besim and Abelló, Alberto and Bontempi, Gianluca , title =. 26th International Conference on Extending Database Technology (EDBT 2023): Ioannina, Greece, March 28-31: Proceedings , year =

work page 2023
[24]

Gan , keywords =

Jesús González and Julio Ortega and Miguel Damas and Pedro Martín-Smith and John Q. Gan , keywords =. A new multi-objective wrapper method for feature selection – Accuracy and stability analysis for BCI , journal =. 2019 , issn =. doi:https://doi.org/10.1016/j.neucom.2019.01.017 , url =

work page doi:10.1016/j.neucom.2019.01.017 2019
[25]

ElDahshan, AbdAllah A

Kamal A. ElDahshan, AbdAllah A. AlHabshy, Luay Thamer Mohammed , TITLE =. Computers, Materials & Continua , VOLUME =. 2023 , NUMBER =

work page 2023
[26]

AIP Conference Proceedings , volume =

Handhika, Tri and Murni and Fahreza, Rafi Mochamad , title =. AIP Conference Proceedings , volume =. 2023 , month =. doi:10.1063/5.0114178 , url =

work page doi:10.1063/5.0114178 2023
[27]

International Journal of Pattern Recognition and Artificial Intelligence , volume =

Gasmi, Safa and Djebbar, Akila and Merouani, Hayet Farida , title =. International Journal of Pattern Recognition and Artificial Intelligence , volume =. 2023 , doi =

work page 2023
[28]

International Journal of Science and Research Archive , volume =

Boruta based feature selection model for heart disease prediction , author =. International Journal of Science and Research Archive , volume =. 2023 , doi =

work page 2023
[29]

Briefings in Bioinformatics , volume =

Evaluation of variable selection methods for random forests and omics data sets , author =. Briefings in Bioinformatics , volume =. 2019 , doi =

work page 2019
[30]

Kullback and R

S. Kullback and R. A. Leibler , title =. The Annals of Mathematical Statistics , number =. 1951 , doi =

work page 1951
[31]

Permutation Importance vs Random Forest Feature Importance (MDI) — scikit-learn documentation , year =

work page
[32]

Important Complexity Reduction of Random Forest in Multi-Classification Problem , year=

Hassine, Kawther and Erbad, Aiman and Hamila, Ridha , booktitle=. Important Complexity Reduction of Random Forest in Multi-Classification Problem , year=

work page
[33]

Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence.arXiv preprint arXiv:2002.04803, 2020

Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence , author=. arXiv preprint arXiv:2002.04803 , year=

work page arXiv 2002
[34]

2025 , eprint=

Feature Selection via GANs (GANFS): Enhancing Machine Learning Models for DDoS Mitigation , author=. 2025 , eprint=

work page 2025
[35]

Cloud Computing Services and Cloud Solutions - Alibaba Cloud , url =

work page
[36]

Rent GPUs | Vast.ai , url =

work page