MIBoost: A gradient boosting algorithm for variable selection after multiple imputation
Pith reviewed 2026-05-19 03:03 UTC · model grok-4.3
The pith
MIBoost extends gradient boosting for variable selection to multiple imputed datasets via a single combined loss function.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that the unified loss function approach for regularization methods can be adapted to component-wise gradient boosting. MIBoost implements a uniform variable-selection mechanism across imputed datasets, producing a single coherent set of selected variables rather than separate ones for each imputation. Paired with the MIBoostCV routine for tuning, this framework delivers models with predictive performance comparable to other methods in simulation studies.
What carries the argument
The MIBoost algorithm, which integrates a single loss function over multiple imputations to achieve consistent variable selection in component-wise gradient boosting.
If this is right
- It provides a unified variable selection process that avoids the drawbacks of pooling models separately from each imputed dataset.
- Simulations demonstrate that predictive performance remains competitive with established approaches for missing data.
- The method supplies a practical tool for analysts dealing with incomplete datasets in statistical learning tasks.
- The corresponding cross-validation procedure supports reliable tuning of the boosting model under multiple imputation.
Where Pith is reading between the lines
- Similar principles could apply to other ensemble or tree-based methods for handling missing data in variable selection.
- Adoption might improve feature stability in domains like medical research where missing values are common.
- Further validation on datasets with complex missingness patterns would strengthen confidence in the approach.
- Integration with different imputation strategies beyond standard multiple imputation could be tested.
Load-bearing premise
A single combined loss function across multiple imputed datasets produces stable and useful variable selections without introducing new biases.
What would settle it
An experiment where models selected by MIBoost show substantially lower predictive accuracy or greater variability in chosen variables compared to methods that fit separate models per imputation and then combine them.
read the original abstract
Statistical learning methods for automated variable selection, such as the Least Absolute Shrinkage and Selection Operator (LASSO), elastic nets, and gradient boosting, have become increasingly popular tools for building powerful prediction models. Yet, in practice, analyses are often complicated by missing data. The most widely used approach to address missingness is multiple imputation, which involves creating several completed datasets. However, there is an ongoing debate about how to perform model selection in the presence of multiple imputed datasets. Simple strategies, such as pooling models across datasets, have been shown to have suboptimal properties. Although more sophisticated methods exist, they are often difficult to implement and therefore not widely applied. In contrast, two recent approaches extend the regularization methods LASSO and elastic nets to multiply imputed datasets by defining a single loss function, resulting in a unified set of coefficients across imputations. Our key contribution is to extend this principle to the framework of component-wise gradient boosting by proposing MIBoost, a novel algorithm that employs a uniform variable-selection mechanism across imputed datasets, together with its corresponding cross-validation routine MIBoostCV. In a simulation study, MIBoost yielded predictive performance comparable to that of other established methods, providing a practical boosting-based approach for variable selection with multiply imputed data. The proposed framework is implemented as the R package booami.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MIBoost, an extension of component-wise gradient boosting to multiple imputation settings. It defines a single combined loss function across imputed datasets to enforce a uniform variable-selection mechanism at each boosting iteration, paired with a tailored cross-validation routine called MIBoostCV. A simulation study is reported to show predictive performance comparable to existing methods for multiply imputed data, and the approach is implemented in the R package booami.
Significance. If the central algorithmic extension holds, the work supplies a practical, boosting-based option for variable selection after multiple imputation, building directly on the unified-loss principle already used for LASSO and elastic net. The reproducible R package is a clear strength that lowers the barrier to adoption. The contribution is incremental rather than foundational but addresses a documented gap in the boosting literature.
major comments (1)
- [Simulation Study] Simulation Study section: the manuscript states that MIBoost yields 'comparable predictive performance' but provides insufficient detail on the data-generating process (missingness mechanism, fraction missing, number of imputations M), the evaluation metrics (e.g., MSE, AUC, or variable-selection stability), and the exact baseline methods. Without these elements it is not possible to verify that the single combined loss produces stable selection without new bias, which is the load-bearing claim of the paper.
minor comments (2)
- [Method] The description of how the combined loss is formed (e.g., summation, averaging, or weighting across imputations) should be stated explicitly in the algorithm box or in the paragraph immediately following the definition of MIBoost.
- [Method] Notation for the per-imputation loss functions and the aggregated loss should be introduced consistently; currently the transition from single-imputation boosting to the multi-imputation version is abrupt.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and positive overall assessment of our work. We address the major comment point by point below and have revised the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Simulation Study] Simulation Study section: the manuscript states that MIBoost yields 'comparable predictive performance' but provides insufficient detail on the data-generating process (missingness mechanism, fraction missing, number of imputations M), the evaluation metrics (e.g., MSE, AUC, or variable-selection stability), and the exact baseline methods. Without these elements it is not possible to verify that the single combined loss produces stable selection without new bias, which is the load-bearing claim of the paper.
Authors: We agree that the simulation study section requires additional detail to allow readers to fully evaluate the performance claims and the behavior of the unified loss function. In the revised manuscript we have substantially expanded this section to describe the data-generating process (including the missingness mechanism and fraction of missing values), the number of imputations M, the precise evaluation metrics (MSE for predictive performance together with measures of variable-selection stability across imputations), and the exact baseline methods used for comparison. These additions directly support verification that the single combined loss yields stable selection without introducing new bias, consistent with the load-bearing claim of the paper. revision: yes
Circularity Check
No significant circularity in algorithmic extension
full rationale
The paper defines MIBoost as a direct algorithmic extension of the single combined loss function (already used for LASSO/elastic net on multiply imputed data) to component-wise gradient boosting, with a tailored CV routine. No equations, derivations, or self-citations are shown to reduce the claimed uniform variable selection or performance to quantities fitted inside the same procedure by construction. Simulation results are presented as separate empirical evaluation. The central claim remains an independent algorithmic proposal rather than a self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Multiple imputation produces valid completed datasets under the missing-at-random assumption.
- ad hoc to paper Component-wise gradient boosting can be adapted to a single loss defined over multiple datasets without altering its variable-selection properties.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
base-learner selection at each iteration is based on the aggregated loss, i.e., the sum of the losses across all datasets... L[t]_r := sum_m sum_i (u(m)[t]_i - h-hat(m)[t]_beta_r (x(m)_i,r ))^2
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MIBoost... uniform variable-selection mechanism across imputed datasets
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Rubin.Multiple Imputation for Nonresponse in Surveys
Donald B. Rubin.Multiple Imputation for Nonresponse in Surveys. Wiley,
-
[2]
ISBN 9780471087052. doi: 10.1002/9780470316696
-
[3]
Angela M. Wood, Ian R. White, and Patrick Royston. How should variable 19 selection be performed with multiply imputed data?Statistics in Medicine, 27(17):3227–3246, 2008. ISSN 0277-6715. doi: 10.1002/sim.3177
-
[4]
Yize Zhao and Qi Long. Variable selection in the presence of missing data: Imputation-based methods.Wiley interdisciplinary reviews. Computational statistics, 9(5), 2017. ISSN 1939-5108. doi: 10.1002/wics.1402
-
[5]
Beesley, Michael Klein- sasser, Stephen A
Jiacong Du, Jonathan Boss, Peisong Han, Lauren J. Beesley, Michael Klein- sasser, Stephen A. Goutman, Stuart Batterman, Eva L. Feldman, and Bhra- mar Mukherjee. Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods.Journal of Computational and Graph- ical Statistics : a joint publication of American Statistical As...
-
[6]
Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288,
-
[7]
doi: 10.1111/j.2517-6161.1996.tb02080.x
ISSN 00359246. doi: 10.1111/j.2517-6161.1996.tb02080.x
-
[8]
Alexander Rix and Jiacong Du.CRAN: Contributed Packages. 2020. doi: 10.32614/CRAN.package.miselect
-
[9]
Boosting with the L2 loss.Journal of the American Statistical Association, 98(462):324–339, 2003
Peter B¨ uhlmann and Bin Yu. Boosting with the L2 loss.Journal of the American Statistical Association, 98(462):324–339, 2003. ISSN 0162-1459. doi: 10.1198/016214503000125
-
[10]
booami: Component-wise gradient boosting after multi- ple imputation, 2025
Robert Kuchen. booami: Component-wise gradient boosting after multi- ple imputation, 2025. URLhttps://cran.r-project.org/web/packages/ booami/index.html
work page 2025
-
[11]
Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001. ISSN 0090-5364. doi: 10.1214/aos/1013203451
-
[12]
Generalized additive models.Statistical Science, 1(3):297–310, 1986
Trevor Hastie and Robert Tibshirani. Generalized additive models.Statistical Science, 1(3):297–310, 1986. ISSN 0883-4237. doi: 10.1214/ss/1177013604
-
[13]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).The Annals of Statistics, 28(2):337–407, 2000. ISSN 0090-5364. doi: 10.1214/aos/1016218223. 20
-
[14]
Boosting for high-dimensional linear models.The An- nals of Statistics, 34(2):559–583, 2006
Peter B¨ uhlmann. Boosting for high-dimensional linear models.The An- nals of Statistics, 34(2):559–583, 2006. ISSN 0090-5364. doi: 10.1214/ 009053606000000092
work page 2006
-
[15]
Lauren J. Beesley and Jeremy M. G. Taylor. A stacked approach for chained equations multiple imputation incorporating the substantive model.Biomet- rics, 77(4):1342–1354, 2021. doi: 10.1111/biom.13372
-
[16]
M. A. Tanner and W. H. Wong. The calculation of posterior distributions by data augmentation.Journal of the American Statistical Association, 82(398): 528–540, 1987
work page 1987
-
[17]
W. Sauerbrei and M. Schumacher. A bootstrap resampling procedure for model building: application to the Cox regression model.Statistics in Medicine, 11(16):2093–2109, 1992. ISSN 0277-6715. doi: 10.1002/sim. 4780111607
work page doi:10.1002/sim 2093
-
[18]
Lin S. Chen, Ross L. Prentice, and Pei Wang. A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation. Biometrics, 70(2):312–322, 2014. doi: 10.1111/biom.12149
-
[19]
Joseph L. Schafer.Analysis of Incomplete Multivariate Data, volume 72 of Monographs on Statistics and Applied Probability,. Chapman & Hall/CRC, London, UK, 1997. ISBN 9780412040610
work page 1997
-
[20]
Hui Zou and Trevor Hastie. Addendum: Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(5):768, 2005. ISSN 1369-7412. doi: 10.1111/j.1467-9868. 2005.00527.x
-
[21]
Hui Zou. The adaptive lasso and its oracle properties.Journal of the American Statistical Association, 101(476):1418–1429, 2006. ISSN 0162-1459. doi: 10. 1198/016214506000000735
work page 2006
-
[22]
Hui Zou and Hao Helen Zhang. On the adaptive elastic-net with a diverging number of parameters.The Annals of Statistics, 37(4):1733–1751, 2009. ISSN 0090-5364. doi: 10.1214/08-AOS625
-
[23]
Stef van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate impu- tation by chained equations in r.Journal of Statistical Software, 45(3):1–67,
-
[24]
doi: 10.18637/jss.v045.i03. 21 Appendix First of all, it is important to note that themiselectpackage provides built-in cross-validation (CV) functions to find the optimal values forλandα. However, these functions can only be applied to data that have already been imputed. As elucidated in subsection 3.2, to prevent data leakage, data should only be imput...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.