pith. sign in

arxiv: 2507.21807 · v6 · submitted 2025-07-29 · 📊 stat.ML · cs.LG

MIBoost: A gradient boosting algorithm for variable selection after multiple imputation

Pith reviewed 2026-05-19 03:03 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords multiple imputationvariable selectiongradient boostingmissing dataMIBooststatistical learningcomponent-wise boostingcross-validation
0
0 comments X p. Extension

The pith

MIBoost extends gradient boosting for variable selection to multiple imputed datasets via a single combined loss function.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new algorithm called MIBoost to perform variable selection using component-wise gradient boosting when data has missing values that are handled by creating multiple completed datasets through imputation. It builds on recent extensions of LASSO and elastic nets by defining one loss function that enforces the same variable selections across all imputed versions of the data. This matters because current strategies for combining results from multiple imputations can lead to suboptimal model choices, and a boosting approach offers an alternative that is straightforward to apply. The work includes a cross-validation method and shows through simulations that the resulting models have predictive performance similar to existing techniques.

Core claim

The authors establish that the unified loss function approach for regularization methods can be adapted to component-wise gradient boosting. MIBoost implements a uniform variable-selection mechanism across imputed datasets, producing a single coherent set of selected variables rather than separate ones for each imputation. Paired with the MIBoostCV routine for tuning, this framework delivers models with predictive performance comparable to other methods in simulation studies.

What carries the argument

The MIBoost algorithm, which integrates a single loss function over multiple imputations to achieve consistent variable selection in component-wise gradient boosting.

If this is right

  • It provides a unified variable selection process that avoids the drawbacks of pooling models separately from each imputed dataset.
  • Simulations demonstrate that predictive performance remains competitive with established approaches for missing data.
  • The method supplies a practical tool for analysts dealing with incomplete datasets in statistical learning tasks.
  • The corresponding cross-validation procedure supports reliable tuning of the boosting model under multiple imputation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar principles could apply to other ensemble or tree-based methods for handling missing data in variable selection.
  • Adoption might improve feature stability in domains like medical research where missing values are common.
  • Further validation on datasets with complex missingness patterns would strengthen confidence in the approach.
  • Integration with different imputation strategies beyond standard multiple imputation could be tested.

Load-bearing premise

A single combined loss function across multiple imputed datasets produces stable and useful variable selections without introducing new biases.

What would settle it

An experiment where models selected by MIBoost show substantially lower predictive accuracy or greater variability in chosen variables compared to methods that fit separate models per imputation and then combine them.

read the original abstract

Statistical learning methods for automated variable selection, such as the Least Absolute Shrinkage and Selection Operator (LASSO), elastic nets, and gradient boosting, have become increasingly popular tools for building powerful prediction models. Yet, in practice, analyses are often complicated by missing data. The most widely used approach to address missingness is multiple imputation, which involves creating several completed datasets. However, there is an ongoing debate about how to perform model selection in the presence of multiple imputed datasets. Simple strategies, such as pooling models across datasets, have been shown to have suboptimal properties. Although more sophisticated methods exist, they are often difficult to implement and therefore not widely applied. In contrast, two recent approaches extend the regularization methods LASSO and elastic nets to multiply imputed datasets by defining a single loss function, resulting in a unified set of coefficients across imputations. Our key contribution is to extend this principle to the framework of component-wise gradient boosting by proposing MIBoost, a novel algorithm that employs a uniform variable-selection mechanism across imputed datasets, together with its corresponding cross-validation routine MIBoostCV. In a simulation study, MIBoost yielded predictive performance comparable to that of other established methods, providing a practical boosting-based approach for variable selection with multiply imputed data. The proposed framework is implemented as the R package booami.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes MIBoost, an extension of component-wise gradient boosting to multiple imputation settings. It defines a single combined loss function across imputed datasets to enforce a uniform variable-selection mechanism at each boosting iteration, paired with a tailored cross-validation routine called MIBoostCV. A simulation study is reported to show predictive performance comparable to existing methods for multiply imputed data, and the approach is implemented in the R package booami.

Significance. If the central algorithmic extension holds, the work supplies a practical, boosting-based option for variable selection after multiple imputation, building directly on the unified-loss principle already used for LASSO and elastic net. The reproducible R package is a clear strength that lowers the barrier to adoption. The contribution is incremental rather than foundational but addresses a documented gap in the boosting literature.

major comments (1)
  1. [Simulation Study] Simulation Study section: the manuscript states that MIBoost yields 'comparable predictive performance' but provides insufficient detail on the data-generating process (missingness mechanism, fraction missing, number of imputations M), the evaluation metrics (e.g., MSE, AUC, or variable-selection stability), and the exact baseline methods. Without these elements it is not possible to verify that the single combined loss produces stable selection without new bias, which is the load-bearing claim of the paper.
minor comments (2)
  1. [Method] The description of how the combined loss is formed (e.g., summation, averaging, or weighting across imputations) should be stated explicitly in the algorithm box or in the paragraph immediately following the definition of MIBoost.
  2. [Method] Notation for the per-imputation loss functions and the aggregated loss should be introduced consistently; currently the transition from single-imputation boosting to the multi-imputation version is abrupt.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments and positive overall assessment of our work. We address the major comment point by point below and have revised the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Simulation Study] Simulation Study section: the manuscript states that MIBoost yields 'comparable predictive performance' but provides insufficient detail on the data-generating process (missingness mechanism, fraction missing, number of imputations M), the evaluation metrics (e.g., MSE, AUC, or variable-selection stability), and the exact baseline methods. Without these elements it is not possible to verify that the single combined loss produces stable selection without new bias, which is the load-bearing claim of the paper.

    Authors: We agree that the simulation study section requires additional detail to allow readers to fully evaluate the performance claims and the behavior of the unified loss function. In the revised manuscript we have substantially expanded this section to describe the data-generating process (including the missingness mechanism and fraction of missing values), the number of imputations M, the precise evaluation metrics (MSE for predictive performance together with measures of variable-selection stability across imputations), and the exact baseline methods used for comparison. These additions directly support verification that the single combined loss yields stable selection without introducing new bias, consistent with the load-bearing claim of the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity in algorithmic extension

full rationale

The paper defines MIBoost as a direct algorithmic extension of the single combined loss function (already used for LASSO/elastic net on multiply imputed data) to component-wise gradient boosting, with a tailored CV routine. No equations, derivations, or self-citations are shown to reduce the claimed uniform variable selection or performance to quantities fitted inside the same procedure by construction. Simulation results are presented as separate empirical evaluation. The central claim remains an independent algorithmic proposal rather than a self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The proposal rests on standard assumptions of multiple imputation (missing at random) and component-wise gradient boosting (additive model updates), with no new invented entities or explicitly fitted free parameters beyond typical boosting hyperparameters such as learning rate and number of iterations.

axioms (2)
  • domain assumption Multiple imputation produces valid completed datasets under the missing-at-random assumption.
    The entire framework presupposes that the imputations are already generated and appropriate; this is invoked when defining the unified loss across imputations.
  • ad hoc to paper Component-wise gradient boosting can be adapted to a single loss defined over multiple datasets without altering its variable-selection properties.
    This is the core modeling choice that allows the uniform selection mechanism; it is introduced in the key contribution paragraph.

pith-pipeline@v0.9.0 · 5760 in / 1402 out tokens · 53858 ms · 2026-05-19T03:03:36.936103+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Rubin.Multiple Imputation for Nonresponse in Surveys

    Donald B. Rubin.Multiple Imputation for Nonresponse in Surveys. Wiley,

  2. [2]

    doi: 10.1002/9780470316696

    ISBN 9780471087052. doi: 10.1002/9780470316696

  3. [3]

    Wood, Ian R

    Angela M. Wood, Ian R. White, and Patrick Royston. How should variable 19 selection be performed with multiply imputed data?Statistics in Medicine, 27(17):3227–3246, 2008. ISSN 0277-6715. doi: 10.1002/sim.3177

  4. [4]

    Variable selection in the presence of missing data: Imputation-based methods.Wiley interdisciplinary reviews

    Yize Zhao and Qi Long. Variable selection in the presence of missing data: Imputation-based methods.Wiley interdisciplinary reviews. Computational statistics, 9(5), 2017. ISSN 1939-5108. doi: 10.1002/wics.1402

  5. [5]

    Beesley, Michael Klein- sasser, Stephen A

    Jiacong Du, Jonathan Boss, Peisong Han, Lauren J. Beesley, Michael Klein- sasser, Stephen A. Goutman, Stuart Batterman, Eva L. Feldman, and Bhra- mar Mukherjee. Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods.Journal of Computational and Graph- ical Statistics : a joint publication of American Statistical As...

  6. [6]

    Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288,

    Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288,

  7. [7]

    doi: 10.1111/j.2517-6161.1996.tb02080.x

    ISSN 00359246. doi: 10.1111/j.2517-6161.1996.tb02080.x

  8. [8]

    Alexander Rix and Jiacong Du.CRAN: Contributed Packages. 2020. doi: 10.32614/CRAN.package.miselect

  9. [9]

    Boosting with the L2 loss.Journal of the American Statistical Association, 98(462):324–339, 2003

    Peter B¨ uhlmann and Bin Yu. Boosting with the L2 loss.Journal of the American Statistical Association, 98(462):324–339, 2003. ISSN 0162-1459. doi: 10.1198/016214503000125

  10. [10]

    booami: Component-wise gradient boosting after multi- ple imputation, 2025

    Robert Kuchen. booami: Component-wise gradient boosting after multi- ple imputation, 2025. URLhttps://cran.r-project.org/web/packages/ booami/index.html

  11. [11]

    Friedman

    Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001. ISSN 0090-5364. doi: 10.1214/aos/1013203451

  12. [12]

    Generalized additive models.Statistical Science, 1(3):297–310, 1986

    Trevor Hastie and Robert Tibshirani. Generalized additive models.Statistical Science, 1(3):297–310, 1986. ISSN 0883-4237. doi: 10.1214/ss/1177013604

  13. [13]

    Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).The Annals of Statistics, 28(2):337–407, 2000

    Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).The Annals of Statistics, 28(2):337–407, 2000. ISSN 0090-5364. doi: 10.1214/aos/1016218223. 20

  14. [14]

    Boosting for high-dimensional linear models.The An- nals of Statistics, 34(2):559–583, 2006

    Peter B¨ uhlmann. Boosting for high-dimensional linear models.The An- nals of Statistics, 34(2):559–583, 2006. ISSN 0090-5364. doi: 10.1214/ 009053606000000092

  15. [15]

    Beesley and Jeremy M

    Lauren J. Beesley and Jeremy M. G. Taylor. A stacked approach for chained equations multiple imputation incorporating the substantive model.Biomet- rics, 77(4):1342–1354, 2021. doi: 10.1111/biom.13372

  16. [16]

    M. A. Tanner and W. H. Wong. The calculation of posterior distributions by data augmentation.Journal of the American Statistical Association, 82(398): 528–540, 1987

  17. [17]

    Sauerbrei and M

    W. Sauerbrei and M. Schumacher. A bootstrap resampling procedure for model building: application to the Cox regression model.Statistics in Medicine, 11(16):2093–2109, 1992. ISSN 0277-6715. doi: 10.1002/sim. 4780111607

  18. [18]

    Chen, Ross L

    Lin S. Chen, Ross L. Prentice, and Pei Wang. A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation. Biometrics, 70(2):312–322, 2014. doi: 10.1111/biom.12149

  19. [19]

    Schafer.Analysis of Incomplete Multivariate Data, volume 72 of Monographs on Statistics and Applied Probability,

    Joseph L. Schafer.Analysis of Incomplete Multivariate Data, volume 72 of Monographs on Statistics and Applied Probability,. Chapman & Hall/CRC, London, UK, 1997. ISBN 9780412040610

  20. [20]

    Addendum: Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(5):768, 2005

    Hui Zou and Trevor Hastie. Addendum: Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(5):768, 2005. ISSN 1369-7412. doi: 10.1111/j.1467-9868. 2005.00527.x

  21. [21]

    The adaptive lasso and its oracle properties.Journal of the American Statistical Association, 101(476):1418–1429, 2006

    Hui Zou. The adaptive lasso and its oracle properties.Journal of the American Statistical Association, 101(476):1418–1429, 2006. ISSN 0162-1459. doi: 10. 1198/016214506000000735

  22. [22]

    On the adaptive elastic-net with a diverging number of parameters.The Annals of Statistics, 37(4):1733–1751, 2009

    Hui Zou and Hao Helen Zhang. On the adaptive elastic-net with a diverging number of parameters.The Annals of Statistics, 37(4):1733–1751, 2009. ISSN 0090-5364. doi: 10.1214/08-AOS625

  23. [23]

    mice: Multivariate impu- tation by chained equations in r.Journal of Statistical Software, 45(3):1–67,

    Stef van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate impu- tation by chained equations in r.Journal of Statistical Software, 45(3):1–67,

  24. [24]

    21 Appendix First of all, it is important to note that themiselectpackage provides built-in cross-validation (CV) functions to find the optimal values forλandα

    doi: 10.18637/jss.v045.i03. 21 Appendix First of all, it is important to note that themiselectpackage provides built-in cross-validation (CV) functions to find the optimal values forλandα. However, these functions can only be applied to data that have already been imputed. As elucidated in subsection 3.2, to prevent data leakage, data should only be imput...