A Bayes-Factor-Guided Approach to Post-Double Selection with Bootstrapped Multiple Imputation

arxiv: 2604.12783 · v2 · submitted 2026-04-14 · 📊 stat.ME · econ.EM

A Bayes-Factor-Guided Approach to Post-Double Selection with Bootstrapped Multiple Imputation

Johannes Bleher (1) , Claudia Tarantola (2) ((1) Department of Econometrics , Empirical Economics & Computational Science Hub , University of Hohenheim , (2) Department of Economics , Management , Quantitative Methods , University of Milan) This is my paper

Pith reviewed 2026-05-10 14:38 UTC · model grok-4.3

classification 📊 stat.ME econ.EM

keywords variable selectionmultiple imputationbootstrappingBayes factorsequential testingpost-double selectionmodel aggregation

0 comments p. Extension

The pith

Treating detections of each variable across bootstrap-imputation iterations as Bernoulli trials lets a likelihood ratio accumulate into an approximate Bayes factor that supplies both an inclusion threshold and an automatic stopping rule.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Variable selection performed separately on many bootstrapped and multiply imputed datasets usually produces different chosen variables each time, and simply taking their union tends to yield overly dense final models. The paper models the sequence of yes-no detection outcomes for any given candidate variable as independent Bernoulli trials whose success probability is unknown. A likelihood-ratio process is run on these trials; the running ratio admits an approximate Bayes-factor reading that can be compared to a threshold to decide inclusion and can be monitored to decide when enough iterations have been seen. This removes the need to choose the number of bootstrap-imputation rounds before seeing the data. The procedure is compared with union and other aggregation rules in 126 Monte Carlo scenarios and illustrated on real data.

Core claim

The paper claims that detection outcomes across bootstrap and multiple-imputation iterations can be treated as independent Bernoulli trials, so that a sequential likelihood-ratio statistic can be formed whose value supplies both a variable-inclusion criterion and a stopping rule for further iterations, all under an approximate Bayes-factor interpretation of the accumulated evidence.

What carries the argument

The sequential likelihood-ratio process on Bernoulli detection outcomes, read as an approximate Bayes factor for variable relevance.

If this is right

The selected model contains only those variables whose accumulated evidence crosses a pre-chosen Bayes-factor threshold.
The total number of bootstrap-imputation iterations is determined by the evidence process itself rather than fixed in advance.
Final models are less dense than those produced by taking the union of selected variables across all iterations.
Performance relative to union and other aggregation methods is demonstrated across 126 simulation scenarios and one real-data example.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Bernoulli-trial evidence accumulation could be applied to other repeated-perturbation schemes such as cross-validation folds or random feature subsamples.
In high-dimensional or computationally expensive settings the data-driven stopping rule may reduce total run time by halting once evidence is sufficient.
The approach offers a template for turning any repeated selection procedure into a sequential evidence-monitoring method.

Load-bearing premise

Detection outcomes across successive bootstrap-imputation iterations behave as independent Bernoulli trials with constant probability.

What would settle it

A simulation study in which the true relevant variables are known shows that the Bayes-factor procedure selects sets whose out-of-sample performance is no better than, or whose size differs markedly from, the sets obtained by simple union of selections across the same iterations.

Figures

Figures reproduced from arXiv: 2604.12783 by (2) Department of Economics, Claudia Tarantola (2) ((1) Department of Econometrics, Empirical Economics & Computational Science Hub, Johannes Bleher (1), Management, Quantitative Methods, University of Hohenheim, University of Milan).

**Figure 1.** Figure 1: Procedure Overview for Bootstrapped Multiply Imputed Variable Selection The figure illustrates how datasets containing missing values (gray shading) are bootstrapped and subsequently imputed. Variable selection is depicted by green shaded columns. The resulting selected variables can subsequently be aggregated using the proposed evidence measure. Original data set X X m Xm, 1 Xm, 2 Xm,p · · · X m − 1 X m −… view at source ↗

**Figure 6.** Figure 6: Empirical Evidence Paths The figure displays cumulative log-evidence paths log Ejt for all p = 55 candidate variables from the BOOT-MI sequential evidence procedure applied to the ESS data (n = 8,543, pilot-calibrated πˆ0, c = log(1000)). Paths corresponding to variables classified as relevant are shown with higher opacity; paths of irrelevant variables are faded. Dashed horizontal lines indicate the symme… view at source ↗

**Figure 7.** Figure 7: TPR–FPR Frontier by Sample Size and Missingness Mechanism (Fixed-Budget) Each point represents the average TPR and FPR of a method within a sample-size and missingness-mechanism cell. Rows correspond to MCAR, MAR, and MNAR; columns to n ∈ {100, 500, 1000}. The cross marks the ideal point at (0, 1). nn: 100 nn: 500 nn: 1000 miss_type: MCAR miss_type: MAR miss_type: MNAR 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.… view at source ↗

**Figure 9.** Figure 9: Mean Distance to Ideal by Sample Size (Fixed-Budget) Euclidean distance to the ideal point (0, 1) by sample size under the fixed-budget comparison, with 95% confidence intervals. 0.4 0.6 0.8 1.0 100 500 1000 Sample size Mean distance to ideal Union Rule Frequency threshold (50%) Frequency threshold (75%) Proposed Method 65 [PITH_FULL_IMAGE:figures/full_fig_p066_9.png] view at source ↗

**Figure 10.** Figure 10: Mean Distance to Ideal by Missingness Rate (Fixed-Budget) Euclidean distance to the ideal point (0, 1) by missingness rate and mechanism under the fixed-budget comparison. MCAR MAR MNAR 20% 40% 60% 20% 40% 60% 20% 40% 60% 0.4 0.6 0.8 1.0 Missingness rate Mean distance to ideal Union Rule Frequency threshold (50%) Frequency threshold (75%) Proposed Method [PITH_FULL_IMAGE:figures/full_fig_p067_10.png] view at source ↗

**Figure 11.** Figure 11: Mean Distance to Ideal by Sample Size (Matched-Budget) As [PITH_FULL_IMAGE:figures/full_fig_p067_11.png] view at source ↗

read the original abstract

When variable selection methods are applied to bootstrapped and multiply imputed datasets, the set of selected variables typically varies across iterations. Aggregating results via the union rule can lead to overly dense models. We propose a sequential evidence aggregation procedure that models detection outcomes across perturbation iterations as Bernoulli trials and accumulates evidence for variable relevance through a likelihood-ratio process admitting an approximate Bayes-factor interpretation. The procedure provides both a variable inclusion criterion and a stopping rule that eliminates the need to fix the number of bootstrap-imputation iterations ex ante. A Monte Carlo study across 126 scenarios and an empirical illustration demonstrate the method's performance relative to existing aggregation approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a sequential likelihood-ratio stopping rule for post-double selection under bootstrap and multiple imputation by treating selection outcomes as Bernoulli trials, but the shared data structure across replicates likely breaks the independence needed for the approximation to work as a Bayes factor.

read the letter

The main thing here is a procedure that models whether a variable gets picked by post-double selection on each bootstrap-imputed replicate as a Bernoulli trial, then accumulates a likelihood ratio across trials to decide both inclusion and when to stop adding more replicates. This replaces the usual fixed number of iterations or simple union rule with something that tries to stop when evidence stabilizes. The Monte Carlo covers 126 scenarios plus an empirical example, which is enough to show how it compares to standard aggregation approaches in terms of model size and performance metrics. That part is useful for applied people who deal with missing data and want to avoid overly dense models without guessing the iteration count ahead of time. The framing is direct and the goal is practical, which is a plus for work in econometrics or statistics with incomplete datasets. The soft spot is the independence assumption. Replicates come from the same original sample via resampling and draws from the same imputation model, so they carry over the same collinearity, finite-sample effects, and imputation uncertainty. Treating the indicators as i.i.d. Bernoulli means the product of per-trial likelihoods is not the joint likelihood ratio, and the stopping rule loses its martingale justification. The paper does not report diagnostics for dependence or sensitivity checks under exchangeable or Markov alternatives, so it is not clear how much the numerical values still correspond to an approximate Bayes factor. Readers who routinely run variable selection on multiply imputed data will find the stopping rule and the simulation results worth looking at. The work is clear enough on its own terms to deserve a serious referee who can press on the approximation and the dependence issue. I would send it to peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes a sequential evidence aggregation method for post-double selection on bootstrapped multiply-imputed data. Selection indicators across iterations are modeled as i.i.d. Bernoulli trials; a likelihood-ratio process is accumulated and interpreted as an approximate Bayes factor that supplies both a variable-inclusion threshold and a data-driven stopping rule, removing the need to pre-specify the number of bootstrap-imputation replicates. Performance is assessed via a Monte Carlo study over 126 scenarios and one empirical example, with comparisons to union-rule and other aggregation baselines.

Significance. If the approximate Bayes-factor construction remains valid, the procedure supplies a principled, adaptive alternative to fixed-iteration or union-based aggregation in missing-data variable selection, potentially improving sparsity without sacrificing detection power. The scale of the simulation design (126 scenarios) is a clear strength and allows broad exploration of operating characteristics.

major comments (3)

[Sequential aggregation procedure] The modeling of detection outcomes as i.i.d. Bernoulli trials whose likelihood ratio yields an approximate Bayes factor (described in the sequential aggregation procedure) is load-bearing for both the inclusion criterion and the stopping rule. Because every replicate is generated from the same observed sample via resampling and draws from the identical posterior predictive, the indicators share finite-sample structure, imputation uncertainty, and collinearity; they are therefore dependent. Under dependence the product of marginal likelihoods no longer equals the joint likelihood ratio, the martingale property required for the stopping time fails, and the numerical value no longer corresponds to a Bayes factor. The Monte Carlo study reports no diagnostic for serial dependence and no sensitivity analysis that replaces the i.i.d. assumption with a Markov or exchangeable model.
[Method description] No derivation or explicit justification of the Bayes-factor approximation is supplied, nor are the numerical thresholds for inclusion (e.g., BF > k) or stopping (e.g., BF crossing a boundary) stated. Without these quantities the Monte Carlo results cannot be reproduced or interpreted as evidence that the procedure controls error rates or improves sparsity relative to fixed-iteration baselines.
[Monte Carlo study] The Monte Carlo design does not report standard errors or confidence bands on the reported performance metrics, nor does it examine sensitivity of the results to the specific choices of p0 < 0.5 and p1 > 0.5 used in the Bernoulli likelihoods. These omissions make it impossible to judge whether the claimed superiority over existing aggregation approaches is robust.

minor comments (2)

Notation for the per-iteration selection indicator and the accumulated likelihood ratio should be introduced with a clear equation or algorithm box to improve readability.
The abstract states that the procedure 'eliminates the need to fix the number of bootstrap-imputation iterations ex ante,' but the manuscript should clarify whether a maximum iteration cap is still imposed in practice and how it interacts with the stopping rule.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our paper. We address each of the major comments point by point below, providing clarifications and outlining the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Sequential aggregation procedure] The modeling of detection outcomes as i.i.d. Bernoulli trials whose likelihood ratio yields an approximate Bayes factor (described in the sequential aggregation procedure) is load-bearing for both the inclusion criterion and the stopping rule. Because every replicate is generated from the same observed sample via resampling and draws from the identical posterior predictive, the indicators share finite-sample structure, imputation uncertainty, and collinearity; they are therefore dependent. Under dependence the product of marginal likelihoods no longer equals the joint likelihood ratio, the martingale property required for the stopping time fails, and the numerical value no longer corresponds to a Bayes factor. The Monte Carlo study reports no diagnostic for serial dependence and no sensitivity analysis that replaces the i.i.d. assumption with a Markov

Authors: We acknowledge that the detection indicators are dependent due to the shared data source. The i.i.d. Bernoulli model is presented as an approximation that enables the sequential likelihood ratio accumulation and its interpretation as an approximate Bayes factor. While the strict martingale property may not hold, the procedure is designed to provide a practical stopping rule and inclusion threshold that performs well in finite samples, as evidenced by our simulations. In the revision, we will add a discussion of the dependence issue, include diagnostics for serial correlation in the indicators, and perform a sensitivity analysis using a first-order Markov model for the sequence of detections. revision: partial
Referee: [Method description] No derivation or explicit justification of the Bayes-factor approximation is supplied, nor are the numerical thresholds for inclusion (e.g., BF > k) or stopping (e.g., BF crossing a boundary) stated. Without these quantities the Monte Carlo results cannot be reproduced or interpreted as evidence that the procedure controls error rates or improves sparsity relative to fixed-iteration baselines.

Authors: We will include a detailed derivation of the likelihood-ratio process and its approximate Bayes factor interpretation in the revised methods section. We will also explicitly state the numerical thresholds used for variable inclusion and the stopping criterion (e.g., the specific BF values for moderate and strong evidence). This will enhance reproducibility and allow readers to better interpret the Monte Carlo results. revision: yes
Referee: [Monte Carlo study] The Monte Carlo design does not report standard errors or confidence bands on the reported performance metrics, nor does it examine sensitivity of the results to the specific choices of p0 < 0.5 and p1 > 0.5 used in the Bernoulli likelihoods. These omissions make it impossible to judge whether the claimed superiority over existing aggregation approaches is robust.

Authors: We agree that reporting standard errors and confidence intervals for the performance metrics would strengthen the presentation. We will add these to the Monte Carlo results. Additionally, we will include a sensitivity analysis varying the values of p0 and p1 to demonstrate the robustness of our findings to these choices. revision: yes

Circularity Check

0 steps flagged

No circularity: new sequential procedure rests on explicit modeling assumptions with external Monte Carlo validation

full rationale

The paper proposes a sequential evidence-aggregation method that explicitly models post-double-selection indicators across bootstrap-imputation replicates as Bernoulli trials and accumulates a likelihood-ratio statistic given an approximate Bayes-factor reading. This construction is introduced as a modeling choice rather than derived from the data; the stopping rule and inclusion threshold follow directly from the assumed i.i.d. Bernoulli likelihoods and the chosen p0/p1 thresholds. No equation reduces a claimed prediction to a parameter fitted on the same quantity, no self-citation supplies a uniqueness theorem, and the Monte Carlo study across 126 scenarios supplies independent performance checks. The derivation chain therefore remains self-contained and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method appears to rest on the standard assumption that selection indicators behave as independent Bernoulli trials and that the likelihood-ratio process approximates a Bayes factor.

pith-pipeline@v0.9.0 · 5442 in / 1072 out tokens · 19588 ms · 2026-05-10T14:38:06.982956+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

A., McCauley, T

Bainter, S. A., McCauley, T. G., Fahmy, M. M., Goodman, Z. T., Kupis, L. and Rao, J. S.: 2023, Comparing bayesian variable selection to lasso approaches for applications in psychology,Psychometrikapp. 1–24. Belloni, A., Chernozhukov, V. and Hansen, C.: 2014a, High-dimensional methods and inference on structural and treatment effects,The Journal of Economi...

work page 2023
[2]

and Wang, S.: 2013, Variable selection for multiply-imputed data with application to dioxin exposure study,Statistics in Medicine32(21), 3646–3659

Chen, Q. and Wang, S.: 2013, Variable selection for multiply-imputed data with application to dioxin exposure study,Statistics in Medicine32(21), 3646–3659. Du, J., Boss, J., Han, P., Beesley, L. J., Kleinsasser, M., Goutman, S. A., Batterman, S. A., Feldman, E. L. and Mukherjee, B.: 2020, Variable selection with multiply-imputed datasets: Choosing betwee...

work page 2013
[3]

Sellke, T., Bayarri, M. J. and Berger, J. O.: 2001, Calibration of p values for testing precise null hypotheses,The American Statistician55(1), 62–71. Wald, A.: 1945, Sequential tests of statistical hypotheses,The Annals of Mathematical Statistics16(2), 117–186. 40 Wood, A. M., White, I. R. and Royston, P.: 2008, How should variable selection be performed...

work page 2001

[1] [1]

A., McCauley, T

Bainter, S. A., McCauley, T. G., Fahmy, M. M., Goodman, Z. T., Kupis, L. and Rao, J. S.: 2023, Comparing bayesian variable selection to lasso approaches for applications in psychology,Psychometrikapp. 1–24. Belloni, A., Chernozhukov, V. and Hansen, C.: 2014a, High-dimensional methods and inference on structural and treatment effects,The Journal of Economi...

work page 2023

[2] [2]

and Wang, S.: 2013, Variable selection for multiply-imputed data with application to dioxin exposure study,Statistics in Medicine32(21), 3646–3659

Chen, Q. and Wang, S.: 2013, Variable selection for multiply-imputed data with application to dioxin exposure study,Statistics in Medicine32(21), 3646–3659. Du, J., Boss, J., Han, P., Beesley, L. J., Kleinsasser, M., Goutman, S. A., Batterman, S. A., Feldman, E. L. and Mukherjee, B.: 2020, Variable selection with multiply-imputed datasets: Choosing betwee...

work page 2013

[3] [3]

Sellke, T., Bayarri, M. J. and Berger, J. O.: 2001, Calibration of p values for testing precise null hypotheses,The American Statistician55(1), 62–71. Wald, A.: 1945, Sequential tests of statistical hypotheses,The Annals of Mathematical Statistics16(2), 117–186. 40 Wood, A. M., White, I. R. and Royston, P.: 2008, How should variable selection be performed...

work page 2001