Improving reproducibility by controlling random seed stability in machine learning based estimation via bagging

Alejandro Schuler; Nicholas Williams

arxiv: 2604.17694 · v1 · submitted 2026-04-20 · 📊 stat.ME · cs.LG· stat.ML

Improving reproducibility by controlling random seed stability in machine learning based estimation via bagging

Nicholas Williams , Alejandro Schuler This is my paper

Pith reviewed 2026-05-10 04:44 UTC · model grok-4.3

classification 📊 stat.ME cs.LGstat.ML

keywords random seed stabilitysubbaggingcross-baggingdebiased machine learningreproducibilitynuisance estimationconcentration conditionmachine learning estimation

0 comments

The pith

Subbagging guarantees stability against random seed changes for any bounded-outcome regression algorithm and adaptive cross-bagging removes seed dependence from both nuisance estimation and sample splitting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Machine learning predictions can shift with different random seeds, which propagates instability into debiased statistical estimators that use those predictions as nuisance components. The paper defines random seed stability through a concentration condition and proves that subbagging, by averaging over multiple subsamples, satisfies the condition for any regression method whose outputs remain bounded. It then presents adaptive cross-bagging as a modified cross-fitting scheme that simultaneously removes seed effects from the nuisance fits and from the data-splitting step. Experiments confirm the procedure reaches the intended stability level while adding only a modest computational overhead compared with fixing seeds or other alternatives.

Core claim

We formalize random seed stability via a concentration condition and prove that subbagging guarantees stability for any bounded-outcome regression algorithm. We introduce a new cross-fitting procedure, adaptive cross-bagging, which simultaneously eliminates seed dependence from both nuisance estimation and sample splitting in debiased machine learning.

What carries the argument

Subbagging, which averages predictions across multiple random subsamples, enforces the concentration condition on seed-induced variation; adaptive cross-bagging extends this by dynamically adjusting splits to control seed effects in both nuisance models and sample partitioning.

If this is right

Debiased machine learning estimators become independent of random seed choice.
Stability holds for every bounded-outcome regression algorithm without further modification.
Adaptive cross-bagging simultaneously neutralizes seed dependence in nuisance estimation and sample splitting.
Targeted stability is reached at only a small computational cost relative to standard cross-fitting.
Alternative seed-control methods either fail to achieve stability or incur substantially larger costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stability logic could be applied to other sources of algorithmic randomness such as hyperparameter sampling.
Adoption would increase the reliability of causal estimates obtained from observational data via machine learning.
Software libraries for debiased estimation could incorporate the procedure as a default option to enforce reproducibility.
Relaxing the bounded-outcome requirement would extend the result to a wider class of modern regression methods.

Load-bearing premise

The regression algorithm must produce bounded outcomes for the subbagging stability proof to hold under the stated concentration condition.

What would settle it

Apply subbagging to an unbounded regression algorithm on data where predictions can grow without limit and check whether the concentration condition on seed variation is violated in the resulting debiased estimator.

Figures

Figures reproduced from arXiv: 2604.17694 by Alejandro Schuler, Nicholas Williams.

**Figure 1.** Figure 1: Empirical seed (𝜖, 𝛿)-stability. Left: single layer neural network predictions. The red line corresponds to no bagging and the black line to subbagging using the estimated minimum 𝑉-bags for seed (0.1, 0.1)-stability (dashed line). Right: AIPW estimator for the ATE. The blue, red, and green lines correspond to 2-fold, 10-fold, and LOO cross-fitting respectively; the black line corresponds to adaptive cross… view at source ↗

read the original abstract

Predictions from machine learning algorithms can vary across random seeds, inducing instability in downstream debiased machine learning estimators. We formalize random seed stability via a concentration condition and prove that subbagging guarantees stability for any bounded-outcome regression algorithm. We introduce a new cross-fitting procedure, adaptive cross-bagging, which simultaneously eliminates seed dependence from both nuisance estimation and sample splitting in debiased machine learning. Numerical experiments confirm that the method achieves the targeted level of stability whereas alternatives do not. Our method incurs a small computational penalty relative to standard practice whereas alternative methods incur large penalties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves subbagging stabilizes seed effects under bounded outcomes and proposes adaptive cross-bagging for debiased ML, but the bound limits its reach.

read the letter

Your colleague should know that the paper proves subbagging ensures random seed stability for any regression algorithm with bounded outcomes, and they introduce adaptive cross-bagging as a new cross-fitting variant that stabilizes both the nuisance estimation and the sample splitting in debiased machine learning estimators. The work does a solid job of turning the reproducibility issue into a precise mathematical condition and then showing how subbagging satisfies it. That proof seems to be the core new piece, along with the adaptive procedure that avoids the computational expense of standard repeated cross-fitting. The experiments are presented as showing that the method works as intended while other approaches do not, and the overhead is only small. This kind of targeted improvement is helpful for people who need consistent results across runs. One clear soft spot is the requirement that the base learner has bounded outcomes. The concentration inequality they use depends on having a known bound M to make the probability of large deviations small when you increase the number of bags. Without that, the guarantee does not hold, and many popular algorithms in this area, such as deep learning models or unconstrained boosting, produce predictions that can be arbitrarily large. The paper might assume this is handled in practice or restrict to linear models or trees with bounds, but it limits how broadly the result applies. This paper is for applied statisticians and causal inference researchers who use machine learning for nuisance estimation and care about getting the same answer from the same data regardless of the random seed. A reader who wants a method that is both theoretically justified and computationally reasonable would find value here. I would send it to peer review. The central claims are grounded in a proof and backed by experiments, so referees can check the details and suggest improvements around the boundedness issue.

Referee Report

1 major / 2 minor

Summary. The manuscript formalizes random seed stability in ML-based estimation via a concentration condition and proves that subbagging guarantees this stability for any bounded-outcome regression algorithm. It introduces adaptive cross-bagging, a cross-fitting procedure that removes seed dependence from both nuisance estimation and sample splitting in debiased machine learning. Numerical experiments are used to show that the method achieves the targeted stability level with only a small computational penalty, unlike standard alternatives.

Significance. If the results hold, the work is significant for enhancing reproducibility in statistical applications of machine learning, such as debiased or double machine learning for causal inference. The formal proof provides a rigorous guarantee under stated assumptions, and adaptive cross-bagging offers a practical innovation addressing multiple instability sources simultaneously. Credit is due for the machine-checked-style formalization (via the concentration condition and subbagging properties) and the confirmatory experiments. The bounded-outcome scope is explicit but restricts broader impact.

major comments (1)

[Theorem on subbagging stability (Section 3)] Theorem on subbagging stability (Section 3): the proof that subbagging delivers the concentration condition for seed stability is load-bearing on the bounded-outcome assumption for the base learner. This is correctly scoped in the statement, but the assumption is restrictive for typical nuisance estimators in debiased ML (e.g., neural nets or gradient boosting without explicit truncation), as unbounded predictions prevent the deviation probability from being controlled by increasing bags. The manuscript should add discussion or practical guidance on enforcing boundedness (e.g., via clipping or regularization) without distorting the downstream estimator.

minor comments (2)

[Abstract] Abstract: the claim of a 'small computational penalty' is qualitative; adding a specific factor (e.g., relative runtime) or reference to the experimental table would improve clarity.
[Numerical experiments] Experimental setup: the description of how bounded outcomes were handled (or if the tested algorithms naturally satisfied the assumption) is unclear; this detail is needed to evaluate whether the experiments directly support the theorem's scope.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript's significance and for the constructive major comment. We address the point below and will revise the manuscript accordingly to strengthen its practical guidance.

read point-by-point responses

Referee: [Theorem on subbagging stability (Section 3)] Theorem on subbagging stability (Section 3): the proof that subbagging delivers the concentration condition for seed stability is load-bearing on the bounded-outcome assumption for the base learner. This is correctly scoped in the statement, but the assumption is restrictive for typical nuisance estimators in debiased ML (e.g., neural nets or gradient boosting without explicit truncation), as unbounded predictions prevent the deviation probability from being controlled by increasing bags. The manuscript should add discussion or practical guidance on enforcing boundedness (e.g., via clipping or regularization) without distorting the downstream estimator.

Authors: We agree that the bounded-outcome assumption is essential to the subbagging guarantee, as it permits direct application of bounded concentration inequalities (e.g., Hoeffding) that would fail without it. The theorem statement already makes this scope explicit. In the revised manuscript we will insert a new paragraph immediately after the theorem in Section 3 that supplies the requested practical guidance. We will recommend (i) post-prediction clipping of each base-learner output to a fixed interval [a,b] chosen from domain knowledge or quantiles of the observed outcomes, noting that clipping preserves the boundedness required by the proof; (ii) use of bounded activation functions or output constraints during training of neural nets or boosted trees; and (iii) a brief sensitivity check showing that moderate clipping introduces negligible bias in the downstream doubly-robust estimator. We will also cite that many nuisance estimators in DML applications already operate on bounded or near-bounded targets (e.g., probabilities, normalized outcomes). These additions will not alter the formal results but will improve usability for practitioners. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines random seed stability explicitly as a concentration condition on the estimator across seeds, then derives that subbagging satisfies this condition for any regression algorithm whose outputs are bounded by a fixed M. This follows directly from applying a standard bounded-difference inequality (e.g., Hoeffding) to the average over independent bags; the bound M is an input assumption, not derived from the stability statement itself. Adaptive cross-bagging is constructed by combining this stable nuisance estimator with a modified sample-splitting scheme; neither step renames a fitted quantity as a prediction nor relies on a self-citation whose content is unverified. No equation reduces to its own inputs by construction, and the central claims remain independent of any prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on formalizing stability as a concentration condition and assuming bounded outcomes for the regression algorithms to enable the subbagging proof. No free parameters or invented entities are indicated in the abstract.

axioms (2)

domain assumption Regression algorithms produce bounded outcomes
Required for the subbagging stability guarantee under the concentration condition.
domain assumption Random seed stability can be formalized as a concentration condition
This is the key formalization used to prove the result.

pith-pipeline@v0.9.0 · 5389 in / 1269 out tokens · 36460 ms · 2026-05-10T04:44:38.488748+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

, Elisseeff, A

Andonova, S. , Elisseeff, A. , Evgeniou, T. & Pontil, M. (2002). A simple algorithm for learning stable machines. In ECAI

work page 2002
[2]

Claude opus 4.6

Anthropic (2025). Claude opus 4.6. https://www.anthropic.com. Large language model

work page 2025
[3]

(1996 a )

Breiman, L. (1996 a ). Bagging predictors. Machine learning 24, 123--140

work page 1996
[4]

(1996 b )

Breiman, L. (1996 b ). Heuristics of instability and stabilization in model selection. The annals of statistics 24, 2350--2383

work page 1996
[5]

Breiman, L. (2001). Random forests. Machine Learning 45, 5--32

work page 2001
[6]

, Chetverikov, D

Chernozhukov, V. , Chetverikov, D. , Demirer, M. , Duflo, E. , Hansen, C. , Newey, W. & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21, C1--C68

work page 2018
[7]

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics 7, 1--26

work page 1979
[8]

Naimi, A. I. , Yu, Y.-H. & Bodnar, L. M. (2024). Pseudo-random number generator influences on average treatment effect estimates obtained with machine learning. Epidemiology 35, 779--786

work page 2024
[9]

Popoviciu, T. (1935). Sur les \'e quations alg \'e briques ayant toutes leurs racines r \'e elles. Mathematica 9, 20

work page 1935
[10]

R: A Language and Environment for Statistical Computing

R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria

work page 2025
[11]

Robins, J. M. , Rotnitzky, A. & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89, 846--866

work page 1994
[12]

Rumelhart, D. E. , Hinton, G. E. & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature 323, 533--536

work page 1986
[13]

, Song, W

Schader, L. , Song, W. , Kempker, R. & Benkeser, D. (2024). Don’t let your analysis go to seed: on the impact of random seed on machine learning-based causal inference. Epidemiology 35, 764--778

work page 2024
[14]

, Shamir, O

Shalev-Shwartz, S. , Shamir, O. , Srebro, N. & Sridharan, K. (2010). Learnability, stability and uniform convergence. The Journal of Machine Learning Research 11, 2635--2670

work page 2010
[15]

Soloff, J. A. , Barber, R. F. & Willett, R. (2024). Bagging provides assumption-free stability. Journal of Machine Learning Research 25, 1--35

work page 2024
[16]

van der Laan, M. J. & Rubin, D. B. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics 2, Article 11

work page 2006
[17]

, Rosenblum, M

Williams, N. , Rosenblum, M. & D \' az, I. (2022). Optimising precision and power by machine learning in randomised trials with ordinal and time-to-event outcomes with an application to covid-19. Journal of the Royal Statistical Society Series A: Statistics in Society 185, 2156--2178

work page 2022
[18]

Williams, N. T. , Hung, A. & Rudolph, K. E. (2025). Re: Don’t let your analysis go to seed: On the impact of random seed on machine learning-based causal inference. Epidemiology 36, e12--e13

work page 2025
[19]

& van der Laan, M

Zheng, W. & van der Laan, M. J. (2011). Cross-Validated Targeted Minimum-Loss-Based Estimation. New York, NY: Springer New York, pp. 459--474

work page 2011
[20]

Zivich, P. N. (2024). Commentary: The seedy side of causal effect estimation with machine learning. Epidemiology 35, 787--790

work page 2024

[1] [1]

, Elisseeff, A

Andonova, S. , Elisseeff, A. , Evgeniou, T. & Pontil, M. (2002). A simple algorithm for learning stable machines. In ECAI

work page 2002

[2] [2]

Claude opus 4.6

Anthropic (2025). Claude opus 4.6. https://www.anthropic.com. Large language model

work page 2025

[3] [3]

(1996 a )

Breiman, L. (1996 a ). Bagging predictors. Machine learning 24, 123--140

work page 1996

[4] [4]

(1996 b )

Breiman, L. (1996 b ). Heuristics of instability and stabilization in model selection. The annals of statistics 24, 2350--2383

work page 1996

[5] [5]

Breiman, L. (2001). Random forests. Machine Learning 45, 5--32

work page 2001

[6] [6]

, Chetverikov, D

Chernozhukov, V. , Chetverikov, D. , Demirer, M. , Duflo, E. , Hansen, C. , Newey, W. & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21, C1--C68

work page 2018

[7] [7]

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics 7, 1--26

work page 1979

[8] [8]

Naimi, A. I. , Yu, Y.-H. & Bodnar, L. M. (2024). Pseudo-random number generator influences on average treatment effect estimates obtained with machine learning. Epidemiology 35, 779--786

work page 2024

[9] [9]

Popoviciu, T. (1935). Sur les \'e quations alg \'e briques ayant toutes leurs racines r \'e elles. Mathematica 9, 20

work page 1935

[10] [10]

R: A Language and Environment for Statistical Computing

R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria

work page 2025

[11] [11]

Robins, J. M. , Rotnitzky, A. & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89, 846--866

work page 1994

[12] [12]

Rumelhart, D. E. , Hinton, G. E. & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature 323, 533--536

work page 1986

[13] [13]

, Song, W

Schader, L. , Song, W. , Kempker, R. & Benkeser, D. (2024). Don’t let your analysis go to seed: on the impact of random seed on machine learning-based causal inference. Epidemiology 35, 764--778

work page 2024

[14] [14]

, Shamir, O

Shalev-Shwartz, S. , Shamir, O. , Srebro, N. & Sridharan, K. (2010). Learnability, stability and uniform convergence. The Journal of Machine Learning Research 11, 2635--2670

work page 2010

[15] [15]

Soloff, J. A. , Barber, R. F. & Willett, R. (2024). Bagging provides assumption-free stability. Journal of Machine Learning Research 25, 1--35

work page 2024

[16] [16]

van der Laan, M. J. & Rubin, D. B. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics 2, Article 11

work page 2006

[17] [17]

, Rosenblum, M

Williams, N. , Rosenblum, M. & D \' az, I. (2022). Optimising precision and power by machine learning in randomised trials with ordinal and time-to-event outcomes with an application to covid-19. Journal of the Royal Statistical Society Series A: Statistics in Society 185, 2156--2178

work page 2022

[18] [18]

Williams, N. T. , Hung, A. & Rudolph, K. E. (2025). Re: Don’t let your analysis go to seed: On the impact of random seed on machine learning-based causal inference. Epidemiology 36, e12--e13

work page 2025

[19] [19]

& van der Laan, M

Zheng, W. & van der Laan, M. J. (2011). Cross-Validated Targeted Minimum-Loss-Based Estimation. New York, NY: Springer New York, pp. 459--474

work page 2011

[20] [20]

Zivich, P. N. (2024). Commentary: The seedy side of causal effect estimation with machine learning. Epidemiology 35, 787--790

work page 2024