Predicting missing values: A good idea?

Stef van Buuren

arxiv: 2605.03733 · v1 · submitted 2026-05-05 · 📊 stat.ML · cs.LG· stat.ME

Predicting missing values: A good idea?

Stef van Buuren This is my paper

Pith reviewed 2026-05-07 12:44 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.ME

keywords missing data imputationmean squared errorstochastic imputationbias in downstream analysisvariance reductionpredictive vs stochasticcorrelation biasregression slope

0 comments

The pith

Minimizing mean squared error when imputing missing values shrinks data variability and biases variance, correlations, slopes, and R-squared.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that imputing missing values by choosing the single number that minimizes prediction error replaces those values with averages. These averages reduce the natural spread of the data and thereby distort key quantities in any later analysis. Adding random noise whose size matches the original prediction error restores the lost variability and removes the distortions in both toy simulations and real software. Popular predictive imputation routines in missForest, softImpute, and mice display the same systematic biases. The result implies that accuracy measured by MSE alone is the wrong target whenever the completed data will be used for statistical inference rather than pure prediction.

Core claim

Imputed values that minimize the mean squared error are conditional means. These means reduce the variability present in the original data, which in turn biases estimates of variance, prevalence, correlation, slope, and explained variance. In a multivariate normal example, predictive imputation produces biased results while stochastic imputation, which adds noise scaled to the MSE, produces unbiased results. The same biases appear in missForest, softImpute, and mice when used predictively. Therefore, MSE is not a sufficient criterion for good imputation.

What carries the argument

The distinction between predictive imputation, which replaces missing entries with conditional expectations, and stochastic imputation, which adds noise whose variance equals the mean squared prediction error, thereby restoring the original joint distribution.

If this is right

Downstream regression, correlation, and prevalence estimates will be biased when performed on predictively imputed data.
Imputation methods must incorporate stochastic noise to preserve the joint distribution of the variables.
MSE is an inadequate sole measure of imputation quality whenever the goal is valid statistical inference.
Software defaults in common packages should switch from pure predictive to stochastic modes for most analytic uses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same variability reduction may distort uncertainty estimates in other point-prediction settings outside imputation.
The correction may require adjustment when missingness depends on unobserved values or when the imputation model is misspecified.
Multiple-imputation frameworks that already embed stochastic draws could serve as a practical safeguard without separate noise tuning.

Load-bearing premise

The observed biases are caused solely by the reduction in variability from using averages, and that noise scaled exactly to the MSE corrects them without introducing new distortions in finite samples or under model misspecification.

What would settle it

Generate complete data from a known distribution, delete values at random, impute with noise whose variance equals the observed MSE on the deleted entries, and check whether the sample variances and correlations in the completed data equal those of the original complete data within Monte Carlo error.

Figures

Figures reproduced from arXiv: 2605.03733 by Stef van Buuren.

**Figure 1.** Figure 1: Visualization of two imputation methods applied to view at source ↗

read the original abstract

Minimizing the Mean Squared Error (MSE) is a key objective in machine learning and is commonly used for imputing missing values. While this approach provides accurate point estimates, it introduces systematic biases in downstream analyses. These biases affect key parameters such as variance, prevalence, correlation, slope, and explained variance. The root cause is that imputed values optimized for MSE are averages, which reduce the natural variability in the data. This paper demonstrates that adding noise to imputed values can effectively eliminate these biases. The required noise level is proportional to the MSE. Using a toy example in a multivariate normal setting, we compare two methods: predictive imputation, which minimizes MSE, and stochastic imputation, which incorporates random noise. Simulation results show that predictive methods systematically introduce bias, while stochastic methods preserve the data's natural variability and produce unbiased estimates. We also evaluate three popular imputation tools -- missForest, softImpute, and mice -- and observe consistent biases in predictive methods. These findings highlight that MSE is an inadequate measure of imputation quality, as it prioritizes accuracy over variability. Incorporating noise into imputation methods is essential to prevent biases and ensure valid downstream analyses, underscoring the importance of stochastic approaches for handling incomplete data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper usefully flags how MSE-driven imputation in tools like missForest and softImpute shrinks variance and biases correlations or slopes, but the core point is already standard in multiple-imputation work and the noise-fix may not hold under misspecification.

read the letter

The main thing here is that predictive imputation, by returning conditional means, cuts the natural spread in the data and thereby biases variance, correlations, slopes, and R-squared in downstream steps. The paper shows this cleanly in a multivariate normal toy case and then checks it on missForest, softImpute, and mice, where the same pattern appears. Adding noise scaled to the reported MSE is presented as the remedy that restores the moments without changing the point estimates much. That demonstration is concrete and easy to follow, and it gives practitioners a quick way to see why point imputation can mislead even when the imputed values look accurate by MSE. The simulations back the direction of the effect, which is the paper's real service. The central observation itself is not new; the multiple-imputation literature has long noted that single imputation understates variability and distorts associations. What is added is the side-by-side run on three current packages, which makes the warning more immediate for people who actually use those routines. A soft spot is the assumption that the MSE from the imputation step equals the right noise variance to add back. In the MVN toy example this holds by construction, but when the learner is misspecified (trees on interactions, matrix factorization on non-low-rank data) the internal error estimate can be too small or too large, so the added noise may leave residual bias in the second moments. The paper does not appear to test that regime directly, which limits how far the recommendation travels. This is for applied statisticians and ML users who impute data and then run regressions or compute correlations; they will get a practical reminder and a simple adjustment to try. It is not a theoretical breakthrough, but the empirical illustrations are sharp enough that a serious referee should see it. I would send it to review rather than desk-reject.

Referee Report

1 major / 2 minor

Summary. The paper claims that MSE-minimizing predictive imputation introduces systematic biases in downstream statistical analyses (variance, correlations, regression slopes, R²) by replacing missing values with conditional expectations, thereby reducing natural variability. It proposes adding independent noise with variance proportional to the imputation MSE to restore unbiased estimates of these quantities. This is illustrated with a clean multivariate normal toy example comparing predictive and stochastic imputation, supported by simulations, and extended to evaluations of missForest, softImpute, and mice, where predictive versions show biases that stochastic versions mitigate.

Significance. If the proposed noise-addition rule generalizes, the result would have substantial practical significance for imputation in statistical modeling and machine learning pipelines, as it suggests that current MSE-focused methods are inadequate for preserving data properties needed for valid inference. The manuscript earns credit for its transparent toy example in the MVN setting, where the mechanism is exact, and for reproducible simulation comparisons that demonstrate the bias direction. However, the significance is tempered by the need to verify robustness under model misspecification.

major comments (1)

[§4 (Evaluation of imputation tools)] §4 (Evaluation of imputation tools): The evaluations of missForest, softImpute, and mice rely on scaling noise to each method's internal MSE estimate. However, when the underlying model is misspecified (e.g., random forests on non-additive interactions or matrix factorization on non-low-rank data), this internal MSE does not equal the true conditional variance. Consequently, the added noise may under- or over-correct the second moments, leaving residual bias in correlations or R². The manuscript should include a controlled misspecification experiment to test whether the simple scaling rule remains effective.

minor comments (2)

[Abstract] The abstract states that 'the required noise level is proportional to the MSE' without specifying the constant of proportionality or whether it is exactly 1 in the MVN case; this should be clarified for precision.
[Simulation results] It is unclear whether error bars or confidence intervals are reported for the simulation results on bias magnitudes; adding these would strengthen the evidence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comment raises an important point about robustness under model misspecification, which we address below.

read point-by-point responses

Referee: [§4 (Evaluation of imputation tools)] §4 (Evaluation of imputation tools): The evaluations of missForest, softImpute, and mice rely on scaling noise to each method's internal MSE estimate. However, when the underlying model is misspecified (e.g., random forests on non-additive interactions or matrix factorization on non-low-rank data), this internal MSE does not equal the true conditional variance. Consequently, the added noise may under- or over-correct the second moments, leaving residual bias in correlations or R². The manuscript should include a controlled misspecification experiment to test whether the simple scaling rule remains effective.

Authors: We agree that the internal MSE reported by missForest, softImpute, or mice will generally differ from the true conditional variance under misspecification, and that this mismatch could leave some residual bias after noise scaling. The multivariate normal toy example in the paper is correctly specified, so the scaling is exact there. In Section 4 we applied noise scaled to each tool's reported MSE on the simulated data and observed clear bias reduction relative to the predictive versions, but we did not include a dedicated misspecification test. We will add such an experiment to the revised manuscript: we will generate data with known misspecifications (e.g., strong interactions for random-forest imputation and non-low-rank structure for matrix factorization) and report the remaining bias in variance, correlations, and R² after applying the simple scaling rule. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on explicit MVN derivation and simulations

full rationale

The paper derives the bias mechanism in the MVN toy example by noting that MSE-optimal imputation yields conditional expectations whose variance is reduced by exactly the conditional variance term; adding independent noise with variance equal to that MSE then restores the marginal moments by direct algebra, not by fitting or renaming. Subsequent evaluations of missForest, softImpute and mice are empirical comparisons that do not invoke self-citations as load-bearing premises or smuggle ansatzes. No step equates a prediction to its own input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard statistical assumptions for its toy example and simulations; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The toy data follow a multivariate normal distribution
Used to compare predictive and stochastic imputation in a controlled setting.

pith-pipeline@v0.9.0 · 5505 in / 1164 out tokens · 36038 ms · 2026-05-07T12:44:36.632576+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 4 canonical work pages

[1]

and Gelman, A

Abayomi, K. and Gelman, A. and Levy, M. , title =. Journal of the Royal Statistical Society. Series C: Applied Statistics , volume =. 2008 , keywords =

2008
[2]

and Li, F

Akande, O. and Li, F. and Reiter, J. , title =. The American Statistician , volume =. 2017 , publisher =

2017
[3]

and Pawlowski, C

Bertsimas, D. and Pawlowski, C. and Zhuo, Y. D. , title =. The Journal of Machine Learning Research , volume =. 2017 , location =

2017
[4]

and Raghunathan, T

Bondarenko, I. and Raghunathan, T. , title =. Statistics in medicine , volume =. 2016 , location =

2016
[5]

Buck, S. F. , title =. Journal of the Royal Statistical Society, Series B, Statistical Methodology , volume =. 1960 , location =

1960
[6]

Cai, M. and. Graphical and numerical diagnostic tools to assess multiple imputation models by posterior predictive checking , journal =
[7]

and Carlin, J.B

Gelman, A. and Carlin, J.B. and Stern, H.S. and Rubin, D.B. , title =. 2004 , abstract =

2004
[8]

Gleason, T. C. and Staelin, R. , title =. Psychometrika , volume =. 1975 , location =

1975
[9]

and Mazumder, R

Hastie, T. and Mazumder, R. and Lee, J. D. and Zadeh, R. , journal=. Matrix completion and low-rank. 2015 , publisher=

2015
[10]

2019 , issn =

Informatics in Medicine Unlocked , volume =. 2019 , issn =. doi:https://doi.org/10.1016/j.imu.2019.100275 , url =

work page doi:10.1016/j.imu.2019.100275 2019
[11]

and Cebere, B

Jarrett, D. and Cebere, B. C. and Liu, T. and Curth, A. and. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[12]

Little, R. J. A. and Rubin, D. B. , title =. 2020 , abstract =

2020
[13]

, title =

Meng, X-L. , title =. Statistical Science , volume =. 1994 , abstract =

1994
[14]

Statistics in medicine , volume=

Using simulation studies to evaluate statistical methods , author=. Statistics in medicine , volume=. 2019 , publisher=

2019
[15]

Toward a standardized evaluation of imputation methodol- ogy

Oberman, H.I. and Vink, G. , title =. Biometrical Journal , volume =. doi:https://doi.org/10.1002/bimj.202200107 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1002/bimj.202200107 , year =

work page doi:10.1002/bimj.202200107
[16]

Prince, S. J. D. Understanding Deep Learning
[17]

and Tulowietzki, J

Ramosaj, B. and Tulowietzki, J. and Pauly, M. , title =. Entropy (Basel) , volume =. 2022 , location =

2022
[18]

Rubin, D. B. , volume =. Multiple imputations in sample surveys - A phenomenological
[19]

Rubin, D. B. , title =. 1987 , keywords =

1987
[20]

Schafer, J. L. and Schenker, N. , title =. Journal of the American Statistical Association , volume =. 2000 , abstract =

2000
[21]

and Roberts, M

Shadbahr, T. and Roberts, M. and Stanczuk, J. and Gilbey, J. and Teare, P. and Dittmer, S. and Thorpe, M. and Torn. The impact of imputation quality on machine learning classifiers for datasets with missing values , url =. Communications Medicine , number =. doi:10.1038/s43856-023-00356-z , isbn =

work page doi:10.1038/s43856-023-00356-z
[22]

Stekhoven, D. J. and Bühlmann, P. , title =. Bioinformatics , volume =. 2011 , location =

2011
[23]

2023 , issn =

Deep learning versus conventional methods for missing data imputation: A review and comparative study , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.eswa.2023.120201 , url =

work page doi:10.1016/j.eswa.2023.120201 2023
[24]

and Cantor, M

Troyanskaya, O. and Cantor, M. and Sherlock, G. and Brown, P. and Hastie, T. and Tibshirani, R. J. and Botstein, D. and Altman, R. B. , title =. Bioinformatics , volume =. 2001 , abstract =

2001
[25]

2011 , abstract =

Journal of Statistical Software , volume =. 2011 , abstract =

2011
[26]

Second Edition , volume =

Flexible Imputation of Missing Data. Second Edition , volume =. 2018 , abstract =

2018
[27]

Waljee, A. K. and Mukherjee, A. and Singal, A. G. and Zhang, Y. and Warren, J. and Balis, U. and Marrero, J. and Zhu, J. and Higgins, P. D. R. , title =. BMJ open , volume =. 2013 , location =

2013
[28]

, title =

Yates, F. , title =. Empirical Journal of Experimental Agriculture , volume =. 1933 , abstract =

1933

[1] [1]

and Gelman, A

Abayomi, K. and Gelman, A. and Levy, M. , title =. Journal of the Royal Statistical Society. Series C: Applied Statistics , volume =. 2008 , keywords =

2008

[2] [2]

and Li, F

Akande, O. and Li, F. and Reiter, J. , title =. The American Statistician , volume =. 2017 , publisher =

2017

[3] [3]

and Pawlowski, C

Bertsimas, D. and Pawlowski, C. and Zhuo, Y. D. , title =. The Journal of Machine Learning Research , volume =. 2017 , location =

2017

[4] [4]

and Raghunathan, T

Bondarenko, I. and Raghunathan, T. , title =. Statistics in medicine , volume =. 2016 , location =

2016

[5] [5]

Buck, S. F. , title =. Journal of the Royal Statistical Society, Series B, Statistical Methodology , volume =. 1960 , location =

1960

[6] [6]

Cai, M. and. Graphical and numerical diagnostic tools to assess multiple imputation models by posterior predictive checking , journal =

[7] [7]

and Carlin, J.B

Gelman, A. and Carlin, J.B. and Stern, H.S. and Rubin, D.B. , title =. 2004 , abstract =

2004

[8] [8]

Gleason, T. C. and Staelin, R. , title =. Psychometrika , volume =. 1975 , location =

1975

[9] [9]

and Mazumder, R

Hastie, T. and Mazumder, R. and Lee, J. D. and Zadeh, R. , journal=. Matrix completion and low-rank. 2015 , publisher=

2015

[10] [10]

2019 , issn =

Informatics in Medicine Unlocked , volume =. 2019 , issn =. doi:https://doi.org/10.1016/j.imu.2019.100275 , url =

work page doi:10.1016/j.imu.2019.100275 2019

[11] [11]

and Cebere, B

Jarrett, D. and Cebere, B. C. and Liu, T. and Curth, A. and. International Conference on Machine Learning , pages=. 2022 , organization=

2022

[12] [12]

Little, R. J. A. and Rubin, D. B. , title =. 2020 , abstract =

2020

[13] [13]

, title =

Meng, X-L. , title =. Statistical Science , volume =. 1994 , abstract =

1994

[14] [14]

Statistics in medicine , volume=

Using simulation studies to evaluate statistical methods , author=. Statistics in medicine , volume=. 2019 , publisher=

2019

[15] [15]

Toward a standardized evaluation of imputation methodol- ogy

Oberman, H.I. and Vink, G. , title =. Biometrical Journal , volume =. doi:https://doi.org/10.1002/bimj.202200107 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1002/bimj.202200107 , year =

work page doi:10.1002/bimj.202200107

[16] [16]

Prince, S. J. D. Understanding Deep Learning

[17] [17]

and Tulowietzki, J

Ramosaj, B. and Tulowietzki, J. and Pauly, M. , title =. Entropy (Basel) , volume =. 2022 , location =

2022

[18] [18]

Rubin, D. B. , volume =. Multiple imputations in sample surveys - A phenomenological

[19] [19]

Rubin, D. B. , title =. 1987 , keywords =

1987

[20] [20]

Schafer, J. L. and Schenker, N. , title =. Journal of the American Statistical Association , volume =. 2000 , abstract =

2000

[21] [21]

and Roberts, M

Shadbahr, T. and Roberts, M. and Stanczuk, J. and Gilbey, J. and Teare, P. and Dittmer, S. and Thorpe, M. and Torn. The impact of imputation quality on machine learning classifiers for datasets with missing values , url =. Communications Medicine , number =. doi:10.1038/s43856-023-00356-z , isbn =

work page doi:10.1038/s43856-023-00356-z

[22] [22]

Stekhoven, D. J. and Bühlmann, P. , title =. Bioinformatics , volume =. 2011 , location =

2011

[23] [23]

2023 , issn =

Deep learning versus conventional methods for missing data imputation: A review and comparative study , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.eswa.2023.120201 , url =

work page doi:10.1016/j.eswa.2023.120201 2023

[24] [24]

and Cantor, M

Troyanskaya, O. and Cantor, M. and Sherlock, G. and Brown, P. and Hastie, T. and Tibshirani, R. J. and Botstein, D. and Altman, R. B. , title =. Bioinformatics , volume =. 2001 , abstract =

2001

[25] [25]

2011 , abstract =

Journal of Statistical Software , volume =. 2011 , abstract =

2011

[26] [26]

Second Edition , volume =

Flexible Imputation of Missing Data. Second Edition , volume =. 2018 , abstract =

2018

[27] [27]

Waljee, A. K. and Mukherjee, A. and Singal, A. G. and Zhang, Y. and Warren, J. and Balis, U. and Marrero, J. and Zhu, J. and Higgins, P. D. R. , title =. BMJ open , volume =. 2013 , location =

2013

[28] [28]

, title =

Yates, F. , title =. Empirical Journal of Experimental Agriculture , volume =. 1933 , abstract =

1933