Distribution Shift in Missing Data Imputation: A Risk-Based Perspective and Importance-Weighted Correction under MAR

Katarzyna Reluga; Luke Shannon; Song Liu

arxiv: 2602.06713 · v2 · pith:5CF2FRIBnew · submitted 2026-02-06 · 📊 stat.ML · cs.LG

Distribution Shift in Missing Data Imputation: A Risk-Based Perspective and Importance-Weighted Correction under MAR

Luke Shannon , Song Liu , Katarzyna Reluga This is my paper

Pith reviewed 2026-05-16 06:29 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords missing data imputationdistribution shiftMARimportance weightingrisk minimizationsimulation study

0 comments

The pith

Standard imputation fails to minimize full-data error under MAR because observed training data differs in distribution from the target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates missing-data imputation as minimizing mean-squared error risk over the complete data distribution. It shows that when missingness depends on observed values, training solely on observed cases produces a shifted distribution, so ordinary learners do not achieve the intended minimum. The authors derive an importance-weighted objective that reweights observed samples by the inverse probability of being observed, allowing the model to target the correct risk. Simulation experiments confirm that the corrected procedure reduces RMSE by roughly 3 percent and Wasserstein distance by 7 percent relative to identical uncorrected baselines.

Core claim

When the probability of missingness depends on the data, the observed training distribution differs from the full data distribution, so standard imputation models trained by empirical risk minimization on observed cases do not minimize mean-squared error on the complete data. The paper supplies an explicit importance-weighted risk objective that corrects for this shift under the MAR assumption by reweighting each observed sample according to the estimated probability it was observed.

What carries the argument

Importance-weighted mean-squared-error risk minimization, where each observed instance is reweighted by the reciprocal of its missingness probability estimated from the observed data.

If this is right

Any imputation pipeline that trains a model on incomplete cases can be upgraded by inserting the importance weights without changing the model class.
Downstream tasks that use the imputed values inherit lower error on the full population distribution.
The same reweighting idea applies to any learner whose loss is evaluated on a distribution different from the training sample.
Consistent gains appear across multiple simulation regimes with average 3 percent RMSE drop and 7 percent Wasserstein improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same shift-correction logic could be tested on real-world datasets where missingness probabilities are plausibly estimable from observed covariates.
Extending the weights to handle high-dimensional or structured data might require more flexible missingness models.
If the MAR assumption is mildly violated, the method may still reduce bias relative to the uncorrected baseline, but the magnitude of improvement would need separate verification.

Load-bearing premise

The missingness mechanism is missing at random, so that the probability an entry is missing can be estimated from the observed variables alone without further bias.

What would settle it

Run the corrected versus uncorrected imputers on data generated under MNAR missingness and check whether the RMSE gap disappears or reverses.

read the original abstract

Missing data imputation, where a model is trained on observed data to estimate unobserved values, is a fundamental problem in machine learning. In this paper, we rigorously formulate imputation model learning as a mean-squared error risk minimisation problem. We show that when the probability of missingness depends on the data, many state-of-the-art methods fail to account for the resulting distribution shift between the observed data used for training and the full data distribution used for evaluation. Consequently, these approaches do not minimise mean-squared error on the full data distribution. Instead, we propose a novel imputation algorithm designed to learn an imputation model from the observed data while explicitly accounting for this distribution shift. Simulation studies show consistent improvements over otherwise identical uncorrected baselines, with average reductions of 3% in RMSE and 7% in Wasserstein distance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper recasts imputation as full-data risk minimization under MAR and adds an importance-weighted correction, with small simulation gains, but the finite-sample weight estimation needs a closer look.

read the letter

The main takeaway is that this paper frames imputation training as minimizing MSE risk on the full data, then shows that MAR missingness shifts the observed training distribution away from the target. They correct for it by reweighting the observed loss with inverse missingness probabilities. That combination is the new piece; prior work on imputation or importance weighting does not put them together exactly this way in the cited literature.

Referee Report

2 major / 2 minor

Summary. The manuscript formulates missing data imputation as minimization of mean-squared error risk on the full data distribution. It shows that under MAR (where missingness depends on observed data), standard methods trained on observed samples incur a distribution shift and fail to minimize the target risk. The authors propose an importance-weighted correction to the imputation objective that reweights by inverse missingness probabilities to recover the full-data risk minimizer, and report simulation gains of 3% lower RMSE and 7% lower Wasserstein distance versus identical uncorrected baselines.

Significance. If the central correction is valid, the work supplies a clean risk-based framing that links imputation to importance sampling and offers a simple, plug-in modification for existing algorithms. The consistent (if modest) simulation improvements indicate potential practical utility in MAR settings. Stronger finite-sample analysis and more detailed empirical protocols would increase the result's impact in statistical machine learning.

major comments (2)

[§3] §3 (theoretical derivation): the unbiasedness argument for the importance-weighted risk assumes known missingness probabilities π; when π is estimated from the finite observed sample the weights become random and potentially correlated with the imputation residuals, yet the manuscript provides no finite-sample bias bound or convergence rate requirement on the propensity estimator. This is load-bearing for the claim that the corrected procedure minimizes full-data MSE.
[Simulation studies] Simulation studies section: the reported average reductions (3% RMSE, 7% Wasserstein) are presented without error bars, number of Monte Carlo repetitions, or the precise data-generating process and propensity model used, preventing assessment of whether the gains are statistically reliable or sensitive to propensity estimation error.

minor comments (2)

[Notation] Notation in the risk objective (early sections) could more explicitly distinguish the observed-data measure from the full-data measure to avoid reader confusion.
[References] The manuscript would benefit from citing recent doubly-robust or augmented inverse-probability weighting literature for missing-data problems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify important aspects of the theoretical and empirical contributions. We respond to each major comment below and outline the revisions we will incorporate.

read point-by-point responses

Referee: [§3] §3 (theoretical derivation): the unbiasedness argument for the importance-weighted risk assumes known missingness probabilities π; when π is estimated from the finite observed sample the weights become random and potentially correlated with the imputation residuals, yet the manuscript provides no finite-sample bias bound or convergence rate requirement on the propensity estimator. This is load-bearing for the claim that the corrected procedure minimizes full-data MSE.

Authors: We agree that the derivation in §3 establishes unbiasedness of the importance-weighted objective under the assumption that the missingness probabilities π are known. When π is estimated from the finite sample, the resulting weights are random and may correlate with the imputation residuals, so the finite-sample properties are not automatically guaranteed by the population argument. The manuscript does not supply explicit finite-sample bias bounds or convergence rates for the propensity estimator. In the revision we will add a clarifying paragraph in §3 that states the consistency result holds under standard conditions (e.g., consistent estimation of π at a rate faster than n^{-1/4} and the use of sample splitting to break dependence between weights and residuals). While a complete finite-sample analysis lies outside the present scope, these additions will make the assumptions and limitations explicit. revision: partial
Referee: [Simulation studies] Simulation studies section: the reported average reductions (3% RMSE, 7% Wasserstein) are presented without error bars, number of Monte Carlo repetitions, or the precise data-generating process and propensity model used, preventing assessment of whether the gains are statistically reliable or sensitive to propensity estimation error.

Authors: We thank the referee for highlighting the lack of reproducibility details. The reported averages were computed over 100 Monte Carlo repetitions on a multivariate Gaussian covariate model with a logistic missingness mechanism and a logistic-regression propensity estimator. In the revised manuscript we will (i) report standard errors or error bars for all metrics, (ii) state the number of repetitions explicitly, (iii) provide the complete data-generating process and propensity model specification, and (iv) include a short sensitivity analysis with respect to propensity estimation error. These changes will allow readers to assess statistical reliability and robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: importance weighting follows from external MAR theory applied to a new risk objective

full rationale

The paper states the imputation task as full-data MSE risk minimization, notes that standard methods optimize only the observed-data risk under MAR-induced shift, and proposes reweighting the observed loss by inverse missingness probabilities. This step is a direct application of standard importance-sampling identities that hold when propensities are known or consistently estimated; the identities are not derived from the paper's own equations and do not reduce the claimed minimizer to a fitted parameter by construction. No self-citation is load-bearing for the core correction, no ansatz is smuggled, and no uniqueness theorem is invoked. Simulations report empirical gains rather than algebraic identities. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard MAR assumption and the ability to estimate missingness probabilities from observed data; no new free parameters or invented entities are introduced beyond conventional importance sampling.

axioms (1)

domain assumption Missingness is missing at random (MAR), i.e., the probability of missingness depends only on observed data.
Invoked in the title and abstract to justify the distribution shift analysis and the feasibility of importance weighting.

pith-pipeline@v0.9.0 · 5440 in / 1118 out tokens · 24684 ms · 2026-05-16T06:29:58.014416+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

J(g) = sum p(Ri=0) Ji(gi) with Ji(gi) = E[(gi(X̃)−Xi)² | Ri=0]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.