Substantive-Model-Compatible Multiple Imputation for Cox Regression with a Diverging Number of Covariates
Pith reviewed 2026-05-22 08:31 UTC · model grok-4.3
The pith
The paper proposes a multiple imputation method for Cox regression that maintains consistency and asymptotic normality even as the number of covariates diverges with sample size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish a semiparametric multiple imputation framework for Cox regression with missing covariates of diverging dimension; missing covariates are imputed through high-dimensional SMC-FCS driven by Cox-model likelihood contributions, with rejection sampling enforcing substantive-model compatibility and ridge-regularized posterior draws stabilizing the imputation models, after which the algorithm stabilizes the Cox estimator via imputation-regularized optimization and inference for low-dimensional linear functionals is obtained by combining debiased estimators through Rubin's rules, yielding consistency and asymptotic normality of the pooled estimator.
What carries the argument
High-dimensional SMC-FCS imputation procedure that uses ridge-regularized posterior draws and rejection sampling to generate datasets compatible with the Cox model while stabilizing subsequent estimation of low-dimensional functionals.
If this is right
- The pooled estimator obtained by Rubin's rules applied to debiased within-imputation estimators remains consistent for low-dimensional linear functionals of the regression coefficients.
- Asymptotic normality holds for the combined estimator, permitting standard inference procedures even when the covariate dimension diverges.
- The stabilization step via imputation-regularized optimization produces reliable finite-sample behavior for the Cox estimator under missing data.
- The method directly supports analysis of high-dimensional survival data with incomplete covariates in biomedical applications.
Where Pith is reading between the lines
- The same stabilization and compatibility techniques could be tested on other semiparametric survival models to check whether consistency extends beyond the Cox case.
- Hybridization with existing penalization methods might allow the approach to handle even faster growth in dimension while retaining the multiple-imputation variance accounting.
Load-bearing premise
The ridge-regularized high-dimensional imputation models preserve substantive-model compatibility with the Cox regression without distorting the asymptotic behavior of the debiased estimator when the number of covariates grows with sample size.
What would settle it
A simulation study in which the pooled estimator exhibits non-vanishing bias or the empirical coverage of Rubin's-rule confidence intervals for a low-dimensional contrast falls outside the nominal 95 percent level when the covariate dimension increases proportionally to log n or faster.
Figures
read the original abstract
Modern biomedical survival studies with high-dimensional genomic and clinical predictors are challenged by missing covariates. Existing methods conduct inference through penalization and debiasing when the number of covariates diverges with sample size, but they are typically developed with fully observed covariates. Conversely, substantive-model-compatible multiple imputation methods, particularly substantive-model-compatible fully conditional specification (SMC-FCS), provide principled handling of missing covariates while preserving compatibility with the Cox model, yet current methodology and theory remain largely restricted to fixed-dimensional settings. To address these limitations, we propose a semiparametric multiple imputation framework for inference in Cox regression with missing covariates of a diverging dimension. Missing covariates are imputed through a high-dimensional SMC-FCS procedure driven by Cox-model likelihood contributions, with rejection sampling used to enforce substantive-model compatibility and ridge-regularized posterior draws used to stabilize the imputation models. The algorithm stabilizes the Cox estimator through an imputation-regularized optimization iteration and then generates multiply imputed datasets from a stabilized chain. Inference for low-dimensional linear functionals or contrasts, $c^\top \beta$, is obtained by combining debiased estimators and within-imputation variance estimates through Rubin's rules. We establish consistency and asymptotic normality of the resulting pooled estimator under a diverging-dimensional regime. Simulation studies demonstrate favorable finite-sample performance, and an application to the Boston Lung Cancer Survival Cohort illustrates the practical utility of the proposed method for high-dimensional survival studies with incomplete covariates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a semiparametric multiple imputation framework for Cox regression with missing covariates under a diverging-dimensional regime (p diverging with n). Missing covariates are imputed via high-dimensional SMC-FCS driven by Cox likelihoods, stabilized by ridge regularization and enforced for compatibility via rejection sampling. The procedure generates multiply imputed datasets, fits debiased Cox estimators on each, and combines them via Rubin's rules for inference on low-dimensional functionals c^T β. The central theoretical contribution is a proof of consistency and asymptotic normality of the pooled estimator. Simulations and an application to the Boston Lung Cancer Survival Cohort are included.
Significance. If the asymptotic results hold, the work would meaningfully extend substantive-model-compatible imputation to the high-dimensional survival setting common in biomedical genomics, where existing penalization/debiasing methods assume fully observed data and standard SMC-FCS is limited to fixed p. The combination of ridge stabilization, rejection sampling, and debiased estimation with Rubin's rules is a technically interesting approach. Credit is due for attempting to derive the required rates under diverging p while preserving compatibility.
major comments (2)
- [§3 (Asymptotic Theory)] §3 (Asymptotic Theory): The consistency and asymptotic normality claims require that the bias introduced by ridge regularization in the high-dimensional SMC-FCS imputation models vanishes at rate o_p(n^{-1/2}). The manuscript does not state or verify an explicit condition on the ridge parameter λ_n (e.g., λ_n = o(n^{-1/2}) or the precise rate needed for the debiasing correction), which is load-bearing for the central claim; without it the extra term in the expansion of the pooled estimator may not be negligible.
- [§2.3 (Rejection Sampling Procedure)] §2.3 (Rejection Sampling Procedure): Finite rejection sampling is used to enforce substantive-model compatibility, yet the resulting Monte Carlo error must be shown to be o_p(n^{-1/2}) uniformly when p diverges. The current argument does not control this term at the rate required for sqrt(n)-consistency of the debiased pooled estimator, undermining the asymptotic normality result.
minor comments (2)
- [Abstract] The abstract refers to an 'imputation-regularized optimization iteration' without a brief definition or reference to the relevant equation; adding one sentence would improve readability.
- [Simulation Studies] Simulation studies should explicitly link the chosen (n, p) sequences to the regularity conditions assumed in the diverging-p theory.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive major comments. The observations on the asymptotic analysis are well-taken and point to places where the manuscript can be strengthened by making implicit rate conditions explicit. We address each comment below and will incorporate the suggested clarifications into the revised manuscript.
read point-by-point responses
-
Referee: [§3 (Asymptotic Theory)] The consistency and asymptotic normality claims require that the bias introduced by ridge regularization in the high-dimensional SMC-FCS imputation models vanishes at rate o_p(n^{-1/2}). The manuscript does not state or verify an explicit condition on the ridge parameter λ_n (e.g., λ_n = o(n^{-1/2}) or the precise rate needed for the debiasing correction), which is load-bearing for the central claim; without it the extra term in the expansion of the pooled estimator may not be negligible.
Authors: We agree that an explicit condition on λ_n is required to guarantee the regularization bias term is o_p(n^{-1/2}) after debiasing. In the revision we will add a new assumption (Assumption 3.4) stating that λ_n = o(n^{-1/2} p^{-1/2}) and verify in the proof of Theorem 3.1 that, under this rate together with the existing sparsity and eigenvalue conditions, the extra bias term is absorbed into the o_p(n^{-1/2}) remainder. This addition does not change the main consistency and normality statements but makes the proof self-contained. revision: yes
-
Referee: [§2.3 (Rejection Sampling Procedure)] Finite rejection sampling is used to enforce substantive-model compatibility, yet the resulting Monte Carlo error must be shown to be o_p(n^{-1/2}) uniformly when p diverges. The current argument does not control this term at the rate required for sqrt(n)-consistency of the debiased pooled estimator, undermining the asymptotic normality result.
Authors: We concur that a uniform bound on the Monte Carlo error induced by finite rejection sampling is necessary when p diverges. We will revise Section 2.3 to include a new lemma (Lemma 2.1) that controls the total variation distance between the finite-rejection and exact conditional distributions by O(1/M + exp(-c n / p)) under a mild lower bound on the acceptance probability; choosing M = o(n^{1/2}) then ensures the Monte Carlo contribution is o_p(n^{-1/2}) uniformly in the diverging dimension. This lemma will be invoked in the proof of asymptotic normality to close the argument. revision: yes
Circularity Check
No significant circularity; asymptotic claims rest on external semiparametric theory
full rationale
The paper introduces a ridge-regularized high-dimensional SMC-FCS imputation procedure with rejection sampling to enforce compatibility, then derives consistency and asymptotic normality for the pooled debiased Cox estimator when p diverges with n. The derivation chain invokes standard high-dimensional M-estimation expansions and Rubin's rules combination, without any step that defines a target functional in terms of itself or renames a fitted quantity as a prediction. No load-bearing self-citation reduces the central result to an unverified prior claim by the same authors; the compatibility and stabilization arguments are presented as algorithmic choices whose approximation error is controlled by the stated rates on the ridge parameter and acceptance probability. The framework is therefore self-contained against external benchmarks for semiparametric inference.
Axiom & Free-Parameter Ledger
free parameters (1)
- ridge regularization parameter
axioms (1)
- domain assumption Standard regularity conditions for the Cox model and asymptotic normality under diverging dimensions
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We establish consistency and asymptotic normality of the resulting pooled estimator under a diverging-dimensional regime.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1177/0962280214521348. David R. Cox. Regression models and life-tables.Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202,
-
[2]
Jian Huang, Tingni Sun, Zhiliang Ying, Yi Yu, and Cun-Hui Zhang
doi: 10.1111/rssb.12224. Jian Huang, Tingni Sun, Zhiliang Ying, Yi Yu, and Cun-Hui Zhang. Oracle inequalities for the lasso in the Cox model.Annals of Statistics, 41(3):1142–1165,
-
[3]
Linda Kachuri, Mattias Johansson, Sara R
doi: 10.1186/1471-2288-14-28. Linda Kachuri, Mattias Johansson, Sara R. Rashkin, Rebecca E. Graff, Yohan Bossé, Venkata Manem, Neil E. Caporaso, Maria Teresa Landi, David C. Christiani, Paolo Vineis, et al. Immune-mediated genetic pathways resulting in pulmonary function impairment increase lung cancer susceptibility.Nature Communications, 11:27,
-
[4]
doi: 10.1038/s41467-019-13855-2. Ruth H. Keogh and Tim P. Morris. Multiple imputation in Cox regression when there are time- varying effects of covariates.Statistics in Medicine, 37(25):3661–3678,
-
[5]
Shengchun Kong, Zhuqing Yu, Xianyang Zhang, and Guang Cheng
doi: 10.1111/biom.12910. Shengchun Kong, Zhuqing Yu, Xianyang Zhang, and Guang Cheng. High-dimensional robust inferenceforCoxregressionmodelsusingdesparsifiedlasso.Scandinavian Journal of Statistics, 48(3):1068–1095,
-
[6]
Faming Liang, Bochao Jia, Jingnan Xue, Qizhai Li, and Ye Luo
doi: 10.1111/sjos.12543. Faming Liang, Bochao Jia, Jingnan Xue, Qizhai Li, and Ye Luo. An imputation-regularized optimization algorithm for high dimensional missing data problems and beyond.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5):899–926,
-
[7]
doi: 10.1111/rssb.12279. Roderick J. A. Little and Donald B. Rubin.Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken, NJ, 3 edition,
-
[8]
Sara van de Geer, Peter Bühlmann, Ya’acov Ritov, and Ruben Dezeure
doi: 10.1080/10629360600810434. Sara van de Geer, Peter Bühlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models.Annals of Statistics, 42(3): 1166–1202,
-
[9]
doi: 10.1214/14-AOS1221. Naisyin Wang and James M. Robins. Large-sample theory for parametric multiple imputation procedures.Biometrika, 85(4):935–948,
-
[10]
Ian R. White and Patrick Royston. Imputing missing covariate values for the Cox model. Statistics in Medicine, 28(15):1982–1998,
work page 1982
-
[11]
doi: 10.1111/sjos.12595. Cun-Hui Zhang and Stephanie S. Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242,
-
[12]
doi: 10.1111/rssb.12026. 8 Appendix: Proofs, Demographics of BLCS and Additional Simulation Results This section presents technical proofs of Theorems 1–4, descriptive analysis of BLCS and addi- tional numerical evidence supporting the main results of the paper. 8.1 Proofs Proof of Theorem 1.Fixβsatisfying∥β−β0∥1≤r. Since the analysis is under the triangu...
-
[13]
Pβ,ˆΛ 0 (x,·)≥εdn a µβ,ˆΛ 0 (·) for some probability measureµβ,ˆΛ 0 onX n. ThereforeP β,ˆΛ 0 is uniformly ergodic for each(n,pn) and admits a unique stationary distributionνβ,ˆΛ 0 . Moreover, by the resulting Doeblin bound (Meyn and Tweedie, 2009), PS β,ˆΛ 0 (x0,·)−νβ,ˆΛ 0 TV ≤(1−εdn a )S.(S.7) Thus the theorem holds withC0 = 1andρdn = 1−εdn a ; ...
work page 2009
-
[14]
rs2794359 (A), n (%) 0.778 genotype 0 801 (82.0%) 172 (83.5%) 629 (81.6%) genotype 1 130 (13.3%) 25 (12.1%) 105 (13.6%) genotype 2 2 (0.2%) 0 (0.0%) 2 (0.3%) Missing, n (%) 44 (4.5%) 9 (4.4%) 35 (4.5%) rs9660890 (C), n (%) 0.429 genotype 0 600 (61.4%) 118 (57.3%) 482 (62.5%) genotype 1 308 (31.5%) 72 (35.0%) 236 (30.6%) genotype 2 48 (4.9%) 10 (4.9%) 38 (...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.