Testing independence in the presence of missing data: high-dimensional case
Pith reviewed 2026-05-08 10:52 UTC · model grok-4.3
The pith
Two modifications to a Kendall-based statistic enable testing independence in high-dimensional data with missing observations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Building upon a recently proposed Kendall-based statistic, the authors introduce two new modifications specifically designed to accommodate incomplete observations. The proposed methods are studied from both theoretical and empirical perspectives. A comprehensive simulation study illustrates the robustness and applicability of the new approaches for testing independence in high-dimensional settings with missing data.
What carries the argument
Two modifications of the Kendall-based statistic that adjust the pairwise ranking computations to incorporate incomplete observations while preserving the test's ability to detect dependence.
If this is right
- The adapted tests achieve correct asymptotic size and nontrivial power under high-dimensional regimes with missing entries.
- Performance remains stable across varying proportions and patterns of missingness in finite-sample simulations.
- The methods enlarge the set of available nonparametric tools for analyzing incomplete high-dimensional data structures.
Where Pith is reading between the lines
- If missingness is non-ignorable, the procedures could become invalid, so sensitivity checks on the missing-data mechanism would be prudent before application.
- Analogous adjustments could be developed for other rank-based dependence measures such as Spearman's rho in the same missing-data context.
- Computational scaling for very large dimensions and sample sizes would benefit from efficient implementations of the modified pairwise calculations.
Load-bearing premise
The modifications correctly handle the missing data mechanism without introducing bias in the high-dimensional regime.
What would settle it
A simulation in which missingness depends on the unobserved values themselves; if the empirical type I error rate under the null deviates substantially from the nominal level, the claim fails.
read the original abstract
In this paper, we consider the problem of testing independence in high-dimensional settings with missing data. Building upon a recently proposed Kendall-based statistic, we introduce two new modifications specifically designed to accommodate incomplete observations. The proposed methods are studied from both theoretical and empirical perspectives. A comprehensive simulation study illustrates the robustness and applicability of the new approaches. The findings contribute to the development of nonparametric methods for analyzing high-dimensional and incomplete data structures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes two modifications to a recently introduced Kendall-based statistic for testing independence, adapted to handle incomplete observations in high-dimensional regimes. It develops theoretical results on the asymptotic behavior of the modified statistics and presents a simulation study to illustrate robustness under missing data.
Significance. If the central theoretical claims hold, the work provides a useful nonparametric extension for independence testing when data are high-dimensional and partially observed, a setting common in applications. The combination of theory and simulations is a strength, though the scope of the missingness assumptions requires clarification for the results to be broadly applicable.
major comments (2)
- [§3] §3 (theoretical results on asymptotic null distribution): the argument that the modified pairwise concordance terms remain unbiased under the null and that the normalization accounts for random per-pair sample sizes must be shown explicitly under MAR (not just MCAR); when p/n does not remain bounded the effective sample size per pair is random and the concentration used for complete-data Kendall statistics does not automatically carry over without additional bias-correction terms or reweighting.
- [Simulation section] Simulation section (likely §4): the reported robustness is demonstrated only under MCAR or simple MAR; no results are shown for MAR where missingness depends on observed covariates, which is the regime where the skeptic concern about bias in the high-dimensional limit would manifest and directly tests the load-bearing assumption.
minor comments (2)
- [§2] Notation for the two modifications is introduced without a clear side-by-side comparison table; adding one would improve readability.
- [Abstract] The abstract states that the methods are 'studied from both theoretical and empirical perspectives' but does not specify the precise missingness rates or (p,n) regimes used in the theory; this should be stated explicitly.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. The points raised regarding the scope of the missingness assumptions and the simulation design are helpful for strengthening the paper. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: §3 (theoretical results on asymptotic null distribution): the argument that the modified pairwise concordance terms remain unbiased under the null and that the normalization accounts for random per-pair sample sizes must be shown explicitly under MAR (not just MCAR); when p/n does not remain bounded the effective sample size per pair is random and the concentration used for complete-data Kendall statistics does not automatically carry over without additional bias-correction terms or reweighting.
Authors: We agree that the current theoretical development in Section 3 is carried out under the MCAR assumption, which guarantees unbiasedness of the modified pairwise concordance terms and permits the normalization to handle random per-pair sample sizes. The referee is correct that the concentration arguments do not automatically extend to MAR without further work when p/n is unbounded. In the revision we will add an explicit derivation under MAR, including the necessary bias-correction or reweighting steps to control the random effective sample sizes and establish the asymptotic null distribution under the stated conditions. revision: yes
-
Referee: Simulation section (likely §4): the reported robustness is demonstrated only under MCAR or simple MAR; no results are shown for MAR where missingness depends on observed covariates, which is the regime where the skeptic concern about bias in the high-dimensional limit would manifest and directly tests the load-bearing assumption.
Authors: We acknowledge that the simulation study focuses on MCAR and a basic MAR mechanism and does not yet include missingness that depends on observed covariates. We will expand the simulation section to incorporate such MAR scenarios. The additional experiments will directly examine performance in the regime the referee identifies and will be reported with the same metrics used in the current study. revision: yes
Circularity Check
No significant circularity; new modifications derived and validated independently
full rationale
The paper explicitly builds on an external recently proposed Kendall-based statistic and introduces two distinct modifications for incomplete observations. These modifications are then subjected to separate theoretical analysis of their asymptotic behavior under high-dimensional regimes and assessed via comprehensive simulations for robustness. No equations or claims reduce by construction to fitted inputs, self-definitions, or unverified self-citations; the derivation chain remains self-contained with independent content in the new adjustments and their validation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
D. Aleksi ´c, M. Cupari ´c, and B. Milo ˇsevi´c. Non-degenerate u-statistics for data missing completely at random with application to testing independence.Stat, 12(1):e634, 2023. 13 Testing independence in the presence of missing data: high-dimensional caseA PREPRINT
work page 2023
- [3]
-
[4]
A. Bordino and T. B. Berrett. Tests of missing completely at random based on sample covariance matrices.The Annals of Statistics, 53(5):2204–2229, 2025
work page 2025
-
[5]
C. K. Enders.Applied missing data analysis. Guilford Publications, 2022
work page 2022
-
[6]
Missing value imputation with adversarial random forests—MissARF
P. Golchian, J. Kapar, D. S. Watson, and M. N. Wright. Missing value imputation with adversarial random forests–missarf.arXiv preprint arXiv:2507.15681, 2025
-
[7]
D. Leung and M. Drton. Testing independence in high dimensions with sums of rank correlations.Annals of Statistics, 46(1):280–307, 2018
work page 2018
-
[8]
R. J. Little. A test of missing completely at random for multivariate data with missing values.Journal of the American statistical Association, 83(404):1198–1202, 1988
work page 1988
-
[9]
R. J. Little and D. B. Rubin.Statistical analysis with missing data. John Wiley & Sons, 2019
work page 2019
-
[10]
G. Mao. Testing independence in high dimensions using kendall’s tau.Computational Statistics & Data Analysis, 117:128–137, 2018
work page 2018
-
[11]
M. Marozzi, A. Mukherjee, and J. Kalina. Interpoint distance tests for high-dimensional comparison studies. Journal of Applied Statistics, 47(4):653–665, 2020
work page 2020
-
[12]
D. L. McLeish. Dependent central limit theorems and invariance principles.the Annals of Probability, 2(4):620– 628, 1974
work page 1974
-
[13]
A. Mirzaei, S. R. Carter, A. E. Patanwala, and C. R. Schneider. Missing data in surveys: Key concepts, ap- proaches, and applications.Research in Social and Administrative Pharmacy, 18(2):2308–2316, 2022
work page 2022
-
[14]
Rockel.missMethods: Methods for Missing Data, 2023
T. Rockel.missMethods: Methods for Missing Data, 2023. R package version 0.4.0
work page 2023
-
[15]
Y . Sang. Test for diagonal symmetry in high dimension.Statistics & Probability Letters, 205:109960, 2024
work page 2024
-
[16]
J. R. Schott. Testing for complete independence in high dimensions.Biometrika, 92(4):951–956, 2005
work page 2005
-
[17]
D. J. Stekhoven and P. B ¨uhlmann. Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2012
work page 2012
-
[18]
Van Buuren.Flexible Imputation of Missing Data, Second Edition
S. Van Buuren.Flexible Imputation of Missing Data, Second Edition. Chapman and Hall/CRC, 2012
work page 2012
-
[19]
X. Xie and X.-L. Meng. Dissecting multiple imputation from a multi-phase inference perspective: what happens when god’s, imputer’s and analyst’s models are uncongenial?Statistica Sinica, pages 1485–1545, 2017
work page 2017
-
[20]
G. Xu, L. Lin, P. Wei, and W. Pan. An adaptive two-sample test for high-dimensional means.Biometrika, 103(3):609–624, 2016
work page 2016
- [21]
-
[22]
Y . Zhang and L. Zhu. Projective independence tests in high dimensions: the curses and the cures.Biometrika, 111(3):1013–1027, 2024. 14
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.