Recognition: unknown
A Semi-Supervised Kernel Two-Sample Test
Pith reviewed 2026-05-09 17:07 UTC · model grok-4.3
The pith
A semi-supervised kernel two-sample test uses abundant unlabeled covariates to produce an asymptotically normal statistic that is easy to calibrate and often more powerful than standard kernel tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By incorporating semi-supervised covariate data into a kernel framework, the method constructs a test statistic whose distribution under the null converges to a normal limit, enabling direct use of standard normal critical values, while the resulting procedure attains greater asymptotic power than covariate-ignoring kernel tests and is consistent against fixed and local alternatives.
What carries the argument
A kernel-based test statistic that folds in unlabeled covariate information while preserving asymptotic normality under the null hypothesis of identical distributions.
Load-bearing premise
Adding covariate information to the test statistic still yields asymptotic normality under the null even though it breaks the exchangeability that permutation tests rely on.
What would settle it
A Monte Carlo study under the null with informative covariates in which the empirical distribution of the proposed statistic deviates substantially from normality at sample sizes where the theory predicts convergence.
Figures
read the original abstract
We consider the problem of two-sample testing in a semi-supervised setting with abundant unlabeled covariate data. Standard two-sample tests neglect covariate information, which has the potential to significantly boost performance. However, incorporating covariates potentially breaks the exchangeability assumption under the null, which further complicates a calibration procedure. To address these issues, we propose a semi-supervised method that produces a test statistic with asymptotic normality, while effectively integrating additional information from covariates. Our test is straightforward to calibrate due to the asymptotic normality under the null and achieves asymptotic power that is often much higher than existing kernel tests without covariates. Furthermore, we formally show that the proposed method is consistent in power against fixed and local alternatives. Simulations confirm the practical and theoretical strengths of our approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a semi-supervised kernel two-sample test that incorporates abundant unlabeled covariate data to improve performance over standard kernel tests. It claims the resulting test statistic is asymptotically normal under the null (enabling calibration via normal critical values), achieves higher asymptotic power than existing kernel methods without covariates, and is consistent against both fixed and local alternatives. These properties are supported by theoretical analysis and simulation experiments.
Significance. If the asymptotic normality result holds, the work offers a practical advance for two-sample testing in semi-supervised regimes by allowing covariate information to boost power without requiring complex resampling-based calibration. The consistency proofs and power comparisons would strengthen the case for adopting such methods when unlabeled covariates are available.
major comments (1)
- [Abstract and theoretical derivation of the test statistic] The central claim of asymptotic normality under the null (stated in the abstract and presumably derived in the theoretical section) is load-bearing for the entire calibration procedure, power analysis, and consistency results. The manuscript must explicitly verify that the semi-supervised construction preserves the conditions for the CLT (e.g., appropriate centering and variance terms) even though covariates break exchangeability; without this step-by-step check, the normality assertion cannot be confirmed and the claimed advantages over standard kernel tests do not follow.
minor comments (2)
- Clarify the precise form of the semi-supervised kernel estimator and any additional assumptions (e.g., on the covariate distribution or kernel bandwidth) needed for the asymptotic results.
- In the simulation section, report the exact sample sizes, covariate dimensions, and number of Monte Carlo replications to allow direct replication of the power comparisons.
Simulated Author's Rebuttal
We thank the referee for the careful review and for emphasizing the need for explicit verification of the asymptotic normality result. This is central to our claims, and we will revise the manuscript to strengthen the theoretical presentation as requested.
read point-by-point responses
-
Referee: The central claim of asymptotic normality under the null (stated in the abstract and presumably derived in the theoretical section) is load-bearing for the entire calibration procedure, power analysis, and consistency results. The manuscript must explicitly verify that the semi-supervised construction preserves the conditions for the CLT (e.g., appropriate centering and variance terms) even though covariates break exchangeability; without this step-by-step check, the normality assertion cannot be confirmed and the claimed advantages over standard kernel tests do not follow.
Authors: We agree that an explicit, step-by-step verification of the CLT conditions is required, especially since the unlabeled covariates break exchangeability of the labeled samples under the null. In the original derivation (Section 3), the test statistic is constructed by first estimating the conditional kernel mean embeddings from the abundant unlabeled covariates and then centering the labeled kernel terms accordingly; this yields a sum of conditionally mean-zero terms whose variance converges in probability to a positive constant. To address the referee's concern directly, we will add a new subsection that (i) states the null hypothesis in terms of the covariate-conditional distributions, (ii) verifies the centering removes the bias induced by the covariates, (iii) shows the variance estimator is consistent by law of large numbers on the unlabeled data, and (iv) confirms the Lindeberg condition holds for the triangular array of semi-supervised terms. These additions will make transparent that asymptotic normality is preserved and that the power gains relative to the standard kernel test follow from the reduced variance. We will also ensure the abstract and introduction reference this expanded derivation. revision: yes
Circularity Check
No significant circularity: asymptotic normality derived from standard CLT arguments on the semi-supervised statistic
full rationale
The paper proposes a new semi-supervised kernel two-sample statistic that incorporates unlabeled covariates while claiming to retain asymptotic normality under the null (despite broken exchangeability). This normality is established via direct analysis of the statistic's mean and variance under the null, not by redefining the target quantity in terms of itself or by fitting parameters on the same data and relabeling the fit as a prediction. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the consistency and power results follow from the same limiting distribution without reducing to the input data by construction. The approach is therefore self-contained against external benchmarks such as standard kernel MMD tests.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Kernel functions are positive definite and appropriate for two-sample testing
- domain assumption The constructed test statistic has asymptotic normality under the null hypothesis
Reference graph
Works this paper leans on
-
[1]
Semi-supervised inference: General theory and estimation of means
Zhang, Anru and Brown, Lawrence D and Cai, T Tony , journal=. Semi-supervised inference: General theory and estimation of means. 2019 , publisher=
2019
-
[2]
Journal of Machine Learning Research , volume=
On the Optimality of Gaussian Kernel Based Nonparametric Tests against Smooth Alternatives , author=. Journal of Machine Learning Research , volume=
-
[3]
Statistical Papers , volume=
Hellinger distances and -entropy in a one-parameter class of density functions , author=. Statistical Papers , volume=. 1989 , publisher=
1989
-
[4]
Two-sample smooth tests for the equality of distributions , volume =
Zhou, Wen-Xin and Zheng, Chao and Zhang, Zhen , year =. Two-sample smooth tests for the equality of distributions , volume =. Bernoulli , publisher =
-
[5]
Sutherland and Hsiao-Yu Tung and Heiko Strathmann and Soumyajit De and Aaditya Ramdas and Alex Smola and Arthur Gretton , booktitle=
Danica J. Sutherland and Hsiao-Yu Tung and Heiko Strathmann and Soumyajit De and Aaditya Ramdas and Alex Smola and Arthur Gretton , booktitle=. Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy
-
[6]
Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features
Choi, Ikjun and Kim, Ilmun , journal=. Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features
-
[7]
B-test: A non-parametric, low variance kernel two-sample test
Zaremba, Wojciech and Gretton, Arthur and Blaschko, Matthew , journal=. B-test: A non-parametric, low variance kernel two-sample test
-
[8]
Advances in Neural Information Processing Systems , volume=
Optimal kernel choice for large-scale two-sample tests , author=. Advances in Neural Information Processing Systems , volume=
-
[9]
The Annals of Statistics , volume=
The projected covariance measure for assumption-lean variable significance testing , author=. The Annals of Statistics , volume=. 2024 , publisher=
2024
-
[10]
Statistic Surveys , volume=
Methods for quantifying dataset similarity: a review, taxonomy and comparison , author=. Statistic Surveys , volume=. 2024 , publisher=
2024
-
[11]
Advances in Neural Information Processing Systems , volume=
An efficient doubly-robust test for the kernel treatment effect , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
R \'e nyi divergence and Kullback-Leibler divergence
Van Erven, Tim and Harremos, Peter , journal=. R \'e nyi divergence and Kullback-Leibler divergence. 2014 , publisher=
2014
-
[13]
Bobkov, Sergey G and Chistyakov, GP and G. R. The Annals of Probability , volume=. 2019 , publisher=
2019
-
[14]
2012 , publisher=
Asymptotic methods in statistical decision theory , author=. 2012 , publisher=
2012
-
[15]
IEEE Transactions on Information Theory , volume=
Minimax rates of entropy estimation on large alphabets via best polynomial approximation , author=. IEEE Transactions on Information Theory , volume=. 2016 , publisher=
2016
-
[16]
The Berry-Esseen bound for Student's statistic
Bentkus, Vidmantas and G. The Berry-Esseen bound for Student's statistic. The Annals of Probability , volume=. 1996 , publisher=
1996
-
[17]
On normal approximations to U-statistics
Bentkus, Vidmantas and Jing, Bing-Yi and Zhou, Wang , journal=. On normal approximations to U-statistics. 2009 , publisher=
2009
-
[18]
, author=
Lower bounds on the smallest eigenvalue of a sample covariance matrix. , author=. Electronic Communications in Probability , volume=. 2014 , publisher=
2014
-
[19]
p-Norm bounds on the expectation of the maximum of a possibly dependent sample
Arnold, Barry C , journal=. p-Norm bounds on the expectation of the maximum of a possibly dependent sample. 1985 , publisher=
1985
-
[20]
Biometrika , volume=
High-dimensional semi-supervised learning: in search of optimal inference of the mean , author=. Biometrika , volume=. 2022 , publisher=
2022
-
[21]
Information Theory: From Coding to Learning
Polyanskiy, Yury and Wu, Yihong , year=. Information Theory: From Coding to Learning
-
[22]
, author=
The correlation-assisted missing data estimator. , author=. Journal of Machine Learning Research , volume=
-
[23]
The Annals of Statistics , volume=
A constrained risk inequality with applications to nonparametric functional estimation , author=. The Annals of Statistics , volume=. 1996 , publisher=
1996
-
[24]
International Conference on Artificial Intelligence and Statistics , pages=
A constrained risk inequality for general losses , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=
2021
-
[25]
Double robust semi-supervised inference for the mean: selection bias under MAR labeling with decaying overlap
Zhang, Yuqian and Chakrabortty, Abhishek and Bradic, Jelena , journal=. Double robust semi-supervised inference for the mean: selection bias under MAR labeling with decaying overlap. 2023 , publisher=
2023
-
[26]
arXiv preprint arXiv:2306.00265 , year=
Doubly Robust Self-Training , author=. arXiv preprint arXiv:2306.00265 , year=
-
[27]
Angelopoulos and Stephen Bates and Clara Fannjiang and Michael I
Anastasios N. Angelopoulos and Stephen Bates and Clara Fannjiang and Michael I. Jordan and Tijana Zrnic , title =. Science , volume =
-
[28]
High dimensional M-estimation with missing outcomes: A semi-parametric framework
Chakrabortty, Abhishek and Lu, Jiarui and Cai, T Tony and Li, Hongzhe , journal=. High dimensional M-estimation with missing outcomes: A semi-parametric framework
-
[29]
The Annals of Statistics , number =
Efficient and adaptive linear regression in semi-supervised settings , author =. The Annals of Statistics , number =
-
[30]
Deng, Siyi and Ning, Yang and Zhao, Jiwei and Zhang, Heping , year =. Optimal and Safe Estimation for High-Dimensional Semi-Supervised Learning. doi:10.1080/01621459.2023.2277409 , journal =
-
[31]
Journal of the American Statistical Association , volume=
Semi-supervised linear regression , author=. Journal of the American Statistical Association , volume=. 2022 , publisher=
2022
-
[32]
Chakrabortty, Abhishek and Dai, Guorong and Tchetgen, Eric Tchetgen , journal=
-
[33]
A General M-estimation Theory in Semi-Supervised Framework
Song, Shanshan and Lin, Yuanyuan and Zhou, Yong , journal=. A General M-estimation Theory in Semi-Supervised Framework. 2023 , publisher=
2023
-
[34]
2018 , institution=
A few notes on contiguity, asymptotics, and local asymptotic normality , author=. 2018 , institution=
2018
-
[35]
Asymptotic Statistics
van der Vaart, Aad W , volume=. Asymptotic Statistics. 2000 , publisher=
2000
-
[36]
Semiparametric statistics
Van der Vaart, Aad W , pages=. Semiparametric statistics. 2002 , journal=
2002
-
[37]
Semiparametric Doubly Robust Targeted Double Machine Learning: A Review
Semiparametric doubly robust targeted double machine learning: a review , author=. arXiv preprint arXiv:2203.06469 , year=
-
[38]
Lee, A. J. , year=. U-statistics: Theory and Practice
-
[39]
Wassily Hoeffding , journal =
-
[40]
CLT for U-statistics with growing dimension
DiCiccio, Cyrus and Romano, Joseph , journal=. CLT for U-statistics with growing dimension
-
[41]
The Econometrics Journal , volume=
Double/debiased machine learning foratment and structural parameters , author=. The Econometrics Journal , volume=. 2018 , page=
2018
-
[42]
Proceedings of the National Academy of Sciences , volume=
Universal inference , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=
2020
-
[43]
, year =
Kennedy, Edward H. , year =. Electronic Journal of Statistics , publisher =
-
[44]
V. S. Koroljuk and Yu. V. Borovskich , title = ". 1994 , publisher =. doi:10.1007/978-94-017-3515-5 , url =
-
[45]
Larry Wasserman , title =. 2004 , publisher =. doi:10.1007/978-0-387-21736-9 , url =
-
[46]
Applications of the van Trees inequality: a Bayesian Cram \'e r-Rao bound
Gill, Richard D and Levit, Boris Y , journal=. Applications of the van Trees inequality: a Bayesian Cram \'e r-Rao bound. 1995 , publisher=
1995
-
[47]
2010 , publisher=
Asymptotic theory for cross-validated targeted maximum likelihood estimation , author=. 2010 , publisher=
2010
-
[48]
On a family of distributions attaining the Bhattacharyya bound
Tanaka, Hidekazu and Akahira, Masafumi , journal=. On a family of distributions attaining the Bhattacharyya bound. 2003 , publisher=
2003
-
[49]
On some analogues of the amount of information and their use in statistical estimation , author=. Sankhy. 1946 , publisher=
1946
-
[50]
2006 , publisher=
Theory of point estimation , author=. 2006 , publisher=
2006
-
[51]
The Journal of Machine Learning Research , volume=
Quantifying uncertainty in random forests via confidence intervals and hypothesis tests , author=. The Journal of Machine Learning Research , volume=. 2016 , publisher=
2016
-
[52]
V-statistics and variance estimation
Zhou, Zhengze and Mentch, Lucas and Hooker, Giles , journal=. V-statistics and variance estimation. 2021 , publisher=
2021
-
[53]
Jackknifing U-statistics
Arvesen, James N , journal=. Jackknifing U-statistics. 1969 , publisher=
1969
-
[54]
A Distribution-Free Theory of Nonparametric Regression
L. A Distribution-Free Theory of Nonparametric Regression. 2002 , publisher =. doi:10.1007/b97848 , url =
-
[55]
Nonparametric regression using deep neural networks with ReLU activation function , volume=
Johannes Schmidt-Hieber , title = ". 2020 , month = aug, publisher =. doi:10.1214/19-aos1875 , url =
-
[56]
arXiv preprint arXiv:2102.12034 (accepted to Biometrika) , year=
Semiparametric counterfactual density estimation , author=. arXiv preprint arXiv:2102.12034 (accepted to Biometrika) , year=
-
[57]
The Journal of Machine Learning Research , volume=
Analysis of a random forests model , author=. The Journal of Machine Learning Research , volume=. 2012 , publisher=
2012
-
[58]
Semi-supervised regression: A recent review
Kostopoulos, Georgios and Karlos, Stamatis and Kotsiantis, Sotiris and Ragos, Omiros , journal=. Semi-supervised regression: A recent review. 2018 , publisher=
2018
-
[59]
Peter J. Downey , title =. 1990 , month = may, publisher =. doi:10.1016/0167-6377(90)90018-z , url =
-
[60]
Ph. Rigollet and A. B. Tsybakov , title =. 2007 , month = sep, publisher =. doi:10.3103/s1066530707030052 , url =
-
[61]
Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003
Optimal rates of aggregation , author=. Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings , pages=. 2003 , organization=
2003
-
[62]
van der Laan and Eric C Polley and Alan E
Mark J. van der Laan and Eric C Polley and Alan E. Hubbard , title = ". 2007 , month = jan, publisher =. doi:10.2202/1544-6115.1309 , url =
-
[63]
Estimating the D imension of a M odel
Herman Callaert and Paul Janssen , title = ". 1978 , month = mar, publisher =. doi:10.1214/aos/1176344132 , url =
-
[64]
Berry , title = "
Andrew C. Berry , title = ". 1941 , publisher =
1941
-
[65]
, series=
Esseen, C.G. , series=. On the Liapounoff Limit of Error in the Theory of Probability. 1942 , publisher=
1942
-
[66]
, year =
Zrnic, Tijana and Candès, Emmanuel J. , year =. Cross-prediction-powered inference , volume =. Proceedings of the National Academy of Sciences , publisher =
-
[67]
Normal approximation by Stein's method
Chen, Louis HY and Goldstein, Larry and Shao, Qi-Man , volume=. Normal approximation by Stein's method. 2011 , publisher=
2011
-
[68]
arXiv preprint arXiv:2111.15546 , year=
Black box tests for algorithmic stability , author=. arXiv preprint arXiv:2111.15546 , year=
-
[69]
Bagging Provides Assumption-free Stability
Soloff, Jake A and Barber, Rina Foygel and Willett, Rebecca , journal=. Bagging Provides Assumption-free Stability
-
[70]
Advances in Neural Information Processing Systems , volume=
Debiased machine learning without sample-splitting for stable estimators , author=. Advances in Neural Information Processing Systems , volume=
-
[71]
Double/de-biased machine learning of global and local parameters using regularized Riesz representers
Chernozhukov, Victor and Newey, W and Robins, James and Singh, Rahul , journal=. Double/de-biased machine learning of global and local parameters using regularized Riesz representers
-
[72]
2021 , publisher=
Hirshberg, David A and Wager, Stefan , journal=. 2021 , publisher=
2021
-
[73]
Journal of the American Statistical Association , volume=
A general framework for inference on algorithm-agnostic variable importance , author=. Journal of the American Statistical Association , volume=. 2023 , publisher=
2023
-
[74]
Mark J. van der Laan and Daniel Rubin , title =". 2006 , month = jan, publisher =. doi:10.2202/1557-4679.1043 , url =
-
[75]
Alexander R. Luedtke and Mark J. van der Laan , title = ". 2016 , month = apr, publisher =. doi:10.1214/15-aos1384 , url =
-
[76]
Adversarial Estimation of Riesz Representers
Chernozhukov, Victor and Newey, Whitney and Singh, Rahul and Syrgkanis, Vasilis , journal=. Adversarial Estimation of Riesz Representers
-
[77]
NATO science series sub series iii computer and systems sciences , volume=
Leave-one-out error and stability of learning algorithms with applications , author=. NATO science series sub series iii computer and systems sciences , volume=. 2003 , publisher=
2003
-
[78]
Cross-Validation and Mean-Square Stability
Kale, Satyen and Kumar, Ravi and Vassilvitskii, Sergei , booktitle=. Cross-Validation and Mean-Square Stability
-
[79]
Stability and Generalization
Bousquet, Olivier and Elisseeff, Andr. Stability and Generalization. The Journal of Machine Learning Research , volume=. 2002 , publisher=
2002
-
[80]
Train faster, generalize better: Stability of stochastic gradient descent
Hardt, Moritz and Recht, Ben and Singer, Yoram , booktitle=. Train faster, generalize better: Stability of stochastic gradient descent. 2016 , organization=
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.