pith. machine review for the scientific record. sign in

arxiv: 2605.01775 · v1 · submitted 2026-05-03 · 📊 stat.ML · cs.LG· stat.ME

Recognition: unknown

A Semi-Supervised Kernel Two-Sample Test

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:07 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.ME
keywords semi-supervisedtwo-sample testkernel methodsasymptotic normalitycovariateshypothesis testingpower analysis
0
0 comments X

The pith

A semi-supervised kernel two-sample test uses abundant unlabeled covariates to produce an asymptotically normal statistic that is easy to calibrate and often more powerful than standard kernel tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses two-sample testing when extra unlabeled data on covariates is available but the main samples are limited. Standard kernel tests ignore this covariate information and rely on exchangeability for calibration, which adding covariates can break. The proposed method builds a test statistic that remains asymptotically normal under the null hypothesis of equal distributions, allowing calibration via normal quantiles. This integration of covariates yields higher asymptotic power against alternatives while maintaining consistency for both fixed and local alternatives. Simulations support that the approach improves performance in practice without complicating the calibration step.

Core claim

By incorporating semi-supervised covariate data into a kernel framework, the method constructs a test statistic whose distribution under the null converges to a normal limit, enabling direct use of standard normal critical values, while the resulting procedure attains greater asymptotic power than covariate-ignoring kernel tests and is consistent against fixed and local alternatives.

What carries the argument

A kernel-based test statistic that folds in unlabeled covariate information while preserving asymptotic normality under the null hypothesis of identical distributions.

Load-bearing premise

Adding covariate information to the test statistic still yields asymptotic normality under the null even though it breaks the exchangeability that permutation tests rely on.

What would settle it

A Monte Carlo study under the null with informative covariates in which the empirical distribution of the proposed statistic deviates substantially from normality at sample sizes where the theory predicts convergence.

Figures

Figures reproduced from arXiv: 2605.01775 by Gyumin Lee, Ilmun Kim, Shubhanshu Shekhar.

Figure 1
Figure 1. Figure 1: Experimental results for the distribution view at source ↗
Figure 2
Figure 2. Figure 2: We consider the case of PV = N(0d, ΣV ) and PW = N(aϵ,j , ΣW ), where aϵ,j ∈ R d has its first j entries equal to ϵ and the rest zero. We let ΣV = ΣW = ρ1d1 ⊤ d + (1 − ρ)Id and obtain {Vi} n1+m1 i=1 by sampling n1 + m1 independent samples from PV . We then con￾struct V = (V ⊤ 1 , . . . , V ⊤ n1 ) ⊤ ∈ R n1×d and obtain a set of n1 labeled samples, X = V · b, where b = (bi) d i=1 ∈ R d with bi = 1 if i belon… view at source ↗
Figure 2
Figure 2. Figure 2: Power comparisons across different dependence scenarios. The xssMMD tests, employing various view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of the construction of the xssMMD statistic based on the same principles as the general view at source ↗
Figure 4
Figure 4. Figure 4: Experimental results for the distribution of view at source ↗
Figure 5
Figure 5. Figure 5: Experimental results for the distribution of view at source ↗
Figure 6
Figure 6. Figure 6: Experimental results for the distribution of view at source ↗
Figure 7
Figure 7. Figure 7: Power analysis of the xssMMD test in various settings. The first two subfigures depict scenarios in view at source ↗
Figure 8
Figure 8. Figure 8: An example of data construction when testing coastal birds against grassland birds. Labeled data view at source ↗
Figure 9
Figure 9. Figure 9: An example of data construction of images with Gaussian noise of view at source ↗
read the original abstract

We consider the problem of two-sample testing in a semi-supervised setting with abundant unlabeled covariate data. Standard two-sample tests neglect covariate information, which has the potential to significantly boost performance. However, incorporating covariates potentially breaks the exchangeability assumption under the null, which further complicates a calibration procedure. To address these issues, we propose a semi-supervised method that produces a test statistic with asymptotic normality, while effectively integrating additional information from covariates. Our test is straightforward to calibrate due to the asymptotic normality under the null and achieves asymptotic power that is often much higher than existing kernel tests without covariates. Furthermore, we formally show that the proposed method is consistent in power against fixed and local alternatives. Simulations confirm the practical and theoretical strengths of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a semi-supervised kernel two-sample test that incorporates abundant unlabeled covariate data to improve performance over standard kernel tests. It claims the resulting test statistic is asymptotically normal under the null (enabling calibration via normal critical values), achieves higher asymptotic power than existing kernel methods without covariates, and is consistent against both fixed and local alternatives. These properties are supported by theoretical analysis and simulation experiments.

Significance. If the asymptotic normality result holds, the work offers a practical advance for two-sample testing in semi-supervised regimes by allowing covariate information to boost power without requiring complex resampling-based calibration. The consistency proofs and power comparisons would strengthen the case for adopting such methods when unlabeled covariates are available.

major comments (1)
  1. [Abstract and theoretical derivation of the test statistic] The central claim of asymptotic normality under the null (stated in the abstract and presumably derived in the theoretical section) is load-bearing for the entire calibration procedure, power analysis, and consistency results. The manuscript must explicitly verify that the semi-supervised construction preserves the conditions for the CLT (e.g., appropriate centering and variance terms) even though covariates break exchangeability; without this step-by-step check, the normality assertion cannot be confirmed and the claimed advantages over standard kernel tests do not follow.
minor comments (2)
  1. Clarify the precise form of the semi-supervised kernel estimator and any additional assumptions (e.g., on the covariate distribution or kernel bandwidth) needed for the asymptotic results.
  2. In the simulation section, report the exact sample sizes, covariate dimensions, and number of Monte Carlo replications to allow direct replication of the power comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for emphasizing the need for explicit verification of the asymptotic normality result. This is central to our claims, and we will revise the manuscript to strengthen the theoretical presentation as requested.

read point-by-point responses
  1. Referee: The central claim of asymptotic normality under the null (stated in the abstract and presumably derived in the theoretical section) is load-bearing for the entire calibration procedure, power analysis, and consistency results. The manuscript must explicitly verify that the semi-supervised construction preserves the conditions for the CLT (e.g., appropriate centering and variance terms) even though covariates break exchangeability; without this step-by-step check, the normality assertion cannot be confirmed and the claimed advantages over standard kernel tests do not follow.

    Authors: We agree that an explicit, step-by-step verification of the CLT conditions is required, especially since the unlabeled covariates break exchangeability of the labeled samples under the null. In the original derivation (Section 3), the test statistic is constructed by first estimating the conditional kernel mean embeddings from the abundant unlabeled covariates and then centering the labeled kernel terms accordingly; this yields a sum of conditionally mean-zero terms whose variance converges in probability to a positive constant. To address the referee's concern directly, we will add a new subsection that (i) states the null hypothesis in terms of the covariate-conditional distributions, (ii) verifies the centering removes the bias induced by the covariates, (iii) shows the variance estimator is consistent by law of large numbers on the unlabeled data, and (iv) confirms the Lindeberg condition holds for the triangular array of semi-supervised terms. These additions will make transparent that asymptotic normality is preserved and that the power gains relative to the standard kernel test follow from the reduced variance. We will also ensure the abstract and introduction reference this expanded derivation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: asymptotic normality derived from standard CLT arguments on the semi-supervised statistic

full rationale

The paper proposes a new semi-supervised kernel two-sample statistic that incorporates unlabeled covariates while claiming to retain asymptotic normality under the null (despite broken exchangeability). This normality is established via direct analysis of the statistic's mean and variance under the null, not by redefining the target quantity in terms of itself or by fitting parameters on the same data and relabeling the fit as a prediction. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the consistency and power results follow from the same limiting distribution without reducing to the input data by construction. The approach is therefore self-contained against external benchmarks such as standard kernel MMD tests.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard asymptotic theory for kernel tests and the assumption that covariates can be integrated without destroying normality under the null.

axioms (2)
  • domain assumption Kernel functions are positive definite and appropriate for two-sample testing
    Implied by the use of kernel methods in the proposal.
  • domain assumption The constructed test statistic has asymptotic normality under the null hypothesis
    This is the key property claimed for calibration and is central to the method.

pith-pipeline@v0.9.0 · 5422 in / 1293 out tokens · 42842 ms · 2026-05-09T17:07:05.898451+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

177 extracted references · 38 canonical work pages · 2 internal anchors

  1. [1]

    Semi-supervised inference: General theory and estimation of means

    Zhang, Anru and Brown, Lawrence D and Cai, T Tony , journal=. Semi-supervised inference: General theory and estimation of means. 2019 , publisher=

  2. [2]

    Journal of Machine Learning Research , volume=

    On the Optimality of Gaussian Kernel Based Nonparametric Tests against Smooth Alternatives , author=. Journal of Machine Learning Research , volume=

  3. [3]

    Statistical Papers , volume=

    Hellinger distances and -entropy in a one-parameter class of density functions , author=. Statistical Papers , volume=. 1989 , publisher=

  4. [4]

    Two-sample smooth tests for the equality of distributions , volume =

    Zhou, Wen-Xin and Zheng, Chao and Zhang, Zhen , year =. Two-sample smooth tests for the equality of distributions , volume =. Bernoulli , publisher =

  5. [5]

    Sutherland and Hsiao-Yu Tung and Heiko Strathmann and Soumyajit De and Aaditya Ramdas and Alex Smola and Arthur Gretton , booktitle=

    Danica J. Sutherland and Hsiao-Yu Tung and Heiko Strathmann and Soumyajit De and Aaditya Ramdas and Alex Smola and Arthur Gretton , booktitle=. Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy

  6. [6]

    Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features

    Choi, Ikjun and Kim, Ilmun , journal=. Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features

  7. [7]

    B-test: A non-parametric, low variance kernel two-sample test

    Zaremba, Wojciech and Gretton, Arthur and Blaschko, Matthew , journal=. B-test: A non-parametric, low variance kernel two-sample test

  8. [8]

    Advances in Neural Information Processing Systems , volume=

    Optimal kernel choice for large-scale two-sample tests , author=. Advances in Neural Information Processing Systems , volume=

  9. [9]

    The Annals of Statistics , volume=

    The projected covariance measure for assumption-lean variable significance testing , author=. The Annals of Statistics , volume=. 2024 , publisher=

  10. [10]

    Statistic Surveys , volume=

    Methods for quantifying dataset similarity: a review, taxonomy and comparison , author=. Statistic Surveys , volume=. 2024 , publisher=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    An efficient doubly-robust test for the kernel treatment effect , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    R \'e nyi divergence and Kullback-Leibler divergence

    Van Erven, Tim and Harremos, Peter , journal=. R \'e nyi divergence and Kullback-Leibler divergence. 2014 , publisher=

  13. [13]

    Bobkov, Sergey G and Chistyakov, GP and G. R. The Annals of Probability , volume=. 2019 , publisher=

  14. [14]

    2012 , publisher=

    Asymptotic methods in statistical decision theory , author=. 2012 , publisher=

  15. [15]

    IEEE Transactions on Information Theory , volume=

    Minimax rates of entropy estimation on large alphabets via best polynomial approximation , author=. IEEE Transactions on Information Theory , volume=. 2016 , publisher=

  16. [16]

    The Berry-Esseen bound for Student's statistic

    Bentkus, Vidmantas and G. The Berry-Esseen bound for Student's statistic. The Annals of Probability , volume=. 1996 , publisher=

  17. [17]

    On normal approximations to U-statistics

    Bentkus, Vidmantas and Jing, Bing-Yi and Zhou, Wang , journal=. On normal approximations to U-statistics. 2009 , publisher=

  18. [18]

    , author=

    Lower bounds on the smallest eigenvalue of a sample covariance matrix. , author=. Electronic Communications in Probability , volume=. 2014 , publisher=

  19. [19]

    p-Norm bounds on the expectation of the maximum of a possibly dependent sample

    Arnold, Barry C , journal=. p-Norm bounds on the expectation of the maximum of a possibly dependent sample. 1985 , publisher=

  20. [20]

    Biometrika , volume=

    High-dimensional semi-supervised learning: in search of optimal inference of the mean , author=. Biometrika , volume=. 2022 , publisher=

  21. [21]

    Information Theory: From Coding to Learning

    Polyanskiy, Yury and Wu, Yihong , year=. Information Theory: From Coding to Learning

  22. [22]

    , author=

    The correlation-assisted missing data estimator. , author=. Journal of Machine Learning Research , volume=

  23. [23]

    The Annals of Statistics , volume=

    A constrained risk inequality with applications to nonparametric functional estimation , author=. The Annals of Statistics , volume=. 1996 , publisher=

  24. [24]

    International Conference on Artificial Intelligence and Statistics , pages=

    A constrained risk inequality for general losses , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

  25. [25]

    Double robust semi-supervised inference for the mean: selection bias under MAR labeling with decaying overlap

    Zhang, Yuqian and Chakrabortty, Abhishek and Bradic, Jelena , journal=. Double robust semi-supervised inference for the mean: selection bias under MAR labeling with decaying overlap. 2023 , publisher=

  26. [26]

    arXiv preprint arXiv:2306.00265 , year=

    Doubly Robust Self-Training , author=. arXiv preprint arXiv:2306.00265 , year=

  27. [27]

    Angelopoulos and Stephen Bates and Clara Fannjiang and Michael I

    Anastasios N. Angelopoulos and Stephen Bates and Clara Fannjiang and Michael I. Jordan and Tijana Zrnic , title =. Science , volume =

  28. [28]

    High dimensional M-estimation with missing outcomes: A semi-parametric framework

    Chakrabortty, Abhishek and Lu, Jiarui and Cai, T Tony and Li, Hongzhe , journal=. High dimensional M-estimation with missing outcomes: A semi-parametric framework

  29. [29]

    The Annals of Statistics , number =

    Efficient and adaptive linear regression in semi-supervised settings , author =. The Annals of Statistics , number =

  30. [30]

    Journal of the American Statistical Association, 1–24 (2024) https://doi.org/10.1080/01621459.2023.2300522

    Deng, Siyi and Ning, Yang and Zhao, Jiwei and Zhang, Heping , year =. Optimal and Safe Estimation for High-Dimensional Semi-Supervised Learning. doi:10.1080/01621459.2023.2277409 , journal =

  31. [31]

    Journal of the American Statistical Association , volume=

    Semi-supervised linear regression , author=. Journal of the American Statistical Association , volume=. 2022 , publisher=

  32. [32]

    Chakrabortty, Abhishek and Dai, Guorong and Tchetgen, Eric Tchetgen , journal=

  33. [33]

    A General M-estimation Theory in Semi-Supervised Framework

    Song, Shanshan and Lin, Yuanyuan and Zhou, Yong , journal=. A General M-estimation Theory in Semi-Supervised Framework. 2023 , publisher=

  34. [34]

    2018 , institution=

    A few notes on contiguity, asymptotics, and local asymptotic normality , author=. 2018 , institution=

  35. [35]

    Asymptotic Statistics

    van der Vaart, Aad W , volume=. Asymptotic Statistics. 2000 , publisher=

  36. [36]

    Semiparametric statistics

    Van der Vaart, Aad W , pages=. Semiparametric statistics. 2002 , journal=

  37. [37]

    Semiparametric Doubly Robust Targeted Double Machine Learning: A Review

    Semiparametric doubly robust targeted double machine learning: a review , author=. arXiv preprint arXiv:2203.06469 , year=

  38. [38]

    Lee, A. J. , year=. U-statistics: Theory and Practice

  39. [39]

    Wassily Hoeffding , journal =

  40. [40]

    CLT for U-statistics with growing dimension

    DiCiccio, Cyrus and Romano, Joseph , journal=. CLT for U-statistics with growing dimension

  41. [41]

    The Econometrics Journal , volume=

    Double/debiased machine learning foratment and structural parameters , author=. The Econometrics Journal , volume=. 2018 , page=

  42. [42]

    Proceedings of the National Academy of Sciences , volume=

    Universal inference , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

  43. [43]

    , year =

    Kennedy, Edward H. , year =. Electronic Journal of Statistics , publisher =

  44. [44]

    V. S. Koroljuk and Yu. V. Borovskich , title = ". 1994 , publisher =. doi:10.1007/978-94-017-3515-5 , url =

  45. [45]

    2004 , publisher =

    Larry Wasserman , title =. 2004 , publisher =. doi:10.1007/978-0-387-21736-9 , url =

  46. [46]

    Applications of the van Trees inequality: a Bayesian Cram \'e r-Rao bound

    Gill, Richard D and Levit, Boris Y , journal=. Applications of the van Trees inequality: a Bayesian Cram \'e r-Rao bound. 1995 , publisher=

  47. [47]

    2010 , publisher=

    Asymptotic theory for cross-validated targeted maximum likelihood estimation , author=. 2010 , publisher=

  48. [48]

    On a family of distributions attaining the Bhattacharyya bound

    Tanaka, Hidekazu and Akahira, Masafumi , journal=. On a family of distributions attaining the Bhattacharyya bound. 2003 , publisher=

  49. [49]

    On some analogues of the amount of information and their use in statistical estimation , author=. Sankhy. 1946 , publisher=

  50. [50]

    2006 , publisher=

    Theory of point estimation , author=. 2006 , publisher=

  51. [51]

    The Journal of Machine Learning Research , volume=

    Quantifying uncertainty in random forests via confidence intervals and hypothesis tests , author=. The Journal of Machine Learning Research , volume=. 2016 , publisher=

  52. [52]

    V-statistics and variance estimation

    Zhou, Zhengze and Mentch, Lucas and Hooker, Giles , journal=. V-statistics and variance estimation. 2021 , publisher=

  53. [53]

    Jackknifing U-statistics

    Arvesen, James N , journal=. Jackknifing U-statistics. 1969 , publisher=

  54. [54]

    A Distribution-Free Theory of Nonparametric Regression

    L. A Distribution-Free Theory of Nonparametric Regression. 2002 , publisher =. doi:10.1007/b97848 , url =

  55. [55]

    Nonparametric regression using deep neural networks with ReLU activation function , volume=

    Johannes Schmidt-Hieber , title = ". 2020 , month = aug, publisher =. doi:10.1214/19-aos1875 , url =

  56. [56]

    arXiv preprint arXiv:2102.12034 (accepted to Biometrika) , year=

    Semiparametric counterfactual density estimation , author=. arXiv preprint arXiv:2102.12034 (accepted to Biometrika) , year=

  57. [57]

    The Journal of Machine Learning Research , volume=

    Analysis of a random forests model , author=. The Journal of Machine Learning Research , volume=. 2012 , publisher=

  58. [58]

    Semi-supervised regression: A recent review

    Kostopoulos, Georgios and Karlos, Stamatis and Kotsiantis, Sotiris and Ragos, Omiros , journal=. Semi-supervised regression: A recent review. 2018 , publisher=

  59. [59]

    Downey , title =

    Peter J. Downey , title =. 1990 , month = may, publisher =. doi:10.1016/0167-6377(90)90018-z , url =

  60. [60]

    Rigollet and A

    Ph. Rigollet and A. B. Tsybakov , title =. 2007 , month = sep, publisher =. doi:10.3103/s1066530707030052 , url =

  61. [61]

    Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003

    Optimal rates of aggregation , author=. Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings , pages=. 2003 , organization=

  62. [62]

    van der Laan and Eric C Polley and Alan E

    Mark J. van der Laan and Eric C Polley and Alan E. Hubbard , title = ". 2007 , month = jan, publisher =. doi:10.2202/1544-6115.1309 , url =

  63. [63]

    Estimating the D imension of a M odel

    Herman Callaert and Paul Janssen , title = ". 1978 , month = mar, publisher =. doi:10.1214/aos/1176344132 , url =

  64. [64]

    Berry , title = "

    Andrew C. Berry , title = ". 1941 , publisher =

  65. [65]

    , series=

    Esseen, C.G. , series=. On the Liapounoff Limit of Error in the Theory of Probability. 1942 , publisher=

  66. [66]

    , year =

    Zrnic, Tijana and Candès, Emmanuel J. , year =. Cross-prediction-powered inference , volume =. Proceedings of the National Academy of Sciences , publisher =

  67. [67]

    Normal approximation by Stein's method

    Chen, Louis HY and Goldstein, Larry and Shao, Qi-Man , volume=. Normal approximation by Stein's method. 2011 , publisher=

  68. [68]

    arXiv preprint arXiv:2111.15546 , year=

    Black box tests for algorithmic stability , author=. arXiv preprint arXiv:2111.15546 , year=

  69. [69]

    Bagging Provides Assumption-free Stability

    Soloff, Jake A and Barber, Rina Foygel and Willett, Rebecca , journal=. Bagging Provides Assumption-free Stability

  70. [70]

    Advances in Neural Information Processing Systems , volume=

    Debiased machine learning without sample-splitting for stable estimators , author=. Advances in Neural Information Processing Systems , volume=

  71. [71]

    Double/de-biased machine learning of global and local parameters using regularized Riesz representers

    Chernozhukov, Victor and Newey, W and Robins, James and Singh, Rahul , journal=. Double/de-biased machine learning of global and local parameters using regularized Riesz representers

  72. [72]

    2021 , publisher=

    Hirshberg, David A and Wager, Stefan , journal=. 2021 , publisher=

  73. [73]

    Journal of the American Statistical Association , volume=

    A general framework for inference on algorithm-agnostic variable importance , author=. Journal of the American Statistical Association , volume=. 2023 , publisher=

  74. [74]

    van der Laan and Daniel Rubin

    Mark J. van der Laan and Daniel Rubin , title =". 2006 , month = jan, publisher =. doi:10.2202/1557-4679.1043 , url =

  75. [75]

    Luedtke and Mark J

    Alexander R. Luedtke and Mark J. van der Laan , title = ". 2016 , month = apr, publisher =. doi:10.1214/15-aos1384 , url =

  76. [76]

    Adversarial Estimation of Riesz Representers

    Chernozhukov, Victor and Newey, Whitney and Singh, Rahul and Syrgkanis, Vasilis , journal=. Adversarial Estimation of Riesz Representers

  77. [77]

    NATO science series sub series iii computer and systems sciences , volume=

    Leave-one-out error and stability of learning algorithms with applications , author=. NATO science series sub series iii computer and systems sciences , volume=. 2003 , publisher=

  78. [78]

    Cross-Validation and Mean-Square Stability

    Kale, Satyen and Kumar, Ravi and Vassilvitskii, Sergei , booktitle=. Cross-Validation and Mean-Square Stability

  79. [79]

    Stability and Generalization

    Bousquet, Olivier and Elisseeff, Andr. Stability and Generalization. The Journal of Machine Learning Research , volume=. 2002 , publisher=

  80. [80]

    Train faster, generalize better: Stability of stochastic gradient descent

    Hardt, Moritz and Recht, Ben and Singer, Yoram , booktitle=. Train faster, generalize better: Stability of stochastic gradient descent. 2016 , organization=

Showing first 80 references.