General Frameworks for Conditional Two-Sample Testing
Pith reviewed 2026-05-23 19:18 UTC · model grok-4.3
The pith
Two frameworks convert conditional independence tests into conditional two-sample tests or reduce the problem to marginal testing via density ratios under targeted distribution classes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Conditional two-sample testing is hard in general because no valid test can have significant power against any single alternative without assumptions. Under targeted classes of distributions, two frameworks succeed: the first converts any conditional independence test into a conditional two-sample test in a black-box manner while preserving asymptotic properties; the second transforms the task into marginal distribution comparison using estimated density ratios and demonstrates this with classification and kernel methods. Simulation studies confirm the frameworks' behavior in finite samples.
What carries the argument
Black-box conversion from any conditional independence test to a conditional two-sample test that preserves asymptotics, together with density ratio estimation that reduces the conditional problem to marginal two-sample testing.
If this is right
- Any existing conditional independence test can be repurposed directly for conditional two-sample testing while retaining its theoretical guarantees.
- Standard marginal two-sample testing procedures become applicable once density ratios are estimated from data.
- The methods inherit the validity and power properties of the base tests within the targeted distribution classes.
- Simulation studies can be used to check finite-sample performance of the converted or reduced tests.
Where Pith is reading between the lines
- The frameworks could simplify testing for group differences in fairness settings by allowing reuse of conditional independence tools to control for confounders.
- If density ratio estimation error grows with dimension, the second framework's power may degrade faster than the first in high-dimensional data.
- The same reduction ideas might apply to other conditional testing problems by swapping the base test or the marginal comparator.
Load-bearing premise
The frameworks must target specific classes of distributions to achieve both validity and power.
What would settle it
A concrete counterexample in which the black-box conversion fails to preserve the type I error control or asymptotic power of the original conditional independence test would disprove the first framework.
Figures
read the original abstract
We study the problem of conditional two-sample testing, which aims to determine whether two populations have the same distribution after accounting for confounding factors. This problem commonly arises in various applications, such as domain adaptation and algorithmic fairness, where comparing two groups is essential while controlling for confounding variables. We begin by establishing a hardness result for conditional two-sample testing, demonstrating that no valid test can have significant power against any single alternative without proper assumptions. We then introduce two general frameworks that implicitly or explicitly target specific classes of distributions for their validity and power. Our first framework allows us to convert any conditional independence test into a conditional two-sample test in a black-box manner, while preserving the asymptotic properties of the original conditional independence test. The second framework transforms the problem into comparing marginal distributions with estimated density ratios, which allows us to leverage existing methods for marginal two-sample testing. We demonstrate this idea in a concrete manner with classification and kernel-based methods. Finally, simulation studies are conducted to illustrate the proposed frameworks in finite-sample scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript establishes a hardness result showing that nonparametric conditional two-sample testing admits no valid test with nontrivial power against a fixed alternative without distributional assumptions. It then presents two frameworks that target specific distribution classes: the first reduces the conditional two-sample problem (X ⊥ A | Z with A binary) to conditional independence testing via a direct equivalence, permitting any valid CI test to be applied in black-box fashion while inheriting its asymptotics; the second re-expresses the problem as a marginal two-sample test after estimating density ratios. The frameworks are instantiated with classification and kernel methods and assessed via finite-sample simulations.
Significance. If the claimed equivalence and asymptotic preservation hold, the work supplies a general, modular route to conditional two-sample testing that reuses existing CI and two-sample procedures. This is relevant to domain adaptation and fairness applications. The explicit alignment with the hardness result (targeting restricted distribution classes) and the black-box character of the first framework are constructive features.
minor comments (4)
- [§2] §2 (hardness result): the precise statement of the alternative class against which power is precluded should be stated as a formal theorem rather than described at high level, to make the necessity of the subsequent assumptions fully transparent.
- [§4.1] §4.1 (first framework): the reduction step that maps the conditional two-sample null to a CI null should include an explicit statement of the measure-theoretic conditions under which the equivalence is measure-preserving, even if standard.
- [§5] §5 (density-ratio framework): the error propagation from density-ratio estimation into the marginal test statistic is only sketched; a short lemma bounding the additional bias term would strengthen the asymptotic claim.
- [Simulations] Simulation section: the reported power curves would benefit from an additional panel showing type-I error under the null for each method, to confirm that the black-box conversion does not inflate size in finite samples.
Simulated Author's Rebuttal
We thank the referee for the positive summary of our work, the recognition of its relevance to domain adaptation and fairness, and the recommendation of minor revision. The report correctly captures the hardness result, the two frameworks, and their modular character. No specific major comments were raised in the report.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper establishes a hardness result showing that nonparametric conditional two-sample testing requires assumptions for nontrivial power, then presents two frameworks: one that converts any conditional independence test to a conditional two-sample test via black-box equivalence while preserving asymptotics, and a second that re-expresses the problem using density-ratio reweighting to reduce to marginal two-sample testing. These are explicit methodological reductions based on distributional equivalences (X ⊥ A | Z with A binary), not self-definitions, fitted parameters renamed as predictions, or self-citation chains. No load-bearing ansatz, uniqueness theorem from the same authors, or renaming of known results is present; the central claims remain independent of the paper's own fitted quantities or prior self-references.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Regularity conditions sufficient for asymptotic properties of conditional independence and marginal two-sample tests to carry over
Reference graph
Works this paper leans on
-
[1]
Andrews, D. W. K. (1997). A Conditional Kolmogorov Test . Econometrica , 65(5):1097--1128
work page 1997
-
[2]
Barocas, S., Hardt, M., and Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities . The MIT Press
work page 2023
-
[3]
Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J. (2020). The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society Series B: Statistical Methodology , 82(1):175--197
work page 2020
-
[4]
Boeken, P. A. and Mooij, J. M. (2021). A bayesian nonparametric conditional two-sample test with an application to local causal discovery. In de Campos, C. and Maathuis, M. H., editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence , volume 161 of Proceedings of Machine Learning Research , pages 1565--1575. PMLR
work page 2021
-
[5]
Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology , 80(3):551--577
work page 2018
- [6]
- [7]
-
[8]
Chen, Y. and Lei, J. (2024). De-Biased Two-Sample U-Statistics With Application To Conditional Distribution Testing . arXiv preprint arXiv:2402.00164
-
[9]
Choi, K., Liao, M., and Ermon, S. (2021). Featurized density ratio estimation. In Uncertainty in Artificial Intelligence , pages 172--182
work page 2021
-
[10]
Choi, K., Meng, C., Song, Y., and Ermon, S. (2022). Density ratio estimation via infinitesimal classification. In International Conference on Artificial Intelligence and Statistics , pages 2552--2573
work page 2022
-
[11]
Chung, E. and Romano, J. P. (2013). Exact and asymptotically robust permutation tests. The Annals of Statistics , 41(2):484--507
work page 2013
-
[12]
Dai, B., Shen, X., and Pan, W. (2022). Significance tests of feature relevance for a black-box learner. IEEE transactions on neural networks and learning systems , 35(2):1898--1911
work page 2022
-
[13]
Doran, G., Muandet, K., Zhang, K., and Sch\" o lkopf, B. (2014). A permutation-based kernel conditional independence test. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence , UAI'14, page 132–141, Arlington, Virginia, USA. AUAI Press
work page 2014
-
[14]
Fan, J. and Lin, S.-K. (1998). Test of significance when data are curves. Journal of the American Statistical Association , 93(443):1007--1021
work page 1998
-
[15]
Fan, Y., Li, Q., and Min, I. (2006). A nonparametric bootstrap test of conditional distributions. Econometric Theory , 22(4):587--613
work page 2006
-
[16]
Fukumizu, K., Gretton, A., Sun, X., and Sch \"o lkopf, B. (2007). Kernel measures of conditional dependence. Advances in Neural Information Processing Systems , 20:489–496
work page 2007
-
[17]
Givens, G. H. and Hoeting, J. A. (2012). Computational statistics . John Wiley & Sons, Hoboken, NJ, USA, 2 edition
work page 2012
-
[18]
Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch \"o lkopf, B., and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research , 13(25):723--773
work page 2012
-
[19]
Hall, P. and Hart, J. D. (1990). Bootstrap test for difference between means in nonparametric regression. Journal of the American Statistical Association , 85(412):1039--1049
work page 1990
-
[20]
Hamidieh, K. (2018). A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science , 154:346–354
work page 2018
-
[21]
Hardt, M., Price, E., Price, E., and Srebro, N. (2016). Equality of opportunity in supervised learning. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems , volume 29, page 3323–3331. Curran Associates, Inc
work page 2016
-
[22]
Hediger, S., Michel, L., and N \"a f, J. (2022). On the use of random forest for two-sample testing. Computational Statistics & Data Analysis , 170:107435
work page 2022
-
[23]
Hu, X. and Lei, J. (2024). A two-sample conditional distribution test using conformal prediction and weighted rank sum. Journal of the American Statistical Association , 119(546):1136--1154
work page 2024
-
[24]
Kanamori, T., Hido, S., and Sugiyama, M. (2009). A least-squares approach to direct importance estimation. The Journal of Machine Learning Research , 10:1391--1445
work page 2009
-
[25]
Kanamori, T., Suzuki, T., and Sugiyama, M. (2010). Theoretical analysis of density ratio estimation. IEICE transactions on fundamentals of electronics, communications and computer sciences , 93(4):787--798
work page 2010
-
[26]
Kim, I., Balakrishnan, S., and Wasserman, L. (2022a). Minimax optimality of permutation tests. The Annals of Statistics , 50(1):225--251
-
[27]
Kim, I., Lee, A. B., and Lei, J. (2019). Global and local two-sample tests via regression. Electronic Journal of Statistics , 13(2):5253--5305
work page 2019
-
[28]
Kim, I., Neykov, M., Balakrishnan, S., and Wasserman, L. (2022b). Local permutation tests for conditional independence. The Annals of Statistics , 50(6):3388--3414
- [29]
-
[30]
Kim, I., Ramdas, A., Singh, A., and Wasserman, L. (2021). Classification accuracy as a proxy for two-sample testing. The Annals of Statistics , 49(1):411--434
work page 2021
-
[31]
Kimura, M. and Hino, H. (2024). A short survey on importance weighting for machine learning. arXiv preprint arXiv:2403.10175
-
[32]
Kulasekera, K. (1995). Comparison of regression curves using quasi-residuals. Journal of the American Statistical Association , 90(431):1085--1093
work page 1995
-
[33]
Kulasekera, K. and Wang, J. (1997). Smoothing parameter selection for power optimality in testing of regression curves. Journal of the American Statistical Association , 92(438):500--511
work page 1997
-
[34]
Li, S., Zhang, Y., Zhu, H., Wang, C., Shu, H., Chen, Z., Sun, Z., and Yang, Y. (2023). K-nearest-neighbor local sampling based conditional independence testing. Advances in Neural Information Processing Systems , 36:23321--23344
work page 2023
-
[35]
Liu, F., Xu, W., Lu, J., Zhang, G., Gretton, A., and Sutherland, D. J. (2020). Learning deep kernels for non-parametric two-sample tests. In International Conference on Machine Learning , pages 6316--6326
work page 2020
-
[36]
Liu, M., Katsevich, E., Janson, L., and Ramdas, A. (2022). Fast and powerful conditional randomization testing via distillation. Biometrika , 109(2):277--293
work page 2022
-
[37]
Liu, S., Takeda, A., Suzuki, T., and Fukumizu, K. (2017). Trimmed density ratio estimation. Advances in Neural Information Processing Systems , 30:4521–4531
work page 2017
-
[38]
Lopez-Paz, D. and Oquab, M. (2017). Revisiting Classifier Two-Sample Tests . In International Conference on Learning Representations
work page 2017
-
[39]
Lundborg, A. R., Kim, I., Shah, R. D., and Samworth, R. J. (2022). The Projected Covariance Measure for assumption-lean variable significance testing . arXiv preprint arXiv:2211.02039 (accepted to the Annals of Statistics)
-
[40]
Mulzer, W. (2018). Five proofs of Chernoff's bound with applications . arXiv preprint arXiv:1801.03365
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
Neumeyer, N. and Dette, H. (2003). Nonparametric comparison of regression curves: an empirical process approach. The Annals of Statistics , 31(3):880--920
work page 2003
-
[42]
Neykov, M., Balakrishnan, S., and Wasserman, L. (2021). Minimax optimal conditional independence testing. The Annals of Statistics , 49(4):2151--2177
work page 2021
- [43]
-
[44]
Pardo-Fernández, J. C., Jiménez-Gamero, M. D., and El Ghouch, A. (2015). Tests for the equality of conditional variance functions in nonparametric regression. Electronic Journal of Statistics , 9(2)
work page 2015
-
[45]
Pogodin, R., Schrab, A., Li, Y., Sutherland, D. J., and Gretton, A. (2024). Practical Kernel Tests of Conditional Independence . arXiv preprint arXiv:2402.13196
-
[46]
Rhodes, B., Xu, K., and Gutmann, M. U. (2020). Telescoping density-ratio estimation. Advances in Neural Information Processing Systems , 33:4905--4916
work page 2020
-
[47]
Scheidegger, C., H \"o rrmann, J., and B \"u hlmann, P. (2022). The weighted generalised covariance measure. Journal of Machine Learning Research , 23(273):1--68
work page 2022
-
[48]
Schrab, A., Kim, I., Albert, M., Laurent, B., Guedj, B., and Gretton, A. (2023). MMD aggregated two-sample test . Journal of Machine Learning Research , 24(194):1--81
work page 2023
-
[49]
Shah, R. D. and Peters, J. (2020). The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics , 48(3):1514--1538
work page 2020
-
[50]
Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference , 90(2):227--244
work page 2000
-
[51]
V., Zhang, K., and Visweswaran, S
Strobl, E. V., Zhang, K., and Visweswaran, S. (2019). Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference , 7(1):20180017
work page 2019
-
[52]
Sugiyama, M., Krauledat, M., and M \"u ller, K.-R. (2007a). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research , 8(35):985--1005
-
[53]
Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P., and Kawanabe, M. (2007b). Direct importance estimation with model selection and its application to covariate shift adaptation. Advances in Neural Information Processing Systems , 20
-
[54]
Sugiyama, M., Suzuki, T., and Kanamori, T. (2010). Density ratio estimation: A comprehensive review. RIMS Kokyuroku , pages 10--31
work page 2010
-
[55]
Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Density Ratio Estimation in Machine Learning . Cambridge University Press
work page 2012
-
[56]
Tansey, W., Veitch, V., Zhang, H., Rabadan, R., and Blei, D. M. (2022). The holdout randomization test for feature selection in black box models. Journal of Computational and Graphical Statistics , 31(1):151--162
work page 2022
-
[57]
Tsuboi, Y., Kashima, H., Hido, S., Bickel, S., and Sugiyama, M. (2009). Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing , 17:138--155
work page 2009
-
[58]
Virolainen, S. J., VonHandorf, A., Viel, K. C. M. F., Weirauch, M. T., and Kottyan, L. C. (2022). Gene–environment interactions and their impact on human health. Genes & Immunity , 24(1):1–11
work page 2022
-
[59]
Williamson, B. D., Gilbert, P. B., Simon, N. R., and Carone, M. (2023). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association , 118(543):1645--1658
work page 2023
-
[60]
Wu, H., Eckhardt, C. M., and Baccarelli, A. A. (2023). Molecular mechanisms of environmental exposures and human disease. Nature Reviews Genetics , 24(5):332–344
work page 2023
-
[61]
Yan, J. and Zhang, X. (2022). A nonparametric two-sample conditional distribution test. arXiv preprint arXiv:2210.08149
-
[62]
Zaremba, W., Gretton, A., and Blaschko, M. (2013). B-test: A non-parametric, low variance kernel two-sample test. Advances in Neural Information Processing Systems , 26
work page 2013
-
[63]
Zhang, K., Huang, B., Zhang, J., Glymour, C., and Sch \"o lkopf, B. (2017). Causal discovery from nonstationary/heterogeneous data: skeleton estimation and orientation determination. Proceedings of the 26th International Joint Conference on Artificial Intelligence , pages 1347--1353
work page 2017
-
[64]
Zhang, K., Peters, J., Janzing, D., and Sch\" o lkopf, B. (2011). Kernel-Based Conditional Independence Test and Application in Causal Discovery . In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence , UAI'11, pages 804--813, Arlington, Virginia, USA. AUAI Press
work page 2011
-
[65]
Zheng, J. X. (2000). A consistent test of conditional parametric distributions. Econometric Theory , 16(5):667--691
work page 2000
-
[66]
Zhu, J. and Hastie, T. (2005). Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics , 14(1):185--205
work page 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.