pith. sign in

arxiv: 2410.16636 · v2 · submitted 2024-10-22 · 📊 stat.ML · cs.LG· math.ST· stat.TH

General Frameworks for Conditional Two-Sample Testing

Pith reviewed 2026-05-23 19:18 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH
keywords conditional two-sample testingconditional independence testingdensity ratio estimationhardness resultblack-box conversiondomain adaptationalgorithmic fairness
0
0 comments X

The pith

Two frameworks convert conditional independence tests into conditional two-sample tests or reduce the problem to marginal testing via density ratios under targeted distribution classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first proves a hardness result: without assumptions on the distributions, no valid test can achieve significant power against any single alternative in conditional two-sample testing. It then presents two frameworks that achieve validity and power by targeting specific distribution classes. The first converts any conditional independence test into a conditional two-sample test through a black-box procedure that preserves the original test's asymptotic properties. The second reduces the conditional problem to comparing marginal distributions after estimating density ratios, which permits direct use of existing marginal two-sample methods such as classification-based or kernel-based approaches. These constructions matter for applications like domain adaptation and algorithmic fairness, where one must compare groups while controlling for confounders, and the paper illustrates their finite-sample behavior through simulations.

Core claim

Conditional two-sample testing is hard in general because no valid test can have significant power against any single alternative without assumptions. Under targeted classes of distributions, two frameworks succeed: the first converts any conditional independence test into a conditional two-sample test in a black-box manner while preserving asymptotic properties; the second transforms the task into marginal distribution comparison using estimated density ratios and demonstrates this with classification and kernel methods. Simulation studies confirm the frameworks' behavior in finite samples.

What carries the argument

Black-box conversion from any conditional independence test to a conditional two-sample test that preserves asymptotics, together with density ratio estimation that reduces the conditional problem to marginal two-sample testing.

If this is right

  • Any existing conditional independence test can be repurposed directly for conditional two-sample testing while retaining its theoretical guarantees.
  • Standard marginal two-sample testing procedures become applicable once density ratios are estimated from data.
  • The methods inherit the validity and power properties of the base tests within the targeted distribution classes.
  • Simulation studies can be used to check finite-sample performance of the converted or reduced tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The frameworks could simplify testing for group differences in fairness settings by allowing reuse of conditional independence tools to control for confounders.
  • If density ratio estimation error grows with dimension, the second framework's power may degrade faster than the first in high-dimensional data.
  • The same reduction ideas might apply to other conditional testing problems by swapping the base test or the marginal comparator.

Load-bearing premise

The frameworks must target specific classes of distributions to achieve both validity and power.

What would settle it

A concrete counterexample in which the black-box conversion fails to preserve the type I error control or asymptotic power of the original conditional independence test would disprove the first framework.

Figures

Figures reproduced from arXiv: 2410.16636 by Ilmun Kim, Seongchan Lee, Suman Cha.

Figure 1
Figure 1. Figure 1: Rejection rates for Scenario 1 under null and alternative hypotheses, shown for both unbounded [PITH_FULL_IMAGE:figures/full_fig_p017_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Rejection rates for Scenario 2 under null and alternative hypotheses, shown for both unbounded [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Rejection rates for Scenario 3 under null and alternative hypotheses, shown for both unbounded [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison of DRT methods on diamonds and superconductivity datasets using LL [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Log-scaled mean squared errors of marginal density ratio [PITH_FULL_IMAGE:figures/full_fig_p035_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Rejection rates of CIT methods on the diamonds and superconductivity datasets under null and [PITH_FULL_IMAGE:figures/full_fig_p036_6.png] view at source ↗
read the original abstract

We study the problem of conditional two-sample testing, which aims to determine whether two populations have the same distribution after accounting for confounding factors. This problem commonly arises in various applications, such as domain adaptation and algorithmic fairness, where comparing two groups is essential while controlling for confounding variables. We begin by establishing a hardness result for conditional two-sample testing, demonstrating that no valid test can have significant power against any single alternative without proper assumptions. We then introduce two general frameworks that implicitly or explicitly target specific classes of distributions for their validity and power. Our first framework allows us to convert any conditional independence test into a conditional two-sample test in a black-box manner, while preserving the asymptotic properties of the original conditional independence test. The second framework transforms the problem into comparing marginal distributions with estimated density ratios, which allows us to leverage existing methods for marginal two-sample testing. We demonstrate this idea in a concrete manner with classification and kernel-based methods. Finally, simulation studies are conducted to illustrate the proposed frameworks in finite-sample scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The manuscript establishes a hardness result showing that nonparametric conditional two-sample testing admits no valid test with nontrivial power against a fixed alternative without distributional assumptions. It then presents two frameworks that target specific distribution classes: the first reduces the conditional two-sample problem (X ⊥ A | Z with A binary) to conditional independence testing via a direct equivalence, permitting any valid CI test to be applied in black-box fashion while inheriting its asymptotics; the second re-expresses the problem as a marginal two-sample test after estimating density ratios. The frameworks are instantiated with classification and kernel methods and assessed via finite-sample simulations.

Significance. If the claimed equivalence and asymptotic preservation hold, the work supplies a general, modular route to conditional two-sample testing that reuses existing CI and two-sample procedures. This is relevant to domain adaptation and fairness applications. The explicit alignment with the hardness result (targeting restricted distribution classes) and the black-box character of the first framework are constructive features.

minor comments (4)
  1. [§2] §2 (hardness result): the precise statement of the alternative class against which power is precluded should be stated as a formal theorem rather than described at high level, to make the necessity of the subsequent assumptions fully transparent.
  2. [§4.1] §4.1 (first framework): the reduction step that maps the conditional two-sample null to a CI null should include an explicit statement of the measure-theoretic conditions under which the equivalence is measure-preserving, even if standard.
  3. [§5] §5 (density-ratio framework): the error propagation from density-ratio estimation into the marginal test statistic is only sketched; a short lemma bounding the additional bias term would strengthen the asymptotic claim.
  4. [Simulations] Simulation section: the reported power curves would benefit from an additional panel showing type-I error under the null for each method, to confirm that the black-box conversion does not inflate size in finite samples.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work, the recognition of its relevance to domain adaptation and fairness, and the recommendation of minor revision. The report correctly captures the hardness result, the two frameworks, and their modular character. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper establishes a hardness result showing that nonparametric conditional two-sample testing requires assumptions for nontrivial power, then presents two frameworks: one that converts any conditional independence test to a conditional two-sample test via black-box equivalence while preserving asymptotics, and a second that re-expresses the problem using density-ratio reweighting to reduce to marginal two-sample testing. These are explicit methodological reductions based on distributional equivalences (X ⊥ A | Z with A binary), not self-definitions, fitted parameters renamed as predictions, or self-citation chains. No load-bearing ansatz, uniqueness theorem from the same authors, or renaming of known results is present; the central claims remain independent of the paper's own fitted quantities or prior self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard regularity conditions for asymptotic validity of the converted tests and on the feasibility of density ratio estimation; no new free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption Regularity conditions sufficient for asymptotic properties of conditional independence and marginal two-sample tests to carry over
    Invoked when the paper states that asymptotic properties are preserved.

pith-pipeline@v0.9.0 · 5707 in / 1125 out tokens · 25518 ms · 2026-05-23T19:18:36.473884+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 1 internal anchor

  1. [1]

    Andrews, D. W. K. (1997). A Conditional Kolmogorov Test . Econometrica , 65(5):1097--1128

  2. [2]

    Barocas, S., Hardt, M., and Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities . The MIT Press

  3. [3]

    B., Wang, Y., Barber, R

    Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J. (2020). The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society Series B: Statistical Methodology , 82(1):175--197

  4. [4]

    Boeken, P. A. and Mooij, J. M. (2021). A bayesian nonparametric conditional two-sample test with an application to local causal discovery. In de Campos, C. and Maathuis, M. H., editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence , volume 161 of Proceedings of Machine Learning Research , pages 1565--1575. PMLR

  5. [5]

    Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology , 80(3):551--577

  6. [6]

    Chakraborty, A., Zhang, J., and Katsevich, E. (2024). Doubly robust and computationally efficient high-dimensional variable selection. arXiv preprint arXiv:2409.09512

  7. [7]

    Chatterjee, A., Niu, Z., and Bhattacharya, B. B. (2024). A kernel-based conditional two-sample test using nearest neighbors (with applications to calibration, regression curves, and simulation-based inference). arXiv preprint arXiv:2407.16550

  8. [8]

    and Lei, J

    Chen, Y. and Lei, J. (2024). De-Biased Two-Sample U-Statistics With Application To Conditional Distribution Testing . arXiv preprint arXiv:2402.00164

  9. [9]

    Choi, K., Liao, M., and Ermon, S. (2021). Featurized density ratio estimation. In Uncertainty in Artificial Intelligence , pages 172--182

  10. [10]

    Choi, K., Meng, C., Song, Y., and Ermon, S. (2022). Density ratio estimation via infinitesimal classification. In International Conference on Artificial Intelligence and Statistics , pages 2552--2573

  11. [11]

    and Romano, J

    Chung, E. and Romano, J. P. (2013). Exact and asymptotically robust permutation tests. The Annals of Statistics , 41(2):484--507

  12. [12]

    Dai, B., Shen, X., and Pan, W. (2022). Significance tests of feature relevance for a black-box learner. IEEE transactions on neural networks and learning systems , 35(2):1898--1911

  13. [13]

    Doran, G., Muandet, K., Zhang, K., and Sch\" o lkopf, B. (2014). A permutation-based kernel conditional independence test. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence , UAI'14, page 132–141, Arlington, Virginia, USA. AUAI Press

  14. [14]

    and Lin, S.-K

    Fan, J. and Lin, S.-K. (1998). Test of significance when data are curves. Journal of the American Statistical Association , 93(443):1007--1021

  15. [15]

    Fan, Y., Li, Q., and Min, I. (2006). A nonparametric bootstrap test of conditional distributions. Econometric Theory , 22(4):587--613

  16. [16]

    Fukumizu, K., Gretton, A., Sun, X., and Sch \"o lkopf, B. (2007). Kernel measures of conditional dependence. Advances in Neural Information Processing Systems , 20:489–496

  17. [17]

    Givens, G. H. and Hoeting, J. A. (2012). Computational statistics . John Wiley & Sons, Hoboken, NJ, USA, 2 edition

  18. [18]

    M., Rasch, M

    Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch \"o lkopf, B., and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research , 13(25):723--773

  19. [19]

    and Hart, J

    Hall, P. and Hart, J. D. (1990). Bootstrap test for difference between means in nonparametric regression. Journal of the American Statistical Association , 85(412):1039--1049

  20. [20]

    Hamidieh, K. (2018). A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science , 154:346–354

  21. [21]

    Hardt, M., Price, E., Price, E., and Srebro, N. (2016). Equality of opportunity in supervised learning. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems , volume 29, page 3323–3331. Curran Associates, Inc

  22. [22]

    Hediger, S., Michel, L., and N \"a f, J. (2022). On the use of random forest for two-sample testing. Computational Statistics & Data Analysis , 170:107435

  23. [23]

    and Lei, J

    Hu, X. and Lei, J. (2024). A two-sample conditional distribution test using conformal prediction and weighted rank sum. Journal of the American Statistical Association , 119(546):1136--1154

  24. [24]

    Kanamori, T., Hido, S., and Sugiyama, M. (2009). A least-squares approach to direct importance estimation. The Journal of Machine Learning Research , 10:1391--1445

  25. [25]

    Kanamori, T., Suzuki, T., and Sugiyama, M. (2010). Theoretical analysis of density ratio estimation. IEICE transactions on fundamentals of electronics, communications and computer sciences , 93(4):787--798

  26. [26]

    Kim, I., Balakrishnan, S., and Wasserman, L. (2022a). Minimax optimality of permutation tests. The Annals of Statistics , 50(1):225--251

  27. [27]

    B., and Lei, J

    Kim, I., Lee, A. B., and Lei, J. (2019). Global and local two-sample tests via regression. Electronic Journal of Statistics , 13(2):5253--5305

  28. [28]

    Kim, I., Neykov, M., Balakrishnan, S., and Wasserman, L. (2022b). Local permutation tests for conditional independence. The Annals of Statistics , 50(6):3388--3414

  29. [29]

    Kim, I., Neykov, M., Balakrishnan, S., and Wasserman, L. (2023). Conditional Independence Testing for Discrete Distributions: Beyond ^2 -and G -tests . arXiv preprint arXiv:2308.05373

  30. [30]

    Kim, I., Ramdas, A., Singh, A., and Wasserman, L. (2021). Classification accuracy as a proxy for two-sample testing. The Annals of Statistics , 49(1):411--434

  31. [31]

    and Hino, H

    Kimura, M. and Hino, H. (2024). A short survey on importance weighting for machine learning. arXiv preprint arXiv:2403.10175

  32. [32]

    Kulasekera, K. (1995). Comparison of regression curves using quasi-residuals. Journal of the American Statistical Association , 90(431):1085--1093

  33. [33]

    and Wang, J

    Kulasekera, K. and Wang, J. (1997). Smoothing parameter selection for power optimality in testing of regression curves. Journal of the American Statistical Association , 92(438):500--511

  34. [34]

    Li, S., Zhang, Y., Zhu, H., Wang, C., Shu, H., Chen, Z., Sun, Z., and Yang, Y. (2023). K-nearest-neighbor local sampling based conditional independence testing. Advances in Neural Information Processing Systems , 36:23321--23344

  35. [35]

    Liu, F., Xu, W., Lu, J., Zhang, G., Gretton, A., and Sutherland, D. J. (2020). Learning deep kernels for non-parametric two-sample tests. In International Conference on Machine Learning , pages 6316--6326

  36. [36]

    Liu, M., Katsevich, E., Janson, L., and Ramdas, A. (2022). Fast and powerful conditional randomization testing via distillation. Biometrika , 109(2):277--293

  37. [37]

    Liu, S., Takeda, A., Suzuki, T., and Fukumizu, K. (2017). Trimmed density ratio estimation. Advances in Neural Information Processing Systems , 30:4521–4531

  38. [38]

    and Oquab, M

    Lopez-Paz, D. and Oquab, M. (2017). Revisiting Classifier Two-Sample Tests . In International Conference on Learning Representations

  39. [39]

    R., Kim, I., Shah, R

    Lundborg, A. R., Kim, I., Shah, R. D., and Samworth, R. J. (2022). The Projected Covariance Measure for assumption-lean variable significance testing . arXiv preprint arXiv:2211.02039 (accepted to the Annals of Statistics)

  40. [40]

    Mulzer, W. (2018). Five proofs of Chernoff's bound with applications . arXiv preprint arXiv:1801.03365

  41. [41]

    and Dette, H

    Neumeyer, N. and Dette, H. (2003). Nonparametric comparison of regression curves: an empirical process approach. The Annals of Statistics , 31(3):880--920

  42. [42]

    Neykov, M., Balakrishnan, S., and Wasserman, L. (2021). Minimax optimal conditional independence testing. The Annals of Statistics , 49(4):2151--2177

  43. [43]

    Neykov, M., Wasserman, L., Kim, I., and Balakrishnan, S. (2023). Nearly Minimax Optimal Wasserstein Conditional Independence Testing . arXiv preprint arXiv:2308.08672

  44. [44]

    C., Jiménez-Gamero, M

    Pardo-Fernández, J. C., Jiménez-Gamero, M. D., and El Ghouch, A. (2015). Tests for the equality of conditional variance functions in nonparametric regression. Electronic Journal of Statistics , 9(2)

  45. [45]

    J., and Gretton, A

    Pogodin, R., Schrab, A., Li, Y., Sutherland, D. J., and Gretton, A. (2024). Practical Kernel Tests of Conditional Independence . arXiv preprint arXiv:2402.13196

  46. [46]

    Rhodes, B., Xu, K., and Gutmann, M. U. (2020). Telescoping density-ratio estimation. Advances in Neural Information Processing Systems , 33:4905--4916

  47. [47]

    o rrmann, J., and B \

    Scheidegger, C., H \"o rrmann, J., and B \"u hlmann, P. (2022). The weighted generalised covariance measure. Journal of Machine Learning Research , 23(273):1--68

  48. [48]

    Schrab, A., Kim, I., Albert, M., Laurent, B., Guedj, B., and Gretton, A. (2023). MMD aggregated two-sample test . Journal of Machine Learning Research , 24(194):1--81

  49. [49]

    Shah, R. D. and Peters, J. (2020). The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics , 48(3):1514--1538

  50. [50]

    Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference , 90(2):227--244

  51. [51]

    V., Zhang, K., and Visweswaran, S

    Strobl, E. V., Zhang, K., and Visweswaran, S. (2019). Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference , 7(1):20180017

  52. [52]

    Sugiyama, M., Krauledat, M., and M \"u ller, K.-R. (2007a). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research , 8(35):985--1005

  53. [53]

    Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P., and Kawanabe, M. (2007b). Direct importance estimation with model selection and its application to covariate shift adaptation. Advances in Neural Information Processing Systems , 20

  54. [54]

    Sugiyama, M., Suzuki, T., and Kanamori, T. (2010). Density ratio estimation: A comprehensive review. RIMS Kokyuroku , pages 10--31

  55. [55]

    Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Density Ratio Estimation in Machine Learning . Cambridge University Press

  56. [56]

    Tansey, W., Veitch, V., Zhang, H., Rabadan, R., and Blei, D. M. (2022). The holdout randomization test for feature selection in black box models. Journal of Computational and Graphical Statistics , 31(1):151--162

  57. [57]

    Tsuboi, Y., Kashima, H., Hido, S., Bickel, S., and Sugiyama, M. (2009). Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing , 17:138--155

  58. [58]

    J., VonHandorf, A., Viel, K

    Virolainen, S. J., VonHandorf, A., Viel, K. C. M. F., Weirauch, M. T., and Kottyan, L. C. (2022). Gene–environment interactions and their impact on human health. Genes & Immunity , 24(1):1–11

  59. [59]

    D., Gilbert, P

    Williamson, B. D., Gilbert, P. B., Simon, N. R., and Carone, M. (2023). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association , 118(543):1645--1658

  60. [60]

    M., and Baccarelli, A

    Wu, H., Eckhardt, C. M., and Baccarelli, A. A. (2023). Molecular mechanisms of environmental exposures and human disease. Nature Reviews Genetics , 24(5):332–344

  61. [61]

    and Zhang, X

    Yan, J. and Zhang, X. (2022). A nonparametric two-sample conditional distribution test. arXiv preprint arXiv:2210.08149

  62. [62]

    Zaremba, W., Gretton, A., and Blaschko, M. (2013). B-test: A non-parametric, low variance kernel two-sample test. Advances in Neural Information Processing Systems , 26

  63. [63]

    Zhang, K., Huang, B., Zhang, J., Glymour, C., and Sch \"o lkopf, B. (2017). Causal discovery from nonstationary/heterogeneous data: skeleton estimation and orientation determination. Proceedings of the 26th International Joint Conference on Artificial Intelligence , pages 1347--1353

  64. [64]

    Zhang, K., Peters, J., Janzing, D., and Sch\" o lkopf, B. (2011). Kernel-Based Conditional Independence Test and Application in Causal Discovery . In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence , UAI'11, pages 804--813, Arlington, Virginia, USA. AUAI Press

  65. [65]

    Zheng, J. X. (2000). A consistent test of conditional parametric distributions. Econometric Theory , 16(5):667--691

  66. [66]

    and Hastie, T

    Zhu, J. and Hastie, T. (2005). Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics , 14(1):185--205