Deep-testing: the case of dependence detection
Pith reviewed 2026-05-07 12:40 UTC · model grok-4.3
The pith
A neural network trained on simulated null and alternative samples produces a test statistic that achieves the highest overall power for independence testing against nineteen competing methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Deep-testing approaches the classical problem of hypothesis testing by training a deep neural network on simulated data satisfying the null and alternative hypotheses; the resulting classification map serves as the test statistic and leverages the network's strong discriminating power to produce a highly powerful test. As a proof of concept the method is applied to independence testing, where a large-scale simulation study shows that deep-testing attains the highest overall power among nineteen competing procedures across a broad range of complex dependence structures.
What carries the argument
The classification map learned by a deep neural network trained on simulated samples from the null and alternative hypotheses, used directly as the test statistic.
If this is right
- The learned classifier can serve as a test statistic for independence without requiring explicit formulas for the null distribution.
- High power is maintained across a wide variety of dependence structures that are difficult for traditional tests.
- The procedure offers a general template that can be applied to other hypothesis-testing problems by changing the simulation protocol.
- Performance gains arise from the network's ability to extract discriminating features directly from the sample geometry.
Where Pith is reading between the lines
- The same training strategy could be used to construct tests for multivariate dependence or for conditional independence by adjusting the simulation design.
- One would still need to verify that the p-value obtained from the network output is calibrated under the null on data sets whose marginal distributions differ from those used in training.
- Hybrid methods that combine the network-based statistic with classical rank-based tests might improve robustness when sample sizes are small.
Load-bearing premise
A neural network trained on simulated null and alternative samples will produce a test statistic whose null distribution can be reliably calibrated and that generalizes to yield valid and powerful tests on real data.
What would settle it
Applying the trained classifier to fresh independent samples drawn from the same distributions used in training and checking whether the empirical rejection rate at the nominal level equals the target significance level.
Figures
read the original abstract
Deep learning methods have proved highly effective for classification and image recognition problems. In this paper, we ask whether this success can be transferred to hypothesis testing: if a neural network can distinguish, for example, an image of a handwritten digit from another, can it also distinguish an "image of a sample" (such as a scatter plot) generated under a given statistical model from one generated outside that model? Motivated by this idea, we propose a novel procedure called deep-testing, which approaches the classical inferential problem of hypothesis testing through deep learning. More specifically, the test statistic is a classification map learned by a deep neural network from simulated data satisfying the null and alternative hypotheses, leveraging its strong discriminating power to construct a highly powerful test. As a proof of concept, we apply deep-testing to the problem of independence testing, arguably one of the most important problems in statistics. In a large-scale simulation study, deep-testing achieves the highest overall power against nineteen competing methods across a broad range of complex dependence structures, confirming the viability of the proposed approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes deep-testing, a hypothesis testing framework that trains a deep neural network on simulated data drawn from the null and alternative distributions to learn a classification map serving as the test statistic. As a proof of concept, the method is applied to independence testing; a large-scale simulation study reports that deep-testing attains the highest overall power among nineteen competing procedures across a range of complex dependence structures.
Significance. If the simulation results are obtained under strict separation between training and evaluation distributions and the procedure maintains valid type-I error control, the work demonstrates that deep-learning classifiers can be repurposed as powerful, flexible test statistics for nonparametric problems where analytic forms are unavailable. The empirical breadth of the study provides concrete evidence that the approach is viable for dependence detection, though its broader utility hinges on generalization beyond the simulated regimes.
major comments (2)
- [Simulation study] Simulation study section: the manuscript does not explicitly state whether the dependence structures (or their parameterizations) used to generate the training alternatives are disjoint from those used to evaluate power. Without this separation, the reported power ranking could reflect the network's ability to exploit simulation-specific artifacts rather than a general advantage over the nonparametric competitors.
- [Method] Method section (around the definition of the test statistic): it is unclear how the threshold for the learned classifier output is calibrated to guarantee finite-sample or asymptotic type-I error control. Because the network is trained on external simulated data, the null distribution of the resulting statistic is not automatically pivotal and requires a separate calibration step whose details are not provided.
minor comments (2)
- [Abstract] The abstract lists 'nineteen competing methods' without naming them or providing a reference table; adding this information would allow readers to assess the breadth of the comparison immediately.
- [Notation] Notation for the network output (e.g., the precise mapping from classifier probability to test statistic) should be introduced with an equation number in the methods section to facilitate later discussion of calibration.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. We address the major comments point by point below.
read point-by-point responses
-
Referee: [Simulation study] Simulation study section: the manuscript does not explicitly state whether the dependence structures (or their parameterizations) used to generate the training alternatives are disjoint from those used to evaluate power. Without this separation, the reported power ranking could reflect the network's ability to exploit simulation-specific artifacts rather than a general advantage over the nonparametric competitors.
Authors: We thank the referee for this observation. Upon checking, the training alternatives were generated from dependence structures and parameterizations that are disjoint from the evaluation set to avoid any potential for the network to exploit simulation-specific features. We will revise the manuscript to explicitly state this separation in the Simulation study section, including a description of the distinct sets used for training and evaluation. revision: yes
-
Referee: [Method] Method section (around the definition of the test statistic): it is unclear how the threshold for the learned classifier output is calibrated to guarantee finite-sample or asymptotic type-I error control. Because the network is trained on external simulated data, the null distribution of the resulting statistic is not automatically pivotal and requires a separate calibration step whose details are not provided.
Authors: We agree that additional details on threshold calibration are necessary. The procedure involves simulating a large number of samples under the null hypothesis after training, computing the classifier outputs on these samples, and determining the threshold as the appropriate quantile to achieve the desired type-I error rate. This ensures finite-sample control. We will update the Method section to provide a complete description of this calibration process, along with theoretical justification for its validity. revision: yes
Circularity Check
No significant circularity in deep-testing procedure or simulation claims
full rationale
The paper defines deep-testing by training a neural network classifier on independently generated simulated samples drawn from the null (independence) and chosen alternatives, then uses the resulting classification map as the test statistic. Power is assessed via a separate large-scale Monte Carlo study that applies the trained statistic to fresh draws from a range of dependence structures and compares rejection rates against 19 other methods. No equation or claim reduces by construction to a parameter fitted on the same data being tested, no self-citation supplies a load-bearing uniqueness result, and the training simulations are external to any real-data application. The derivation is therefore self-contained as a standard empirical simulation-based procedure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A deep neural network trained on simulated samples can learn a discriminating map between null and alternative distributions that generalizes to produce a powerful test statistic on real data.
Reference graph
Works this paper leans on
-
[1]
Abramowitz, M. and Stegun, I. A. (1965).Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. US Department of Commerce, National Bureau of Standards, Applied Mathematics Series 55
work page 1965
-
[2]
Albawi, S., Mohammed, T . A. and Al-Zawi, S. (2017). Understanding of a convolutional neural network. InProceedings of the 2017 International Conference on Engineering and Technology (ICET). 16
work page 2017
-
[3]
Allen-Zhu, Z., Li, Y. and Liang, Y. (2019).Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers.Advances in Neural Information Processing Systems, 32
work page 2019
-
[4]
Bahri, Y., Kadmon, J., Pennington, J., Schoenholz, S. S., Sohl-Dickstein, J. and Ganguli, S. (2020).Statisti- cal Mechanics of Deep Learning.Annual Review of Condensed Matter Physics, 11, 501–528
work page 2020
-
[5]
Bartlett, P . L., Long, P . M., Lugosi, G. and Tsigler, A. (2020).Benign Overfitting in Linear Regression. Proceedings of the National Academy of Sciences, 117, 30063–30070
work page 2020
-
[6]
Bartlett, P . L., Montanari, A. and Rakhlin, A. (2021). Deep Learning: a Statistical Viewpoint.Acta Numer- ica, 30, 87–201
work page 2021
-
[7]
Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proc. Natl. Acad. Sci. USA, 116, 15849–15854
work page 2019
-
[8]
Bellot, A. and van der Schaar, M. (2019).Conditional Independence Testing using Generative Adversarial Networks.Advances in Neural Information Processing Systems, 32
work page 2019
-
[9]
Berrett, T . B. and Samworth, R. J. (2019).Nonpara- metric independence testing via mutual information. Biometrika, 106, 547–566
work page 2019
-
[10]
Blomqvist, N. (1950).On a Measure of Dependence between Two Random Variables.Annals of Mathemat- ical statistics, 21(4), 593–600
work page 1950
-
[11]
Breiman, L. and Friedman, J. H. (1985).Estimating Optimal Transformations for Multiple Regression and Correlation.Journal of the American Statistical Association, 80(391), 580–598
work page 1985
-
[12]
Cover, T . M. and Thomas, J. A. (2006).Elements of Information Theory, 2nd ed. Wiley, New York
work page 2006
-
[13]
Dawid, A. P . (1979).Conditional independence in sta- tistical theory.Journal of the Royal Statistical Society: Series B, 41, 1–15
work page 1979
-
[14]
(1979).La fonction de dépendance em- pirique et ses propriétés
Deheuvels, P . (1979).La fonction de dépendance em- pirique et ses propriétés. Un test non paramétrique d’indépendance.Bulletins de l’Académie Royale de Bel- gique, 65, 274–292
work page 1979
-
[15]
Dembo, A., Kagan, A. and Shepp, L. A. (2001).Remarks on the Maximum Correlation Coefficient.Bernoulli, 7(2), 343–350
work page 2001
-
[16]
Devroye, L., Györfi, L. and Lugosi, G. (1996).A Proba- bilistic Theory of Pattern Recognition. Springer, New York
work page 1996
-
[17]
Drouet Mari, D. and Kotz, S. (2001).Correlation and Dependence. Imperial College Press
work page 2001
-
[18]
Fan, J., Ma, C. and Zhong, Y. (2021).A Selective Overview of Deep Learning.Statistical Science, 36, 264–290
work page 2021
-
[19]
Gebelein, H. (1941).Das statistische Problem der Ko- rrelation als Variations- und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. Zeitschrift für Angewandte Mathematik und Mechanik, 21, 364–379
work page 1941
-
[20]
Towards a universal representation of statistical dependence
Geenens, G. (2023).Towards a universal repre- sentation of statistical dependence.arXiv preprint arXiv:2302.08151
-
[21]
Geenens, G. and Lafaye de Micheaux, P . (2022).The Hellinger Correlation.Journal of the American Statis- tical Association, 117, 639–653
work page 2022
-
[22]
Genest, C. and Boies, J. C. (2003).Detecting Depen- dence with Kendall Plots.The American Statistician, 57(4), 275–284
work page 2003
-
[23]
Gerber, P .R., Han, Y. and Polyanskiy, Y. (2023).Mini- max optimal testing via classification.Proceedings of Machine Learning Research, 195:1–38
work page 2023
-
[24]
Goodfellow, I., Bengio, Y. and Courville, A. (2016).Deep Learning. MIT Press
work page 2016
-
[25]
Gretton, A., Bousquet, O., Smola, A. and Schölkopf, B. (2005).Measuring Statistical Dependence with Hilbert-Schmidt Norms. InProceedings of the 16th In- ternational Conference on Algorithmic Learning Theory, 63–77
work page 2005
-
[26]
Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B. and Smola, A. J. (2008).A kernel sta- tistical test of independence. InAdvances in Neural Information Processing Systems, 585–592
work page 2008
-
[27]
Guyon, I. and Elisseeff, A. (2003).An introduction to variable and feature selection.Journal of Machine Learning Research, 3, 1157–1182
work page 2003
-
[28]
Härdle, W . K. and Simar, L. (2007).Applied Multivari- ate Analysis, 2nd ed. Springer
work page 2007
-
[29]
Hasanpour, S. H., Rouhani, M., Fayyaz, M. and Sabokrou, M. (2016).Lets keep it simple, using simple architectures to outperform deeper and more com- plex architectures.arXiv preprintarXiv:1608.06037
-
[30]
Hastie, T ., Tibshirani, R. and Friedman, J. (2013).The Elements of Statistical Learning: Data Mining, Infer- ence, and Prediction. Springer
work page 2013
-
[31]
Hausser, J. and Strimmer, K. (2015).Estimation of Entropy, Mutual Information and Related Quantities. Rpackage, version 1.2.1
work page 2015
-
[32]
Heller, R., Heller, Y. and Gorfine, M. (2012).A Consis- tent Multivariate Test of Association Based on Ranks of Distances.arXiv preprintarXiv:1201.3522
-
[33]
(1948).A Non-Parametric Test of In- dependence.Annals of Mathematical statistics, 19(4), 546–557
Hoeffding, W . (1948).A Non-Parametric Test of In- dependence.Annals of Mathematical statistics, 19(4), 546–557
work page 1948
-
[34]
(2014).Dependence Modeling with Copulas
Joe, H. (2014).Dependence Modeling with Copulas. Chapman and Hall/CRC, Boca Raton, FL
work page 2014
-
[35]
Kallenberg, W . C. M. and Ledwina, T . (1999).Data- Driven Rank Tests for Independence.Journal of the American Statistical Association, 94(445), 285–301
work page 1999
-
[36]
Kendall, M. G. (1938).A new measure of rank corre- lation.Biometrika, 30(1–2), 81–89
work page 1938
-
[37]
Kendall, M. G. and Buckland, W . R. (1971).A Dictio- nary of Statistical Terms, 3rd ed. Hafner, New York
work page 1971
-
[38]
Kinney, J. B. and Atwal, G. S. (2014).Equitability, 17 mutual information and the maximal information coefficient.Proceedings of the National Academy of Sciences, 111, 3354–3359
work page 2014
-
[39]
Kozachenko, L. F . and Leonenko, N. N. (1987).Sample estimate of the entropy of a random vector.Problems of Information Transmission, 23, 95–101
work page 1987
-
[40]
Lafaye de Micheaux, P . and Tran, V . (2016).PoweR: A Reproducible Research Tool to Ease Monte Carlo Power Simulation Studies for Goodness-of-fit Tests inR.Journal of Statistical Software, 69(3)
work page 2016
-
[41]
LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P . (1998). Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86, 2278–2324
work page 1998
-
[42]
LeCun, Y., Bengio, Y. and Hinton, G. (2015).Deep learning.Nature, 521, 436–444
work page 2015
-
[43]
Lehmann, E. L. and Romano, J. P . (2005).Testing Statistical Hypotheses. Springer, New York
work page 2005
-
[44]
Li, J. J. and Tong, X. (2020).Statistical hypothesis testing versus machine learning binary classification: Distinctions and guidelines.Patterns, 1(7)
work page 2020
-
[45]
Liquet, B., Moka, S. and Nazarathy, Y. (2024).Mathe- matical Engineering of Deep Learning. CRC Press
work page 2024
-
[46]
Linfoot, E. H. (1957).An Informational Measure of Correlation.Information and Control, 1(1), 85–89
work page 1957
-
[47]
Lopez-Paz, D., Hennig, P . and Schölkopf, B. (2013).The Randomised Dependence Coefficient.arXiv preprint arXiv:1304.7717
-
[48]
Nelsen, R. B. (2006).An Introduction to Copulas. Springer, New York
work page 2006
-
[49]
Nelsen, R. B., Quesada-Molina, J. J., Rodriguez-Lallena, J. A. and Ubeda-Flores, M. (2003).Kendall distribution functions.statistics and Probability Letters, 65, 263– 268
work page 2003
-
[50]
Neyman, J. and Pearson, E. S. (1933).IX. On the problem of the most efficient tests of statistical hypotheses.Philosophical Transactions of the Royal Society of London. Series A, 231, 289–337
work page 1933
-
[51]
Paschali, M., Zhao, Q., Adeli, E. and Pohl, K. M. (2022).Bridging the gap between deep learning and hypothesis-driven analysis via permutation testing. In Rekik, I., Adeli, E., Park, S. H. and Cintas, C. (eds), Predictive Intelligence in Medicine. PRIME 2022.Lecture Notes in Computer Science, 13564, 13–23. Springer, Cham
work page 2022
-
[52]
Pandeva, T ., Forré, P ., Ramdas, A. and Shekhar, S. (2024).Deep anytime-valid hypothesis testing.Pro- ceedings of the 27th International Conference on Arti- ficial Intelligence and Statistics (AISTATS),Proceedings of Machine Learning Research, 238, 622–630
work page 2024
-
[53]
Pfister, N., Bühlmann, P ., Schölkopf, B. and Peters, J. (2018).Kernel-Based Tests for Joint Independence. Journal of the Royal Statistical Society: Series B, 80, 5– 31
work page 2018
-
[54]
Pfister, N. and Peters, J. (2019).dHSIC: Independence Testing via Hilbert-Schmidt Independence Criterion. Rpackage, version 2.1
work page 2019
-
[55]
(1959).On Measures of Dependence
Rényi, A. (1959).On Measures of Dependence. Acta Mathematica Academiae Scientiarum Hungaricae, 10(3–4), 441–451
work page 1959
-
[56]
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P . J., Lander, E. S., Mitzenmacher, M. and Sabeti, P . C. (2011).Detecting Novel Associations in Large Data Sets.Science, 334, 1518–1524
work page 2011
-
[57]
Reshef, Y. A., Reshef, D. N., Finucane, H. K., Sabeti, P . C. and Mitzenmacher, M. (2016).Measuring depen- dence powerfully and equitably.Journal of Machine Learning Research, 17, 1–63
work page 2016
-
[58]
Rigollet, P . and Tong, X. (2011).Neyman–Pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research, 12, 2831–2855
work page 2011
-
[59]
Sakib, S., Ahmed, N., Kabir, A. J. and Ahmed, H. (2018). An Overview of Convolutional Neural Network: Its Architecture and Applications.Preprints, 2018110546
work page 2018
-
[60]
(1984).On measures of concordance
Scarsini, M. (1984).On measures of concordance. Stochastica, 8, 201–218
work page 1984
-
[61]
Schweizer, B. and Wolff, E. F . (1981).On Nonparamet- ric Measures of Dependence for Random Variables. Annals of statistics, 9, 879–885
work page 1981
-
[62]
(2003).Mathematical statistics
Shao, J. (2003).Mathematical statistics. Springer, New York
work page 2003
-
[63]
Shao, X. and Zhang, J. (2014).Martingale Difference Correlation and Its Use in High-Dimensional Variable Screening.Journal of the American Statistical Associa- tion, 109(507), 1302–1318
work page 2014
-
[64]
Sejnowski, T . J. (2020).The unreasonable effectiveness of deep learning in artificial intelligence.Proceedings of the National Academy of Sciences, 117, 30033–30038
work page 2020
-
[65]
Spearman, C. (1904).The Proof and Measurement of Association between Two Things.The American Journal of Psychology, 15(1), 72–101
work page 1904
-
[66]
Suh, N. and Cheng, G. (2025).A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models.Annual Review of statistics and Its Application, 12, 177–207
work page 2025
-
[67]
Székely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007). Measuring and testing independence by correlation of distances.Annals of statistics, 35, 2769–2794
work page 2007
-
[68]
Tong, X. (2013).A plug-in approach to Neyman– Pearson classification.Journal of Machine Learning Research, 14, 3011–3040
work page 2013
-
[69]
Tong, X., Feng, Y. and Zhao, A. (2016).A survey on Neyman–Pearson classification and suggestions for future research.WIREs Computational statistics, 8, 64– 81
work page 2016
-
[70]
Tong, X., Xia, L., Wang, J. and Feng, Y. (2020).Neyman– Pearson classification: parametrics and sample size requirement.Journal of Machine Learning Research, 21, 1–48
work page 2020
-
[71]
Vexler, A., Chen, X. and Hutson, A. D. (2017).Depen- dence and Independence Structure and Inference. Statistical Methods in Medical Research, 26(5), 2114– 18 2132
work page 2017
-
[72]
Yang, Y., Zhang, K. and Zhong, P .-S. (2025).Testing conditional independence with deep neural network based binary expansion testing (DeepBET).Proceed- ings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS),Proceedings of Machine Learning Research, 258, 4690–4698
work page 2025
-
[73]
− r 1−V (ℓ)2 k , r 1−V (ℓ)2 k #, ifV (ℓ) k <c (ℓ) −1, the mixture 0.5U
Xu, N., Liu, F . and Sutherland, D. J. (2026).Learning representations for independence testing.Transac- tions on Machine Learning Research. 19 APPENDIX A. Generation of the training sets Here we provide the essential algorithms – that is, the data-generating-process (DGP) formulae – used to generate independent samples (‘units’) ofnpairs of i.i.d. observ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.