Comparing Two Categorical Gini Correlations with Applications to Classification Problems
Pith reviewed 2026-05-20 01:43 UTC · model grok-4.3
The pith
A test for the difference between two categorical Gini correlations enables comparison of predictor importance for categorical classification outcomes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the difference of two categorical Gini correlations yields a test statistic that is asymptotically normal under both the null and alternative hypotheses, is consistent against alternatives, accommodates arbitrary and unequal predictor dimensions, and remains valid when the predictor groups are dependent.
What carries the argument
The difference of two categorical Gini correlations, together with the derived test statistic whose asymptotic normality is established under regularity conditions on the data.
If this is right
- Predictor groups of arbitrary and unequal dimensions can be compared directly for their association with the categorical outcome.
- The test remains valid when the two predictor groups exhibit dependence.
- Inference can proceed either through the established asymptotic normal approximation or through a nonparametric bootstrap procedure.
- The method supplies a practical tool for assessing relative predictor importance in classification problems with categorical responses.
Where Pith is reading between the lines
- The framework could support sequential testing to rank more than two predictor groups in a single analysis.
- Similar difference-of-dependence tests might be constructed for other categorical dependence measures beyond the Gini correlation.
- In applied settings the test could guide feature-set selection before fitting a final classifier.
Load-bearing premise
The data distributions must satisfy regularity conditions that make the categorical Gini correlation a valid dependence measure and allow the central limit theorem to apply to the test statistic.
What would settle it
Large-sample simulations drawn from distributions that violate the regularity conditions should produce a test statistic whose empirical distribution deviates markedly from normality under the null hypothesis of equal correlations.
Figures
read the original abstract
This article proposes an inferential framework for comparing predictor importance in classification problems with categorical response variables. The approach is based on the categorical Gini correlation (CGC) proposed by Dang et al. (2020), a measure of dependence between numerical predictors and categorical outcomes. Predictor importance is evaluated by testing differences in CGCs across competing predictor groups. The proposed methodology accommodates predictors of arbitrary and unequal dimensions and allows for dependence between predictor groups. Asymptotic normality of the test statistic is established under both the null and alternative hypotheses, and the resulting test is shown to be consistent. In addition to deriving the asymptotic distribution, a nonparametric bootstrap procedure is developed as an alternative approach to inference. Simulation studies, along with applications to breast cancer and human activity recognition datasets, demonstrate the effectiveness of the proposed framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a test for the difference between two categorical Gini correlations (CGCs) to compare predictor importance for a categorical response in classification settings. Extending Dang et al. (2020), it derives asymptotic normality of the difference estimator under both the null and alternative, proves consistency, accommodates unequal predictor dimensions, and allows dependence between the two predictor groups. A nonparametric bootstrap is also developed. The claims are supported by simulation studies and applications to breast cancer and human activity recognition data.
Significance. If the joint asymptotic results hold, the work supplies a practical inferential tool for assessing relative predictor strength when dimensions differ and groups may be dependent, which is common in classification. The bootstrap alternative and real-data examples add immediate usability. The extension to the difference of two CGCs under dependence fills a methodological gap left by the single-CGC theory in the cited prior work.
major comments (1)
- [§3.2, Theorem 3.1] §3.2, Theorem 3.1 and the subsequent joint CLT argument: the asymptotic normality claim under the alternative when the two predictor groups are dependent requires an explicit joint limiting distribution that incorporates the cross-covariance between the two empirical CGC estimators. The manuscript appears to state the marginal CLTs from Dang et al. (2020) and then invoke a delta-method step; without a displayed expression for the off-diagonal covariance term or a verification that the regularity conditions (e.g., finite moments of the joint kernel) remain sufficient under dependence, the limiting variance under HA is not fully secured.
minor comments (2)
- [§2.1] The notation for the two predictor vectors X and Y is introduced without an explicit statement that their dimensions p and q may be unequal; a short sentence clarifying this point would aid readability.
- [§4] Simulation tables report empirical rejection rates but do not include standard errors across the Monte Carlo replications; adding these would strengthen the evidence that the bootstrap and asymptotic versions behave comparably.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on our manuscript. The point raised about the joint limiting distribution under dependence is well taken, and we address it directly below.
read point-by-point responses
-
Referee: [§3.2, Theorem 3.1] §3.2, Theorem 3.1 and the subsequent joint CLT argument: the asymptotic normality claim under the alternative when the two predictor groups are dependent requires an explicit joint limiting distribution that incorporates the cross-covariance between the two empirical CGC estimators. The manuscript appears to state the marginal CLTs from Dang et al. (2020) and then invoke a delta-method step; without a displayed expression for the off-diagonal covariance term or a verification that the regularity conditions (e.g., finite moments of the joint kernel) remain sufficient under dependence, the limiting variance under HA is not fully secured.
Authors: We agree that the presentation in Section 3.2 would benefit from an explicit statement of the joint limiting distribution of the two empirical CGC estimators under dependence. In the revision we will add the full asymptotic covariance matrix, including the off-diagonal cross-covariance term between the two U-statistic estimators, and apply the delta method to the difference. We will also verify that the moment conditions on the joint kernel (finite second moments) remain sufficient under the dependence structure permitted by the paper. These clarifications will be inserted into the statement of Theorem 3.1 and the surrounding text. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper defines its test statistic from the categorical Gini correlation measure introduced in the independent prior work of Dang et al. (2020) and then derives the joint asymptotic normality of the difference under both null and alternative hypotheses, including the case of dependent predictor groups with unequal dimensions. This joint limiting distribution is presented as a new technical result rather than a direct renaming or algebraic reduction of the single-CGC asymptotics. The nonparametric bootstrap is offered as a separate computational alternative, not as the justification for the limiting distribution. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claims therefore retain independent mathematical content.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2.1 … √n(ρ̂1 − ρ̂2) → N(0, σ0²) … jackknife estimator ĉM … U-statistic kernel hkl
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
C1–C4 … non-degenerate U-statistic … bootstrap within each class
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-Ortiz, J. L. (2012). Human Activity Recognition Using Smartphones [Dataset].UCI Machine Learning Repository. DOI:https: //doi.org/10.24432/C54S4K
-
[2]
Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-Ortiz, J. L. (2013). A Public Domain Dataset for Human Activity Recognition Using Smartphones.Proceedings of the 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN). Bruges, Belgium, 437–442. Available at:https://i6doc. com/en/book/?GCOI=28001...
work page 2013
-
[3]
M., Hewage, S., and Mayeaux, A
Aich, A., Murshed, M. M., Hewage, S., and Mayeaux, A. (2026). A copula based supervised filter for feature selection in machine learning driven diabetes risk prediction.Scientific Reports,16, 12132. DOI:https://doi.org/10.1038/s41598-026-41874-9
-
[4]
M., Hewage, S., and Mayeaux, A
Aich, A., Murshed, M. M., Hewage, S., and Mayeaux, A. (2025). CopulaSMOTE: A Copula- Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction.arXiv preprint arXiv:2506.17326. DOI:https://doi.org/10.48550/arXiv.2506.17326 22
-
[5]
Aich, A., Hewage, S., and Murshed, M. M. (2025). Copula Based Fusion of Clinical and Ge- nomic Machine Learning Risk Scores for Breast Cancer Risk Stratification.arXiv preprint arXiv:2511.17605. DOI:https://doi.org/10.48550/arXiv.2511.17605
-
[6]
Aich, A., Murshed, M. M., Hewage, S., and Aich, A. B. (2026). Bayesian Inference for Joint Tail Risk in Paired Biomarkers via Archimedean Copulas with Restricted Jeffreys Priors. arXiv preprint arXiv:2602.15319. DOI:https://doi.org/10.48550/arXiv.2602.15319
-
[7]
Belhechmi, S., De Bin, R., Rotolo, F. and Michiels, S. (2020). Accounting for grouped predictor variables or pathways in high dimensional penalized Cox regression models.BMC Bioinformatics, 21(1):277. DOI:https://doi.org/10.1186/s12859-020-03618-y
-
[8]
Buch, G., Schulz, A., Schmidtmann, I., Strauch, K. and Wild, P. S. (2021). A systematic review and evaluation of statistical methods for group variable selection,Stat. Med.,42, 331-352. DOI:https://doi.org/10.1002/sim.9620
-
[9]
Cheng, G., Li, X., Lai, P., Song, F. and Yu, J. (2017). Robust rank screening for ultrahigh dimensional discriminant analysis.Stat. Comput.,27(2), 535-545. DOI:https://doi. org/10.1007/s11222-016-9637-2
-
[10]
Cui, H., Li, R. and Zhong, W. (2015). Model-free feature screening for ultrahigh di- mensional discriminant analysis.J. Amer. Statist. Assoc.,110, 630-641. DOI:https: //doi.org/10.1080/01621459.2014.920256
-
[11]
Dang, X., Nguyen, D., Chen, X. and Zhang, J. (2021). A new Gini correlation between quantitative and qualitative variables.Scand. J. Stat.,48(4), 1314-1343. DOI:https: //doi.org/10.1111/sjos.12490
-
[12]
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion).Journal of the Royal Statistical Society, Series B,70, 849-911. DOI:https://doi.org/10.1111/j.1467-9868.2008.00674.x
-
[13]
Gini, C. (1914). On the measurement of concentration and variability of characters.Metron, LXIII(1), 3-38
work page 1914
-
[14]
Goldman, M.J., Craft, B., Hastie, M., Repeˇ cka, K., McDade, F., Kamath, A., Banerjee, A., Luo, Y., Rogers, D., Brooks, A.N. and Zhu, J. (2020). Visualizing and interpreting cancer genomics data via the Xena platform.Nat. Biotechnol.,38(6), 675-678. DOI:https: //doi.org/10.1038/s41587-020-0546-8
-
[15]
Hewage, S. and Sang, Y. (2024). Jackknife empirical likelihood confidence intervals for the categorical Gini correlation.J. Stat. Plan. Inference,231, 106123. DOI:https://doi. org/10.1016/j.jspi.2023.106123
-
[16]
Hewage, S. (2025). A Nonparametric K-sample Test for Variability Based on Gini’s Mean Difference.J. Stat. Theory Appl.,24(2), 334–353. DOI:https://doi.org/10.1007/ s44199-025-00112-3
work page 2025
-
[17]
Hewage, S. (2025). gcor: A Python Implementation of Categorical Gini Correlation and Its Inference.arXiv preprint arXiv:2506.19230. DOI:https://doi.org/10.48550/arXiv. 2506.19230 23
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
-
[18]
Hewage, S. S. (2025).Statistical Inference for Categorical Gini Correlation and Gini’s Mean Difference. University of Louisiana at Lafayette
work page 2025
- [19]
-
[20]
DOI:https://doi.org/10.1016/j.csda.2019.02.003
-
[21]
Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution.Ann. Math. Statist.19, 293-325. DOI:https://doi.org/10.1214/aoms/1177730196
-
[22]
Hotelling, H. (1940). The selection of variates for use in prediction with some comments on the general problem of nuisance parameters.Ann. Math. Statist.,11, 271-283. DOI: https://doi.org/10.1214/aoms/1177731867
-
[23]
Lai, P., Song, F., Chen, K. and Liu, Z. (2017). Model free feature screening with dependent variable in ultrahigh dimensional binary classification.Statist. Probab. Lett.,125, 141-148. DOI:https://doi.org/10.1016/j.spl.2017.02.011
-
[24]
Mai, Q. and Zou, H. (2013). The Kolmogorov Filter for Variance Screening in High- Dimensional Binary Classification.Biometrika,100, 229-234. DOI:https://doi.org/10. 1093/biomet/ass062
work page 2013
-
[25]
Meier, L., Van De Geer, S. and B¨ uhlmann, P. (2008). The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B. Stat. Methodol.,70, 53-71. DOI:https://doi.org/10.1111/j. 1467-9868.2007.00627.x
work page doi:10.1111/j 2008
-
[26]
Meng, X.L., Rosenthal, R. and Rubin, D.B. (1992). Comparing correlated correlation co- efficients.Psych. Bull.,111, 172-175. DOI:https://doi.org/10.1037/0033-2909.111. 1.172
-
[27]
Mercer, J. (1909). Functions of positive and negative type, and their connection the theory of integral equations.Philos. Trans. Roy. Soc. A,209, 415-446. DOI:https://doi.org/ 10.1098/rsta.1909.0016
-
[28]
Neil, J.J. and Dunn, O.J. (1975). Equality of dependent correlation coefficients.Biometrics, 31, 531-543. DOI:https://doi.org/10.2307/2529435
-
[29]
Ni, L. and Fang, F. (2016). Entropy-based model-free feature screening for ultrahigh di- mensional multiclass classification.J. Nonparametr. Stat.,28(3), 515-530. DOI:https: //doi.org/10.1080/10485252.2016.1167206
-
[30]
Niu, Y., Zhang, R., Liu, J. and Li, H. (2020). Group screening for ultra-high-dimensional feature under linear model.Stat. Theor. Relat. Field.,4(1), 43-54. DOI:https://doi. org/10.1080/24754269.2019.1633763
-
[31]
Olkin, I. (1967). Correlations revisited. In J.C. Stanley (Ed.),Improving experimental de- sign and statistical analysis. Chicago, IL: Rand McNally, pp. 102-128
work page 1967
-
[32]
Parker, J.S., Mullins, M., Cheang, M.C., Leung, S., Voduc, D., Vickery, T., Davies, S., Fauron, C., He, X., Hu, Z. and Quackenbush, J.F. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes.J. Clin. Oncol.,27(8), 1160-1167. DOI:https: //doi.org/10.1200/JCO.2008.18.1370 24
-
[33]
Ramos-Carre˜ no, C. and Torrecilla, J. L. (2023). dcor: Distance correlation and energy statistics in Python.SoftwareX,22, 101326. DOI:https://doi.org/10.1016/j.softx. 2023.101326
-
[34]
Sang, Y. and Dang, X. (2024). Grouped feature screening for ultrahigh-dimensional classifi- cation via Gini distance correlation.J. Multivar. Anal.DOI:https://doi.org/10.1016/ j.jmva.2024.105360
-
[35]
Sang, Y. and Dang, X. (2023). Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem.Electron. J. Stat.,17(2), 2539-2574. DOI: https://doi.org/10.1214/23-EJS2165
-
[36]
Serfling, R.J. (1980).Approximation theorems of mathematical statistics. John Wiley & Sons. DOI:https://doi.org/10.1002/9780470316481
-
[37]
Shao, J. and Tu, D. (1996).The Jackknife and Bootstrap. Springer, New York. DOI:https: //doi.org/10.1007/978-1-4612-0795-5
-
[38]
Sz´ ekely, G.J. and Rizzo, M.L. (2013a). Energy statistics: A class of statistics based on distances.J. Stat. Plan. Infer.143, 1249-1272. DOI:https://doi.org/10.1016/j.jspi. 2013.03.018
-
[39]
Annual Review of Statistics and Its Application , author =
Sz´ ekely, G.J. and Rizzo, M.L. (2017). The energy of data.Ann. Rev. Stat. Appl.,4(1), 447-479. DOI:https://doi.org/10.1146/annurev-statistics-060116-054026
-
[40]
Sz´ ekely, G.J., Rizzo, M.L. and Bakirov, N. (2007). Measuring and testing dependence by correlation of distances.Ann. Statist.35(6), 2769-2794. DOI:https://doi.org/10.1214/ 009053607000000505
work page 2007
-
[41]
Wang, Z., Deng, G., and Xu, H. (2023). Group feature screening based on Gini impurity for ultrahigh-dimensional multi-classification.AIMS Math.,8(2), 4342-4362. DOI:https: //doi.org/10.3934/math.2023216
-
[42]
Williams, E.J. (1959a). Significance of difference between two non-independent correlation coefficients.Biometrics,15, 135-136
-
[43]
Wolberg, W., Mangasarian, O., Street, N., and Street, W. (1995). Breast Cancer Wisconsin (Diagnostic) [Dataset].UCI Machine Learning Repository. DOI:https://doi.org/10. 24432/C5DW2B
work page 1995
-
[44]
Yitzhaki, S. (2003). Gini’s mean difference: A superior measure of variability for non- normal distributions.Metron,61(2), 285-316
work page 2003
-
[45]
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.J. R. Stat. Soc. Ser. B. Stat. Methodol.,68(1), 49-67. DOI:https://doi.org/ 10.1111/j.1467-9868.2005.00532.x
-
[46]
Zhang, S., Dang, X., Nguyen, D., Wilkins, D. and Chen, Y. (2019). Estimating feature- label dependence using Gini distance statistics.IEEE Transactions on Pattern Analysis and Machine Intelligence,43(6), 1947-1963. DOI:https://doi.org/10.1109/TPAMI.2019. 2960358
-
[47]
Zou, G.Y. (2007). Toward using confidence intervals to compare correlations.Psychol. Methods,12(4), 399. DOI:https://doi.org/10.1037/1082-989X.12.4.399 25
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.