pith. sign in

arxiv: 2605.17763 · v1 · pith:MHYELPMPnew · submitted 2026-05-18 · 📊 stat.ME · stat.ML

Comparing Two Categorical Gini Correlations with Applications to Classification Problems

Pith reviewed 2026-05-20 01:43 UTC · model grok-4.3

classification 📊 stat.ME stat.ML
keywords categorical Gini correlationpredictor importanceclassificationasymptotic normalitybootstrap testdependence measurecategorical response
0
0 comments X

The pith

A test for the difference between two categorical Gini correlations enables comparison of predictor importance for categorical classification outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an inferential method to compare how strongly different groups of numerical predictors relate to a categorical response variable. It builds on the categorical Gini correlation as a dependence measure and constructs a test statistic whose asymptotic distribution is normal under both the hypothesis of equal correlations and the hypothesis of differing correlations. The framework handles predictor groups that may have unequal numbers of variables and may be statistically dependent. A bootstrap procedure is also derived for inference, and the approach is illustrated with simulation studies plus applications to breast cancer and human activity recognition data.

Core claim

The central claim is that the difference of two categorical Gini correlations yields a test statistic that is asymptotically normal under both the null and alternative hypotheses, is consistent against alternatives, accommodates arbitrary and unequal predictor dimensions, and remains valid when the predictor groups are dependent.

What carries the argument

The difference of two categorical Gini correlations, together with the derived test statistic whose asymptotic normality is established under regularity conditions on the data.

If this is right

  • Predictor groups of arbitrary and unequal dimensions can be compared directly for their association with the categorical outcome.
  • The test remains valid when the two predictor groups exhibit dependence.
  • Inference can proceed either through the established asymptotic normal approximation or through a nonparametric bootstrap procedure.
  • The method supplies a practical tool for assessing relative predictor importance in classification problems with categorical responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could support sequential testing to rank more than two predictor groups in a single analysis.
  • Similar difference-of-dependence tests might be constructed for other categorical dependence measures beyond the Gini correlation.
  • In applied settings the test could guide feature-set selection before fitting a final classifier.

Load-bearing premise

The data distributions must satisfy regularity conditions that make the categorical Gini correlation a valid dependence measure and allow the central limit theorem to apply to the test statistic.

What would settle it

Large-sample simulations drawn from distributions that violate the regularity conditions should produce a test statistic whose empirical distribution deviates markedly from normality under the null hypothesis of equal correlations.

Figures

Figures reproduced from arXiv: 2605.17763 by Sameera Hewage, Yongli Sang.

Figure 1
Figure 1. Figure 1: Size and power of tests in Example 3.2(b). Dashed horizontal line is the nominal level 0.05. Example 3.3 (Binary Logistic Regression Model) In this example, we consider a logistic regression model where the binary response is generated as log  P(Z = 1|V ) P(Z = −1|V )  = −3 + 2V1 + 2V2 + 2V3 + 3 sin(V4) + 4V 2 5 , where V ∼ N(0, Σ) with Σ = (ρij )p×p having two scenarios ρij = 0 and ρij = 0.5 |j−i| , i ̸… view at source ↗
Figure 2
Figure 2. Figure 2: Top 15 Random Forest feature importances for the Wisconsin Breast Cancer dataset, [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
read the original abstract

This article proposes an inferential framework for comparing predictor importance in classification problems with categorical response variables. The approach is based on the categorical Gini correlation (CGC) proposed by Dang et al. (2020), a measure of dependence between numerical predictors and categorical outcomes. Predictor importance is evaluated by testing differences in CGCs across competing predictor groups. The proposed methodology accommodates predictors of arbitrary and unequal dimensions and allows for dependence between predictor groups. Asymptotic normality of the test statistic is established under both the null and alternative hypotheses, and the resulting test is shown to be consistent. In addition to deriving the asymptotic distribution, a nonparametric bootstrap procedure is developed as an alternative approach to inference. Simulation studies, along with applications to breast cancer and human activity recognition datasets, demonstrate the effectiveness of the proposed framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a test for the difference between two categorical Gini correlations (CGCs) to compare predictor importance for a categorical response in classification settings. Extending Dang et al. (2020), it derives asymptotic normality of the difference estimator under both the null and alternative, proves consistency, accommodates unequal predictor dimensions, and allows dependence between the two predictor groups. A nonparametric bootstrap is also developed. The claims are supported by simulation studies and applications to breast cancer and human activity recognition data.

Significance. If the joint asymptotic results hold, the work supplies a practical inferential tool for assessing relative predictor strength when dimensions differ and groups may be dependent, which is common in classification. The bootstrap alternative and real-data examples add immediate usability. The extension to the difference of two CGCs under dependence fills a methodological gap left by the single-CGC theory in the cited prior work.

major comments (1)
  1. [§3.2, Theorem 3.1] §3.2, Theorem 3.1 and the subsequent joint CLT argument: the asymptotic normality claim under the alternative when the two predictor groups are dependent requires an explicit joint limiting distribution that incorporates the cross-covariance between the two empirical CGC estimators. The manuscript appears to state the marginal CLTs from Dang et al. (2020) and then invoke a delta-method step; without a displayed expression for the off-diagonal covariance term or a verification that the regularity conditions (e.g., finite moments of the joint kernel) remain sufficient under dependence, the limiting variance under HA is not fully secured.
minor comments (2)
  1. [§2.1] The notation for the two predictor vectors X and Y is introduced without an explicit statement that their dimensions p and q may be unequal; a short sentence clarifying this point would aid readability.
  2. [§4] Simulation tables report empirical rejection rates but do not include standard errors across the Monte Carlo replications; adding these would strengthen the evidence that the bootstrap and asymptotic versions behave comparably.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. The point raised about the joint limiting distribution under dependence is well taken, and we address it directly below.

read point-by-point responses
  1. Referee: [§3.2, Theorem 3.1] §3.2, Theorem 3.1 and the subsequent joint CLT argument: the asymptotic normality claim under the alternative when the two predictor groups are dependent requires an explicit joint limiting distribution that incorporates the cross-covariance between the two empirical CGC estimators. The manuscript appears to state the marginal CLTs from Dang et al. (2020) and then invoke a delta-method step; without a displayed expression for the off-diagonal covariance term or a verification that the regularity conditions (e.g., finite moments of the joint kernel) remain sufficient under dependence, the limiting variance under HA is not fully secured.

    Authors: We agree that the presentation in Section 3.2 would benefit from an explicit statement of the joint limiting distribution of the two empirical CGC estimators under dependence. In the revision we will add the full asymptotic covariance matrix, including the off-diagonal cross-covariance term between the two U-statistic estimators, and apply the delta method to the difference. We will also verify that the moment conditions on the joint kernel (finite second moments) remain sufficient under the dependence structure permitted by the paper. These clarifications will be inserted into the statement of Theorem 3.1 and the surrounding text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines its test statistic from the categorical Gini correlation measure introduced in the independent prior work of Dang et al. (2020) and then derives the joint asymptotic normality of the difference under both null and alternative hypotheses, including the case of dependent predictor groups with unequal dimensions. This joint limiting distribution is presented as a new technical result rather than a direct renaming or algebraic reduction of the single-CGC asymptotics. The nonparametric bootstrap is offered as a separate computational alternative, not as the justification for the limiting distribution. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claims therefore retain independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; therefore no specific free parameters, axioms, or invented entities can be extracted or audited from the provided text.

pith-pipeline@v0.9.0 · 5658 in / 899 out tokens · 37185 ms · 2026-05-20T01:43:25.712990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-Ortiz, J. L. (2012). Human Activity Recognition Using Smartphones [Dataset].UCI Machine Learning Repository. DOI:https: //doi.org/10.24432/C54S4K

  2. [2]

    Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-Ortiz, J. L. (2013). A Public Domain Dataset for Human Activity Recognition Using Smartphones.Proceedings of the 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN). Bruges, Belgium, 437–442. Available at:https://i6doc. com/en/book/?GCOI=28001...

  3. [3]

    M., Hewage, S., and Mayeaux, A

    Aich, A., Murshed, M. M., Hewage, S., and Mayeaux, A. (2026). A copula based supervised filter for feature selection in machine learning driven diabetes risk prediction.Scientific Reports,16, 12132. DOI:https://doi.org/10.1038/s41598-026-41874-9

  4. [4]

    M., Hewage, S., and Mayeaux, A

    Aich, A., Murshed, M. M., Hewage, S., and Mayeaux, A. (2025). CopulaSMOTE: A Copula- Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction.arXiv preprint arXiv:2506.17326. DOI:https://doi.org/10.48550/arXiv.2506.17326 22

  5. [5]

    Aich, A., Hewage, S., and Murshed, M. M. (2025). Copula Based Fusion of Clinical and Ge- nomic Machine Learning Risk Scores for Breast Cancer Risk Stratification.arXiv preprint arXiv:2511.17605. DOI:https://doi.org/10.48550/arXiv.2511.17605

  6. [6]

    M., Hewage, S., and Aich, A

    Aich, A., Murshed, M. M., Hewage, S., and Aich, A. B. (2026). Bayesian Inference for Joint Tail Risk in Paired Biomarkers via Archimedean Copulas with Restricted Jeffreys Priors. arXiv preprint arXiv:2602.15319. DOI:https://doi.org/10.48550/arXiv.2602.15319

  7. [7]

    and Michiels, S

    Belhechmi, S., De Bin, R., Rotolo, F. and Michiels, S. (2020). Accounting for grouped predictor variables or pathways in high dimensional penalized Cox regression models.BMC Bioinformatics, 21(1):277. DOI:https://doi.org/10.1186/s12859-020-03618-y

  8. [8]

    and Wild, P

    Buch, G., Schulz, A., Schmidtmann, I., Strauch, K. and Wild, P. S. (2021). A systematic review and evaluation of statistical methods for group variable selection,Stat. Med.,42, 331-352. DOI:https://doi.org/10.1002/sim.9620

  9. [9]

    and Yu, J

    Cheng, G., Li, X., Lai, P., Song, F. and Yu, J. (2017). Robust rank screening for ultrahigh dimensional discriminant analysis.Stat. Comput.,27(2), 535-545. DOI:https://doi. org/10.1007/s11222-016-9637-2

  10. [10]

    and Zhong, W

    Cui, H., Li, R. and Zhong, W. (2015). Model-free feature screening for ultrahigh di- mensional discriminant analysis.J. Amer. Statist. Assoc.,110, 630-641. DOI:https: //doi.org/10.1080/01621459.2014.920256

  11. [11]

    and Zhang, J

    Dang, X., Nguyen, D., Chen, X. and Zhang, J. (2021). A new Gini correlation between quantitative and qualitative variables.Scand. J. Stat.,48(4), 1314-1343. DOI:https: //doi.org/10.1111/sjos.12490

  12. [12]

    and Lv, J

    Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion).Journal of the Royal Statistical Society, Series B,70, 849-911. DOI:https://doi.org/10.1111/j.1467-9868.2008.00674.x

  13. [13]

    Gini, C. (1914). On the measurement of concentration and variability of characters.Metron, LXIII(1), 3-38

  14. [14]

    and Zhu, J

    Goldman, M.J., Craft, B., Hastie, M., Repeˇ cka, K., McDade, F., Kamath, A., Banerjee, A., Luo, Y., Rogers, D., Brooks, A.N. and Zhu, J. (2020). Visualizing and interpreting cancer genomics data via the Xena platform.Nat. Biotechnol.,38(6), 675-678. DOI:https: //doi.org/10.1038/s41587-020-0546-8

  15. [15]

    and Sang, Y

    Hewage, S. and Sang, Y. (2024). Jackknife empirical likelihood confidence intervals for the categorical Gini correlation.J. Stat. Plan. Inference,231, 106123. DOI:https://doi. org/10.1016/j.jspi.2023.106123

  16. [16]

    Hewage, S. (2025). A Nonparametric K-sample Test for Variability Based on Gini’s Mean Difference.J. Stat. Theory Appl.,24(2), 334–353. DOI:https://doi.org/10.1007/ s44199-025-00112-3

  17. [17]

    Hewage, S. (2025). gcor: A Python Implementation of Categorical Gini Correlation and Its Inference.arXiv preprint arXiv:2506.19230. DOI:https://doi.org/10.48550/arXiv. 2506.19230 23

  18. [18]

    Hewage, S. S. (2025).Statistical Inference for Categorical Gini Correlation and Gini’s Mean Difference. University of Louisiana at Lafayette

  19. [19]

    and Xu, W

    He, S., Ma, S. and Xu, W. (2019). A modified mean-variance feature-screening procedure for ultrahigh-dimensional discriminant analysis.Comput. Statist. Data Anal.,137, 155-

  20. [20]

    DOI:https://doi.org/10.1016/j.csda.2019.02.003

  21. [21]

    Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution.Ann. Math. Statist.19, 293-325. DOI:https://doi.org/10.1214/aoms/1177730196

  22. [22]

    Hotelling, H. (1940). The selection of variates for use in prediction with some comments on the general problem of nuisance parameters.Ann. Math. Statist.,11, 271-283. DOI: https://doi.org/10.1214/aoms/1177731867

  23. [23]

    and Liu, Z

    Lai, P., Song, F., Chen, K. and Liu, Z. (2017). Model free feature screening with dependent variable in ultrahigh dimensional binary classification.Statist. Probab. Lett.,125, 141-148. DOI:https://doi.org/10.1016/j.spl.2017.02.011

  24. [24]

    and Zou, H

    Mai, Q. and Zou, H. (2013). The Kolmogorov Filter for Variance Screening in High- Dimensional Binary Classification.Biometrika,100, 229-234. DOI:https://doi.org/10. 1093/biomet/ass062

  25. [25]

    F., & Quataert, E

    Meier, L., Van De Geer, S. and B¨ uhlmann, P. (2008). The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B. Stat. Methodol.,70, 53-71. DOI:https://doi.org/10.1111/j. 1467-9868.2007.00627.x

  26. [26]

    and Rubin, D.B

    Meng, X.L., Rosenthal, R. and Rubin, D.B. (1992). Comparing correlated correlation co- efficients.Psych. Bull.,111, 172-175. DOI:https://doi.org/10.1037/0033-2909.111. 1.172

  27. [27]

    Mercer, J. (1909). Functions of positive and negative type, and their connection the theory of integral equations.Philos. Trans. Roy. Soc. A,209, 415-446. DOI:https://doi.org/ 10.1098/rsta.1909.0016

  28. [28]

    and Dunn, O.J

    Neil, J.J. and Dunn, O.J. (1975). Equality of dependent correlation coefficients.Biometrics, 31, 531-543. DOI:https://doi.org/10.2307/2529435

  29. [29]

    and Fang, F

    Ni, L. and Fang, F. (2016). Entropy-based model-free feature screening for ultrahigh di- mensional multiclass classification.J. Nonparametr. Stat.,28(3), 515-530. DOI:https: //doi.org/10.1080/10485252.2016.1167206

  30. [30]

    and Li, H

    Niu, Y., Zhang, R., Liu, J. and Li, H. (2020). Group screening for ultra-high-dimensional feature under linear model.Stat. Theor. Relat. Field.,4(1), 43-54. DOI:https://doi. org/10.1080/24754269.2019.1633763

  31. [31]

    Olkin, I. (1967). Correlations revisited. In J.C. Stanley (Ed.),Improving experimental de- sign and statistical analysis. Chicago, IL: Rand McNally, pp. 102-128

  32. [32]

    and Quackenbush, J.F

    Parker, J.S., Mullins, M., Cheang, M.C., Leung, S., Voduc, D., Vickery, T., Davies, S., Fauron, C., He, X., Hu, Z. and Quackenbush, J.F. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes.J. Clin. Oncol.,27(8), 1160-1167. DOI:https: //doi.org/10.1200/JCO.2008.18.1370 24

  33. [33]

    and Torrecilla, J

    Ramos-Carre˜ no, C. and Torrecilla, J. L. (2023). dcor: Distance correlation and energy statistics in Python.SoftwareX,22, 101326. DOI:https://doi.org/10.1016/j.softx. 2023.101326

  34. [34]

    and Dang, X

    Sang, Y. and Dang, X. (2024). Grouped feature screening for ultrahigh-dimensional classifi- cation via Gini distance correlation.J. Multivar. Anal.DOI:https://doi.org/10.1016/ j.jmva.2024.105360

  35. [35]

    and Dang, X

    Sang, Y. and Dang, X. (2023). Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem.Electron. J. Stat.,17(2), 2539-2574. DOI: https://doi.org/10.1214/23-EJS2165

  36. [36]

    (1980).Approximation theorems of mathematical statistics

    Serfling, R.J. (1980).Approximation theorems of mathematical statistics. John Wiley & Sons. DOI:https://doi.org/10.1002/9780470316481

  37. [37]

    and Tu, D

    Shao, J. and Tu, D. (1996).The Jackknife and Bootstrap. Springer, New York. DOI:https: //doi.org/10.1007/978-1-4612-0795-5

  38. [38]

    and Rizzo, M.L

    Sz´ ekely, G.J. and Rizzo, M.L. (2013a). Energy statistics: A class of statistics based on distances.J. Stat. Plan. Infer.143, 1249-1272. DOI:https://doi.org/10.1016/j.jspi. 2013.03.018

  39. [39]

    and Rizzo, M.L

    Sz´ ekely, G.J. and Rizzo, M.L. (2017). The energy of data.Ann. Rev. Stat. Appl.,4(1), 447-479. DOI:https://doi.org/10.1146/annurev-statistics-060116-054026

  40. [40]

    and Bakirov, N

    Sz´ ekely, G.J., Rizzo, M.L. and Bakirov, N. (2007). Measuring and testing dependence by correlation of distances.Ann. Statist.35(6), 2769-2794. DOI:https://doi.org/10.1214/ 009053607000000505

  41. [41]

    Wang, Z., Deng, G., and Xu, H. (2023). Group feature screening based on Gini impurity for ultrahigh-dimensional multi-classification.AIMS Math.,8(2), 4342-4362. DOI:https: //doi.org/10.3934/math.2023216

  42. [42]

    Williams, E.J. (1959a). Significance of difference between two non-independent correlation coefficients.Biometrics,15, 135-136

  43. [43]

    Wolberg, W., Mangasarian, O., Street, N., and Street, W. (1995). Breast Cancer Wisconsin (Diagnostic) [Dataset].UCI Machine Learning Repository. DOI:https://doi.org/10. 24432/C5DW2B

  44. [44]

    Yitzhaki, S. (2003). Gini’s mean difference: A superior measure of variability for non- normal distributions.Metron,61(2), 285-316

  45. [45]

    and Lin, Y

    Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.J. R. Stat. Soc. Ser. B. Stat. Methodol.,68(1), 49-67. DOI:https://doi.org/ 10.1111/j.1467-9868.2005.00532.x

  46. [46]

    and Chen, Y

    Zhang, S., Dang, X., Nguyen, D., Wilkins, D. and Chen, Y. (2019). Estimating feature- label dependence using Gini distance statistics.IEEE Transactions on Pattern Analysis and Machine Intelligence,43(6), 1947-1963. DOI:https://doi.org/10.1109/TPAMI.2019. 2960358

  47. [47]

    Zou, G.Y. (2007). Toward using confidence intervals to compare correlations.Psychol. Methods,12(4), 399. DOI:https://doi.org/10.1037/1082-989X.12.4.399 25