Comparing Two Categorical Gini Correlations with Applications to Classification Problems

Sameera Hewage; Yongli Sang

arxiv: 2605.17763 · v1 · pith:MHYELPMPnew · submitted 2026-05-18 · 📊 stat.ME · stat.ML

Comparing Two Categorical Gini Correlations with Applications to Classification Problems

Sameera Hewage , Yongli Sang This is my paper

Pith reviewed 2026-05-20 01:43 UTC · model grok-4.3

classification 📊 stat.ME stat.ML

keywords categorical Gini correlationpredictor importanceclassificationasymptotic normalitybootstrap testdependence measurecategorical response

0 comments

The pith

A test for the difference between two categorical Gini correlations enables comparison of predictor importance for categorical classification outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an inferential method to compare how strongly different groups of numerical predictors relate to a categorical response variable. It builds on the categorical Gini correlation as a dependence measure and constructs a test statistic whose asymptotic distribution is normal under both the hypothesis of equal correlations and the hypothesis of differing correlations. The framework handles predictor groups that may have unequal numbers of variables and may be statistically dependent. A bootstrap procedure is also derived for inference, and the approach is illustrated with simulation studies plus applications to breast cancer and human activity recognition data.

Core claim

The central claim is that the difference of two categorical Gini correlations yields a test statistic that is asymptotically normal under both the null and alternative hypotheses, is consistent against alternatives, accommodates arbitrary and unequal predictor dimensions, and remains valid when the predictor groups are dependent.

What carries the argument

The difference of two categorical Gini correlations, together with the derived test statistic whose asymptotic normality is established under regularity conditions on the data.

If this is right

Predictor groups of arbitrary and unequal dimensions can be compared directly for their association with the categorical outcome.
The test remains valid when the two predictor groups exhibit dependence.
Inference can proceed either through the established asymptotic normal approximation or through a nonparametric bootstrap procedure.
The method supplies a practical tool for assessing relative predictor importance in classification problems with categorical responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could support sequential testing to rank more than two predictor groups in a single analysis.
Similar difference-of-dependence tests might be constructed for other categorical dependence measures beyond the Gini correlation.
In applied settings the test could guide feature-set selection before fitting a final classifier.

Load-bearing premise

The data distributions must satisfy regularity conditions that make the categorical Gini correlation a valid dependence measure and allow the central limit theorem to apply to the test statistic.

What would settle it

Large-sample simulations drawn from distributions that violate the regularity conditions should produce a test statistic whose empirical distribution deviates markedly from normality under the null hypothesis of equal correlations.

Figures

Figures reproduced from arXiv: 2605.17763 by Sameera Hewage, Yongli Sang.

**Figure 1.** Figure 1: Size and power of tests in Example 3.2(b). Dashed horizontal line is the nominal level 0.05. Example 3.3 (Binary Logistic Regression Model) In this example, we consider a logistic regression model where the binary response is generated as log P(Z = 1|V ) P(Z = −1|V ) = −3 + 2V1 + 2V2 + 2V3 + 3 sin(V4) + 4V 2 5 , where V ∼ N(0, Σ) with Σ = (ρij )p×p having two scenarios ρij = 0 and ρij = 0.5 |j−i| , i ̸… view at source ↗

**Figure 2.** Figure 2: Top 15 Random Forest feature importances for the Wisconsin Breast Cancer dataset, [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

read the original abstract

This article proposes an inferential framework for comparing predictor importance in classification problems with categorical response variables. The approach is based on the categorical Gini correlation (CGC) proposed by Dang et al. (2020), a measure of dependence between numerical predictors and categorical outcomes. Predictor importance is evaluated by testing differences in CGCs across competing predictor groups. The proposed methodology accommodates predictors of arbitrary and unequal dimensions and allows for dependence between predictor groups. Asymptotic normality of the test statistic is established under both the null and alternative hypotheses, and the resulting test is shown to be consistent. In addition to deriving the asymptotic distribution, a nonparametric bootstrap procedure is developed as an alternative approach to inference. Simulation studies, along with applications to breast cancer and human activity recognition datasets, demonstrate the effectiveness of the proposed framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a test for differences between two categorical Gini correlations that handles unequal dimensions and dependence, with claimed joint asymptotics and a bootstrap, but the joint limiting distribution under the alternative needs close checking.

read the letter

The core contribution is a test statistic for whether two groups of predictors differ in their categorical Gini correlation with a categorical response. It extends the single-measure results from Dang et al. (2020) by allowing unequal predictor dimensions and dependence between the groups, then supplies asymptotic normality under both null and alternative plus a nonparametric bootstrap. Simulations and two real-data examples (breast cancer classification and human activity recognition) are included to show practical behavior. That is the main new piece: turning the existing dependence measure into a comparative inferential tool for feature selection in classification settings. The approach is straightforward and directly addresses a common task without requiring a full parametric model. The real-data applications give a sense of how the test behaves on messy, moderate-sized problems, which is useful. The bootstrap option is a sensible practical addition when sample sizes are not huge. The main soft spot is the joint asymptotic claim. The abstract states normality holds under the alternative even with dependence between predictor groups, but extending marginal results to a joint central limit theorem requires controlling the cross-covariance terms carefully. If the derivations only treat the marginal cases and then invoke bootstrap without a full joint expansion, the theoretical guarantee under the alternative could rest on unstated regularity conditions. The reader's stress-test note flags exactly this point, and it is worth verifying in the proofs. The regularity conditions inherited from the 2020 paper are also taken as given; any gaps there would propagate. This work is aimed at statisticians and machine-learning researchers who need a dependence-based way to rank or compare predictors when the response is categorical. Readers already familiar with Gini correlations or rank-based measures will see the most immediate value. It is not a sweeping theoretical advance, but it fills a practical gap with enough supporting material to be worth refereeing. I would send it to peer review, with the main questions focused on the joint limiting distribution and the precise conditions under which the bootstrap is valid.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a test for the difference between two categorical Gini correlations (CGCs) to compare predictor importance for a categorical response in classification settings. Extending Dang et al. (2020), it derives asymptotic normality of the difference estimator under both the null and alternative, proves consistency, accommodates unequal predictor dimensions, and allows dependence between the two predictor groups. A nonparametric bootstrap is also developed. The claims are supported by simulation studies and applications to breast cancer and human activity recognition data.

Significance. If the joint asymptotic results hold, the work supplies a practical inferential tool for assessing relative predictor strength when dimensions differ and groups may be dependent, which is common in classification. The bootstrap alternative and real-data examples add immediate usability. The extension to the difference of two CGCs under dependence fills a methodological gap left by the single-CGC theory in the cited prior work.

major comments (1)

[§3.2, Theorem 3.1] §3.2, Theorem 3.1 and the subsequent joint CLT argument: the asymptotic normality claim under the alternative when the two predictor groups are dependent requires an explicit joint limiting distribution that incorporates the cross-covariance between the two empirical CGC estimators. The manuscript appears to state the marginal CLTs from Dang et al. (2020) and then invoke a delta-method step; without a displayed expression for the off-diagonal covariance term or a verification that the regularity conditions (e.g., finite moments of the joint kernel) remain sufficient under dependence, the limiting variance under HA is not fully secured.

minor comments (2)

[§2.1] The notation for the two predictor vectors X and Y is introduced without an explicit statement that their dimensions p and q may be unequal; a short sentence clarifying this point would aid readability.
[§4] Simulation tables report empirical rejection rates but do not include standard errors across the Monte Carlo replications; adding these would strengthen the evidence that the bootstrap and asymptotic versions behave comparably.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. The point raised about the joint limiting distribution under dependence is well taken, and we address it directly below.

read point-by-point responses

Referee: [§3.2, Theorem 3.1] §3.2, Theorem 3.1 and the subsequent joint CLT argument: the asymptotic normality claim under the alternative when the two predictor groups are dependent requires an explicit joint limiting distribution that incorporates the cross-covariance between the two empirical CGC estimators. The manuscript appears to state the marginal CLTs from Dang et al. (2020) and then invoke a delta-method step; without a displayed expression for the off-diagonal covariance term or a verification that the regularity conditions (e.g., finite moments of the joint kernel) remain sufficient under dependence, the limiting variance under HA is not fully secured.

Authors: We agree that the presentation in Section 3.2 would benefit from an explicit statement of the joint limiting distribution of the two empirical CGC estimators under dependence. In the revision we will add the full asymptotic covariance matrix, including the off-diagonal cross-covariance term between the two U-statistic estimators, and apply the delta method to the difference. We will also verify that the moment conditions on the joint kernel (finite second moments) remain sufficient under the dependence structure permitted by the paper. These clarifications will be inserted into the statement of Theorem 3.1 and the surrounding text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines its test statistic from the categorical Gini correlation measure introduced in the independent prior work of Dang et al. (2020) and then derives the joint asymptotic normality of the difference under both null and alternative hypotheses, including the case of dependent predictor groups with unequal dimensions. This joint limiting distribution is presented as a new technical result rather than a direct renaming or algebraic reduction of the single-CGC asymptotics. The nonparametric bootstrap is offered as a separate computational alternative, not as the justification for the limiting distribution. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claims therefore retain independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; therefore no specific free parameters, axioms, or invented entities can be extracted or audited from the provided text.

pith-pipeline@v0.9.0 · 5658 in / 899 out tokens · 37185 ms · 2026-05-20T01:43:25.712990+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2.1 … √n(ρ̂1 − ρ̂2) → N(0, σ0²) … jackknife estimator ĉM … U-statistic kernel hkl
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

C1–C4 … non-degenerate U-statistic … bootstrap within each class

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

[1]

Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-Ortiz, J. L. (2012). Human Activity Recognition Using Smartphones [Dataset].UCI Machine Learning Repository. DOI:https: //doi.org/10.24432/C54S4K

work page doi:10.24432/c54s4k 2012
[2]

Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-Ortiz, J. L. (2013). A Public Domain Dataset for Human Activity Recognition Using Smartphones.Proceedings of the 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN). Bruges, Belgium, 437–442. Available at:https://i6doc. com/en/book/?GCOI=28001...

work page 2013
[3]

M., Hewage, S., and Mayeaux, A

Aich, A., Murshed, M. M., Hewage, S., and Mayeaux, A. (2026). A copula based supervised filter for feature selection in machine learning driven diabetes risk prediction.Scientific Reports,16, 12132. DOI:https://doi.org/10.1038/s41598-026-41874-9

work page doi:10.1038/s41598-026-41874-9 2026
[4]

M., Hewage, S., and Mayeaux, A

Aich, A., Murshed, M. M., Hewage, S., and Mayeaux, A. (2025). CopulaSMOTE: A Copula- Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction.arXiv preprint arXiv:2506.17326. DOI:https://doi.org/10.48550/arXiv.2506.17326 22

work page doi:10.48550/arxiv.2506.17326 2025
[5]

Aich, A., Hewage, S., and Murshed, M. M. (2025). Copula Based Fusion of Clinical and Ge- nomic Machine Learning Risk Scores for Breast Cancer Risk Stratification.arXiv preprint arXiv:2511.17605. DOI:https://doi.org/10.48550/arXiv.2511.17605

work page doi:10.48550/arxiv.2511.17605 2025
[6]

M., Hewage, S., and Aich, A

Aich, A., Murshed, M. M., Hewage, S., and Aich, A. B. (2026). Bayesian Inference for Joint Tail Risk in Paired Biomarkers via Archimedean Copulas with Restricted Jeffreys Priors. arXiv preprint arXiv:2602.15319. DOI:https://doi.org/10.48550/arXiv.2602.15319

work page doi:10.48550/arxiv.2602.15319 2026
[7]

and Michiels, S

Belhechmi, S., De Bin, R., Rotolo, F. and Michiels, S. (2020). Accounting for grouped predictor variables or pathways in high dimensional penalized Cox regression models.BMC Bioinformatics, 21(1):277. DOI:https://doi.org/10.1186/s12859-020-03618-y

work page doi:10.1186/s12859-020-03618-y 2020
[8]

and Wild, P

Buch, G., Schulz, A., Schmidtmann, I., Strauch, K. and Wild, P. S. (2021). A systematic review and evaluation of statistical methods for group variable selection,Stat. Med.,42, 331-352. DOI:https://doi.org/10.1002/sim.9620

work page doi:10.1002/sim.9620 2021
[9]

and Yu, J

Cheng, G., Li, X., Lai, P., Song, F. and Yu, J. (2017). Robust rank screening for ultrahigh dimensional discriminant analysis.Stat. Comput.,27(2), 535-545. DOI:https://doi. org/10.1007/s11222-016-9637-2

work page doi:10.1007/s11222-016-9637-2 2017
[10]

and Zhong, W

Cui, H., Li, R. and Zhong, W. (2015). Model-free feature screening for ultrahigh di- mensional discriminant analysis.J. Amer. Statist. Assoc.,110, 630-641. DOI:https: //doi.org/10.1080/01621459.2014.920256

work page doi:10.1080/01621459.2014.920256 2015
[11]

and Zhang, J

Dang, X., Nguyen, D., Chen, X. and Zhang, J. (2021). A new Gini correlation between quantitative and qualitative variables.Scand. J. Stat.,48(4), 1314-1343. DOI:https: //doi.org/10.1111/sjos.12490

work page doi:10.1111/sjos.12490 2021
[12]

and Lv, J

Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion).Journal of the Royal Statistical Society, Series B,70, 849-911. DOI:https://doi.org/10.1111/j.1467-9868.2008.00674.x

work page doi:10.1111/j.1467-9868.2008.00674.x 2008
[13]

Gini, C. (1914). On the measurement of concentration and variability of characters.Metron, LXIII(1), 3-38

work page 1914
[14]

and Zhu, J

Goldman, M.J., Craft, B., Hastie, M., Repeˇ cka, K., McDade, F., Kamath, A., Banerjee, A., Luo, Y., Rogers, D., Brooks, A.N. and Zhu, J. (2020). Visualizing and interpreting cancer genomics data via the Xena platform.Nat. Biotechnol.,38(6), 675-678. DOI:https: //doi.org/10.1038/s41587-020-0546-8

work page doi:10.1038/s41587-020-0546-8 2020
[15]

and Sang, Y

Hewage, S. and Sang, Y. (2024). Jackknife empirical likelihood confidence intervals for the categorical Gini correlation.J. Stat. Plan. Inference,231, 106123. DOI:https://doi. org/10.1016/j.jspi.2023.106123

work page doi:10.1016/j.jspi.2023.106123 2024
[16]

Hewage, S. (2025). A Nonparametric K-sample Test for Variability Based on Gini’s Mean Difference.J. Stat. Theory Appl.,24(2), 334–353. DOI:https://doi.org/10.1007/ s44199-025-00112-3

work page 2025
[17]

Hewage, S. (2025). gcor: A Python Implementation of Categorical Gini Correlation and Its Inference.arXiv preprint arXiv:2506.19230. DOI:https://doi.org/10.48550/arXiv. 2506.19230 23

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[18]

Hewage, S. S. (2025).Statistical Inference for Categorical Gini Correlation and Gini’s Mean Difference. University of Louisiana at Lafayette

work page 2025
[19]

and Xu, W

He, S., Ma, S. and Xu, W. (2019). A modified mean-variance feature-screening procedure for ultrahigh-dimensional discriminant analysis.Comput. Statist. Data Anal.,137, 155-

work page 2019
[20]

DOI:https://doi.org/10.1016/j.csda.2019.02.003

work page doi:10.1016/j.csda.2019.02.003 2019
[21]

Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution.Ann. Math. Statist.19, 293-325. DOI:https://doi.org/10.1214/aoms/1177730196

work page doi:10.1214/aoms/1177730196 1948
[22]

Hotelling, H. (1940). The selection of variates for use in prediction with some comments on the general problem of nuisance parameters.Ann. Math. Statist.,11, 271-283. DOI: https://doi.org/10.1214/aoms/1177731867

work page doi:10.1214/aoms/1177731867 1940
[23]

and Liu, Z

Lai, P., Song, F., Chen, K. and Liu, Z. (2017). Model free feature screening with dependent variable in ultrahigh dimensional binary classification.Statist. Probab. Lett.,125, 141-148. DOI:https://doi.org/10.1016/j.spl.2017.02.011

work page doi:10.1016/j.spl.2017.02.011 2017
[24]

and Zou, H

Mai, Q. and Zou, H. (2013). The Kolmogorov Filter for Variance Screening in High- Dimensional Binary Classification.Biometrika,100, 229-234. DOI:https://doi.org/10. 1093/biomet/ass062

work page 2013
[25]

F., & Quataert, E

Meier, L., Van De Geer, S. and B¨ uhlmann, P. (2008). The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B. Stat. Methodol.,70, 53-71. DOI:https://doi.org/10.1111/j. 1467-9868.2007.00627.x

work page doi:10.1111/j 2008
[26]

and Rubin, D.B

Meng, X.L., Rosenthal, R. and Rubin, D.B. (1992). Comparing correlated correlation co- efficients.Psych. Bull.,111, 172-175. DOI:https://doi.org/10.1037/0033-2909.111. 1.172

work page doi:10.1037/0033-2909.111 1992
[27]

Mercer, J. (1909). Functions of positive and negative type, and their connection the theory of integral equations.Philos. Trans. Roy. Soc. A,209, 415-446. DOI:https://doi.org/ 10.1098/rsta.1909.0016

work page doi:10.1098/rsta.1909.0016 1909
[28]

and Dunn, O.J

Neil, J.J. and Dunn, O.J. (1975). Equality of dependent correlation coefficients.Biometrics, 31, 531-543. DOI:https://doi.org/10.2307/2529435

work page doi:10.2307/2529435 1975
[29]

and Fang, F

Ni, L. and Fang, F. (2016). Entropy-based model-free feature screening for ultrahigh di- mensional multiclass classification.J. Nonparametr. Stat.,28(3), 515-530. DOI:https: //doi.org/10.1080/10485252.2016.1167206

work page doi:10.1080/10485252.2016.1167206 2016
[30]

and Li, H

Niu, Y., Zhang, R., Liu, J. and Li, H. (2020). Group screening for ultra-high-dimensional feature under linear model.Stat. Theor. Relat. Field.,4(1), 43-54. DOI:https://doi. org/10.1080/24754269.2019.1633763

work page doi:10.1080/24754269.2019.1633763 2020
[31]

Olkin, I. (1967). Correlations revisited. In J.C. Stanley (Ed.),Improving experimental de- sign and statistical analysis. Chicago, IL: Rand McNally, pp. 102-128

work page 1967
[32]

and Quackenbush, J.F

Parker, J.S., Mullins, M., Cheang, M.C., Leung, S., Voduc, D., Vickery, T., Davies, S., Fauron, C., He, X., Hu, Z. and Quackenbush, J.F. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes.J. Clin. Oncol.,27(8), 1160-1167. DOI:https: //doi.org/10.1200/JCO.2008.18.1370 24

work page doi:10.1200/jco.2008.18.1370 2009
[33]

and Torrecilla, J

Ramos-Carre˜ no, C. and Torrecilla, J. L. (2023). dcor: Distance correlation and energy statistics in Python.SoftwareX,22, 101326. DOI:https://doi.org/10.1016/j.softx. 2023.101326

work page doi:10.1016/j.softx 2023
[34]

and Dang, X

Sang, Y. and Dang, X. (2024). Grouped feature screening for ultrahigh-dimensional classifi- cation via Gini distance correlation.J. Multivar. Anal.DOI:https://doi.org/10.1016/ j.jmva.2024.105360

work page arXiv 2024
[35]

and Dang, X

Sang, Y. and Dang, X. (2023). Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem.Electron. J. Stat.,17(2), 2539-2574. DOI: https://doi.org/10.1214/23-EJS2165

work page doi:10.1214/23-ejs2165 2023
[36]

(1980).Approximation theorems of mathematical statistics

Serfling, R.J. (1980).Approximation theorems of mathematical statistics. John Wiley & Sons. DOI:https://doi.org/10.1002/9780470316481

work page doi:10.1002/9780470316481 1980
[37]

and Tu, D

Shao, J. and Tu, D. (1996).The Jackknife and Bootstrap. Springer, New York. DOI:https: //doi.org/10.1007/978-1-4612-0795-5

work page doi:10.1007/978-1-4612-0795-5 1996
[38]

and Rizzo, M.L

Sz´ ekely, G.J. and Rizzo, M.L. (2013a). Energy statistics: A class of statistics based on distances.J. Stat. Plan. Infer.143, 1249-1272. DOI:https://doi.org/10.1016/j.jspi. 2013.03.018

work page doi:10.1016/j.jspi 2013
[39]

and Rizzo, M.L

Sz´ ekely, G.J. and Rizzo, M.L. (2017). The energy of data.Ann. Rev. Stat. Appl.,4(1), 447-479. DOI:https://doi.org/10.1146/annurev-statistics-060116-054026

work page doi:10.1146/annurev-statistics-060116-054026 2017
[40]

and Bakirov, N

Sz´ ekely, G.J., Rizzo, M.L. and Bakirov, N. (2007). Measuring and testing dependence by correlation of distances.Ann. Statist.35(6), 2769-2794. DOI:https://doi.org/10.1214/ 009053607000000505

work page 2007
[41]

Wang, Z., Deng, G., and Xu, H. (2023). Group feature screening based on Gini impurity for ultrahigh-dimensional multi-classification.AIMS Math.,8(2), 4342-4362. DOI:https: //doi.org/10.3934/math.2023216

work page doi:10.3934/math.2023216 2023
[42]

Williams, E.J. (1959a). Significance of difference between two non-independent correlation coefficients.Biometrics,15, 135-136

work page
[43]

Wolberg, W., Mangasarian, O., Street, N., and Street, W. (1995). Breast Cancer Wisconsin (Diagnostic) [Dataset].UCI Machine Learning Repository. DOI:https://doi.org/10. 24432/C5DW2B

work page 1995
[44]

Yitzhaki, S. (2003). Gini’s mean difference: A superior measure of variability for non- normal distributions.Metron,61(2), 285-316

work page 2003
[45]

and Lin, Y

Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.J. R. Stat. Soc. Ser. B. Stat. Methodol.,68(1), 49-67. DOI:https://doi.org/ 10.1111/j.1467-9868.2005.00532.x

work page doi:10.1111/j.1467-9868.2005.00532.x 2006
[46]

and Chen, Y

Zhang, S., Dang, X., Nguyen, D., Wilkins, D. and Chen, Y. (2019). Estimating feature- label dependence using Gini distance statistics.IEEE Transactions on Pattern Analysis and Machine Intelligence,43(6), 1947-1963. DOI:https://doi.org/10.1109/TPAMI.2019. 2960358

work page doi:10.1109/tpami.2019 2019
[47]

Zou, G.Y. (2007). Toward using confidence intervals to compare correlations.Psychol. Methods,12(4), 399. DOI:https://doi.org/10.1037/1082-989X.12.4.399 25

work page doi:10.1037/1082-989x.12.4.399 2007

[1] [1]

Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-Ortiz, J. L. (2012). Human Activity Recognition Using Smartphones [Dataset].UCI Machine Learning Repository. DOI:https: //doi.org/10.24432/C54S4K

work page doi:10.24432/c54s4k 2012

[2] [2]

Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-Ortiz, J. L. (2013). A Public Domain Dataset for Human Activity Recognition Using Smartphones.Proceedings of the 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN). Bruges, Belgium, 437–442. Available at:https://i6doc. com/en/book/?GCOI=28001...

work page 2013

[3] [3]

M., Hewage, S., and Mayeaux, A

Aich, A., Murshed, M. M., Hewage, S., and Mayeaux, A. (2026). A copula based supervised filter for feature selection in machine learning driven diabetes risk prediction.Scientific Reports,16, 12132. DOI:https://doi.org/10.1038/s41598-026-41874-9

work page doi:10.1038/s41598-026-41874-9 2026

[4] [4]

M., Hewage, S., and Mayeaux, A

Aich, A., Murshed, M. M., Hewage, S., and Mayeaux, A. (2025). CopulaSMOTE: A Copula- Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction.arXiv preprint arXiv:2506.17326. DOI:https://doi.org/10.48550/arXiv.2506.17326 22

work page doi:10.48550/arxiv.2506.17326 2025

[5] [5]

Aich, A., Hewage, S., and Murshed, M. M. (2025). Copula Based Fusion of Clinical and Ge- nomic Machine Learning Risk Scores for Breast Cancer Risk Stratification.arXiv preprint arXiv:2511.17605. DOI:https://doi.org/10.48550/arXiv.2511.17605

work page doi:10.48550/arxiv.2511.17605 2025

[6] [6]

M., Hewage, S., and Aich, A

Aich, A., Murshed, M. M., Hewage, S., and Aich, A. B. (2026). Bayesian Inference for Joint Tail Risk in Paired Biomarkers via Archimedean Copulas with Restricted Jeffreys Priors. arXiv preprint arXiv:2602.15319. DOI:https://doi.org/10.48550/arXiv.2602.15319

work page doi:10.48550/arxiv.2602.15319 2026

[7] [7]

and Michiels, S

Belhechmi, S., De Bin, R., Rotolo, F. and Michiels, S. (2020). Accounting for grouped predictor variables or pathways in high dimensional penalized Cox regression models.BMC Bioinformatics, 21(1):277. DOI:https://doi.org/10.1186/s12859-020-03618-y

work page doi:10.1186/s12859-020-03618-y 2020

[8] [8]

and Wild, P

Buch, G., Schulz, A., Schmidtmann, I., Strauch, K. and Wild, P. S. (2021). A systematic review and evaluation of statistical methods for group variable selection,Stat. Med.,42, 331-352. DOI:https://doi.org/10.1002/sim.9620

work page doi:10.1002/sim.9620 2021

[9] [9]

and Yu, J

Cheng, G., Li, X., Lai, P., Song, F. and Yu, J. (2017). Robust rank screening for ultrahigh dimensional discriminant analysis.Stat. Comput.,27(2), 535-545. DOI:https://doi. org/10.1007/s11222-016-9637-2

work page doi:10.1007/s11222-016-9637-2 2017

[10] [10]

and Zhong, W

Cui, H., Li, R. and Zhong, W. (2015). Model-free feature screening for ultrahigh di- mensional discriminant analysis.J. Amer. Statist. Assoc.,110, 630-641. DOI:https: //doi.org/10.1080/01621459.2014.920256

work page doi:10.1080/01621459.2014.920256 2015

[11] [11]

and Zhang, J

Dang, X., Nguyen, D., Chen, X. and Zhang, J. (2021). A new Gini correlation between quantitative and qualitative variables.Scand. J. Stat.,48(4), 1314-1343. DOI:https: //doi.org/10.1111/sjos.12490

work page doi:10.1111/sjos.12490 2021

[12] [12]

and Lv, J

Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion).Journal of the Royal Statistical Society, Series B,70, 849-911. DOI:https://doi.org/10.1111/j.1467-9868.2008.00674.x

work page doi:10.1111/j.1467-9868.2008.00674.x 2008

[13] [13]

Gini, C. (1914). On the measurement of concentration and variability of characters.Metron, LXIII(1), 3-38

work page 1914

[14] [14]

and Zhu, J

Goldman, M.J., Craft, B., Hastie, M., Repeˇ cka, K., McDade, F., Kamath, A., Banerjee, A., Luo, Y., Rogers, D., Brooks, A.N. and Zhu, J. (2020). Visualizing and interpreting cancer genomics data via the Xena platform.Nat. Biotechnol.,38(6), 675-678. DOI:https: //doi.org/10.1038/s41587-020-0546-8

work page doi:10.1038/s41587-020-0546-8 2020

[15] [15]

and Sang, Y

Hewage, S. and Sang, Y. (2024). Jackknife empirical likelihood confidence intervals for the categorical Gini correlation.J. Stat. Plan. Inference,231, 106123. DOI:https://doi. org/10.1016/j.jspi.2023.106123

work page doi:10.1016/j.jspi.2023.106123 2024

[16] [16]

Hewage, S. (2025). A Nonparametric K-sample Test for Variability Based on Gini’s Mean Difference.J. Stat. Theory Appl.,24(2), 334–353. DOI:https://doi.org/10.1007/ s44199-025-00112-3

work page 2025

[17] [17]

Hewage, S. (2025). gcor: A Python Implementation of Categorical Gini Correlation and Its Inference.arXiv preprint arXiv:2506.19230. DOI:https://doi.org/10.48550/arXiv. 2506.19230 23

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025

[18] [18]

Hewage, S. S. (2025).Statistical Inference for Categorical Gini Correlation and Gini’s Mean Difference. University of Louisiana at Lafayette

work page 2025

[19] [19]

and Xu, W

He, S., Ma, S. and Xu, W. (2019). A modified mean-variance feature-screening procedure for ultrahigh-dimensional discriminant analysis.Comput. Statist. Data Anal.,137, 155-

work page 2019

[20] [20]

DOI:https://doi.org/10.1016/j.csda.2019.02.003

work page doi:10.1016/j.csda.2019.02.003 2019

[21] [21]

Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution.Ann. Math. Statist.19, 293-325. DOI:https://doi.org/10.1214/aoms/1177730196

work page doi:10.1214/aoms/1177730196 1948

[22] [22]

Hotelling, H. (1940). The selection of variates for use in prediction with some comments on the general problem of nuisance parameters.Ann. Math. Statist.,11, 271-283. DOI: https://doi.org/10.1214/aoms/1177731867

work page doi:10.1214/aoms/1177731867 1940

[23] [23]

and Liu, Z

Lai, P., Song, F., Chen, K. and Liu, Z. (2017). Model free feature screening with dependent variable in ultrahigh dimensional binary classification.Statist. Probab. Lett.,125, 141-148. DOI:https://doi.org/10.1016/j.spl.2017.02.011

work page doi:10.1016/j.spl.2017.02.011 2017

[24] [24]

and Zou, H

Mai, Q. and Zou, H. (2013). The Kolmogorov Filter for Variance Screening in High- Dimensional Binary Classification.Biometrika,100, 229-234. DOI:https://doi.org/10. 1093/biomet/ass062

work page 2013

[25] [25]

F., & Quataert, E

Meier, L., Van De Geer, S. and B¨ uhlmann, P. (2008). The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B. Stat. Methodol.,70, 53-71. DOI:https://doi.org/10.1111/j. 1467-9868.2007.00627.x

work page doi:10.1111/j 2008

[26] [26]

and Rubin, D.B

Meng, X.L., Rosenthal, R. and Rubin, D.B. (1992). Comparing correlated correlation co- efficients.Psych. Bull.,111, 172-175. DOI:https://doi.org/10.1037/0033-2909.111. 1.172

work page doi:10.1037/0033-2909.111 1992

[27] [27]

Mercer, J. (1909). Functions of positive and negative type, and their connection the theory of integral equations.Philos. Trans. Roy. Soc. A,209, 415-446. DOI:https://doi.org/ 10.1098/rsta.1909.0016

work page doi:10.1098/rsta.1909.0016 1909

[28] [28]

and Dunn, O.J

Neil, J.J. and Dunn, O.J. (1975). Equality of dependent correlation coefficients.Biometrics, 31, 531-543. DOI:https://doi.org/10.2307/2529435

work page doi:10.2307/2529435 1975

[29] [29]

and Fang, F

Ni, L. and Fang, F. (2016). Entropy-based model-free feature screening for ultrahigh di- mensional multiclass classification.J. Nonparametr. Stat.,28(3), 515-530. DOI:https: //doi.org/10.1080/10485252.2016.1167206

work page doi:10.1080/10485252.2016.1167206 2016

[30] [30]

and Li, H

Niu, Y., Zhang, R., Liu, J. and Li, H. (2020). Group screening for ultra-high-dimensional feature under linear model.Stat. Theor. Relat. Field.,4(1), 43-54. DOI:https://doi. org/10.1080/24754269.2019.1633763

work page doi:10.1080/24754269.2019.1633763 2020

[31] [31]

Olkin, I. (1967). Correlations revisited. In J.C. Stanley (Ed.),Improving experimental de- sign and statistical analysis. Chicago, IL: Rand McNally, pp. 102-128

work page 1967

[32] [32]

and Quackenbush, J.F

Parker, J.S., Mullins, M., Cheang, M.C., Leung, S., Voduc, D., Vickery, T., Davies, S., Fauron, C., He, X., Hu, Z. and Quackenbush, J.F. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes.J. Clin. Oncol.,27(8), 1160-1167. DOI:https: //doi.org/10.1200/JCO.2008.18.1370 24

work page doi:10.1200/jco.2008.18.1370 2009

[33] [33]

and Torrecilla, J

Ramos-Carre˜ no, C. and Torrecilla, J. L. (2023). dcor: Distance correlation and energy statistics in Python.SoftwareX,22, 101326. DOI:https://doi.org/10.1016/j.softx. 2023.101326

work page doi:10.1016/j.softx 2023

[34] [34]

and Dang, X

Sang, Y. and Dang, X. (2024). Grouped feature screening for ultrahigh-dimensional classifi- cation via Gini distance correlation.J. Multivar. Anal.DOI:https://doi.org/10.1016/ j.jmva.2024.105360

work page arXiv 2024

[35] [35]

and Dang, X

Sang, Y. and Dang, X. (2023). Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem.Electron. J. Stat.,17(2), 2539-2574. DOI: https://doi.org/10.1214/23-EJS2165

work page doi:10.1214/23-ejs2165 2023

[36] [36]

(1980).Approximation theorems of mathematical statistics

Serfling, R.J. (1980).Approximation theorems of mathematical statistics. John Wiley & Sons. DOI:https://doi.org/10.1002/9780470316481

work page doi:10.1002/9780470316481 1980

[37] [37]

and Tu, D

Shao, J. and Tu, D. (1996).The Jackknife and Bootstrap. Springer, New York. DOI:https: //doi.org/10.1007/978-1-4612-0795-5

work page doi:10.1007/978-1-4612-0795-5 1996

[38] [38]

and Rizzo, M.L

Sz´ ekely, G.J. and Rizzo, M.L. (2013a). Energy statistics: A class of statistics based on distances.J. Stat. Plan. Infer.143, 1249-1272. DOI:https://doi.org/10.1016/j.jspi. 2013.03.018

work page doi:10.1016/j.jspi 2013

[39] [39]

and Rizzo, M.L

Sz´ ekely, G.J. and Rizzo, M.L. (2017). The energy of data.Ann. Rev. Stat. Appl.,4(1), 447-479. DOI:https://doi.org/10.1146/annurev-statistics-060116-054026

work page doi:10.1146/annurev-statistics-060116-054026 2017

[40] [40]

and Bakirov, N

Sz´ ekely, G.J., Rizzo, M.L. and Bakirov, N. (2007). Measuring and testing dependence by correlation of distances.Ann. Statist.35(6), 2769-2794. DOI:https://doi.org/10.1214/ 009053607000000505

work page 2007

[41] [41]

Wang, Z., Deng, G., and Xu, H. (2023). Group feature screening based on Gini impurity for ultrahigh-dimensional multi-classification.AIMS Math.,8(2), 4342-4362. DOI:https: //doi.org/10.3934/math.2023216

work page doi:10.3934/math.2023216 2023

[42] [42]

Williams, E.J. (1959a). Significance of difference between two non-independent correlation coefficients.Biometrics,15, 135-136

work page

[43] [43]

Wolberg, W., Mangasarian, O., Street, N., and Street, W. (1995). Breast Cancer Wisconsin (Diagnostic) [Dataset].UCI Machine Learning Repository. DOI:https://doi.org/10. 24432/C5DW2B

work page 1995

[44] [44]

Yitzhaki, S. (2003). Gini’s mean difference: A superior measure of variability for non- normal distributions.Metron,61(2), 285-316

work page 2003

[45] [45]

and Lin, Y

Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.J. R. Stat. Soc. Ser. B. Stat. Methodol.,68(1), 49-67. DOI:https://doi.org/ 10.1111/j.1467-9868.2005.00532.x

work page doi:10.1111/j.1467-9868.2005.00532.x 2006

[46] [46]

and Chen, Y

Zhang, S., Dang, X., Nguyen, D., Wilkins, D. and Chen, Y. (2019). Estimating feature- label dependence using Gini distance statistics.IEEE Transactions on Pattern Analysis and Machine Intelligence,43(6), 1947-1963. DOI:https://doi.org/10.1109/TPAMI.2019. 2960358

work page doi:10.1109/tpami.2019 2019

[47] [47]

Zou, G.Y. (2007). Toward using confidence intervals to compare correlations.Psychol. Methods,12(4), 399. DOI:https://doi.org/10.1037/1082-989X.12.4.399 25

work page doi:10.1037/1082-989x.12.4.399 2007