gcor: A Python Implementation of Categorical Gini Correlation and Its Inference

Sameera Hewage

arxiv: 2506.19230 · v4 · submitted 2025-06-24 · 📊 stat.ME · stat.CO

gcor: A Python Implementation of Categorical Gini Correlation and Its Inference

Sameera Hewage This is my paper

Pith reviewed 2026-05-19 08:37 UTC · model grok-4.3

classification 📊 stat.ME stat.CO

keywords Categorical Gini CorrelationPython implementationdependence measureconfidence intervalsindependence testsfeature screeningstatistical software

0 comments

The pith

A Python package implements Categorical Gini Correlation for dependence between numeric and categorical variables along with confidence intervals and independence tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper delivers a Python implementation called gcor for computing Categorical Gini Correlation, a measure of association between a numerical variable and a categorical variable that has the property that zero correlation implies independence. The library also supports constructing confidence intervals and performing independence tests. All procedures rely on efficient algorithms that use vectorization and parallelization to reduce computation time. A sympathetic reader would care because the tool makes this dependence measure practical for tasks such as feature screening in classification without requiring users to code the formulas themselves.

Core claim

The paper presents gcor, a Python package that computes the Categorical Gini Correlation introduced by Dang et al. and supplies optimized routines for confidence interval construction and independence testing, with all steps accelerated through vectorization and parallelization.

What carries the argument

The gcor Python package, which encodes the Categorical Gini Correlation formulas and inference procedures with vectorized and parallelized implementations.

If this is right

Users can apply Categorical Gini Correlation to feature screening for classification without writing custom code.
Independence tests become available for pairs consisting of one numeric and one categorical variable.
Larger data sets can be analyzed because the procedures run faster under vectorization and parallelization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The package may encourage direct empirical comparisons between Categorical Gini Correlation and other dependence measures on real data sets.
Integration into existing statistical workflows could make the measure routine for variable selection steps.
Similar implementations in other languages would follow naturally once a reference version exists.

Load-bearing premise

The provided Python code correctly implements the Categorical Gini Correlation formulas and inference procedures originally defined by Dang et al. without introducing computational errors or altering the statistical properties.

What would settle it

Run the package on a dataset where the numeric and categorical variables are known to be independent or dependent, then check whether the reported correlation values, interval coverage, and test decisions match the theoretical expectations.

Figures

Figures reproduced from arXiv: 2506.19230 by Sameera Hewage.

**Figure 1.** Figure 1: Performance comparison between Python (gcor function) and R (GiniDistance package) implementations of categorical Gini correlation. 2.7 Reproducibility All code developed in this study for computing the CGC, constructing confidence intervals, and performing independence tests is publicly available at https://github.com/sameera-hewage/ gcor. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

read the original abstract

Categorical Gini Correlation (CGC), introduced by Dang et al. (2020), is a novel dependence measure designed to quantify the association between a numerical variable and a categorical variable. It has appealing properties compared to existing dependence measures, such as zero correlation mutually implying independence between the variables. It has also shown superior performance over existing methods when applied to feature screening for classification. This article presents a Python implementation for computing CGC, constructing confidence intervals, and performing independence tests based on it. Efficient algorithms have been implemented for all procedures, and they have been optimized using vectorization and parallelization to enhance computational efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a Python implementation of an existing dependence measure with efficiency claims but no validation or benchmarks shown.

read the letter

Hi colleague, the main thing here is that this paper supplies a Python package called gcor for Categorical Gini Correlation, including routines for confidence intervals and independence tests, rather than any new statistical theory. It builds directly on the 2020 Dang et al. work that defined the measure and noted its advantages for feature screening and the zero-implies-independence property. The authors added vectorization and parallelization to the computations, which is a practical engineering choice if the goal is to make the tool faster and easier to use in Python workflows. That part could help applied users who prefer this language over whatever the original code was written in. The citation pattern is clean and properly credits the source paper without any circular reasoning or invented steps. On the soft side, the write-up asserts that the algorithms are efficient and optimized but provides no numerical verification, no reproduction of example values from the 2020 paper, no timing comparisons, and no accuracy checks after the parallel changes. Without those, the correctness and speedup claims rest on trust in the code rather than evidence in the manuscript. This is aimed at practitioners doing dependence analysis or classification in Python who want a ready implementation instead of coding it themselves. Theoretical readers or those comparing dependence measures will not find new ground. I would not bring it to a methods-focused reading group and would not cite it unless I actually start using the package in a project. For peer review at a core statistics journal I would recommend against sending it out, since the advance is too narrow to justify referee time.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a Python package 'gcor' implementing Categorical Gini Correlation (CGC) for quantifying dependence between a numerical variable and a categorical variable, along with procedures for constructing confidence intervals and performing independence tests. It claims that efficient algorithms for these procedures have been implemented and optimized via vectorization and parallelization.

Significance. If the implementation is shown to be correct and the efficiency claims are substantiated, the package would offer a practical, accessible tool for applying CGC in statistical workflows such as feature screening for classification, extending the method introduced by Dang et al. (2020) to the Python ecosystem with performance improvements.

major comments (2)

[Abstract] Abstract: The assertion that 'efficient algorithms have been implemented for all procedures, and they have been optimized using vectorization and parallelization to enhance computational efficiency' is unsupported by any benchmark timings, scalability tests, error analysis, or verification against the original formulas.
[Implementation] Implementation section: No numerical validation, reproduction of example values from Dang et al. (2020), or comparison to a reference implementation is described to confirm that the vectorized and parallelized code faithfully reproduces the CGC dependence measure, confidence intervals, and independence tests without altering statistical properties.

minor comments (2)

The manuscript would benefit from explicit statements of the Python version, required dependencies, and installation instructions to improve reproducibility for users.
Consider including a small worked example with output values in the main text to illustrate usage of the core functions for CGC computation and testing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing the gcor Python package. We address each major comment below and agree that additional empirical support is needed to substantiate the efficiency and correctness claims.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'efficient algorithms have been implemented for all procedures, and they have been optimized using vectorization and parallelization to enhance computational efficiency' is unsupported by any benchmark timings, scalability tests, error analysis, or verification against the original formulas.

Authors: We agree that the abstract claim regarding efficiency optimizations is not supported by evidence in the current version. In the revision we will add benchmark timings (comparing vectorized/parallelized code to naive loops), scalability tests across sample sizes, and direct verification against the original CGC formulas from Dang et al. (2020). These results will be placed in a new 'Numerical Performance' subsection. revision: yes
Referee: [Implementation] Implementation section: No numerical validation, reproduction of example values from Dang et al. (2020), or comparison to a reference implementation is described to confirm that the vectorized and parallelized code faithfully reproduces the CGC dependence measure, confidence intervals, and independence tests without altering statistical properties.

Authors: The referee is correct that the manuscript lacks explicit numerical validation. We will revise the Implementation section to include reproduction of the numerical examples from Dang et al. (2020), side-by-side comparison with a reference (non-vectorized) implementation, and checks that confidence intervals and test p-values remain statistically equivalent. This will confirm that vectorization and parallelization preserve the original statistical properties. revision: yes

Circularity Check

0 steps flagged

No circularity: direct implementation of prior method with no internal derivations

full rationale

The paper is a software implementation contribution that references the CGC formulas and inference procedures from Dang et al. (2020) without introducing any new derivations, predictions, fitted parameters, or self-referential steps. No equations or claims reduce to the paper's own inputs by construction, and there is no load-bearing self-citation chain. The work is self-contained as an efficient Python package for an externally defined statistical measure; absence of numerical validation against the reference is a correctness concern rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software implementation paper. It introduces no new free parameters, axioms, or invented entities; all statistical content is drawn from the referenced 2020 work by Dang et al.

pith-pipeline@v0.9.0 · 5620 in / 1103 out tokens · 35104 ms · 2026-05-19T08:37:29.687002+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Categorical Gini correlation can then be estimated unbiasedly as a function of U-statistics: ˆρg(X, Y) = (˜U − ∑ ˆpk ˜Uk)/˜U where ˜Uk and ˜U are averages of pairwise distances.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The independence test … employs a permutation procedure to estimate both the critical value and the p-value.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Comparing Two Categorical Gini Correlations with Applications to Classification Problems
stat.ME 2026-05 unverdicted novelty 6.0

Proposes an inferential framework to test differences in categorical Gini correlations for predictor importance in classification, establishing asymptotic normality and consistency while accommodating unequal dimensio...

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper

[1]

Dang, X., Nguyen, D., Chen, X., & Zhang, J. (2021). A new Gini correlation between quantitative and qualitative variables. Scand. J. Stat. , 48(4), 1314-1343. DOI: https: //doi.org/10.1111/sjos.12490

work page doi:10.1111/sjos.12490 2021
[2]

Nguyen, D., & Dang, X. (2025). GiniDistance: A new Gini correlation between quantita- tive and qualitative variables (Version 0.1.1) [R package]. https://CRAN.R-project.org/ package=GiniDistance

work page 2025
[3]

D., & Sillero-Denamiel, M

Jim´ enez-Gamero, M. D., & Sillero-Denamiel, M. R. (2025). The k-sample problem using Gini covariance for large k. J. Multivar. Anal. , Article ˆA 105463. DOI: https://doi.org/ 10.1016/j.jmva.2025.105463 7

work page doi:10.1016/j.jmva.2025.105463 2025
[4]

Hewage, S., & Sang, Y. (2024). Jackknife empirical likelihood confidence intervals for the categorical Gini correlation. J. Stat. Plan. Inference , 231, 106123. DOI: https://doi. org/10.1016/j.jspi.2023.106123

work page doi:10.1016/j.jspi.2023.106123 2024
[5]

Hewage, S. (2025). A nonparametric K-sample test for variability based on Gini’s mean difference. J. Stat. Theory Appl. , 1-20. DOI: https://doi.org/10.1007/ s44199-025-00112-3

work page 2025
[6]

Hewage, S. S. (2025). Statistical Inference for Categorical Gini Correlation and Gini’s Mean Difference (Doctoral dissertation, University of Louisiana at Lafayette)

work page 2025
[7]

Ramos-Carre˜ no,, C., & Torrecilla, J. L. (2023). dcor: Distance correlation and energy statistics in Python. SoftwareX, 22, 101326. DOI: https://doi.org/10.1016/j.softx. 2023.101326

work page doi:10.1016/j.softx 2023
[8]

Sang, Y., & Dang, X. (2023). Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem. Electron. J. Stat. , 17(2), 2539-2574. DOI: https://doi.org/10.1214/23-EJS2165

work page doi:10.1214/23-ejs2165 2023
[9]

Sang, Y., & Dang, X. (2024). Grouped feature screening for ultrahigh-dimensional classifi- cation via Gini distance correlation. J. Multivar. Anal.. DOI: https://doi.org/10.1016/ j.jmva.2024.105360

work page arXiv 2024
[10]

Shang, D., Li, A., & Shang, P. (2023). An improved nonlinear correlation method for feature selection of complex data. Nonlinear Dyn. , 111(12), 11357-11369. DOI: https: //doi.org/10.1007/s11071-023-08406-w

work page doi:10.1007/s11071-023-08406-w 2023
[11]

Shao, J., & Tu, D. (1996). The Jackknife and Bootstrap . Springer. DOI: https://doi. org/10.1007/978-1-4612-0795-5

work page doi:10.1007/978-1-4612-0795-5 1996
[12]

Liu, Y., & Shang, P. (2025). Measuring Feature-Label Dependence Using Projection Cor- relation Statistic. arXiv preprint arXiv:2504.19180. DOI: https://doi.org/10.48550/ arXiv.2504.19180

work page arXiv 2025
[13]

Suresh, S., & Kattumannil, S. K. (2024). JEL ratio test for independence between a continuous and a categorical random variable. arXiv preprint arXiv:2402.18105. DOI: https://doi.org/10.48550/arXiv.2402.18105

work page doi:10.48550/arxiv.2402.18105 2024
[14]

J., Rizzo, M

Sz´ ekely, G. J., Rizzo, M. L., & Bakirov, N. (2007). Measuring and testing dependence by correlation of distances. Ann. Statist., 35(6), 2769-2794. DOI: https://doi.org/10. 1214/009053607000000505

work page 2007
[15]

J., & Rizzo, M

Sz´ ekely, G. J., & Rizzo, M. L. (2009). Brownian distance covariance. Ann. Appl. Stat. , 3(4), 1233-1303. DOI: https://doi.org/10.1214/09-AOAS312

work page doi:10.1214/09-aoas312 2009
[16]

and Rizzo, M.L

Sz´ ekely, G. J., & Rizzo, M. L. (2013a). Energy statistics: A class of statistics based on distances. J. Stat. Plan. Infer., 143, 1249-1272. DOI: https://doi.org/10.1016/j.jspi. 2013.03.018

work page doi:10.1016/j.jspi 2013
[17]

Wang, B., Shang, P., & Zhang, B. (2025). Generalized Gini dependence measures for complex data and their applications in K-sample problem and feature screening. Nonlinear Dyn., 113(9), 9709-9733. DOI: https://doi.org/10.1007/s11071-024-10620-z 8

work page doi:10.1007/s11071-024-10620-z 2025
[18]

Yitzhaki, S., & Schechtman, E. (2013). The Gini Methodology . Springer. DOI: https: //doi.org/10.1007/978-1-4614-4720-7_2

work page doi:10.1007/978-1-4614-4720-7_2 2013
[19]

Zhang, S., Dang, X., Nguyen, D., Wilkins, D., & Chen, Y. (2019). Estimating feature-label dependence using Gini distance statistics. IEEE Trans. Pattern Anal. Mach. Intell. , 43(6), 1947-1963. DOI: https://doi.org/10.1109/TPAMI.2019.2960358 9

work page doi:10.1109/tpami.2019.2960358 2019

[1] [1]

Dang, X., Nguyen, D., Chen, X., & Zhang, J. (2021). A new Gini correlation between quantitative and qualitative variables. Scand. J. Stat. , 48(4), 1314-1343. DOI: https: //doi.org/10.1111/sjos.12490

work page doi:10.1111/sjos.12490 2021

[2] [2]

Nguyen, D., & Dang, X. (2025). GiniDistance: A new Gini correlation between quantita- tive and qualitative variables (Version 0.1.1) [R package]. https://CRAN.R-project.org/ package=GiniDistance

work page 2025

[3] [3]

D., & Sillero-Denamiel, M

Jim´ enez-Gamero, M. D., & Sillero-Denamiel, M. R. (2025). The k-sample problem using Gini covariance for large k. J. Multivar. Anal. , Article ˆA 105463. DOI: https://doi.org/ 10.1016/j.jmva.2025.105463 7

work page doi:10.1016/j.jmva.2025.105463 2025

[4] [4]

Hewage, S., & Sang, Y. (2024). Jackknife empirical likelihood confidence intervals for the categorical Gini correlation. J. Stat. Plan. Inference , 231, 106123. DOI: https://doi. org/10.1016/j.jspi.2023.106123

work page doi:10.1016/j.jspi.2023.106123 2024

[5] [5]

Hewage, S. (2025). A nonparametric K-sample test for variability based on Gini’s mean difference. J. Stat. Theory Appl. , 1-20. DOI: https://doi.org/10.1007/ s44199-025-00112-3

work page 2025

[6] [6]

Hewage, S. S. (2025). Statistical Inference for Categorical Gini Correlation and Gini’s Mean Difference (Doctoral dissertation, University of Louisiana at Lafayette)

work page 2025

[7] [7]

Ramos-Carre˜ no,, C., & Torrecilla, J. L. (2023). dcor: Distance correlation and energy statistics in Python. SoftwareX, 22, 101326. DOI: https://doi.org/10.1016/j.softx. 2023.101326

work page doi:10.1016/j.softx 2023

[8] [8]

Sang, Y., & Dang, X. (2023). Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem. Electron. J. Stat. , 17(2), 2539-2574. DOI: https://doi.org/10.1214/23-EJS2165

work page doi:10.1214/23-ejs2165 2023

[9] [9]

Sang, Y., & Dang, X. (2024). Grouped feature screening for ultrahigh-dimensional classifi- cation via Gini distance correlation. J. Multivar. Anal.. DOI: https://doi.org/10.1016/ j.jmva.2024.105360

work page arXiv 2024

[10] [10]

Shang, D., Li, A., & Shang, P. (2023). An improved nonlinear correlation method for feature selection of complex data. Nonlinear Dyn. , 111(12), 11357-11369. DOI: https: //doi.org/10.1007/s11071-023-08406-w

work page doi:10.1007/s11071-023-08406-w 2023

[11] [11]

Shao, J., & Tu, D. (1996). The Jackknife and Bootstrap . Springer. DOI: https://doi. org/10.1007/978-1-4612-0795-5

work page doi:10.1007/978-1-4612-0795-5 1996

[12] [12]

Liu, Y., & Shang, P. (2025). Measuring Feature-Label Dependence Using Projection Cor- relation Statistic. arXiv preprint arXiv:2504.19180. DOI: https://doi.org/10.48550/ arXiv.2504.19180

work page arXiv 2025

[13] [13]

Suresh, S., & Kattumannil, S. K. (2024). JEL ratio test for independence between a continuous and a categorical random variable. arXiv preprint arXiv:2402.18105. DOI: https://doi.org/10.48550/arXiv.2402.18105

work page doi:10.48550/arxiv.2402.18105 2024

[14] [14]

J., Rizzo, M

Sz´ ekely, G. J., Rizzo, M. L., & Bakirov, N. (2007). Measuring and testing dependence by correlation of distances. Ann. Statist., 35(6), 2769-2794. DOI: https://doi.org/10. 1214/009053607000000505

work page 2007

[15] [15]

J., & Rizzo, M

Sz´ ekely, G. J., & Rizzo, M. L. (2009). Brownian distance covariance. Ann. Appl. Stat. , 3(4), 1233-1303. DOI: https://doi.org/10.1214/09-AOAS312

work page doi:10.1214/09-aoas312 2009

[16] [16]

and Rizzo, M.L

Sz´ ekely, G. J., & Rizzo, M. L. (2013a). Energy statistics: A class of statistics based on distances. J. Stat. Plan. Infer., 143, 1249-1272. DOI: https://doi.org/10.1016/j.jspi. 2013.03.018

work page doi:10.1016/j.jspi 2013

[17] [17]

Wang, B., Shang, P., & Zhang, B. (2025). Generalized Gini dependence measures for complex data and their applications in K-sample problem and feature screening. Nonlinear Dyn., 113(9), 9709-9733. DOI: https://doi.org/10.1007/s11071-024-10620-z 8

work page doi:10.1007/s11071-024-10620-z 2025

[18] [18]

Yitzhaki, S., & Schechtman, E. (2013). The Gini Methodology . Springer. DOI: https: //doi.org/10.1007/978-1-4614-4720-7_2

work page doi:10.1007/978-1-4614-4720-7_2 2013

[19] [19]

Zhang, S., Dang, X., Nguyen, D., Wilkins, D., & Chen, Y. (2019). Estimating feature-label dependence using Gini distance statistics. IEEE Trans. Pattern Anal. Mach. Intell. , 43(6), 1947-1963. DOI: https://doi.org/10.1109/TPAMI.2019.2960358 9

work page doi:10.1109/tpami.2019.2960358 2019