pith. sign in

arxiv: 2506.19230 · v4 · submitted 2025-06-24 · 📊 stat.ME · stat.CO

gcor: A Python Implementation of Categorical Gini Correlation and Its Inference

Pith reviewed 2026-05-19 08:37 UTC · model grok-4.3

classification 📊 stat.ME stat.CO
keywords Categorical Gini CorrelationPython implementationdependence measureconfidence intervalsindependence testsfeature screeningstatistical software
0
0 comments X

The pith

A Python package implements Categorical Gini Correlation for dependence between numeric and categorical variables along with confidence intervals and independence tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper delivers a Python implementation called gcor for computing Categorical Gini Correlation, a measure of association between a numerical variable and a categorical variable that has the property that zero correlation implies independence. The library also supports constructing confidence intervals and performing independence tests. All procedures rely on efficient algorithms that use vectorization and parallelization to reduce computation time. A sympathetic reader would care because the tool makes this dependence measure practical for tasks such as feature screening in classification without requiring users to code the formulas themselves.

Core claim

The paper presents gcor, a Python package that computes the Categorical Gini Correlation introduced by Dang et al. and supplies optimized routines for confidence interval construction and independence testing, with all steps accelerated through vectorization and parallelization.

What carries the argument

The gcor Python package, which encodes the Categorical Gini Correlation formulas and inference procedures with vectorized and parallelized implementations.

If this is right

  • Users can apply Categorical Gini Correlation to feature screening for classification without writing custom code.
  • Independence tests become available for pairs consisting of one numeric and one categorical variable.
  • Larger data sets can be analyzed because the procedures run faster under vectorization and parallelization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The package may encourage direct empirical comparisons between Categorical Gini Correlation and other dependence measures on real data sets.
  • Integration into existing statistical workflows could make the measure routine for variable selection steps.
  • Similar implementations in other languages would follow naturally once a reference version exists.

Load-bearing premise

The provided Python code correctly implements the Categorical Gini Correlation formulas and inference procedures originally defined by Dang et al. without introducing computational errors or altering the statistical properties.

What would settle it

Run the package on a dataset where the numeric and categorical variables are known to be independent or dependent, then check whether the reported correlation values, interval coverage, and test decisions match the theoretical expectations.

Figures

Figures reproduced from arXiv: 2506.19230 by Sameera Hewage.

Figure 1
Figure 1. Figure 1: Performance comparison between Python (gcor function) and R (GiniDistance package) implementations of categorical Gini correlation. 2.7 Reproducibility All code developed in this study for computing the CGC, constructing confidence intervals, and performing independence tests is publicly available at https://github.com/sameera-hewage/ gcor. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

Categorical Gini Correlation (CGC), introduced by Dang et al. (2020), is a novel dependence measure designed to quantify the association between a numerical variable and a categorical variable. It has appealing properties compared to existing dependence measures, such as zero correlation mutually implying independence between the variables. It has also shown superior performance over existing methods when applied to feature screening for classification. This article presents a Python implementation for computing CGC, constructing confidence intervals, and performing independence tests based on it. Efficient algorithms have been implemented for all procedures, and they have been optimized using vectorization and parallelization to enhance computational efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a Python package 'gcor' implementing Categorical Gini Correlation (CGC) for quantifying dependence between a numerical variable and a categorical variable, along with procedures for constructing confidence intervals and performing independence tests. It claims that efficient algorithms for these procedures have been implemented and optimized via vectorization and parallelization.

Significance. If the implementation is shown to be correct and the efficiency claims are substantiated, the package would offer a practical, accessible tool for applying CGC in statistical workflows such as feature screening for classification, extending the method introduced by Dang et al. (2020) to the Python ecosystem with performance improvements.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'efficient algorithms have been implemented for all procedures, and they have been optimized using vectorization and parallelization to enhance computational efficiency' is unsupported by any benchmark timings, scalability tests, error analysis, or verification against the original formulas.
  2. [Implementation] Implementation section: No numerical validation, reproduction of example values from Dang et al. (2020), or comparison to a reference implementation is described to confirm that the vectorized and parallelized code faithfully reproduces the CGC dependence measure, confidence intervals, and independence tests without altering statistical properties.
minor comments (2)
  1. The manuscript would benefit from explicit statements of the Python version, required dependencies, and installation instructions to improve reproducibility for users.
  2. Consider including a small worked example with output values in the main text to illustrate usage of the core functions for CGC computation and testing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing the gcor Python package. We address each major comment below and agree that additional empirical support is needed to substantiate the efficiency and correctness claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'efficient algorithms have been implemented for all procedures, and they have been optimized using vectorization and parallelization to enhance computational efficiency' is unsupported by any benchmark timings, scalability tests, error analysis, or verification against the original formulas.

    Authors: We agree that the abstract claim regarding efficiency optimizations is not supported by evidence in the current version. In the revision we will add benchmark timings (comparing vectorized/parallelized code to naive loops), scalability tests across sample sizes, and direct verification against the original CGC formulas from Dang et al. (2020). These results will be placed in a new 'Numerical Performance' subsection. revision: yes

  2. Referee: [Implementation] Implementation section: No numerical validation, reproduction of example values from Dang et al. (2020), or comparison to a reference implementation is described to confirm that the vectorized and parallelized code faithfully reproduces the CGC dependence measure, confidence intervals, and independence tests without altering statistical properties.

    Authors: The referee is correct that the manuscript lacks explicit numerical validation. We will revise the Implementation section to include reproduction of the numerical examples from Dang et al. (2020), side-by-side comparison with a reference (non-vectorized) implementation, and checks that confidence intervals and test p-values remain statistically equivalent. This will confirm that vectorization and parallelization preserve the original statistical properties. revision: yes

Circularity Check

0 steps flagged

No circularity: direct implementation of prior method with no internal derivations

full rationale

The paper is a software implementation contribution that references the CGC formulas and inference procedures from Dang et al. (2020) without introducing any new derivations, predictions, fitted parameters, or self-referential steps. No equations or claims reduce to the paper's own inputs by construction, and there is no load-bearing self-citation chain. The work is self-contained as an efficient Python package for an externally defined statistical measure; absence of numerical validation against the reference is a correctness concern rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software implementation paper. It introduces no new free parameters, axioms, or invented entities; all statistical content is drawn from the referenced 2020 work by Dang et al.

pith-pipeline@v0.9.0 · 5620 in / 1103 out tokens · 35104 ms · 2026-05-19T08:37:29.687002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Comparing Two Categorical Gini Correlations with Applications to Classification Problems

    stat.ME 2026-05 unverdicted novelty 6.0

    Proposes an inferential framework to test differences in categorical Gini correlations for predictor importance in classification, establishing asymptotic normality and consistency while accommodating unequal dimensio...

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper

  1. [1]

    Dang, X., Nguyen, D., Chen, X., & Zhang, J. (2021). A new Gini correlation between quantitative and qualitative variables. Scand. J. Stat. , 48(4), 1314-1343. DOI: https: //doi.org/10.1111/sjos.12490

  2. [2]

    Nguyen, D., & Dang, X. (2025). GiniDistance: A new Gini correlation between quantita- tive and qualitative variables (Version 0.1.1) [R package]. https://CRAN.R-project.org/ package=GiniDistance

  3. [3]

    D., & Sillero-Denamiel, M

    Jim´ enez-Gamero, M. D., & Sillero-Denamiel, M. R. (2025). The k-sample problem using Gini covariance for large k. J. Multivar. Anal. , Article ˆA 105463. DOI: https://doi.org/ 10.1016/j.jmva.2025.105463 7

  4. [4]

    Hewage, S., & Sang, Y. (2024). Jackknife empirical likelihood confidence intervals for the categorical Gini correlation. J. Stat. Plan. Inference , 231, 106123. DOI: https://doi. org/10.1016/j.jspi.2023.106123

  5. [5]

    Hewage, S. (2025). A nonparametric K-sample test for variability based on Gini’s mean difference. J. Stat. Theory Appl. , 1-20. DOI: https://doi.org/10.1007/ s44199-025-00112-3

  6. [6]

    Hewage, S. S. (2025). Statistical Inference for Categorical Gini Correlation and Gini’s Mean Difference (Doctoral dissertation, University of Louisiana at Lafayette)

  7. [7]

    Ramos-Carre˜ no,, C., & Torrecilla, J. L. (2023). dcor: Distance correlation and energy statistics in Python. SoftwareX, 22, 101326. DOI: https://doi.org/10.1016/j.softx. 2023.101326

  8. [8]

    Sang, Y., & Dang, X. (2023). Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem. Electron. J. Stat. , 17(2), 2539-2574. DOI: https://doi.org/10.1214/23-EJS2165

  9. [9]

    Sang, Y., & Dang, X. (2024). Grouped feature screening for ultrahigh-dimensional classifi- cation via Gini distance correlation. J. Multivar. Anal.. DOI: https://doi.org/10.1016/ j.jmva.2024.105360

  10. [10]

    Shang, D., Li, A., & Shang, P. (2023). An improved nonlinear correlation method for feature selection of complex data. Nonlinear Dyn. , 111(12), 11357-11369. DOI: https: //doi.org/10.1007/s11071-023-08406-w

  11. [11]

    Shao, J., & Tu, D. (1996). The Jackknife and Bootstrap . Springer. DOI: https://doi. org/10.1007/978-1-4612-0795-5

  12. [12]

    Liu, Y., & Shang, P. (2025). Measuring Feature-Label Dependence Using Projection Cor- relation Statistic. arXiv preprint arXiv:2504.19180. DOI: https://doi.org/10.48550/ arXiv.2504.19180

  13. [13]

    Suresh, S., & Kattumannil, S. K. (2024). JEL ratio test for independence between a continuous and a categorical random variable. arXiv preprint arXiv:2402.18105. DOI: https://doi.org/10.48550/arXiv.2402.18105

  14. [14]

    J., Rizzo, M

    Sz´ ekely, G. J., Rizzo, M. L., & Bakirov, N. (2007). Measuring and testing dependence by correlation of distances. Ann. Statist., 35(6), 2769-2794. DOI: https://doi.org/10. 1214/009053607000000505

  15. [15]

    J., & Rizzo, M

    Sz´ ekely, G. J., & Rizzo, M. L. (2009). Brownian distance covariance. Ann. Appl. Stat. , 3(4), 1233-1303. DOI: https://doi.org/10.1214/09-AOAS312

  16. [16]

    and Rizzo, M.L

    Sz´ ekely, G. J., & Rizzo, M. L. (2013a). Energy statistics: A class of statistics based on distances. J. Stat. Plan. Infer., 143, 1249-1272. DOI: https://doi.org/10.1016/j.jspi. 2013.03.018

  17. [17]

    Wang, B., Shang, P., & Zhang, B. (2025). Generalized Gini dependence measures for complex data and their applications in K-sample problem and feature screening. Nonlinear Dyn., 113(9), 9709-9733. DOI: https://doi.org/10.1007/s11071-024-10620-z 8

  18. [18]

    Yitzhaki, S., & Schechtman, E. (2013). The Gini Methodology . Springer. DOI: https: //doi.org/10.1007/978-1-4614-4720-7_2

  19. [19]

    Zhang, S., Dang, X., Nguyen, D., Wilkins, D., & Chen, Y. (2019). Estimating feature-label dependence using Gini distance statistics. IEEE Trans. Pattern Anal. Mach. Intell. , 43(6), 1947-1963. DOI: https://doi.org/10.1109/TPAMI.2019.2960358 9