gcor: A Python Implementation of Categorical Gini Correlation and Its Inference
Pith reviewed 2026-05-19 08:37 UTC · model grok-4.3
The pith
A Python package implements Categorical Gini Correlation for dependence between numeric and categorical variables along with confidence intervals and independence tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents gcor, a Python package that computes the Categorical Gini Correlation introduced by Dang et al. and supplies optimized routines for confidence interval construction and independence testing, with all steps accelerated through vectorization and parallelization.
What carries the argument
The gcor Python package, which encodes the Categorical Gini Correlation formulas and inference procedures with vectorized and parallelized implementations.
If this is right
- Users can apply Categorical Gini Correlation to feature screening for classification without writing custom code.
- Independence tests become available for pairs consisting of one numeric and one categorical variable.
- Larger data sets can be analyzed because the procedures run faster under vectorization and parallelization.
Where Pith is reading between the lines
- The package may encourage direct empirical comparisons between Categorical Gini Correlation and other dependence measures on real data sets.
- Integration into existing statistical workflows could make the measure routine for variable selection steps.
- Similar implementations in other languages would follow naturally once a reference version exists.
Load-bearing premise
The provided Python code correctly implements the Categorical Gini Correlation formulas and inference procedures originally defined by Dang et al. without introducing computational errors or altering the statistical properties.
What would settle it
Run the package on a dataset where the numeric and categorical variables are known to be independent or dependent, then check whether the reported correlation values, interval coverage, and test decisions match the theoretical expectations.
Figures
read the original abstract
Categorical Gini Correlation (CGC), introduced by Dang et al. (2020), is a novel dependence measure designed to quantify the association between a numerical variable and a categorical variable. It has appealing properties compared to existing dependence measures, such as zero correlation mutually implying independence between the variables. It has also shown superior performance over existing methods when applied to feature screening for classification. This article presents a Python implementation for computing CGC, constructing confidence intervals, and performing independence tests based on it. Efficient algorithms have been implemented for all procedures, and they have been optimized using vectorization and parallelization to enhance computational efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a Python package 'gcor' implementing Categorical Gini Correlation (CGC) for quantifying dependence between a numerical variable and a categorical variable, along with procedures for constructing confidence intervals and performing independence tests. It claims that efficient algorithms for these procedures have been implemented and optimized via vectorization and parallelization.
Significance. If the implementation is shown to be correct and the efficiency claims are substantiated, the package would offer a practical, accessible tool for applying CGC in statistical workflows such as feature screening for classification, extending the method introduced by Dang et al. (2020) to the Python ecosystem with performance improvements.
major comments (2)
- [Abstract] Abstract: The assertion that 'efficient algorithms have been implemented for all procedures, and they have been optimized using vectorization and parallelization to enhance computational efficiency' is unsupported by any benchmark timings, scalability tests, error analysis, or verification against the original formulas.
- [Implementation] Implementation section: No numerical validation, reproduction of example values from Dang et al. (2020), or comparison to a reference implementation is described to confirm that the vectorized and parallelized code faithfully reproduces the CGC dependence measure, confidence intervals, and independence tests without altering statistical properties.
minor comments (2)
- The manuscript would benefit from explicit statements of the Python version, required dependencies, and installation instructions to improve reproducibility for users.
- Consider including a small worked example with output values in the main text to illustrate usage of the core functions for CGC computation and testing.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript describing the gcor Python package. We address each major comment below and agree that additional empirical support is needed to substantiate the efficiency and correctness claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that 'efficient algorithms have been implemented for all procedures, and they have been optimized using vectorization and parallelization to enhance computational efficiency' is unsupported by any benchmark timings, scalability tests, error analysis, or verification against the original formulas.
Authors: We agree that the abstract claim regarding efficiency optimizations is not supported by evidence in the current version. In the revision we will add benchmark timings (comparing vectorized/parallelized code to naive loops), scalability tests across sample sizes, and direct verification against the original CGC formulas from Dang et al. (2020). These results will be placed in a new 'Numerical Performance' subsection. revision: yes
-
Referee: [Implementation] Implementation section: No numerical validation, reproduction of example values from Dang et al. (2020), or comparison to a reference implementation is described to confirm that the vectorized and parallelized code faithfully reproduces the CGC dependence measure, confidence intervals, and independence tests without altering statistical properties.
Authors: The referee is correct that the manuscript lacks explicit numerical validation. We will revise the Implementation section to include reproduction of the numerical examples from Dang et al. (2020), side-by-side comparison with a reference (non-vectorized) implementation, and checks that confidence intervals and test p-values remain statistically equivalent. This will confirm that vectorization and parallelization preserve the original statistical properties. revision: yes
Circularity Check
No circularity: direct implementation of prior method with no internal derivations
full rationale
The paper is a software implementation contribution that references the CGC formulas and inference procedures from Dang et al. (2020) without introducing any new derivations, predictions, fitted parameters, or self-referential steps. No equations or claims reduce to the paper's own inputs by construction, and there is no load-bearing self-citation chain. The work is self-contained as an efficient Python package for an externally defined statistical measure; absence of numerical validation against the reference is a correctness concern rather than circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Categorical Gini correlation can then be estimated unbiasedly as a function of U-statistics: ˆρg(X, Y) = (˜U − ∑ ˆpk ˜Uk)/˜U where ˜Uk and ˜U are averages of pairwise distances.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The independence test … employs a permutation procedure to estimate both the critical value and the p-value.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Comparing Two Categorical Gini Correlations with Applications to Classification Problems
Proposes an inferential framework to test differences in categorical Gini correlations for predictor importance in classification, establishing asymptotic normality and consistency while accommodating unequal dimensio...
Reference graph
Works this paper leans on
-
[1]
Dang, X., Nguyen, D., Chen, X., & Zhang, J. (2021). A new Gini correlation between quantitative and qualitative variables. Scand. J. Stat. , 48(4), 1314-1343. DOI: https: //doi.org/10.1111/sjos.12490
-
[2]
Nguyen, D., & Dang, X. (2025). GiniDistance: A new Gini correlation between quantita- tive and qualitative variables (Version 0.1.1) [R package]. https://CRAN.R-project.org/ package=GiniDistance
work page 2025
-
[3]
Jim´ enez-Gamero, M. D., & Sillero-Denamiel, M. R. (2025). The k-sample problem using Gini covariance for large k. J. Multivar. Anal. , Article ˆA 105463. DOI: https://doi.org/ 10.1016/j.jmva.2025.105463 7
-
[4]
Hewage, S., & Sang, Y. (2024). Jackknife empirical likelihood confidence intervals for the categorical Gini correlation. J. Stat. Plan. Inference , 231, 106123. DOI: https://doi. org/10.1016/j.jspi.2023.106123
-
[5]
Hewage, S. (2025). A nonparametric K-sample test for variability based on Gini’s mean difference. J. Stat. Theory Appl. , 1-20. DOI: https://doi.org/10.1007/ s44199-025-00112-3
work page 2025
-
[6]
Hewage, S. S. (2025). Statistical Inference for Categorical Gini Correlation and Gini’s Mean Difference (Doctoral dissertation, University of Louisiana at Lafayette)
work page 2025
-
[7]
Ramos-Carre˜ no,, C., & Torrecilla, J. L. (2023). dcor: Distance correlation and energy statistics in Python. SoftwareX, 22, 101326. DOI: https://doi.org/10.1016/j.softx. 2023.101326
-
[8]
Sang, Y., & Dang, X. (2023). Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem. Electron. J. Stat. , 17(2), 2539-2574. DOI: https://doi.org/10.1214/23-EJS2165
- [9]
-
[10]
Shang, D., Li, A., & Shang, P. (2023). An improved nonlinear correlation method for feature selection of complex data. Nonlinear Dyn. , 111(12), 11357-11369. DOI: https: //doi.org/10.1007/s11071-023-08406-w
-
[11]
Shao, J., & Tu, D. (1996). The Jackknife and Bootstrap . Springer. DOI: https://doi. org/10.1007/978-1-4612-0795-5
- [12]
-
[13]
Suresh, S., & Kattumannil, S. K. (2024). JEL ratio test for independence between a continuous and a categorical random variable. arXiv preprint arXiv:2402.18105. DOI: https://doi.org/10.48550/arXiv.2402.18105
-
[14]
Sz´ ekely, G. J., Rizzo, M. L., & Bakirov, N. (2007). Measuring and testing dependence by correlation of distances. Ann. Statist., 35(6), 2769-2794. DOI: https://doi.org/10. 1214/009053607000000505
work page 2007
-
[15]
Sz´ ekely, G. J., & Rizzo, M. L. (2009). Brownian distance covariance. Ann. Appl. Stat. , 3(4), 1233-1303. DOI: https://doi.org/10.1214/09-AOAS312
-
[16]
Sz´ ekely, G. J., & Rizzo, M. L. (2013a). Energy statistics: A class of statistics based on distances. J. Stat. Plan. Infer., 143, 1249-1272. DOI: https://doi.org/10.1016/j.jspi. 2013.03.018
-
[17]
Wang, B., Shang, P., & Zhang, B. (2025). Generalized Gini dependence measures for complex data and their applications in K-sample problem and feature screening. Nonlinear Dyn., 113(9), 9709-9733. DOI: https://doi.org/10.1007/s11071-024-10620-z 8
-
[18]
Yitzhaki, S., & Schechtman, E. (2013). The Gini Methodology . Springer. DOI: https: //doi.org/10.1007/978-1-4614-4720-7_2
-
[19]
Zhang, S., Dang, X., Nguyen, D., Wilkins, D., & Chen, Y. (2019). Estimating feature-label dependence using Gini distance statistics. IEEE Trans. Pattern Anal. Mach. Intell. , 43(6), 1947-1963. DOI: https://doi.org/10.1109/TPAMI.2019.2960358 9
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.