An Empirical Comparison of Methods for Quantifying the Similarity of Categorical Datasets

Andrea Bommert; J\"org Rahnenf\"uhrer; Marieke Stolte

arxiv: 2604.11458 · v1 · submitted 2026-04-13 · 📊 stat.ME · stat.CO

An Empirical Comparison of Methods for Quantifying the Similarity of Categorical Datasets

Marieke Stolte , J\"org Rahnenf\"uhrer , Andrea Bommert This is my paper

Pith reviewed 2026-05-10 16:16 UTC · model grok-4.3

classification 📊 stat.ME stat.CO

keywords categorical datasimilarity measurestwo-sample testsmulti-sample testsFriedman-Rafsky testMahalanobis cross-matchempirical comparisondistributional differences

0 comments

The pith

Simulations identify the Friedman-Rafsky test as a strong compromise for detecting differences between two categorical datasets and the MMCM test for multiple datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts an empirical comparison of statistical methods for quantifying similarity between categorical datasets, evaluating how well each detects differences in distributions and how much computational resources each requires. Edge count tests perform reliably in the two-sample setting, with the Friedman-Rafsky test emerging as a balanced choice across scenarios due to high detection rates, moderate resource use, and low error rates. For datasets where each variable has five categories, the best performer shifts depending on the nature of the distributional difference, sometimes favoring constrained minimum distance or classifier-based tests instead. In the multi-sample case, the Multi-Sample Mahalanobis Cross-Match test stands out for delivering comparable detection power with notably lower resource demands. These patterns help guide selection of methods for practical tasks such as data validation and quality assessment.

Core claim

Through targeted simulations on categorical data, the study shows that edge count tests such as the Friedman-Rafsky test achieve strong performance in identifying differences between two datasets while maintaining acceptable resource consumption and few computational errors, making it a recommended compromise. For comparing multiple datasets, the Multi-Sample Mahalanobis Cross-Match (MMCM) test provides similarly effective detection with lower resource requirements.

What carries the argument

Simulation-based head-to-head comparison of graph-based tests, distance measures, and classifier tests on synthetic categorical distributions, tracking detection power, runtime, and error frequency.

If this is right

The Friedman-Rafsky test offers a practical default for two-sample comparisons of categorical data.
The MMCM test is efficient for multi-sample similarity checks with limited computational budget.
When each variable has five categories, the optimal method can shift to constrained minimum distance or classifier two-sample tests depending on the difference type.
Resource consumption and occurrence of computational errors should be weighed alongside statistical power when choosing a method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These recommendations could help standardize similarity checks during data preprocessing in machine learning pipelines.
Similar simulation designs might be extended to mixed continuous-categorical data or to larger numbers of categories to test robustness.
Domain-specific applications could validate the findings by applying the tests to categorical data from fields like genomics or survey analysis.

Load-bearing premise

The chosen simulation setups and specific types of distributional differences tested adequately represent the real-world scenarios where these similarity methods would be used.

What would settle it

Running the top-performing methods on real categorical datasets that contain independently confirmed distributional differences and checking whether detection rates match the simulation rankings.

read the original abstract

Quantifying the similarity of two or more datasets has widespread applications in statistics and machine learning. The method choice is, however, difficult due to the abundance of proposed methods and the lack of neutral comparison studies, especially for categorical data. Here, the most promising methods are compared concerning their ability to detect certain differences between datasets and their resource consumption. The results show that the edge count tests perform well when comparing two datasets (i.e., the two-sample case). For certain scenarios, the constrained minimum (CM) distance performs even better. For categorical data consisting of variables with five categories each, the best method depends on the type of difference between the distributions, with either the CM distance and certain graph-based tests performing best, or the classifier-based tests (C2ST). This tendency is even clearer for multiple datasets. Overall, the Friedman-Rafsky test can be recommended for two samples as a compromise of high performance, acceptable resource consumption, and computational error occurrences. For the multi-sample case, the Multi-Sample Mahalanobis Cross-Match (MMCM) test can be recommended due to its comparably good performance and low resource consumption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper delivers a useful empirical benchmark for categorical dataset similarity methods but its recommendations rest on a narrow slice of simulations with five categories per variable.

read the letter

This paper runs an empirical comparison of methods for quantifying similarity between categorical datasets and offers some clear recommendations from the results. The key points are that the Friedman-Rafsky test works well as a balanced option for two-sample comparisons, while the Multi-Sample Mahalanobis Cross-Match test is a good pick for multiple samples due to solid performance and low resource demands. What the paper does well is fill the gap in neutral benchmarks for this task. Categorical data similarity comes up often in statistics and machine learning, yet most prior work focused on continuous data or lacked systematic tests across methods. By simulating various distributional differences and tracking both detection ability and computational costs, the authors give practitioners something concrete to go on. They also note how the best method can depend on the exact type of difference present, which is a realistic observation. The main soft spot is the fixed simulation setup. All the categorical variables have exactly five categories, and the tests cover only certain scenarios of how the distributions differ. Graph-based and other methods can be quite sensitive to the number of categories because it changes the effective dimensionality and sparsity. If the paper had varied the category count or included more kinds of alternatives, the rankings might hold more broadly. As it stands, the advice is reliable within those bounds but should be applied cautiously elsewhere. This kind of study is useful for anyone who needs to compare categorical datasets in practice, such as in data integration or model validation tasks. A reader who wants guidance on method selection without running their own experiments will get value from it. I would recommend sending this to peer review. It has enough substance to warrant referee input on the experimental design and to potentially expand the scope in revisions.

Referee Report

2 major / 1 minor

Summary. The paper presents an empirical comparison of methods for quantifying similarity between categorical datasets, evaluating their statistical power to detect differences and their computational resource consumption (runtime and error rates). Simulations are used to rank methods, leading to the recommendation of the Friedman-Rafsky test as a balanced choice for two-sample problems and the Multi-Sample Mahalanobis Cross-Match (MMCM) test for the multi-sample case.

Significance. If the simulation results are robust within their tested regime, the work fills a gap by providing a neutral, head-to-head evaluation of similarity measures for categorical data, which is relevant for applications in statistics and machine learning. The joint consideration of detection performance and practical resource use strengthens its applied value.

major comments (2)

[Abstract] Abstract: The overall recommendation of the Friedman-Rafsky test for two samples (and MMCM for multiple samples) is derived from simulations restricted to variables with exactly five categories each and only selected types of distributional differences; because graph-based, distance-based, and classifier-based methods are known to be sensitive to category cardinality (which governs effective dimension and sparsity) and to the precise form of shift, the reported ranking is conditional on an untested slice of the problem space rather than generally supported.
[Abstract] Abstract and simulation description: No systematic variation of category cardinality or exhaustive coverage of alternative types (marginal, dependence, tail) is reported, which is load-bearing for the central actionable recommendations; the observed performance ordering can change outside this narrow grid.

minor comments (1)

[Abstract] The repeated use of the vague qualifier 'certain scenarios' without an explicit mapping to the tested difference types reduces clarity for readers who wish to apply the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below and outline the revisions we will make to improve the manuscript's transparency regarding the scope of the simulations.

read point-by-point responses

Referee: [Abstract] Abstract: The overall recommendation of the Friedman-Rafsky test for two samples (and MMCM for multiple samples) is derived from simulations restricted to variables with exactly five categories each and only selected types of distributional differences; because graph-based, distance-based, and classifier-based methods are known to be sensitive to category cardinality (which governs effective dimension and sparsity) and to the precise form of shift, the reported ranking is conditional on an untested slice of the problem space rather than generally supported.

Authors: We agree that the recommendations are derived from simulations using a fixed cardinality of five categories per variable and a selection of distributional difference types. This design choice focused on a representative moderate-cardinality regime common in applied categorical data settings. We acknowledge the known sensitivity of graph-based, distance-based, and classifier-based methods to cardinality and shift type. In the revised manuscript, we will qualify the abstract to state that the recommendations apply under the tested conditions, and we will add a dedicated limitations paragraph in the Discussion section that explicitly addresses potential changes in performance ordering with different cardinalities and shift forms, along with suggestions for future work. revision: partial
Referee: [Abstract] Abstract and simulation description: No systematic variation of category cardinality or exhaustive coverage of alternative types (marginal, dependence, tail) is reported, which is load-bearing for the central actionable recommendations; the observed performance ordering can change outside this narrow grid.

Authors: The referee is correct that we did not conduct a systematic sweep over category cardinalities or provide exhaustive coverage of every possible difference type. Our simulations did include multiple categories of differences (marginal shifts, dependence changes, and tail behaviors), but always within the fixed five-category setting and without a full factorial crossing. A more exhaustive design was not pursued due to the substantial computational cost of evaluating all methods across additional grids. We will revise the simulation description and abstract to more precisely delineate the tested scenarios, and we will expand the Discussion to consider how results might generalize or differ outside the current grid. These textual changes will improve transparency while preserving the utility of the reported comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical methods comparison

full rationale

This is a pure simulation-based empirical comparison of pre-existing methods for categorical dataset similarity. No derivations, equations, or predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Recommendations follow directly from measured performance, runtime, and error metrics on independent simulated data; the simulation design, while scoped to specific category cardinalities and difference types, does not create any internal logical loop or renaming of known results as new findings. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central recommendations rest on the assumption that the chosen simulation designs capture relevant real-world differences between categorical distributions and that the tested methods were implemented without bias. No new free parameters, axioms beyond standard statistical assumptions, or invented entities are introduced.

axioms (1)

domain assumption Simulated differences between categorical distributions adequately represent practical use cases for similarity quantification
Performance claims and recommendations depend directly on the representativeness of the simulation scenarios used in the study.

pith-pipeline@v0.9.0 · 5509 in / 1257 out tokens · 42674 ms · 2026-05-10T16:16:43.747000+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

5case, there are5 2 “25 distinct values, so it is also very likely that the union of all optimal 5NN graphs is the full graph. In the case where the 5NN, “u

Therefore, when calculating the optimal non-bipartite matching that is used in both methods, there are many optimal solutions. The implemented matching algorithm goes through the observations in the order of the samples and starts looking for a match in the reverse order. Therefore, for the two-sample case with ties, the observations from the first datase...

work page 2022
[2]

skewed” alternatives but a low PESR for the “1 up, 1 down

Again, the scenarios with higher PESR are mostly ones with balanced sample sizes, while most of the unbalanced scenarios are among the lower-performing cases. In contrast to the binary case, for five categories, two types of deviations are considered: the class probability distribution becoming more and more skewed, or the probability of one class going u...

work page

[1] [1]

5case, there are5 2 “25 distinct values, so it is also very likely that the union of all optimal 5NN graphs is the full graph. In the case where the 5NN, “u

Therefore, when calculating the optimal non-bipartite matching that is used in both methods, there are many optimal solutions. The implemented matching algorithm goes through the observations in the order of the samples and starts looking for a match in the reverse order. Therefore, for the two-sample case with ties, the observations from the first datase...

work page 2022

[2] [2]

skewed” alternatives but a low PESR for the “1 up, 1 down

Again, the scenarios with higher PESR are mostly ones with balanced sample sizes, while most of the unbalanced scenarios are among the lower-performing cases. In contrast to the binary case, for five categories, two types of deviations are considered: the class probability distribution becoming more and more skewed, or the probability of one class going u...

work page