An Empirical Comparison of Methods for Quantifying the Similarity of Numeric Datasets
Pith reviewed 2026-05-10 15:37 UTC · model grok-4.3
The pith
Combinations of four to six methods quantify numeric dataset similarity nearly as well as the single best method in 90 to 95 percent of scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that no single method for quantifying dataset similarity performs best across all scenarios, but concrete combinations of four to six methods in the two-sample setting and two to three in the multi-sample setting ensure that in 90% to 95% of the considered simulation scenarios, at least one method from the combination performs almost as well as the best available method, while balancing differentiation power and computational efficiency.
What carries the argument
Empirical evaluation and ranking of 36 similarity quantification methods for continuous numeric data, leading to decision rules and recommended small ensembles.
Load-bearing premise
The selected simulation scenarios involving shifts, scales, and higher moments, along with the chosen performance criteria, adequately represent the challenges in real-world dataset comparisons.
What would settle it
Applying the recommended method combinations to a diverse set of real-world numeric datasets and checking whether at least one method in each combination remains nearly as effective as the single best method in distinguishing distributions.
read the original abstract
Methods for quantifying the similarity of datasets are relevant in applications where two or more datasets, or their underlying distributions, need to be compared, ranging from two- and k-sample testing to applications in machine learning and synthetic data generation. Many methods for quantifying the similarity of datasets are available from the literature, but due to the lack of neutral comparison studies, it is unclear which method to choose when. Here, 36 methods applicable to continuous data are compared across various scenarios, including two or more datasets drawn from different distributions. Several deviations between datasets are considered, including shift and scale alternatives or differences in higher moments. An overall method ranking is established based on the methods' abilities to differentiate between datasets from different distributions, combined with computational aspects. Based on this, concrete decision rules for finding the best method based on characteristics of the datasets are determined. Moreover, combinations of four to six methods are proposed in the two-sample case such that in 90% to 95% of the considered scenarios, at least one of these methods is almost as good as the best method. In the multi-sample case, a combination of two to three methods is proposed analogously.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts an empirical comparison of 36 methods for quantifying similarity between numeric datasets, focusing on two-sample and multi-sample settings. It evaluates the methods on simulated data with distributional deviations including shifts, scale changes, and higher-moment differences, establishing rankings based on discriminatory ability and computational efficiency. Decision rules for method selection are derived, along with recommended combinations of 4-6 methods (two-sample) or 2-3 methods (multi-sample) that achieve near-optimal performance in 90-95% of the considered scenarios.
Significance. If the simulation-derived rankings and combination rules hold under broader conditions, the work supplies concrete, actionable guidance for practitioners selecting similarity measures in hypothesis testing, machine learning, and synthetic data validation. The absence of circularity in the evaluation (forward simulation on independent data) is a strength, as is the focus on both statistical performance and runtime. However, the significance is constrained by the narrow scope of the simulated regimes, which do not address dependence, high dimensionality, or irregular sampling that commonly arise in applications.
major comments (1)
- [Abstract] Abstract and simulation design: the central claim that combinations of 4-6 (two-sample) or 2-3 (multi-sample) methods succeed in 90-95% of scenarios rests on the assumption that the chosen deviations (shift/scale/higher moments) and performance criteria are representative. No details are provided on sample sizes, distribution families, handling of ties, or inclusion of dependence structures and multivariate features; this omission is load-bearing because the proposed decision rules may not generalize when these features are present.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments on the simulation design and scope of the claims are well taken. We have revised the manuscript to improve transparency in the abstract, clarify the simulation parameters, and add an explicit limitations discussion. Our responses to the major comment are provided below.
read point-by-point responses
-
Referee: [Abstract] Abstract and simulation design: the central claim that combinations of 4-6 (two-sample) or 2-3 (multi-sample) methods succeed in 90-95% of scenarios rests on the assumption that the chosen deviations (shift/scale/higher moments) and performance criteria are representative. No details are provided on sample sizes, distribution families, handling of ties, or inclusion of dependence structures and multivariate features; this omission is load-bearing because the proposed decision rules may not generalize when these features are present.
Authors: We agree that the abstract should have provided more explicit information on the simulation setup. In the revised version we have expanded the abstract to state the sample-size range (50 to 500 observations per dataset), the distribution families examined (normal, t, exponential, gamma, and selected mixtures), and the fact that ties in performance rankings are broken by computational cost. The study is restricted to independent samples drawn from continuous distributions; dependence structures, irregular sampling, and high-dimensional regimes are not included. While several of the 36 methods (e.g., energy distance, MMD) are defined for multivariate data, the empirical ranking and combination rules were derived from univariate simulations to isolate the effects of location, scale, and higher-moment shifts. We have added a new limitations paragraph that explicitly cautions readers that the 90–95 % coverage figures and the derived decision rules apply only to the simulated regimes and may not hold under dependence or high dimensionality. Claims in the abstract and conclusion have been qualified accordingly. revision: yes
- Empirical performance of the recommended methods and combinations under dependence structures, high dimensionality, or irregular sampling, because these regimes were outside the scope of the original simulation study.
Circularity Check
No circularity: empirical benchmark of pre-existing methods on independent simulations
full rationale
The paper performs a forward simulation study comparing 36 existing methods across controlled two- and multi-sample scenarios with shift, scale, and higher-moment deviations. Method rankings and the proposed combinations (4-6 methods for two-sample, 2-3 for multi-sample) are computed directly from the simulation performance metrics and runtimes; no equations, parameters, or claims reduce to quantities fitted from the paper's own outputs. No self-citations are load-bearing for the central empirical claims, and the derivation chain consists solely of independent data generation followed by evaluation. This matches the default non-circular case for empirical comparison studies.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Simulated deviations (shift, scale, higher moments) and the chosen differentiation metric capture the relevant aspects of dataset similarity for the target applications.
Reference graph
Works this paper leans on
-
[1]
Geometric Dataset Distances via Optimal Transport
Agarwal, S. M. D., Bhattacharya, B., and Zhang, N. R. (2020):multicross: A Graph-Based Test for Comparing Multivariate Distributions in the Multi Sample Framework, R package version 2.1.0,url:https://CRAN.R-project.org/package=multicross. Alvarez-Melis, D. and Fusi, N. (2020): “Geometric Dataset Distances via Optimal Transport”, in:Advances in Neural Info...
-
[2]
A Framework for Measuring Changes in Data Characteristics
Ganti, V., Gehrke, J., Ramakrishnan, R., and Loh, W.-Y. (1999): “A Framework for Measuring Changes in Data Characteristics”, in:Proceedings of the 18th Symposium on Principles of Database Systems, pp. 126–137. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. (2012): “A Kernel Two- Sample Test”, in:Journal of Machine Learning Research13,...
-
[3]
DISCO (F,α = 1.5) DISCO (B,α = 1.5) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0 0.2 0.4 Figure 28:Mean PESR difference to best method per scenario for shift alternati...
work page 2010
-
[4]
CF (1MST) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0 0.2 0.4 Figure 33:Mean PESR difference to best method per scenario for shift alternatives andk“2 datasets for al...
work page 1979
-
[5]
DISCO (F,α = 1.5) DISCO (B,α = 1.5) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0 0.2 0.4 Figure 49:Mean PESR difference to best method per scenario for scale alternati...
work page 2010
-
[6]
ZC (1MST,κ = 1.14) ZC (1MST,κ = 1.31) CF (1MST) CCS (5MST) FR (5MST) FR (1MST) CCS (1MST) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0 0.2 0.4 0.6 0.8 Figure 54:Mean P...
work page 1979
-
[7]
DISCO (F,α = 1.5) DISCO (B,α = 1.5) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0.2 0.4 Figure 70:Mean PESR difference to best method per scenario for correlation alter...
work page 2010
-
[8]
CF (1MST) CCS (1MST) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0.2 0.4 0.6 Figure 75:Mean PESR difference to best method per scenario for correlation alternatives and...
work page 1979
-
[9]
DISCO (F,α = 1.5) DISCO (B,α = 1.5) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0.2 0.4 0.6 Figure 91:Mean PESR difference to best method per scenario for the normal vs...
work page 2010
-
[10]
FR (5MST) ZC (1MST,κ = 1.14) ZC (1MST,κ = 1.31) CF (1MST) FR (1MST) CCS (5MST) CCS (1MST) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0 0.2 0.4 0.6 0.8 Figure 96:Mean P...
work page 1979
-
[11]
DISCO (B,α = 1.5) DISCO (F,α = 1.5) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0.2 0.4 Figure 112:Mean PESR difference to best method per scenario for the kurtosis alt...
work page 2010
-
[12]
ZC (1MST,κ = 1.14) CF (1MST) ZC (1MST,κ = 1.31) CCS (5MST) CCS (1MST) FR (5MST) FR (1MST) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0.2 0.4 Figure 117:Mean PESR diffe...
work page 1979
-
[13]
DISCO (F,α = 1.5) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0.2 0.4 0.6 Figure 133:Mean PESR difference to best method per scenario for the skewness and kurtosis alte...
work page 2010
-
[14]
ZC (1MST,κ = 1.14) ZC (1MST,κ = 1.31) CF (5MST) CF (1MST) CCS (5MST) CCS (1MST) FR (5MST) FR (1MST) N p Deviation DGP Balance Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis normal vs. t Overall p 2 10 50 Overall N 50 100 200 500 1000 Overall 0 0.2 0.4 0.6 0.8 Figure...
work page 1979
-
[15]
Selected variant: DISCO (F,α“0.5)
Energy DISCO (F,α = 1.5) DISCO (B,α = 1.5) N p Deviation DGP Balance Grouping Grouping 3+1 2+2 2+1+1 1+1+1+1 Overall Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis Overall p 2 10 50 Overall N 100 200 400 Overall 0.2 0.4 0.6 0.8 Figure 157:Mean PESR difference to bes...
work page 2004
-
[16]
N p Deviation DGP Balance Grouping Grouping 3+1 2+2 2+1+1 1+1+1+1 Overall Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis Overall p 2 10 50 Overall N 100 200 400 Overall 0 0.2 0.4 0.6 0.8 Figure 164:Mean PESR difference to best method per scenario for the shift alter...
work page 2022
-
[17]
Selected variant: DISCO (F,α“0.5)
Energy DISCO (F,α = 1.5) DISCO (B,α = 1.5) N p Deviation DGP Balance Grouping Grouping 3+1 2+2 2+1+1 1+1+1+1 Overall Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis Overall p 2 10 50 Overall N 100 200 400 Overall 0 0.2 0.4 0.6 0.8 Figure 170:Mean PESR difference to b...
work page 2004
-
[18]
Selected variant: DISCO (F,α“0.5)
Energy DISCO (F,α = 1.5) DISCO (B,α = 1.5) N p Deviation DGP Balance Grouping Grouping 3+1 2+2 2+1+1 1+1+1+1 Overall Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis Overall p 2 10 50 Overall N 100 200 400 Overall 0.2 0.4 0.6 0.8 Figure 183:Mean PESR difference to bes...
work page 2004
-
[19]
N p Deviation DGP Balance Grouping Grouping 3+1 2+2 2+1+1 1+1+1+1 Overall Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis Overall p 2 10 50 Overall N 100 200 400 Overall 0 0.2 0.4 0.6 0.8 Figure 190:Mean PESR difference to best method per scenario for the correlation...
work page 2022
-
[20]
Selected variant: DISCO (B,α“0.5)
DISCO (B,α = 1.5) DISCO (F,α = 1.5) N p Deviation DGP Balance Grouping Grouping 3+1 2+2 2+1+1 1+1+1+1 Overall Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis Overall p 2 10 50 Overall N 100 200 400 Overall 0.2 0.4 Figure 196:Mean PESR difference to best method per sc...
work page 2004
-
[21]
N p Deviation DGP Balance Grouping Grouping 3+1 2+2 2+1+1 1+1+1+1 Overall Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis Overall p 2 10 50 Overall N 100 200 400 Overall 0.2 0.4 Figure 203:Mean PESR difference to best method per scenario for the kurtosis alternative ...
work page 2022
-
[22]
Selected variant: DISCO (B,α“0.5)
DISCO (F,α = 0.5) N p Deviation DGP Balance Grouping Grouping 3+1 2+2 2+1+1 1+1+1+1 Overall Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis Overall p 2 10 50 Overall N 100 200 400 Overall 0.2 0.4 0.6 0.8 Figure 209:Mean PESR difference to best method per scenario for...
work page 2004
-
[23]
N p Deviation DGP Balance Grouping Grouping 3+1 2+2 2+1+1 1+1+1+1 Overall Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis Overall p 2 10 50 Overall N 100 200 400 Overall 0 0.2 0.4 0.6 0.8 Figure 215:Mean PESR difference to best method per scenario for the skewness an...
work page 2022
-
[24]
N p Deviation DGP Balance Grouping Grouping 3+1 2+2 2+1+1 1+1+1+1 Overall Balance balanced unbalanced Overall DGP normal t lognormal chisq Overall Deviation shift scale correlation kurtosis skewness and kurtosis Overall p 2 10 50 Overall N 100 200 400 Overall 0 0.2 0.4 0.6 0.8 Figure 216:Mean PESR difference to best method per scenario for the skewness an...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.