pith. sign in

arxiv: 2606.10673 · v1 · pith:DFAO6KWBnew · submitted 2026-06-09 · 📊 stat.OT · cs.LG

ClusBench: The Clustering Benchmark Data Resource You've All Been Waiting For (?)

Pith reviewed 2026-06-27 10:56 UTC · model grok-4.3

classification 📊 stat.OT cs.LG
keywords clusteringbenchmarksynthetic datanon-parametric distributionreal-world dataevaluationclustering methodsdata generation
0
0 comments X

The pith

ClusBench creates nearly 3000 synthetic datasets by fitting non-parametric distributions to more than 200 real-world sources for clustering evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard benchmarks for clustering methods rely on either limited real datasets or overly simple simulations that miss real-world complexities. This paper produces a large collection of synthetic datasets derived from public real-world data by fitting flexible non-parametric distributions. The approach keeps much of the original data's nuance while allowing datasets to grow in size. An accompanying R package makes the resource available for download. Researchers can use these to test clustering algorithms more realistically than before.

Core claim

By fitting a flexible non-parametric distribution to each base data set we are able to retain much of the nuance in real-world data which is difficult to reproduce in standard simulations, while also producing data sets whose sizes are sometimes substantially greater than the data sets from which they are derived. The synthetic data sets, plus an accompanying R package, are available for download from https://github.com/DavidHofmeyr/ClusBench.

What carries the argument

Fitting a flexible non-parametric distribution to each base dataset to generate synthetic versions that preserve real-world features.

If this is right

  • Clustering methods can be evaluated on a much larger and more varied set of test cases than previously available.
  • The synthetic datasets maintain structural features from real applications that determine clustering performance.
  • Dataset sizes can exceed the originals, enabling tests on larger scales.
  • Standardized access via an R package facilitates reproducible benchmarking across studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Methods that perform well on these datasets may generalize better to practical applications than those tuned on synthetic simulations alone.
  • This resource could serve as a template for creating benchmarks in other machine learning tasks like classification or regression.
  • Future work might compare clustering algorithm rankings on ClusBench versus traditional benchmarks to identify discrepancies.

Load-bearing premise

The non-parametric distribution fitted to each base dataset preserves the structural features that determine clustering difficulty and method performance.

What would settle it

If clustering methods show substantially different performance rankings or behaviors on these synthetic datasets compared to the original real-world base datasets, the preservation of relevant structure would be called into question.

Figures

Figures reproduced from arXiv: 2606.10673 by David P. Hofmeyr.

Figure 1
Figure 1. Figure 1: Two commonly used classification data sets (left), each with one synthetic sample generated using our modified KDE (middle) and one using a stan￾dard KDE (right) modified KDE produces a data set which is pleasing as a representation of what we might expect a larger sample from the same underlying distribution to look like. It is worth pointing out that, of course, choosing a smaller bandwidth will allow th… view at source ↗
Figure 2
Figure 2. Figure 2: PCA projections of some of the data sets removed due to difficulty in ob￾taining clustering solutions which align with the actual classes. Points in different classes are differentiated by colour and point character. These may be seen as arguably the “most clusterable” of those removed. 3.2 Rebalanced Class Proportions As mentioned above some of the data sets removed were challenging to cluster “accu￾ratel… view at source ↗
Figure 3
Figure 3. Figure 3: Samples from two distributions which, after rebalancing of class propor￾tions, become far more reasonably clusterable. We show, in [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The 278 batches of ten data sets characterised by dimension and number of classes. In addition points are differentiated by size according to maximum clusterability, and by colour according to the hierarchical model giving the best result. class imbalance; and maximum, mean, and minimum clusterability (by clustering model/linkage method). The dendrogram in the left margin is based on Ward type linkage but … view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Two data sets from each of the groups. In each case one of the samples from the “most typical” and “least typical” batch of ten is shown. 2. sampleX: A function to fit and sample from a modified KDE, as described in Section 2. 3. SampleXy: A function to fit and sample from a mixture of modified KDEs, stratified according to a given vector of class labels. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The iris data set, as well as two of the synthetic samples generated from it which are included in the ClusBench repository. for the ubiquity of the iris data set making the generation of larger (and realistic) “versions” of it potentially interesting. To illustrate the functionality of the functions sampleX and sampleXy we revist the Statlog Satellite data set. The following code chunk downloads and impor… view at source ↗
Figure 8
Figure 8. Figure 8: The “satimage” (Statlog Satellite) data set, as well as one of the synthetic variants in the ClusBench repository, plus two alternatives generated with the ClusBench package. References K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http: //archive.ics.uci.edu/ml. James Joseph Balamuta and Philip Truong. ucimlrepo: Explore UCI ML Reposi￾tory Datasets, 2024. URL https://CRAN.R-project.o… view at source ↗
read the original abstract

Although some very common test beds exist for assessing the performance of clustering methods, large scale benchmarking is typically limited to relatively simplistic simulation set-ups. Here we describe the production and curation of close to 3000 synthetic data sets, derived from more than 200 publicly available data sets; the majority of which arose from real-world applications. By fitting a flexible non-parametric distribution to each base data set we are able to retain much of the nuance in real-world data which is difficult to reproduce in standard simulations, while also producing data sets whose sizes are sometimes substantially greater than the data sets from which they are derived. The synthetic data sets, plus an accompanying R package, are available for download from https://github.com/DavidHofmeyr/ClusBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes ClusBench, a curated collection of nearly 3000 synthetic datasets generated from more than 200 publicly available real-world datasets. The generation process fits a flexible non-parametric distribution to each base dataset, with the stated goal of retaining real-world nuances that are hard to capture in standard simulations while also enabling substantially larger sample sizes. The datasets and an accompanying R package are made available via GitHub.

Significance. If the synthetic datasets preserve the structural properties that determine clustering difficulty (separation, overlap, manifold structure, effective dimension), the resource would provide a valuable, scalable benchmark that bridges the gap between simplistic simulations and limited real-world data. The scale (nearly 3000 datasets) and the provision of an R package are practical strengths that could support reproducible large-scale method comparisons.

major comments (2)
  1. [Abstract / generation pipeline] Abstract and generation description: the central claim that non-parametric fitting 'retains much of the nuance in real-world data which is difficult to reproduce in standard simulations' is presented without any empirical validation. No section compares clustering algorithm rankings, error rates, or structural metrics (e.g., cluster separation, overlap, or effective dimension) between the original base datasets and the generated synthetic versions.
  2. [Methods / data-generation description] No validation or sensitivity analysis is supplied for the non-parametric fitting step itself (kernel choice, bandwidth selection, handling of labels or categorical features). When upsampling is performed, it is unclear whether the fitted distribution preserves the features that drive method performance differences; this assumption is load-bearing for the benchmark's claimed utility.
minor comments (2)
  1. [Title] The title ends with a parenthetical question mark; consider rephrasing for a more conventional academic tone.
  2. [Abstract / availability statement] The GitHub link is given only in the abstract; ensure the repository is cited with a stable DOI or version tag in the main text and that the R package documentation clearly describes the generation parameters used for each dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of validation for the ClusBench resource. We address each major comment below and commit to revisions that will strengthen the manuscript by adding the requested empirical support and methodological details.

read point-by-point responses
  1. Referee: [Abstract / generation pipeline] Abstract and generation description: the central claim that non-parametric fitting 'retains much of the nuance in real-world data which is difficult to reproduce in standard simulations' is presented without any empirical validation. No section compares clustering algorithm rankings, error rates, or structural metrics (e.g., cluster separation, overlap, or effective dimension) between the original base datasets and the generated synthetic versions.

    Authors: We agree that the central claim would be more robust with direct empirical validation, which is absent from the current manuscript. In the revised version we will add a new section that computes and compares structural metrics (including measures of cluster separation, overlap, and effective dimension) between each base dataset and its synthetic counterpart. We will also report a targeted comparison of clustering algorithm performance rankings and error rates on a representative subset of matched original-synthetic pairs to demonstrate preservation of relative difficulty. revision: yes

  2. Referee: [Methods / data-generation description] No validation or sensitivity analysis is supplied for the non-parametric fitting step itself (kernel choice, bandwidth selection, handling of labels or categorical features). When upsampling is performed, it is unclear whether the fitted distribution preserves the features that drive method performance differences; this assumption is load-bearing for the benchmark's claimed utility.

    Authors: The manuscript currently gives only a high-level description of the fitting procedure without the requested sensitivity analysis or implementation specifics. We will expand the methods section to document the kernel type, bandwidth selection approach, and handling of categorical or labeled features. We will further add a sensitivity analysis performed on a subset of datasets that varies these choices and examines the resulting impact on structural properties. This analysis will directly test whether upsampling preserves performance-relevant features and will be linked to the new validation section. revision: yes

Circularity Check

0 steps flagged

No circularity: data-generation pipeline with no derivation chain

full rationale

The paper describes fitting non-parametric distributions to real base datasets to produce synthetic versions, presented as a curation pipeline rather than any mathematical derivation, prediction, or uniqueness result. No equations, self-citations for theorems, fitted inputs renamed as predictions, or ansatzes appear in the provided text. The central claim concerns the practical utility of the generated resource and does not reduce to its inputs by construction; the reader's assessment of score 1.0 is consistent with this self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified premise that non-parametric fitting from real data produces superior benchmarks; no free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.1-grok · 5653 in / 1029 out tokens · 22472 ms · 2026-06-27T10:56:24.934784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Journal of Computational and Graphical Statistics , pages=

    Improving Spectral Clustering Using the Asymptotic Value of the Normalized Cut , author=. Journal of Computational and Graphical Statistics , pages=. 2019 , publisher=

  2. [2]

    Data in brief , volume=

    Clustering benchmark datasets exploiting the fundamental clustering problems , author=. Data in brief , volume=. 2020 , publisher=

  3. [3]

    SoftwareX , volume=

    A framework for benchmarking clustering algorithms , author=. SoftwareX , volume=. 2022 , publisher=

  4. [4]

    2024 , note =

    ucimlrepo: Explore UCI ML Repository Datasets , author =. 2024 , note =

  5. [5]

    Biometrics , pages=

    A method for cluster analysis , author=. Biometrics , pages=. 1965 , publisher=

  6. [6]

    Journal of Statistical Software , volume=

    MixSim: An R package for simulating data to study performance of clustering algorithms , author=. Journal of Statistical Software , volume=

  7. [7]

    The Journal of Machine Learning Research , volume=

    Minimum density hyperplanes , author=. The Journal of Machine Learning Research , volume=. 2016 , publisher=

  8. [8]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Clustering by minimum cut hyperplanes , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2016 , publisher=

  9. [9]

    Advances in Neural Information Processing Systems , pages=

    Small-variance asymptotics for exponential family Dirichlet process mixture models , author=. Advances in Neural Information Processing Systems , pages=

  10. [10]

    , author=

    A density-based algorithm for discovering clusters in large spatial databases with noise. , author=. Kdd , volume=

  11. [11]

    The Annals of Statistics , volume=

    Generalized density clustering , author=. The Annals of Statistics , volume=. 2010 , publisher=

  12. [12]

    Journal of the American statistical Association , volume=

    Model-based clustering, discriminant analysis, and density estimation , author=. Journal of the American statistical Association , volume=. 2002 , publisher=

  13. [13]

    Computational statistics & Data analysis , volume=

    A classification EM algorithm for clustering and two stochastic versions , author=. Computational statistics & Data analysis , volume=. 1992 , publisher=

  14. [14]

    Journal of the American Statistical Association , volume=

    Finding the number of clusters in a dataset: An information-theoretic approach , author=. Journal of the American Statistical Association , volume=. 2003 , publisher=

  15. [15]

    On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization

    On differentiating parameterized argmin and argmax problems with application to bi-level optimization , author=. arXiv preprint arXiv:1607.05447 , year=

  16. [16]

    PLoS computational biology , volume=

    Uncovering a macrophage transcriptional program by integrating evidence from motif scanning and expression dynamics , author=. PLoS computational biology , volume=. 2008 , publisher=

  17. [17]

    , author=

    X-means: Extending k-means with efficient estimation of the number of clusters. , author=. Icml , volume=

  18. [18]

    Information retrieval , volume=

    A comparison of extrinsic clustering evaluation metrics based on formal constraints , author=. Information retrieval , volume=. 2009 , publisher=

  19. [19]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Blobworld: Image segmentation using expectation-maximization and its application to image querying , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2002 , publisher=

  20. [20]

    Journal of Marketing Research , pages=

    LADI: A latent discriminant model for analyzing marketing research data , author=. Journal of Marketing Research , pages=. 1989 , publisher=

  21. [21]

    Molecular Ecology Resources , volume=

    Spatially explicit Bayesian clustering models in population genetics , author=. Molecular Ecology Resources , volume=. 2010 , publisher=

  22. [22]

    The annals of Statistics , pages=

    Projection pursuit , author=. The annals of Statistics , pages=. 1985 , publisher=

  23. [23]

    Journal of the American statistical Association , volume=

    How biased is the apparent error rate of a prediction rule? , author=. Journal of the American statistical Association , volume=. 1986 , publisher=

  24. [24]

    Journal of the American Statistical association , volume=

    Objective criteria for the evaluation of clustering methods , author=. Journal of the American Statistical association , volume=. 1971 , publisher=

  25. [25]

    Journal of classification , volume=

    Comparing partitions , author=. Journal of classification , volume=. 1985 , publisher=

  26. [26]

    Breitenbach, Markus , url =

  27. [27]

    Bache and M

    K. Bache and M. Lichman , year =

  28. [28]

    Selected papers of Hirotugu Akaike , pages=

    Information theory and an extension of the maximum likelihood principle , author=. Selected papers of Hirotugu Akaike , pages=. 1998 , publisher=

  29. [29]

    2013 , url =

    R: A Language and Environment for Statistical Computing , author =. 2013 , url =

  30. [30]

    The annals of statistics , volume=

    Estimating the dimension of a model , author=. The annals of statistics , volume=. 1978 , publisher=

  31. [31]

    Statistica Sinica , pages=

    Degrees of freedom and model search , author=. Statistica Sinica , pages=. 2015 , publisher=

  32. [32]

    IEEE transactions on signal processing , volume=

    Unbiased risk estimates for singular value thresholding and spectral estimators , author=. IEEE transactions on signal processing , volume=. 2013 , publisher=

  33. [33]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms , author=. arXiv preprint arXiv:1708.07747 , year=

  34. [34]

    1993 , howpublished =

    Srinivasan, Ashwin , title =. 1993 , howpublished =

  35. [35]

    2025 , note =

    Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC) , author =. 2025 , note =

  36. [36]

    Proceedings of the 26th annual international conference on machine learning , pages=

    Information theoretic measures for clusterings comparison: is a correction for chance necessary? , author=. Proceedings of the 26th annual international conference on machine learning , pages=

  37. [37]

    Journal of Classification , volume=

    Comparing partitions , author=. Journal of Classification , volume=. 1985 , publisher=

  38. [38]

    Journal of the American statistical association , volume=

    Hierarchical grouping to optimize an objective function , author=. Journal of the American statistical association , volume=. 1963 , publisher=

  39. [39]

    , title =

    Quinlan, R. , title =. 1993 , howpublished =

  40. [40]

    Silverman, B. W. , title =

  41. [41]

    BioData mining , volume=

    PMLB: a large benchmark suite for machine learning evaluation and comparison , author=. BioData mining , volume=. 2017 , publisher=

  42. [42]

    IEEE Transactions on pattern analysis and machine intelligence , volume=

    Mean shift: A robust approach toward feature space analysis , author=. IEEE Transactions on pattern analysis and machine intelligence , volume=. 2002 , publisher=

  43. [43]

    The annals of Statistics , pages=

    Estimation of the mean of a multivariate normal distribution , author=. The annals of Statistics , pages=. 1981 , publisher=

  44. [44]

    Journal of the Royal Statistical Society: Series C (Applied Statistics) , volume=

    A K-means clustering algorithm , author=. Journal of the Royal Statistical Society: Series C (Applied Statistics) , volume=. 1979 , publisher=

  45. [45]

    Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=

    Estimating the number of clusters in a data set via the gap statistic , author=. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=. 2001 , publisher=

  46. [46]

    Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science , volume=

    Selection of K in K-means clustering , author=. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science , volume=. 2005 , publisher=

  47. [47]

    2009 , publisher=

    Finding groups in data: an introduction to cluster analysis , author=. 2009 , publisher=

  48. [48]

    2008 , publisher=

    Introduction to Information Retrieval , author=. 2008 , publisher=

  49. [49]

    Advances in neural information processing systems , pages=

    Learning the k in k-means , author=. Advances in neural information processing systems , pages=

  50. [50]

    2017 , note =

    cluster: Cluster Analysis Basics and Extensions , author =. 2017 , note =

  51. [51]

    Journal of Statistical Software, Articles , volume =

    Malika Charrad and Nadia Ghazzali and Véronique Boiteau and Azam Niknafs , title =. Journal of Statistical Software, Articles , volume =. 2014 , issn =. doi:10.18637/jss.v061.i06 , url =

  52. [52]

    Computational Statistics & Data Analysis , pages=

    Degrees of freedom and model selection for k-means clustering , author=. Computational Statistics & Data Analysis , pages=. 2020 , publisher=

  53. [53]

    Statistics & Probability Letters , volume=

    On Stein’s unbiased risk estimate for reduced rank estimators , author=. Statistics & Probability Letters , volume=. 2018 , publisher=

  54. [54]

    Pattern Recognition , volume=

    Enhancing principal direction divisive clustering , author=. Pattern Recognition , volume=. 2010 , publisher=

  55. [55]

    Data mining and knowledge discovery , volume=

    Principal direction divisive partitioning , author=. Data mining and knowledge discovery , volume=. 1998 , publisher=

  56. [56]

    IEEE Signal Processing Letters , volume=

    Selecting the number of principal components with SURE , author=. IEEE Signal Processing Letters , volume=. 2014 , publisher=

  57. [57]

    Journal of the American Statistical Association , volume=

    The estimation of prediction error: covariance penalties and cross-validation , author=. Journal of the American Statistical Association , volume=. 2004 , publisher=

  58. [58]

    2005 , publisher=

    Multimodal projection pursuit using the dip statistic , author=. 2005 , publisher=

  59. [59]

    2015 IEEE Symposium Series on Computational Intelligence , pages=

    Maximum clusterability divisive clustering , author=. 2015 IEEE Symposium Series on Computational Intelligence , pages=. 2015 , organization=

  60. [60]

    Hellenic Conference on Artificial Intelligence , pages=

    Clustering of high dimensional data streams , author=. Hellenic Conference on Artificial Intelligence , pages=. 2012 , organization=

  61. [61]

    Systems & Control Letters , volume=

    Stochastic approximation with two time scales , author=. Systems & Control Letters , volume=. 1997 , publisher=

  62. [62]

    Statistics and Computing , volume=

    Divisive clustering of high dimensional data streams , author=. Statistics and Computing , volume=. 2016 , publisher=

  63. [63]

    Journal of the American Statistical Association , volume=

    Cluster identification using projections , author=. Journal of the American Statistical Association , volume=. 2001 , publisher=

  64. [64]

    Journal of Computational and Graphical Statistics , volume =

    Projection Pursuit Clustering for Exploratory Data Analysis , author =. Journal of Computational and Graphical Statistics , volume =. 2003 , doi =

  65. [65]

    Hofmeyr and Nicos G

    David P. Hofmeyr and Nicos G. Pavlidis , title =. 2019 , journal =. doi:10.32614/RJ-2019-046 , url =

  66. [66]

    Journal of Statistical Planning and Inference , volume=

    Multiple outlier detection in multivariate data using projection pursuit techniques , author=. Journal of Statistical Planning and Inference , volume=. 2000 , publisher=

  67. [67]

    arXiv preprint arXiv:2003.09960 , year=

    Efficient clustering for stretched mixtures: Landscape and optimality , author=. arXiv preprint arXiv:2003.09960 , year=

  68. [68]

    Introduction to

    Michael Hahsler and Matthew Bola. Introduction to. Journal of Statistical Software , year =