pith. sign in

arxiv: 2605.20806 · v1 · pith:JMRVKMRDnew · submitted 2026-05-20 · 📊 stat.ME · stat.AP

Evaluation of the number of clusters in a data set using p-values from Multiple Tests of Hypotheses

Pith reviewed 2026-05-21 02:45 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords cluster number determinationinterpoint distancesnonparametric hypothesis testingp-value combinationcluster validity indexmultiple testingnonparametric clustering
0
0 comments X

The pith

Combining p-values from multiple nonparametric tests on interpoint distances determines the number of clusters in a dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a nonparametric index that checks for clustering structure by running univariate hypothesis tests on interpoint distances. These tests produce p-values that are combined in a stepwise sequence to decide the total number of groups. The approach works together with any clustering algorithm that requires the number of clusters as input. It applies to data of any dimension and avoids much of the extra computation required by other cluster validity measures. Experiments on real and simulated data support that the method identifies the correct number of clusters reliably.

Core claim

The central claim is that interpoint distances computed from a given data set can serve as the basis for a collection of univariate nonparametric hypothesis tests; the p-values from these tests can then be combined in a stepwise decision process that identifies the true number of clusters present, providing an efficient and accurate alternative to existing cluster accuracy indices when used with any standard clustering algorithm.

What carries the argument

Stepwise combination of p-values obtained from univariate nonparametric tests performed on interpoint distances.

If this is right

  • The index can be paired with any clustering algorithm that accepts a pre-specified number of clusters as input.
  • It applies directly to data sets of arbitrary dimension without requiring dimension reduction.
  • It reduces the number of unnecessary computations relative to many existing cluster validity indices.
  • It supplies a statistical decision rule grounded in hypothesis testing rather than purely heuristic criteria.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dependence among interpoint distances may require a specific multiple-testing adjustment that the paper leaves implicit; explicit simulation checks across increasing dimensions would clarify the robustness.
  • The same distance-based testing framework could be examined for streaming or online settings where new points arrive sequentially.
  • Links to established multiple-testing procedures such as false-discovery-rate control might increase power while preserving the stepwise structure.

Load-bearing premise

The interpoint distances under the null hypothesis of no clustering structure yield p-values that can be validly combined in a stepwise manner without distortion from their mutual dependence or from the multiplicity of tests.

What would settle it

Apply the procedure to synthetic data generated from a known mixture of well-separated Gaussian components and observe whether the stepwise p-value process correctly stops at the true number of components, or fails to recover that number when the components are allowed to overlap heavily.

read the original abstract

This paper proposes a novel, nonparametric, interpoint distance-based measure to investigate whether there exist any groups in a set of given data, and if so then, how many groups are prevailing in total. It is a cluster accuracy index useful for arbitrary-dimensional data set, in association with any clustering algorithm having the number of groups specified as a priori. We perform univariate, nonparametric, multiple statistical tests of hypotheses, where as many dependent tests as the sample size are carried out using the interpoint distances. They possess $p$-values to be combined to reach a decision, which is taken in a step-wise process for a possible number of clusters. It reduces the unnecessary computations compared with the other accuracy measures from the literature. Data study establishes the proposed index's efficiency and superiority.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a nonparametric interpoint distance-based index for determining the number of clusters in a dataset. It performs univariate nonparametric hypothesis tests on the interpoint distances (one per distance, hence roughly n tests for n points), obtains p-values, and combines them via a stepwise rule to select the number of clusters k when used in conjunction with any clustering algorithm that takes k as input. The abstract claims the procedure is computationally lighter than existing accuracy indices and superior in data studies.

Significance. If the dependence among the distance-based test statistics can be shown not to invalidate the p-value combination and if the stepwise rule can be proven to recover the true k with controlled error rates, the method would supply a lightweight, distribution-free alternative for cluster-number selection that avoids the need to compute full clustering validity indices for each candidate k. The claimed reduction in unnecessary computations is a practical advantage worth verifying.

major comments (2)
  1. The abstract states that 'as many dependent tests as the sample size are carried out using the interpoint distances' and that p-values are 'combined to reach a decision' in a 'step-wise process.' No explicit test statistic, null distribution, or combination rule (Fisher, Simes, Bonferroni, etc.) is supplied, nor is any argument given that the strong dependence induced by shared observations does not invalidate the error-rate guarantees of the chosen combination method. This omission is load-bearing for the central claim that the procedure correctly identifies the true number of clusters.
  2. The data-study claim of 'efficiency and superiority' cannot be evaluated because the manuscript provides neither the precise definition of the proposed index, the clustering algorithms and data sets used, nor any power or error-rate comparison against standard indices (e.g., silhouette, gap statistic, or Davies-Bouldin). Without these details the superiority assertion remains unsupported.
minor comments (1)
  1. The abstract refers to 'univariate, nonparametric, multiple statistical tests of hypotheses' without naming the underlying nonparametric test (Wilcoxon, Kolmogorov-Smirnov, etc.) or the precise hypothesis being tested for each interpoint distance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive feedback. Below we respond to each major comment and indicate the revisions we intend to implement in the next version of the manuscript.

read point-by-point responses
  1. Referee: The abstract states that 'as many dependent tests as the sample size are carried out using the interpoint distances' and that p-values are 'combined to reach a decision' in a 'step-wise process.' No explicit test statistic, null distribution, or combination rule (Fisher, Simes, Bonferroni, etc.) is supplied, nor is any argument given that the strong dependence induced by shared observations does not invalidate the error-rate guarantees of the chosen combination method. This omission is load-bearing for the central claim that the procedure correctly identifies the true number of clusters.

    Authors: We thank the referee for this insightful comment. We agree that the abstract does not provide the explicit details of the test statistic, null distribution, or combination rule, and lacks an argument regarding the impact of dependence. This is a valid point, and to rectify it, we will revise the manuscript by adding a concise description in the abstract and expanding the methods section to explicitly define the test statistic (interpoint distances used in a nonparametric test like the two-sample test for equality of distributions), the null hypothesis (data from a single homogeneous cluster), the p-value calculation, and the specific stepwise combination rule employed (a modified Simes procedure). Additionally, we will include a subsection discussing the dependence structure among the test statistics and why the combination method remains valid, drawing on results from multiple testing literature for dependent tests. We believe these additions will strengthen the central claim. revision: yes

  2. Referee: The data-study claim of 'efficiency and superiority' cannot be evaluated because the manuscript provides neither the precise definition of the proposed index, the clustering algorithms and data sets used, nor any power or error-rate comparison against standard indices (e.g., silhouette, gap statistic, or Davies-Bouldin). Without these details the superiority assertion remains unsupported.

    Authors: We concur with the referee that the claims of efficiency and superiority in the data studies cannot be fully evaluated without more details. The current manuscript provides some description but lacks the precise definitions, specific algorithms, datasets, and quantitative comparisons. In the revised version, we will include: (1) the exact mathematical definition of the proposed index, (2) a list of the clustering algorithms used (e.g., k-means, hierarchical), (3) the datasets employed (e.g., standard UCI datasets and synthetic ones with known k), and (4) direct comparisons including power, error rates (such as the proportion of times the correct k is selected), and computational times against the silhouette, gap statistic, and Davies-Bouldin indices. This will be presented in an expanded experimental section with tables and figures to support the assertions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external statistical tests rather than self-referential construction.

full rationale

The paper defines a new interpoint-distance-based index that applies univariate nonparametric hypothesis tests to distances and combines the resulting p-values in a stepwise decision rule for selecting the number of clusters. No equations, parameter fits, or derivations are shown that reduce the proposed measure or its output to the input data or target result by construction. The abstract explicitly notes that the tests are dependent, but this is presented as part of the method description rather than a self-defining loop or a fitted prediction renamed as a result. The approach is self-contained against external benchmarks of hypothesis testing and p-value combination; no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. This is the normal case of a proposed statistical procedure whose validity rests on the properties of the tests themselves, not on circular re-use of the target quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method implicitly assumes that interpoint distances under a null of no clusters admit well-behaved nonparametric tests and that dependence among the tests does not invalidate the combined decision rule; no free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption Interpoint distances under the null hypothesis of no clustering structure permit valid univariate nonparametric hypothesis tests.
    Invoked when the paper states that multiple tests are performed using the interpoint distances.
  • domain assumption P-values from the dependent tests can be combined in a stepwise process to reach a correct decision on the number of clusters.
    Central to the decision procedure described in the abstract.

pith-pipeline@v0.9.0 · 5660 in / 1394 out tokens · 30059 ms · 2026-05-21T02:45:26.404263+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages

  1. [1]

    S., Muller, K

    Ahn, J., Marron, J. S., Muller, K. M., Chi, Y.-Y. (2007).The high- dimension, low-sample-size geometric representation holds under mild con- ditions, Biometrika,94, 760–766

  2. [2]

    and Saranadasa H

    Bai Z. and Saranadasa H. (1996).Effect of high dimension: by an example of a two sample problem.Stat Sinica,6, 311—329

  3. [3]

    Ball, G. H. and Hall, D. J. (1965).Isodata: A novel method of data anal- ysis and pattern classification. Stanford Research Institute, Menlo Park

  4. [4]

    and Raftery A

    Banfield J. and Raftery A. E. (1993).Model-based Gaussian and non- Gaussian clustering. Biometrics.49, 803–821

  5. [5]

    & Harabasz, J

    Cali´ nski, T. & Harabasz, J. (1974).A Dendrite Method for Cluster Anal- ysis. Communications in Statistics – Theory and Methods.3, 1–27

  6. [6]

    Campello, R. J. G. B., Moulavi, D., Sander, J. (2013).Density- Based Clustering Based on Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases (PAKDD 2013). Lecture Notes in Computer Science.7819, 160–172

  7. [7]

    and Govaert, G

    Celeux, G. and Govaert, G. (1995).Gaussian parsimonious clustering models.Pattern Recognition.28, 781–793

  8. [8]

    and Yang, L

    Cheng, D., Zhu, Q., Huang, J., Wu, Q. and Yang, L. (2019).A Novel Cluster Validity Index Based on Local Cores. IEEE Transactions on Neural Networks and Learning Systems.30, 985–999

  9. [9]

    Davies, D. L. and Bouldin, D. W. (1979).A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence.2, 224– 227

  10. [10]

    and Chattopadhyay, A

    De, T., Chattopadhyay, T. and Chattopadhyay, A. K. (2014).Use of cross-correlation function to study formation mechanism of massive ellip- tical galaxies. Publications of the Astronomical Society of Australia,31, Article id: e407, pages 1–8

  11. [11]

    P., Laird N

    Dempster A. P., Laird N. M., Rubin D. B. (1977).Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statis- tical Society, Series B.39, 1–38. 17

  12. [12]

    Dunn, J. C. (1974).Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics.4, 95–104

  13. [13]

    and Tibshirani, R

    Efron, B. and Tibshirani, R. (1993).An Introduction to the Bootstrap. Chapman and Hall, New York, London

  14. [14]

    Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996).A density-based algorithm for discovering clusters in large spatial databases with noise.Pro- ceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press, Portland, Oregon, 226–231

  15. [15]

    S., Landau, S

    Everitt, B. S., Landau, S. and Leese, M. (2001).Cluster Analysis. Arnold, London

  16. [16]

    and Riedwyl, H

    Flury, B. and Riedwyl, H. (1988).Multivariate Statistics: A practical approach. Chapman & Hall, London

  17. [17]

    and Raftery, A

    Frayley, C. and Raftery, A. E. (1998),How Many Clusters? Which Clus- tering Method? Answers via Model-Based Cluster Analysis. The Com- puter Journal.41, 578–588

  18. [18]

    and Raftery A

    Fraley C. and Raftery A. E. (1999).MCLUST: Software for model-based cluster analysis. Journal of Classification.,16, 297–306

  19. [19]

    and Raftery, A

    Fraley, C. and Raftery, A. E. (2002).Model-based clustering, discrimi- nant analysis and density estimation. Journal of the American Statistical Association,97, 611–631

  20. [20]

    and Raftery, A

    Fraley, C. and Raftery, A. E. (2003).Enhanced model-based clustering, density estimation, and discriminant analysis software: Mclust.Journal of Classification.20, 263–286

  21. [21]

    and Raftery A

    Fraley C. and Raftery A. E. (2007).Model-based methods of classifi- cation: using the mclust software in chemometrics.Journal of Statistical Software.18,1–13

  22. [22]

    E., Murphy, T

    Fraley, C., Raftery, A. E., Murphy, T. B., Scrucca, L. (2012).MCLUST version 4 for R: Normal mixture modeling for model-based clustering, clas- sification, and density estimation. Technical Report. Vol.597, Department of Statistics, University of Washington. 18

  23. [23]

    & Kell, D

    Handl, J., Knowles, K. & Kell, D. (2005).Computational cluster vali- dation in post-genomic data analysis. Bioinformatics.21, 3201–3212

  24. [24]

    Hartigan, J. A. (1975).Clustering Algorithms. John Wiley & Sons, New York, USA

  25. [25]

    Hartigan, J. A. and Wong, M. A. (1979).A K-means clustering algo- rithm. Applied Statistics.28, 100–108

  26. [26]

    V., Mckean, J

    Hogg, R. V., Mckean, J. W. and Craig, A. T. (2019).Introduction to Mathematical Statistics. Pearson Education, Boston

  27. [27]

    Hope, A. C. A. (1968).A simplified Monte Carlo significance test pro- cedure. Journal of the Royal Statistical Society Series B,30, 582–598

  28. [28]

    and Arabie, P

    Hubert, L. and Arabie, P. (1985).Comparing Partitions, Journal of the Classification,2, 193–218

  29. [29]

    Jain, A. K. , Murty, M. N. and Flynn, P. J. (1999).Data clustering: a review. ACM Computing Surveys.31, 264–323

  30. [30]

    Joanes, D. N. and Gill, C. A. (1998).Comparing measures of sample skewness and kurtosis. The Statistician,47, 183–189

  31. [31]

    Johnson, R. A. and Wichern, D. W. (2007).Applied Multivariate Sta- tistical Analysis, Pearson Prentice Hall, New Jersey

  32. [32]

    and Marron, J

    Jung, S. and Marron, J. S. (2009).PCA consistency in high dimension, low sample size context. The Annals of Statistics,37, 4104–4130

  33. [33]

    and Kalina, J

    Jureˇ ckov´ a, J. and Kalina, J. (2012).Nonparametric multivariate rank tests and their unbiasedness.Bernoulli,18, 229—251

  34. [34]

    Kass, R. E. and Raftery, A. E. (1995).Bayes Factors. Journal of the American Statistical Association.90, 773–795

  35. [35]

    and Rousseeuw, P

    Kaufman, L. and Rousseeuw, P. J. (2005).Finding Groups in Data: An Introduction to Cluster Analysis.John Wiley and Sons, New Jersey

  36. [36]

    Kost, J. T. and McDermott, M. P. (2002).Combining dependent p- values.Statistics & Probability Letters,60, 183—190. 19

  37. [37]

    (2015).Multivariate multidistance tests for high- dimensional low sample size case-control studies.Statistics in Medicine, 34, 1511—1526

    Marozzi, M. (2015).Multivariate multidistance tests for high- dimensional low sample size case-control studies.Statistics in Medicine, 34, 1511—1526

  38. [38]

    (2016).Multivariate tests based on interpoint distances with application to magnetic resonance imaging.Statistical Methods in Medical Research,25, 2593–2610

    Marozzi, M. (2016).Multivariate tests based on interpoint distances with application to magnetic resonance imaging.Statistical Methods in Medical Research,25, 2593–2610

  39. [39]

    and Peel, D

    McLachlan, G. and Peel, D. (2000).Finite Mixture Models. John Wiley and Sons, New York

  40. [40]

    (2019).Uncovering astrophysical phenomena related to galax- ies and other objects through statistical analysis.Ph.D

    Modak, S. (2019).Uncovering astrophysical phenomena related to galax- ies and other objects through statistical analysis.Ph.D. Thesis, University of Calcutta, Kolkata, India. URL: http://hdl.handle.net/10603/314773

  41. [41]

    (2021).Distinction of groups of gamma-ray bursts in the BATSE catalog through fuzzy clustering

    Modak, S. (2021).Distinction of groups of gamma-ray bursts in the BATSE catalog through fuzzy clustering. Astronomy and Computing.34, Article id 100441, Pages 1–7

  42. [42]

    (2022).A new nonparametric interpoint distance-based mea- sure for assessment of clustering

    Modak, S. (2022).A new nonparametric interpoint distance-based mea- sure for assessment of clustering. Journal of Statistical Computation and Simulation.92, 1062–1077

  43. [43]

    (2023a).Pointwise norm-based clustering of data in arbi- trary dimensional space

    Modak, S. (2023a).Pointwise norm-based clustering of data in arbi- trary dimensional space. Communications in Statistics - Case Studies, Data Analysis and Applications,9, 121–134

  44. [44]

    (2023b).Validity index for clustered data in non-negative space

    Modak, S. (2023b).Validity index for clustered data in non-negative space. Calcutta Statistical Association Bulletin,75, 60–71

  45. [45]

    (2023c).A new measure for assessment of clustering based on kernel density estimation

    Modak, S. (2023c).A new measure for assessment of clustering based on kernel density estimation. Communications in Statistics – Theory and Methods,52, 5942-5951

  46. [46]

    (2024a).A new interpoint distance-based clustering algorithm using kernel density estimation

    Modak, S. (2024a).A new interpoint distance-based clustering algorithm using kernel density estimation. Communications in Statistics - Simulation and Computation,53, 5323-5341

  47. [47]

    (2024b).Book Review: Finding Groups in Data: An In- troduction to Cluster Analysis, Leonard Kaufman & Peter J

    Modak, S. (2024b).Book Review: Finding Groups in Data: An In- troduction to Cluster Analysis, Leonard Kaufman & Peter J. Rousseeuw,

  48. [48]

    Journal of Applied Statistics,51, 1618-1620. 20

  49. [49]

    and Bandyopadhyay, U

    Modak, S. and Bandyopadhyay, U. (2019).A new nonparametric test for two sample multivariate location problem with application to astronomy. Journal of Statistical Theory and Applications,18, 136–146

  50. [50]

    Modak, S., Chattopadhyay, A. K. & Chattopadhyay, T. (2018).Clus- tering of gamma-ray bursts through kernel principal component analysis. Communications in Statistics – Simulation and Computation.47, 1088– 1102

  51. [51]

    & Chattopadhyay, A

    Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2017).Two phase formation of massive elliptical galaxies: study through cross– correlation including spatial effect.Astrophysics and Space Science.362, Article id: 206, Pages 1–10

  52. [52]

    & Chattopadhyay, A

    Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2020).Unsu- pervised classification of eclipsing binary light curves through k-medoids clustering. Journal of Applied Statistics.47, 376–392

  53. [53]

    & Chattopadhyay, A

    Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2022).Cluster- ing of eclipsing binary light curves through functional principal component analysis. Astrophysics and Space Science.367, Article id: 19, Pages 1–10

  54. [54]

    K., Bandyopadhyay, S

    Pakhiraa, M. K., Bandyopadhyay, S. and Maulik, U. (2004).Validity index for crisp and fuzzy clusters. Pattern Recognition.37, 487–501

  55. [55]

    L., Shmulevich,

    Poole, W., Gibbs, D. L., Shmulevich,. I., Bernard, B., Knijnenburg, T. A. (2016).Combining dependent P-values with an empirical adaptation of Brown’s method.Bioinformatics,32, i430—i436

  56. [56]

    Ripley B. D. (1996).Pattern recognition and neural networks. Cam- bridge University Press, Cambridge

  57. [57]

    Rousseeuw, P. J. (1987).Silhouettes: A graphical aid to the interpre- tation and validation of cluster analysis.Journal of Computational and Applied Mathematics.20, 53–65

  58. [58]

    and Smola, A

    Sch¨ olkopf, B. and Smola, A. J. (2002).Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.MIT Press, Cambridge. 21

  59. [59]

    (1978).Estimating the Dimension of a Model.The Annals of Statistics,6, 461–464

    Schwarz, G. (1978).Estimating the Dimension of a Model.The Annals of Statistics,6, 461–464

  60. [60]

    Scrucca, L., Fop, M., Murphy, T. B. and Raftery, A. E. (2016).mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal,8, 289–317

  61. [61]

    Silva, L. E. Brito Da, Melton, N. M. and Wunsch, D. C. (2020).Incre- mental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study. Institute of Electrical and Electronics Engineers,8, 22025–22047

  62. [62]

    (2019).Analysis of the Duration–Hardness Ratio Plane of Gamma-Ray Bursts Using Skewed Distributions.The Astrophysical Journal.870, 1–9, Article id: 105

    Tarnopolski, M. (2019).Analysis of the Duration–Hardness Ratio Plane of Gamma-Ray Bursts Using Skewed Distributions.The Astrophysical Journal.870, 1–9, Article id: 105

  63. [63]

    & Hastie, T

    Tibshirani, R., Walther, G. & Hastie, T. (2001).Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B.63, 411–423

  64. [64]

    G., R´ acz, I

    T´ oth, B. G., R´ acz, I. I. & Horv´ ath, I. (2019).Gaussian-mixture-model- based cluster analysis of gamma-ray bursts in the BATSE catalog. Monthly Notices of the Royal Astronomical Society.486, 4823–4828

  65. [65]

    Vale, D. C. and Maurelli V. A. (1983).Simulating multivariate nonnor- mal distributions. Psychometrika,48, 465–471

  66. [66]

    and Aoshima, M

    Yata, K. and Aoshima, M. (2010).Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data ma- trix, Journal of Multivariate Analysis,101, 2060–2077. 22