Evaluation of the number of clusters in a data set using $p$-values from Multiple Tests of Hypotheses

Soumita Modak

arxiv: 2605.20806 · v1 · pith:JMRVKMRDnew · submitted 2026-05-20 · 📊 stat.ME · stat.AP

Evaluation of the number of clusters in a data set using p-values from Multiple Tests of Hypotheses

Soumita Modak This is my paper

Pith reviewed 2026-05-21 02:45 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords cluster number determinationinterpoint distancesnonparametric hypothesis testingp-value combinationcluster validity indexmultiple testingnonparametric clustering

0 comments

The pith

Combining p-values from multiple nonparametric tests on interpoint distances determines the number of clusters in a dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a nonparametric index that checks for clustering structure by running univariate hypothesis tests on interpoint distances. These tests produce p-values that are combined in a stepwise sequence to decide the total number of groups. The approach works together with any clustering algorithm that requires the number of clusters as input. It applies to data of any dimension and avoids much of the extra computation required by other cluster validity measures. Experiments on real and simulated data support that the method identifies the correct number of clusters reliably.

Core claim

The central claim is that interpoint distances computed from a given data set can serve as the basis for a collection of univariate nonparametric hypothesis tests; the p-values from these tests can then be combined in a stepwise decision process that identifies the true number of clusters present, providing an efficient and accurate alternative to existing cluster accuracy indices when used with any standard clustering algorithm.

What carries the argument

Stepwise combination of p-values obtained from univariate nonparametric tests performed on interpoint distances.

If this is right

The index can be paired with any clustering algorithm that accepts a pre-specified number of clusters as input.
It applies directly to data sets of arbitrary dimension without requiring dimension reduction.
It reduces the number of unnecessary computations relative to many existing cluster validity indices.
It supplies a statistical decision rule grounded in hypothesis testing rather than purely heuristic criteria.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dependence among interpoint distances may require a specific multiple-testing adjustment that the paper leaves implicit; explicit simulation checks across increasing dimensions would clarify the robustness.
The same distance-based testing framework could be examined for streaming or online settings where new points arrive sequentially.
Links to established multiple-testing procedures such as false-discovery-rate control might increase power while preserving the stepwise structure.

Load-bearing premise

The interpoint distances under the null hypothesis of no clustering structure yield p-values that can be validly combined in a stepwise manner without distortion from their mutual dependence or from the multiplicity of tests.

What would settle it

Apply the procedure to synthetic data generated from a known mixture of well-separated Gaussian components and observe whether the stepwise p-value process correctly stops at the true number of components, or fails to recover that number when the components are allowed to overlap heavily.

read the original abstract

This paper proposes a novel, nonparametric, interpoint distance-based measure to investigate whether there exist any groups in a set of given data, and if so then, how many groups are prevailing in total. It is a cluster accuracy index useful for arbitrary-dimensional data set, in association with any clustering algorithm having the number of groups specified as a priori. We perform univariate, nonparametric, multiple statistical tests of hypotheses, where as many dependent tests as the sample size are carried out using the interpoint distances. They possess $p$-values to be combined to reach a decision, which is taken in a step-wise process for a possible number of clusters. It reduces the unnecessary computations compared with the other accuracy measures from the literature. Data study establishes the proposed index's efficiency and superiority.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tries a fresh p-value combination on interpoint distances to pick cluster count, but dependence among the tests is left unhandled and that undercuts the claims.

read the letter

The main point to know is that this work builds a cluster-number selector by running a nonparametric test on every interpoint distance, collecting the p-values, and stepping through them to decide how many groups are present. It is nonparametric and pairs with any clustering routine that takes k as input, which gives it some practical flexibility over methods that bake in specific assumptions about cluster shape or density. The abstract also stresses lower computation than existing accuracy indices, which would matter for bigger data sets if the savings are real once the full procedure is implemented. That construction is the actual novelty here; it is not just another silhouette variant but an attempt to ground the choice in a sequence of hypothesis tests. The data study is presented as evidence of better performance, though the abstract gives no numbers or baselines, so the strength of that evidence is still open. On the soft side, the interpoint distances are clearly dependent because every pair shares points with many others, yet the description supplies no adjustment for that dependence when combining p-values or when running the stepwise rule. Standard combination techniques lose their guarantees under strong positive dependence, and without a limiting argument or a simulation check that shows the error rate stays controlled, the selected k could be biased even if the marginal tests are valid. The paper also does not spell out how the null distribution for no clustering is constructed or how power behaves when clusters are weak or overlapping. This is the kind of work that would interest applied statisticians who already use distance-based clustering and want a lighter validation step. A reader comfortable with multiple-testing literature could extract the idea and test the dependence issue themselves. I would send it to referees because the core construction is distinct enough to deserve technical scrutiny on the dependence question and on the reported data comparisons; a revision that adds a clear robustness argument or explicit simulation protocol could make it usable.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a nonparametric interpoint distance-based index for determining the number of clusters in a dataset. It performs univariate nonparametric hypothesis tests on the interpoint distances (one per distance, hence roughly n tests for n points), obtains p-values, and combines them via a stepwise rule to select the number of clusters k when used in conjunction with any clustering algorithm that takes k as input. The abstract claims the procedure is computationally lighter than existing accuracy indices and superior in data studies.

Significance. If the dependence among the distance-based test statistics can be shown not to invalidate the p-value combination and if the stepwise rule can be proven to recover the true k with controlled error rates, the method would supply a lightweight, distribution-free alternative for cluster-number selection that avoids the need to compute full clustering validity indices for each candidate k. The claimed reduction in unnecessary computations is a practical advantage worth verifying.

major comments (2)

The abstract states that 'as many dependent tests as the sample size are carried out using the interpoint distances' and that p-values are 'combined to reach a decision' in a 'step-wise process.' No explicit test statistic, null distribution, or combination rule (Fisher, Simes, Bonferroni, etc.) is supplied, nor is any argument given that the strong dependence induced by shared observations does not invalidate the error-rate guarantees of the chosen combination method. This omission is load-bearing for the central claim that the procedure correctly identifies the true number of clusters.
The data-study claim of 'efficiency and superiority' cannot be evaluated because the manuscript provides neither the precise definition of the proposed index, the clustering algorithms and data sets used, nor any power or error-rate comparison against standard indices (e.g., silhouette, gap statistic, or Davies-Bouldin). Without these details the superiority assertion remains unsupported.

minor comments (1)

The abstract refers to 'univariate, nonparametric, multiple statistical tests of hypotheses' without naming the underlying nonparametric test (Wilcoxon, Kolmogorov-Smirnov, etc.) or the precise hypothesis being tested for each interpoint distance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive feedback. Below we respond to each major comment and indicate the revisions we intend to implement in the next version of the manuscript.

read point-by-point responses

Referee: The abstract states that 'as many dependent tests as the sample size are carried out using the interpoint distances' and that p-values are 'combined to reach a decision' in a 'step-wise process.' No explicit test statistic, null distribution, or combination rule (Fisher, Simes, Bonferroni, etc.) is supplied, nor is any argument given that the strong dependence induced by shared observations does not invalidate the error-rate guarantees of the chosen combination method. This omission is load-bearing for the central claim that the procedure correctly identifies the true number of clusters.

Authors: We thank the referee for this insightful comment. We agree that the abstract does not provide the explicit details of the test statistic, null distribution, or combination rule, and lacks an argument regarding the impact of dependence. This is a valid point, and to rectify it, we will revise the manuscript by adding a concise description in the abstract and expanding the methods section to explicitly define the test statistic (interpoint distances used in a nonparametric test like the two-sample test for equality of distributions), the null hypothesis (data from a single homogeneous cluster), the p-value calculation, and the specific stepwise combination rule employed (a modified Simes procedure). Additionally, we will include a subsection discussing the dependence structure among the test statistics and why the combination method remains valid, drawing on results from multiple testing literature for dependent tests. We believe these additions will strengthen the central claim. revision: yes
Referee: The data-study claim of 'efficiency and superiority' cannot be evaluated because the manuscript provides neither the precise definition of the proposed index, the clustering algorithms and data sets used, nor any power or error-rate comparison against standard indices (e.g., silhouette, gap statistic, or Davies-Bouldin). Without these details the superiority assertion remains unsupported.

Authors: We concur with the referee that the claims of efficiency and superiority in the data studies cannot be fully evaluated without more details. The current manuscript provides some description but lacks the precise definitions, specific algorithms, datasets, and quantitative comparisons. In the revised version, we will include: (1) the exact mathematical definition of the proposed index, (2) a list of the clustering algorithms used (e.g., k-means, hierarchical), (3) the datasets employed (e.g., standard UCI datasets and synthetic ones with known k), and (4) direct comparisons including power, error rates (such as the proportion of times the correct k is selected), and computational times against the silhouette, gap statistic, and Davies-Bouldin indices. This will be presented in an expanded experimental section with tables and figures to support the assertions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external statistical tests rather than self-referential construction.

full rationale

The paper defines a new interpoint-distance-based index that applies univariate nonparametric hypothesis tests to distances and combines the resulting p-values in a stepwise decision rule for selecting the number of clusters. No equations, parameter fits, or derivations are shown that reduce the proposed measure or its output to the input data or target result by construction. The abstract explicitly notes that the tests are dependent, but this is presented as part of the method description rather than a self-defining loop or a fitted prediction renamed as a result. The approach is self-contained against external benchmarks of hypothesis testing and p-value combination; no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. This is the normal case of a proposed statistical procedure whose validity rests on the properties of the tests themselves, not on circular re-use of the target quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method implicitly assumes that interpoint distances under a null of no clusters admit well-behaved nonparametric tests and that dependence among the tests does not invalidate the combined decision rule; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption Interpoint distances under the null hypothesis of no clustering structure permit valid univariate nonparametric hypothesis tests.
Invoked when the paper states that multiple tests are performed using the interpoint distances.
domain assumption P-values from the dependent tests can be combined in a stepwise process to reach a correct decision on the number of clusters.
Central to the decision procedure described in the abstract.

pith-pipeline@v0.9.0 · 5660 in / 1394 out tokens · 30059 ms · 2026-05-21T02:45:26.404263+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We perform univariate, nonparametric, multiple statistical tests of hypotheses, where as many dependent tests as the sample size are carried out using the interpoint distances. They possess p-values to be combined to reach a decision, which is taken in a step-wise process for a possible number of clusters.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our nonparametric, distribution-free, validity index is based on interpoint distances.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages

[1]

S., Muller, K

Ahn, J., Marron, J. S., Muller, K. M., Chi, Y.-Y. (2007).The high- dimension, low-sample-size geometric representation holds under mild con- ditions, Biometrika,94, 760–766

work page 2007
[2]

and Saranadasa H

Bai Z. and Saranadasa H. (1996).Effect of high dimension: by an example of a two sample problem.Stat Sinica,6, 311—329

work page 1996
[3]

Ball, G. H. and Hall, D. J. (1965).Isodata: A novel method of data anal- ysis and pattern classification. Stanford Research Institute, Menlo Park

work page 1965
[4]

and Raftery A

Banfield J. and Raftery A. E. (1993).Model-based Gaussian and non- Gaussian clustering. Biometrics.49, 803–821

work page 1993
[5]

& Harabasz, J

Cali´ nski, T. & Harabasz, J. (1974).A Dendrite Method for Cluster Anal- ysis. Communications in Statistics – Theory and Methods.3, 1–27

work page 1974
[6]

Campello, R. J. G. B., Moulavi, D., Sander, J. (2013).Density- Based Clustering Based on Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases (PAKDD 2013). Lecture Notes in Computer Science.7819, 160–172

work page 2013
[7]

and Govaert, G

Celeux, G. and Govaert, G. (1995).Gaussian parsimonious clustering models.Pattern Recognition.28, 781–793

work page 1995
[8]

and Yang, L

Cheng, D., Zhu, Q., Huang, J., Wu, Q. and Yang, L. (2019).A Novel Cluster Validity Index Based on Local Cores. IEEE Transactions on Neural Networks and Learning Systems.30, 985–999

work page 2019
[9]

Davies, D. L. and Bouldin, D. W. (1979).A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence.2, 224– 227

work page 1979
[10]

and Chattopadhyay, A

De, T., Chattopadhyay, T. and Chattopadhyay, A. K. (2014).Use of cross-correlation function to study formation mechanism of massive ellip- tical galaxies. Publications of the Astronomical Society of Australia,31, Article id: e407, pages 1–8

work page 2014
[11]

P., Laird N

Dempster A. P., Laird N. M., Rubin D. B. (1977).Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statis- tical Society, Series B.39, 1–38. 17

work page 1977
[12]

Dunn, J. C. (1974).Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics.4, 95–104

work page 1974
[13]

and Tibshirani, R

Efron, B. and Tibshirani, R. (1993).An Introduction to the Bootstrap. Chapman and Hall, New York, London

work page 1993
[14]

Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996).A density-based algorithm for discovering clusters in large spatial databases with noise.Pro- ceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press, Portland, Oregon, 226–231

work page 1996
[15]

S., Landau, S

Everitt, B. S., Landau, S. and Leese, M. (2001).Cluster Analysis. Arnold, London

work page 2001
[16]

and Riedwyl, H

Flury, B. and Riedwyl, H. (1988).Multivariate Statistics: A practical approach. Chapman & Hall, London

work page 1988
[17]

and Raftery, A

Frayley, C. and Raftery, A. E. (1998),How Many Clusters? Which Clus- tering Method? Answers via Model-Based Cluster Analysis. The Com- puter Journal.41, 578–588

work page 1998
[18]

and Raftery A

Fraley C. and Raftery A. E. (1999).MCLUST: Software for model-based cluster analysis. Journal of Classification.,16, 297–306

work page 1999
[19]

and Raftery, A

Fraley, C. and Raftery, A. E. (2002).Model-based clustering, discrimi- nant analysis and density estimation. Journal of the American Statistical Association,97, 611–631

work page 2002
[20]

and Raftery, A

Fraley, C. and Raftery, A. E. (2003).Enhanced model-based clustering, density estimation, and discriminant analysis software: Mclust.Journal of Classification.20, 263–286

work page 2003
[21]

and Raftery A

Fraley C. and Raftery A. E. (2007).Model-based methods of classifi- cation: using the mclust software in chemometrics.Journal of Statistical Software.18,1–13

work page 2007
[22]

E., Murphy, T

Fraley, C., Raftery, A. E., Murphy, T. B., Scrucca, L. (2012).MCLUST version 4 for R: Normal mixture modeling for model-based clustering, clas- sification, and density estimation. Technical Report. Vol.597, Department of Statistics, University of Washington. 18

work page 2012
[23]

& Kell, D

Handl, J., Knowles, K. & Kell, D. (2005).Computational cluster vali- dation in post-genomic data analysis. Bioinformatics.21, 3201–3212

work page 2005
[24]

Hartigan, J. A. (1975).Clustering Algorithms. John Wiley & Sons, New York, USA

work page 1975
[25]

Hartigan, J. A. and Wong, M. A. (1979).A K-means clustering algo- rithm. Applied Statistics.28, 100–108

work page 1979
[26]

V., Mckean, J

Hogg, R. V., Mckean, J. W. and Craig, A. T. (2019).Introduction to Mathematical Statistics. Pearson Education, Boston

work page 2019
[27]

Hope, A. C. A. (1968).A simplified Monte Carlo significance test pro- cedure. Journal of the Royal Statistical Society Series B,30, 582–598

work page 1968
[28]

and Arabie, P

Hubert, L. and Arabie, P. (1985).Comparing Partitions, Journal of the Classification,2, 193–218

work page 1985
[29]

Jain, A. K. , Murty, M. N. and Flynn, P. J. (1999).Data clustering: a review. ACM Computing Surveys.31, 264–323

work page 1999
[30]

Joanes, D. N. and Gill, C. A. (1998).Comparing measures of sample skewness and kurtosis. The Statistician,47, 183–189

work page 1998
[31]

Johnson, R. A. and Wichern, D. W. (2007).Applied Multivariate Sta- tistical Analysis, Pearson Prentice Hall, New Jersey

work page 2007
[32]

and Marron, J

Jung, S. and Marron, J. S. (2009).PCA consistency in high dimension, low sample size context. The Annals of Statistics,37, 4104–4130

work page 2009
[33]

and Kalina, J

Jureˇ ckov´ a, J. and Kalina, J. (2012).Nonparametric multivariate rank tests and their unbiasedness.Bernoulli,18, 229—251

work page 2012
[34]

Kass, R. E. and Raftery, A. E. (1995).Bayes Factors. Journal of the American Statistical Association.90, 773–795

work page 1995
[35]

and Rousseeuw, P

Kaufman, L. and Rousseeuw, P. J. (2005).Finding Groups in Data: An Introduction to Cluster Analysis.John Wiley and Sons, New Jersey

work page 2005
[36]

Kost, J. T. and McDermott, M. P. (2002).Combining dependent p- values.Statistics & Probability Letters,60, 183—190. 19

work page 2002
[37]

(2015).Multivariate multidistance tests for high- dimensional low sample size case-control studies.Statistics in Medicine, 34, 1511—1526

Marozzi, M. (2015).Multivariate multidistance tests for high- dimensional low sample size case-control studies.Statistics in Medicine, 34, 1511—1526

work page 2015
[38]

(2016).Multivariate tests based on interpoint distances with application to magnetic resonance imaging.Statistical Methods in Medical Research,25, 2593–2610

Marozzi, M. (2016).Multivariate tests based on interpoint distances with application to magnetic resonance imaging.Statistical Methods in Medical Research,25, 2593–2610

work page 2016
[39]

and Peel, D

McLachlan, G. and Peel, D. (2000).Finite Mixture Models. John Wiley and Sons, New York

work page 2000
[40]

(2019).Uncovering astrophysical phenomena related to galax- ies and other objects through statistical analysis.Ph.D

Modak, S. (2019).Uncovering astrophysical phenomena related to galax- ies and other objects through statistical analysis.Ph.D. Thesis, University of Calcutta, Kolkata, India. URL: http://hdl.handle.net/10603/314773

work page 2019
[41]

(2021).Distinction of groups of gamma-ray bursts in the BATSE catalog through fuzzy clustering

Modak, S. (2021).Distinction of groups of gamma-ray bursts in the BATSE catalog through fuzzy clustering. Astronomy and Computing.34, Article id 100441, Pages 1–7

work page 2021
[42]

(2022).A new nonparametric interpoint distance-based mea- sure for assessment of clustering

Modak, S. (2022).A new nonparametric interpoint distance-based mea- sure for assessment of clustering. Journal of Statistical Computation and Simulation.92, 1062–1077

work page 2022
[43]

(2023a).Pointwise norm-based clustering of data in arbi- trary dimensional space

Modak, S. (2023a).Pointwise norm-based clustering of data in arbi- trary dimensional space. Communications in Statistics - Case Studies, Data Analysis and Applications,9, 121–134

work page
[44]

(2023b).Validity index for clustered data in non-negative space

Modak, S. (2023b).Validity index for clustered data in non-negative space. Calcutta Statistical Association Bulletin,75, 60–71

work page
[45]

(2023c).A new measure for assessment of clustering based on kernel density estimation

Modak, S. (2023c).A new measure for assessment of clustering based on kernel density estimation. Communications in Statistics – Theory and Methods,52, 5942-5951

work page
[46]

(2024a).A new interpoint distance-based clustering algorithm using kernel density estimation

Modak, S. (2024a).A new interpoint distance-based clustering algorithm using kernel density estimation. Communications in Statistics - Simulation and Computation,53, 5323-5341

work page
[47]

(2024b).Book Review: Finding Groups in Data: An In- troduction to Cluster Analysis, Leonard Kaufman & Peter J

Modak, S. (2024b).Book Review: Finding Groups in Data: An In- troduction to Cluster Analysis, Leonard Kaufman & Peter J. Rousseeuw,

work page
[48]

Journal of Applied Statistics,51, 1618-1620. 20

work page
[49]

and Bandyopadhyay, U

Modak, S. and Bandyopadhyay, U. (2019).A new nonparametric test for two sample multivariate location problem with application to astronomy. Journal of Statistical Theory and Applications,18, 136–146

work page 2019
[50]

Modak, S., Chattopadhyay, A. K. & Chattopadhyay, T. (2018).Clus- tering of gamma-ray bursts through kernel principal component analysis. Communications in Statistics – Simulation and Computation.47, 1088– 1102

work page 2018
[51]

& Chattopadhyay, A

Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2017).Two phase formation of massive elliptical galaxies: study through cross– correlation including spatial effect.Astrophysics and Space Science.362, Article id: 206, Pages 1–10

work page 2017
[52]

& Chattopadhyay, A

Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2020).Unsu- pervised classification of eclipsing binary light curves through k-medoids clustering. Journal of Applied Statistics.47, 376–392

work page 2020
[53]

& Chattopadhyay, A

Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2022).Cluster- ing of eclipsing binary light curves through functional principal component analysis. Astrophysics and Space Science.367, Article id: 19, Pages 1–10

work page 2022
[54]

K., Bandyopadhyay, S

Pakhiraa, M. K., Bandyopadhyay, S. and Maulik, U. (2004).Validity index for crisp and fuzzy clusters. Pattern Recognition.37, 487–501

work page 2004
[55]

L., Shmulevich,

Poole, W., Gibbs, D. L., Shmulevich,. I., Bernard, B., Knijnenburg, T. A. (2016).Combining dependent P-values with an empirical adaptation of Brown’s method.Bioinformatics,32, i430—i436

work page 2016
[56]

Ripley B. D. (1996).Pattern recognition and neural networks. Cam- bridge University Press, Cambridge

work page 1996
[57]

Rousseeuw, P. J. (1987).Silhouettes: A graphical aid to the interpre- tation and validation of cluster analysis.Journal of Computational and Applied Mathematics.20, 53–65

work page 1987
[58]

and Smola, A

Sch¨ olkopf, B. and Smola, A. J. (2002).Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.MIT Press, Cambridge. 21

work page 2002
[59]

(1978).Estimating the Dimension of a Model.The Annals of Statistics,6, 461–464

Schwarz, G. (1978).Estimating the Dimension of a Model.The Annals of Statistics,6, 461–464

work page 1978
[60]

Scrucca, L., Fop, M., Murphy, T. B. and Raftery, A. E. (2016).mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal,8, 289–317

work page 2016
[61]

Silva, L. E. Brito Da, Melton, N. M. and Wunsch, D. C. (2020).Incre- mental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study. Institute of Electrical and Electronics Engineers,8, 22025–22047

work page 2020
[62]

(2019).Analysis of the Duration–Hardness Ratio Plane of Gamma-Ray Bursts Using Skewed Distributions.The Astrophysical Journal.870, 1–9, Article id: 105

Tarnopolski, M. (2019).Analysis of the Duration–Hardness Ratio Plane of Gamma-Ray Bursts Using Skewed Distributions.The Astrophysical Journal.870, 1–9, Article id: 105

work page 2019
[63]

& Hastie, T

Tibshirani, R., Walther, G. & Hastie, T. (2001).Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B.63, 411–423

work page 2001
[64]

G., R´ acz, I

T´ oth, B. G., R´ acz, I. I. & Horv´ ath, I. (2019).Gaussian-mixture-model- based cluster analysis of gamma-ray bursts in the BATSE catalog. Monthly Notices of the Royal Astronomical Society.486, 4823–4828

work page 2019
[65]

Vale, D. C. and Maurelli V. A. (1983).Simulating multivariate nonnor- mal distributions. Psychometrika,48, 465–471

work page 1983
[66]

and Aoshima, M

Yata, K. and Aoshima, M. (2010).Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data ma- trix, Journal of Multivariate Analysis,101, 2060–2077. 22

work page 2010

[1] [1]

S., Muller, K

Ahn, J., Marron, J. S., Muller, K. M., Chi, Y.-Y. (2007).The high- dimension, low-sample-size geometric representation holds under mild con- ditions, Biometrika,94, 760–766

work page 2007

[2] [2]

and Saranadasa H

Bai Z. and Saranadasa H. (1996).Effect of high dimension: by an example of a two sample problem.Stat Sinica,6, 311—329

work page 1996

[3] [3]

Ball, G. H. and Hall, D. J. (1965).Isodata: A novel method of data anal- ysis and pattern classification. Stanford Research Institute, Menlo Park

work page 1965

[4] [4]

and Raftery A

Banfield J. and Raftery A. E. (1993).Model-based Gaussian and non- Gaussian clustering. Biometrics.49, 803–821

work page 1993

[5] [5]

& Harabasz, J

Cali´ nski, T. & Harabasz, J. (1974).A Dendrite Method for Cluster Anal- ysis. Communications in Statistics – Theory and Methods.3, 1–27

work page 1974

[6] [6]

Campello, R. J. G. B., Moulavi, D., Sander, J. (2013).Density- Based Clustering Based on Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases (PAKDD 2013). Lecture Notes in Computer Science.7819, 160–172

work page 2013

[7] [7]

and Govaert, G

Celeux, G. and Govaert, G. (1995).Gaussian parsimonious clustering models.Pattern Recognition.28, 781–793

work page 1995

[8] [8]

and Yang, L

Cheng, D., Zhu, Q., Huang, J., Wu, Q. and Yang, L. (2019).A Novel Cluster Validity Index Based on Local Cores. IEEE Transactions on Neural Networks and Learning Systems.30, 985–999

work page 2019

[9] [9]

Davies, D. L. and Bouldin, D. W. (1979).A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence.2, 224– 227

work page 1979

[10] [10]

and Chattopadhyay, A

De, T., Chattopadhyay, T. and Chattopadhyay, A. K. (2014).Use of cross-correlation function to study formation mechanism of massive ellip- tical galaxies. Publications of the Astronomical Society of Australia,31, Article id: e407, pages 1–8

work page 2014

[11] [11]

P., Laird N

Dempster A. P., Laird N. M., Rubin D. B. (1977).Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statis- tical Society, Series B.39, 1–38. 17

work page 1977

[12] [12]

Dunn, J. C. (1974).Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics.4, 95–104

work page 1974

[13] [13]

and Tibshirani, R

Efron, B. and Tibshirani, R. (1993).An Introduction to the Bootstrap. Chapman and Hall, New York, London

work page 1993

[14] [14]

Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996).A density-based algorithm for discovering clusters in large spatial databases with noise.Pro- ceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press, Portland, Oregon, 226–231

work page 1996

[15] [15]

S., Landau, S

Everitt, B. S., Landau, S. and Leese, M. (2001).Cluster Analysis. Arnold, London

work page 2001

[16] [16]

and Riedwyl, H

Flury, B. and Riedwyl, H. (1988).Multivariate Statistics: A practical approach. Chapman & Hall, London

work page 1988

[17] [17]

and Raftery, A

Frayley, C. and Raftery, A. E. (1998),How Many Clusters? Which Clus- tering Method? Answers via Model-Based Cluster Analysis. The Com- puter Journal.41, 578–588

work page 1998

[18] [18]

and Raftery A

Fraley C. and Raftery A. E. (1999).MCLUST: Software for model-based cluster analysis. Journal of Classification.,16, 297–306

work page 1999

[19] [19]

and Raftery, A

Fraley, C. and Raftery, A. E. (2002).Model-based clustering, discrimi- nant analysis and density estimation. Journal of the American Statistical Association,97, 611–631

work page 2002

[20] [20]

and Raftery, A

Fraley, C. and Raftery, A. E. (2003).Enhanced model-based clustering, density estimation, and discriminant analysis software: Mclust.Journal of Classification.20, 263–286

work page 2003

[21] [21]

and Raftery A

Fraley C. and Raftery A. E. (2007).Model-based methods of classifi- cation: using the mclust software in chemometrics.Journal of Statistical Software.18,1–13

work page 2007

[22] [22]

E., Murphy, T

Fraley, C., Raftery, A. E., Murphy, T. B., Scrucca, L. (2012).MCLUST version 4 for R: Normal mixture modeling for model-based clustering, clas- sification, and density estimation. Technical Report. Vol.597, Department of Statistics, University of Washington. 18

work page 2012

[23] [23]

& Kell, D

Handl, J., Knowles, K. & Kell, D. (2005).Computational cluster vali- dation in post-genomic data analysis. Bioinformatics.21, 3201–3212

work page 2005

[24] [24]

Hartigan, J. A. (1975).Clustering Algorithms. John Wiley & Sons, New York, USA

work page 1975

[25] [25]

Hartigan, J. A. and Wong, M. A. (1979).A K-means clustering algo- rithm. Applied Statistics.28, 100–108

work page 1979

[26] [26]

V., Mckean, J

Hogg, R. V., Mckean, J. W. and Craig, A. T. (2019).Introduction to Mathematical Statistics. Pearson Education, Boston

work page 2019

[27] [27]

Hope, A. C. A. (1968).A simplified Monte Carlo significance test pro- cedure. Journal of the Royal Statistical Society Series B,30, 582–598

work page 1968

[28] [28]

and Arabie, P

Hubert, L. and Arabie, P. (1985).Comparing Partitions, Journal of the Classification,2, 193–218

work page 1985

[29] [29]

Jain, A. K. , Murty, M. N. and Flynn, P. J. (1999).Data clustering: a review. ACM Computing Surveys.31, 264–323

work page 1999

[30] [30]

Joanes, D. N. and Gill, C. A. (1998).Comparing measures of sample skewness and kurtosis. The Statistician,47, 183–189

work page 1998

[31] [31]

Johnson, R. A. and Wichern, D. W. (2007).Applied Multivariate Sta- tistical Analysis, Pearson Prentice Hall, New Jersey

work page 2007

[32] [32]

and Marron, J

Jung, S. and Marron, J. S. (2009).PCA consistency in high dimension, low sample size context. The Annals of Statistics,37, 4104–4130

work page 2009

[33] [33]

and Kalina, J

Jureˇ ckov´ a, J. and Kalina, J. (2012).Nonparametric multivariate rank tests and their unbiasedness.Bernoulli,18, 229—251

work page 2012

[34] [34]

Kass, R. E. and Raftery, A. E. (1995).Bayes Factors. Journal of the American Statistical Association.90, 773–795

work page 1995

[35] [35]

and Rousseeuw, P

Kaufman, L. and Rousseeuw, P. J. (2005).Finding Groups in Data: An Introduction to Cluster Analysis.John Wiley and Sons, New Jersey

work page 2005

[36] [36]

Kost, J. T. and McDermott, M. P. (2002).Combining dependent p- values.Statistics & Probability Letters,60, 183—190. 19

work page 2002

[37] [37]

(2015).Multivariate multidistance tests for high- dimensional low sample size case-control studies.Statistics in Medicine, 34, 1511—1526

Marozzi, M. (2015).Multivariate multidistance tests for high- dimensional low sample size case-control studies.Statistics in Medicine, 34, 1511—1526

work page 2015

[38] [38]

(2016).Multivariate tests based on interpoint distances with application to magnetic resonance imaging.Statistical Methods in Medical Research,25, 2593–2610

Marozzi, M. (2016).Multivariate tests based on interpoint distances with application to magnetic resonance imaging.Statistical Methods in Medical Research,25, 2593–2610

work page 2016

[39] [39]

and Peel, D

McLachlan, G. and Peel, D. (2000).Finite Mixture Models. John Wiley and Sons, New York

work page 2000

[40] [40]

(2019).Uncovering astrophysical phenomena related to galax- ies and other objects through statistical analysis.Ph.D

Modak, S. (2019).Uncovering astrophysical phenomena related to galax- ies and other objects through statistical analysis.Ph.D. Thesis, University of Calcutta, Kolkata, India. URL: http://hdl.handle.net/10603/314773

work page 2019

[41] [41]

(2021).Distinction of groups of gamma-ray bursts in the BATSE catalog through fuzzy clustering

Modak, S. (2021).Distinction of groups of gamma-ray bursts in the BATSE catalog through fuzzy clustering. Astronomy and Computing.34, Article id 100441, Pages 1–7

work page 2021

[42] [42]

(2022).A new nonparametric interpoint distance-based mea- sure for assessment of clustering

Modak, S. (2022).A new nonparametric interpoint distance-based mea- sure for assessment of clustering. Journal of Statistical Computation and Simulation.92, 1062–1077

work page 2022

[43] [43]

(2023a).Pointwise norm-based clustering of data in arbi- trary dimensional space

Modak, S. (2023a).Pointwise norm-based clustering of data in arbi- trary dimensional space. Communications in Statistics - Case Studies, Data Analysis and Applications,9, 121–134

work page

[44] [44]

(2023b).Validity index for clustered data in non-negative space

Modak, S. (2023b).Validity index for clustered data in non-negative space. Calcutta Statistical Association Bulletin,75, 60–71

work page

[45] [45]

(2023c).A new measure for assessment of clustering based on kernel density estimation

Modak, S. (2023c).A new measure for assessment of clustering based on kernel density estimation. Communications in Statistics – Theory and Methods,52, 5942-5951

work page

[46] [46]

(2024a).A new interpoint distance-based clustering algorithm using kernel density estimation

Modak, S. (2024a).A new interpoint distance-based clustering algorithm using kernel density estimation. Communications in Statistics - Simulation and Computation,53, 5323-5341

work page

[47] [47]

(2024b).Book Review: Finding Groups in Data: An In- troduction to Cluster Analysis, Leonard Kaufman & Peter J

Modak, S. (2024b).Book Review: Finding Groups in Data: An In- troduction to Cluster Analysis, Leonard Kaufman & Peter J. Rousseeuw,

work page

[48] [48]

Journal of Applied Statistics,51, 1618-1620. 20

work page

[49] [49]

and Bandyopadhyay, U

Modak, S. and Bandyopadhyay, U. (2019).A new nonparametric test for two sample multivariate location problem with application to astronomy. Journal of Statistical Theory and Applications,18, 136–146

work page 2019

[50] [50]

Modak, S., Chattopadhyay, A. K. & Chattopadhyay, T. (2018).Clus- tering of gamma-ray bursts through kernel principal component analysis. Communications in Statistics – Simulation and Computation.47, 1088– 1102

work page 2018

[51] [51]

& Chattopadhyay, A

Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2017).Two phase formation of massive elliptical galaxies: study through cross– correlation including spatial effect.Astrophysics and Space Science.362, Article id: 206, Pages 1–10

work page 2017

[52] [52]

& Chattopadhyay, A

Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2020).Unsu- pervised classification of eclipsing binary light curves through k-medoids clustering. Journal of Applied Statistics.47, 376–392

work page 2020

[53] [53]

& Chattopadhyay, A

Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2022).Cluster- ing of eclipsing binary light curves through functional principal component analysis. Astrophysics and Space Science.367, Article id: 19, Pages 1–10

work page 2022

[54] [54]

K., Bandyopadhyay, S

Pakhiraa, M. K., Bandyopadhyay, S. and Maulik, U. (2004).Validity index for crisp and fuzzy clusters. Pattern Recognition.37, 487–501

work page 2004

[55] [55]

L., Shmulevich,

Poole, W., Gibbs, D. L., Shmulevich,. I., Bernard, B., Knijnenburg, T. A. (2016).Combining dependent P-values with an empirical adaptation of Brown’s method.Bioinformatics,32, i430—i436

work page 2016

[56] [56]

Ripley B. D. (1996).Pattern recognition and neural networks. Cam- bridge University Press, Cambridge

work page 1996

[57] [57]

Rousseeuw, P. J. (1987).Silhouettes: A graphical aid to the interpre- tation and validation of cluster analysis.Journal of Computational and Applied Mathematics.20, 53–65

work page 1987

[58] [58]

and Smola, A

Sch¨ olkopf, B. and Smola, A. J. (2002).Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.MIT Press, Cambridge. 21

work page 2002

[59] [59]

(1978).Estimating the Dimension of a Model.The Annals of Statistics,6, 461–464

Schwarz, G. (1978).Estimating the Dimension of a Model.The Annals of Statistics,6, 461–464

work page 1978

[60] [60]

Scrucca, L., Fop, M., Murphy, T. B. and Raftery, A. E. (2016).mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal,8, 289–317

work page 2016

[61] [61]

Silva, L. E. Brito Da, Melton, N. M. and Wunsch, D. C. (2020).Incre- mental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study. Institute of Electrical and Electronics Engineers,8, 22025–22047

work page 2020

[62] [62]

(2019).Analysis of the Duration–Hardness Ratio Plane of Gamma-Ray Bursts Using Skewed Distributions.The Astrophysical Journal.870, 1–9, Article id: 105

Tarnopolski, M. (2019).Analysis of the Duration–Hardness Ratio Plane of Gamma-Ray Bursts Using Skewed Distributions.The Astrophysical Journal.870, 1–9, Article id: 105

work page 2019

[63] [63]

& Hastie, T

Tibshirani, R., Walther, G. & Hastie, T. (2001).Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B.63, 411–423

work page 2001

[64] [64]

G., R´ acz, I

T´ oth, B. G., R´ acz, I. I. & Horv´ ath, I. (2019).Gaussian-mixture-model- based cluster analysis of gamma-ray bursts in the BATSE catalog. Monthly Notices of the Royal Astronomical Society.486, 4823–4828

work page 2019

[65] [65]

Vale, D. C. and Maurelli V. A. (1983).Simulating multivariate nonnor- mal distributions. Psychometrika,48, 465–471

work page 1983

[66] [66]

and Aoshima, M

Yata, K. and Aoshima, M. (2010).Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data ma- trix, Journal of Multivariate Analysis,101, 2060–2077. 22

work page 2010