Data compression for fast dimension reduction and clustering of high-dimensional discrete data

Michael Fop; Silvia D'Angelo

arxiv: 2606.10593 · v1 · pith:OAWKW5G3new · submitted 2026-06-09 · 📊 stat.ME · stat.CO

Data compression for fast dimension reduction and clustering of high-dimensional discrete data

Silvia D'Angelo , Michael Fop This is my paper

Pith reviewed 2026-06-27 12:30 UTC · model grok-4.3

classification 📊 stat.ME stat.CO

keywords data compressiondimension reductionclusteringhigh-dimensional discrete datamodel-based clusteringpositional encodinginjectivity

0 comments

The pith

A scaled positional encoding compresses high-dimensional discrete data into low dimensions while keeping observations distinct and cluster centroids separable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a deterministic compression that turns each high-dimensional discrete observation into a low-dimensional continuous vector via weighted sums defined by a scaled positional encoding. This mapping is injective, so distinct input points stay distinct, and under mild regularity conditions the compressed values admit an approximate Gaussian form while centroid separations between clusters are preserved. A sympathetic reader would care because the result supplies a theoretical basis for running standard model-based clustering directly on the reduced data, with the approach applying to binary, categorical, and count-valued observations and delivering clear computational speed gains. Simulations recover clusters accurately across varied settings, and the method is demonstrated on baby-name records and microbiome counts.

Core claim

The central claim is that the compression mapping is injective, ensuring distinct observations remain distinct; that under mild regularity conditions the compressed variables admit an approximate Gaussian representation; and that separation between cluster centroids is preserved, so that location-driven cluster structure remains identifiable after dimension reduction.

What carries the argument

The scaled positional encoding that defines the weighted sums producing the low-dimensional continuous representation of each discrete observation.

If this is right

Distinct observations remain distinct after compression because the mapping is injective.
Model-based clustering is justified in the compressed space by the approximate Gaussian representation.
Location-driven cluster structure stays identifiable because centroid separation is preserved.
The transformation applies stably to binary, categorical, and count-valued data.
Cluster recovery remains accurate in simulations while computation is substantially faster than standard dimension-reduction methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the regularity conditions prove sensitive to extreme sparsity, the compression could be adapted by changing the scaling of the positional encoding for those regimes.
The same injective compression might support other location-based tasks such as regression or outlier detection without needing new theory.
Because the method is deterministic and parameter-light, it could serve as a preprocessing step before many existing discrete-data pipelines beyond clustering.
Testing whether the approximate Gaussian property holds for count data with very heavy tails would clarify the practical scope of the mild conditions.

Load-bearing premise

The mild regularity conditions that justify the approximate Gaussian representation and the preservation of centroid separation must actually hold for the data being compressed.

What would settle it

Finding two distinct high-dimensional discrete observations that map to the exact same compressed vector, or a dataset of known clusters whose centroids become inseparable after compression while satisfying the stated regularity conditions, would disprove the core theoretical properties.

Figures

Figures reproduced from arXiv: 2606.10593 by Michael Fop, Silvia D'Angelo.

**Figure 2.** Figure 2: Comparison of variable orderings in the data compression. ARI between true and estimated class [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Simulation study – Scenario 1. ARI between true and estimated cluster memberships across methods. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Simulation study – Scenario 2. ARI between true and estimated cluster memberships across methods [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Simulation study – Scenario 3. ARI between true and estimated cluster memberships across methods [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Simulation study – Scenario 4. ARI between true and estimated cluster memberships across methods [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Simulation study – Scenario 5. ARI between true and estimated cluster memberships across methods [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Simulation study – Scenario 1. Computational times for the different methods. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Simulation study – Scenario 1. Example of low-dimensional representation for [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Simulation study – Scenario 2. ARI values between simulated true and estimated class memberships [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Simulation study – Scenario 2. Example of low-dimensional representation for [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Simulation study – Scenario 3. ARI values between simulated true and estimated class memberships [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Simulation study – Scenario 3. Example of low-dimensional representation for [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Simulation study – Scenario 4. ARI values between simulated true and estimated class memberships [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: Simulation study – Scenario 4. ARI values between simulated true and estimated class memberships [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗

**Figure 16.** Figure 16: Simulation study – Scenario 4. Example of low-dimensional representation for [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗

**Figure 17.** Figure 17: Simulation study – Scenario 5. ARI values between simulated true and estimated class memberships [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗

**Figure 18.** Figure 18: Simulation study – Scenario 5. ARI values between simulated true and estimated class memberships [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗

**Figure 19.** Figure 19: Simulation study – Scenario 5. Example of low-dimensional representation for [PITH_FULL_IMAGE:figures/full_fig_p035_19.png] view at source ↗

**Figure 20.** Figure 20: Irish female baby names. Co-occurence matrix obtained fitting k-means on compressed data, with [PITH_FULL_IMAGE:figures/full_fig_p036_20.png] view at source ↗

**Figure 21.** Figure 21: Irish female baby names. Co-occurence matrix obtained fitting k-means on data reduced with [PITH_FULL_IMAGE:figures/full_fig_p036_21.png] view at source ↗

**Figure 22.** Figure 22: Irish female baby names. Co-occurence matrix obtained fitting [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗

**Figure 23.** Figure 23: Irish female baby names. Co-occurence matrix obtained fitting [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗

**Figure 24.** Figure 24: Irish female baby names. Observed names’ yearly distributions, colored according to the number of [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗

**Figure 25.** Figure 25: Irish male baby names. Co-occurence matrix obtained fitting k-means on compressed data, with [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗

**Figure 26.** Figure 26: Irish male baby names. Co-occurence matrix obtained fitting k-means on data reduced with multidi [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗

**Figure 27.** Figure 27: Irish male baby names. Co-occurence matrix obtained fitting [PITH_FULL_IMAGE:figures/full_fig_p040_27.png] view at source ↗

**Figure 28.** Figure 28: Irish male baby names. Co-occurence matrix obtained fitting [PITH_FULL_IMAGE:figures/full_fig_p040_28.png] view at source ↗

**Figure 29.** Figure 29: Irish male baby names. Observed names’ yearly distributions, colored according to the number of [PITH_FULL_IMAGE:figures/full_fig_p041_29.png] view at source ↗

**Figure 30.** Figure 30: Schnorr microbiome data. Co-occurence matrix obtained fitting k-means on compressed data, with [PITH_FULL_IMAGE:figures/full_fig_p042_30.png] view at source ↗

**Figure 31.** Figure 31: Schnorr microbiome data. Co-occurence matrix obtained fitting [PITH_FULL_IMAGE:figures/full_fig_p042_31.png] view at source ↗

**Figure 32.** Figure 32: Schnorr microbiome data. Co-occurence matrix obtained fitting k-means on data reduced with [PITH_FULL_IMAGE:figures/full_fig_p043_32.png] view at source ↗

**Figure 33.** Figure 33: Schnorr microbiome data. Co-occurence matrix obtained fitting [PITH_FULL_IMAGE:figures/full_fig_p043_33.png] view at source ↗

read the original abstract

High-dimensional discrete data arise in many contemporary applications, including genomics, microbiome research, survey studies, and digital behavioral analysis. Clustering such data remains challenging because existing methods are often computationally demanding, sensitive to sparsity and discreteness, or designed for specific data types. We propose a deterministic dimension-reduction framework for clustering high-dimensional discrete observations. The method compresses each observation into a low-dimensional continuous representation through weighted sums defined by a scaled positional encoding, yielding a numerically stable transformation applicable to binary, categorical, and count-valued data. We establish several theoretical properties of the proposed compression. The mapping is injective, ensuring that distinct observations remain distinct after compression. Under mild regularity conditions, the compressed variables admit an approximate Gaussian representation, providing a theoretical basis for model-based clustering in the compressed space. We further show that separation between cluster centroids is preserved under compression, implying that location-driven cluster structure remains identifiable after dimension reduction. Extensive simulation studies demonstrate accurate cluster recovery across a wide range of realistic settings. The proposed approach is also computationally efficient, providing substantial speed improvements over commonly used dimension-reduction techniques often used in conjunction with clustering. Applications to Irish baby-name records and microbiome data further illustrate its practical utility. The proposed framework offers a scalable, computationally efficient, and broadly applicable approach to clustering high-dimensional discrete data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The compression gives a fast, deterministic way to handle high-dim discrete clustering but the Gaussian and centroid claims rest on vague conditions that need checking against sparse data.

read the letter

The core contribution is a weighted-sum compression built on scaled positional encodings that turns high-dimensional discrete vectors into a low-dimensional continuous space while staying injective. That construction is new in this context and directly targets the computational bottleneck in clustering binary, categorical, or count data from genomics or microbiome work.

The paper shows the method is numerically stable, runs faster than common dimension-reduction steps, and recovers clusters accurately in simulations across varied sparsity and dimension settings. The two real-data examples (Irish baby names and microbiome counts) add concrete evidence that the approach is usable in practice.

The soft spot is the theoretical justification for model-based clustering after compression. The claims of an approximate Gaussian representation and preserved centroid separation both invoke unspecified mild regularity conditions. Without their exact form it is unclear whether they survive the heavy sparsity and zero-inflation typical of these datasets; if the conditions require lower bounds on cell probabilities or sub-Gaussian tails that fail here, the clustering rationale weakens even though injectivity itself looks fine.

This is aimed at applied statisticians and computational biologists who already use mixture models or k-means on discrete high-dimensional data and want a quicker preprocessing route. A reader facing exactly that workflow will find the empirical results and speed claims useful.

It deserves peer review. The practical problem is real, the method is straightforward to implement, and the simulations provide a starting point for evaluation; referees can press for the missing regularity statements and targeted checks on sparse regimes.

Referee Report

2 major / 1 minor

Summary. The paper proposes a deterministic compression framework for high-dimensional discrete data (binary, categorical, count) that maps each observation to a low-dimensional continuous vector via weighted sums with a scaled positional encoding. It claims the mapping is injective, that the compressed variables admit an approximate Gaussian representation under mild regularity conditions (providing justification for model-based clustering), and that separation between cluster centroids is preserved. Simulation studies and applications to Irish baby-name records and microbiome data are used to illustrate performance and computational efficiency.

Significance. If the injectivity, approximate-Gaussian, and centroid-separation results hold once the regularity conditions are stated explicitly, the method would supply a fast, broadly applicable preprocessing step that enables standard Gaussian-mixture clustering on compressed discrete data while preserving location-driven structure. This could be practically significant for large-scale genomics and microbiome analyses where existing dimension-reduction-plus-clustering pipelines are computationally heavy.

major comments (2)

[Abstract / theoretical properties] Abstract and theoretical-properties section: the claims that 'under mild regularity conditions, the compressed variables admit an approximate Gaussian representation' and that 'separation between cluster centroids is preserved' are load-bearing for the clustering justification, yet the precise statement of those conditions (moment requirements, lower bounds on cell probabilities, dependence on sparsity or ambient dimension, etc.) is not supplied. Without an explicit theorem or set of assumptions, it is impossible to verify whether the conditions hold for the sparse discrete data the method targets.
[Theoretical properties] Theoretical-properties section (likely around the injectivity and Gaussian claims): the manuscript must supply the derivation steps or regularity-condition statements that convert the deterministic weighted-sum construction into the stated approximate-Gaussian and centroid-preservation results; the current abstract supplies none.

minor comments (1)

[Abstract] Abstract: the phrase 'extensive simulation studies demonstrate accurate cluster recovery across a wide range of realistic settings' would benefit from a one-sentence indication of the range of dimensions, sparsity levels, and cluster-separation regimes examined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for explicit theoretical details. We will revise the manuscript to address these points by adding formal statements of conditions and derivations.

read point-by-point responses

Referee: [Abstract / theoretical properties] Abstract and theoretical-properties section: the claims that 'under mild regularity conditions, the compressed variables admit an approximate Gaussian representation' and that 'separation between cluster centroids is preserved' are load-bearing for the clustering justification, yet the precise statement of those conditions (moment requirements, lower bounds on cell probabilities, dependence on sparsity or ambient dimension, etc.) is not supplied. Without an explicit theorem or set of assumptions, it is impossible to verify whether the conditions hold for the sparse discrete data the method targets.

Authors: We agree that explicit statements of the regularity conditions are required. In the revised manuscript we will add a dedicated theorem specifying the assumptions, including moment bounds, minimum cell-probability thresholds, and their dependence on sparsity and ambient dimension. This will allow direct verification for the sparse discrete data targeted by the method. revision: yes
Referee: [Theoretical properties] Theoretical-properties section (likely around the injectivity and Gaussian claims): the manuscript must supply the derivation steps or regularity-condition statements that convert the deterministic weighted-sum construction into the stated approximate-Gaussian and centroid-preservation results; the current abstract supplies none.

Authors: We will include the derivation steps in the theoretical-properties section of the revision. These will detail how the scaled positional encoding and weighted sums yield the approximate-Gaussian property and centroid separation under the stated conditions. The abstract will be updated to reference the new theorem explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity; properties derived directly from deterministic mapping definition

full rationale

The paper defines a deterministic compression via weighted sums with scaled positional encoding. It then states that the mapping is injective, that compressed variables admit an approximate Gaussian representation under mild regularity conditions, and that cluster-centroid separation is preserved. These are presented as mathematical consequences of the construction itself rather than reductions to fitted parameters, self-citations, or renamings. No load-bearing self-citation chains, ansatzes smuggled via prior work, or predictions that equal their inputs by construction appear in the abstract or context. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate free parameters, axioms, or invented entities; the 'scaled positional encoding' and 'mild regularity conditions' are referenced but not formalized.

pith-pipeline@v0.9.1-grok · 5756 in / 1263 out tokens · 27670 ms · 2026-06-27T12:30:17.424991+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 21 canonical work pages

[1]

Bioinformatics , volume =

Chen, Guanhua and Wang, Xinyue and Sun, Qiang and Tang, Zheng-Zheng , title =. Bioinformatics , volume =. 2025 , month =

2025
[2]

Polynomials and Polynomial Inequalities , series =

Peter Borwein and Tam. Polynomials and Polynomial Inequalities , series =. 1995 , isbn =

1995
[3]

William Feller , title =
[4]

Shannon , title =

Claude E. Shannon , title =. Bell System Technical Journal , volume =
[5]

Serge Lang , title =
[6]

Ingwer Borg and Patrick J. F. Groenen , title =. 2005 , isbn =

2005
[7]

Journal of Machine Learning Research , volume =

Laurens van der Maaten and Geoffrey Hinton , title =. Journal of Machine Learning Research , volume =. 2008 , issn =

2008
[8]

Statistics and Computing , volume =

Ulrike von Luxburg , title =. Statistics and Computing , volume =. 2007 , doi =

2007
[9]

Jolliffe , title =

Ian T. Jolliffe , title =. 2002 , isbn =

2002
[10]

Bishop , title =

Christopher M. Bishop , title =. 2006 , isbn =

2006
[11]

The Gut Microbiota of Rural Papua New Guineans: Composition, Diversity Patterns, and Ecological Processes , journal =

Mart. The Gut Microbiota of Rural Papua New Guineans: Composition, Diversity Patterns, and Ecological Processes , journal =. 2015 , doi =

2015
[12]

Irish Babies' Names , year =
[13]

and Scrucca, L

Casa, A. and Scrucca, L. and Menardi, G. , title =. Advances in Data Analysis and Classification , volume =. 2021 , doi =

2021
[14]

and Brodley, C.E

Fern, X.Z. and Brodley, C.E. , title =. Proceedings of the 20th international conference on machine learning , volume =. 2003 , doi =

2003
[15]

and Candela, M

Schnorr, S.L. and Candela, M. and Rampelli, S. and Centanni, M. and Consolandi, C. and Basaglia, G. and Turroni, S. and Biagi, E. and Peano, C. and Severgnini, M. and others , title =. Nature Communications , volume =. 2014 , doi =

2014
[16]

and Zhang, L

Shi, Y. and Zhang, L. and Peterson, C.B. and others , title =. Microbiome , volume =. 2022 , doi =

2022
[17]

and Chen, Z

Wang, C. and Chen, Z. and Xi, R. , title =. Annals of Applied Statistics , volume =. 2025 , doi =

2025
[18]

2026 , note =

vegan: Community Ecology Package , author =. 2026 , note =

2026
[19]

Journal of classification , volume=

Comparing partitions , author=. Journal of classification , volume=. 1985 , publisher=

1985
[20]

2024 , note =

kernlab: Kernel-Based Machine Learning Lab , author =. 2024 , note =

2024
[21]

kernlab -- An

Alexandros Karatzoglou and Alex Smola and Kurt Hornik and Achim Zeileis , journal =. kernlab -- An. 2004 , volume =

2004
[22]

Krijthe , year =

Jesse H. Krijthe , year =
[23]

Journal of Open Source Software , year =

simstudy: Illuminating research methods through data generation , author =. Journal of Open Source Software , year =
[24]

2013 , issn =

An extensive comparative study of cluster validity indices , journal =. 2013 , issn =. doi:https://doi.org/10.1016/j.patcog.2012.07.021 , url =

work page doi:10.1016/j.patcog.2012.07.021 2013
[25]

2021 , issn =

Clustering with the Average Silhouette Width , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.csda.2021.107190 , url =

work page doi:10.1016/j.csda.2021.107190 2021
[26]

Lausser, Ludwig and Schmid, Florian and Schirra, Lyn-Rouven and Wilhelm, Adalbert F. X. and Kestler, Hans A. , title=. Advances in Data Analysis and Classification , year=. doi:10.1007/s11634-016-0277-3 , url=

work page doi:10.1007/s11634-016-0277-3
[27]

Priebe and Youngser Park and David J

Congyuan Yang and Carey E. Priebe and Youngser Park and David J. Marchette , title =. Journal of Computational and Graphical Statistics , volume =. 2021 , publisher =

2021
[28]

, title =

MacQueen, James B. , title =. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability , volume =. 1967 , publisher =

1967
[29]

, title =

Lloyd, Stuart P. , title =. IEEE Transactions on Information Theory , volume =. 1982 , doi =

1982
[30]

, title =

Rousseeuw, Peter J. , title =. Journal of Computational and Applied Mathematics , volume =. 1987 , doi =

1987
[31]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =

Tibshirani, Robert and Walther, Guenther and Hastie, Trevor , title =. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =. 2001 , month =. doi:10.1111/1467-9868.00293 , url =

work page doi:10.1111/1467-9868.00293 2001
[32]

Brendan and Raftery, Adrian E

Bouveyron, Charles and Celeux, Gilles and Murphy, T. Brendan and Raftery, Adrian E. , year=. Model-Based Clustering and Classification for Data Science: With Applications in R , publisher=
[33]

https://doi.org/10.1201/9781003277965

Luca Scrucca and Chris Fraley and T. Brendan Murphy and Adrian E. Raftery , publisher =. Model-Based Clustering, Classification, and Density Estimation Using. doi:10.1201/9781003277965 , year =

work page doi:10.1201/9781003277965
[34]

2026 , url =

R: A Language and Environment for Statistical Computing , author =. 2026 , url =

2026
[35]

McLachlan and Suren Rathnayake

McLachlan, Geoffrey J. and Rathnayake, Suren , title =. WIREs Data Mining and Knowledge Discovery , volume =. doi:https://doi.org/10.1002/widm.1135 , url =. https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1135 , year =

work page doi:10.1002/widm.1135
[36]

Journal of the American Statistical Association , volume =

Chris Fraley and Adrian E Raftery , title =. Journal of the American Statistical Association , volume =. 2002 , publisher =

2002
[37]

Journal of Classification , year=

Anderlucci, Laura and Fortunato, Francesca and Montanari, Angela , title=. Journal of Classification , year=. doi:10.1007/s00357-021-09403-7 , url=

work page doi:10.1007/s00357-021-09403-7
[38]

and McNicholas, Paul D

Payne, Andrea and Silva, Anjali and Rothstein, Steven J. and McNicholas, Paul D. and Subedi, Sanjeena , title=. Statistics and Computing , year=. doi:10.1007/s11222-025-10720-9 , url=

work page doi:10.1007/s11222-025-10720-9
[39]

Journal of Classification , year=

Tu, Wangshu and Subedi, Sanjeena , title=. Journal of Classification , year=
[40]

Scientific Reports , year=

Fang, Yuan and Subedi, Sanjeena , title=. Scientific Reports , year=. doi:10.1038/s41598-023-41318-8 , url=

work page doi:10.1038/s41598-023-41318-8
[41]

2014 , issn =

Computational Statistics & Data Analysis , volume =. 2014 , issn =. doi:https://doi.org/10.1016/j.csda.2012.12.008 , url =

work page doi:10.1016/j.csda.2012.12.008 2014
[42]

Journal of Machine Learning Research , year =

Dapeng Yao and Fangzheng Xie and Yanxun Xu , title =. Journal of Machine Learning Research , year =
[43]

Gallivan and Adrian Barbu , title =

Yijia Zhou and Kyle A. Gallivan and Adrian Barbu , title =. Journal of Computational and Graphical Statistics , volume =. 2025 , publisher =

2025
[44]

Clarke and Jennifer L

Saeid Amiri and Bertrand S. Clarke and Jennifer L. Clarke , title =. Journal of Computational and Graphical Statistics , volume =. 2018 , publisher =

2018
[45]

Journal of Nonparametric Statistics , volume =

Yong Wang and Reza Modarres , title =. Journal of Nonparametric Statistics , volume =. 2025 , publisher =

2025
[46]

Journal of the American Statistical Association , volume =

Xiaoxia Champon and Ana-Maria Staicu and Anthony Weishampel and Chathura Jayalah and William Rand , title =. Journal of the American Statistical Association , volume =
[47]

Journal of the American Statistical Association , volume =

Zhiyi Tian and Jiaming Xu and Jen Tang , title =. Journal of the American Statistical Association , volume =. 2024 , publisher =

2024
[48]

Journal of the American Statistical Association , volume =

Raffaele Argiento and Edoardo Filippi-Mazzola and Lucia Paci , title =. Journal of the American Statistical Association , volume =. 2025 , publisher =

2025
[49]

Biometrika , volume =

Ghilotti, L and Beraha, M and Guglielmi, A , title =. Biometrika , volume =. 2025 , month =

2025
[50]

Bioinformatics Advances , volume =

Rao, Jackie and Kirk, Paul D W , title =. Bioinformatics Advances , volume =. 2025 , month =

2025
[51]

The R Journal , year =

Papastamoulis, Panagiotis and Rattray, Magnus , title =. The R Journal , year =. doi:10.32614/RJ-2017-022 , volume =

work page doi:10.32614/rj-2017-022 2017
[52]

2009 , issn =

Discrete data clustering using finite mixture models , journal =. 2009 , issn =

2009
[53]

2025 , author =

Review of Post-Clustering Inference Methods , journal =. 2025 , author =

2025
[54]

and Hemanth, Duraisamy Jude and Sethi, Jasleen K

Mittal, Mamta and Goyal, Lalit M. and Hemanth, Duraisamy Jude and Sethi, Jasleen K. , title =. WIREs Data Mining and Knowledge Discovery , volume =
[55]

Advances in Data Analysis and Classification , year=

Mori, Matteo and Anderlucci, Laura , title=. Advances in Data Analysis and Classification , year=
[56]

Advances in Data Analysis and Classification , year=

Papastamoulis, Panagiotis , title=. Advances in Data Analysis and Classification , year=
[57]

2015 , issn =

Model based clustering of high-dimensional binary data , journal =. 2015 , issn =. doi:https://doi.org/10.1016/j.csda.2014.12.009 , url =

work page doi:10.1016/j.csda.2014.12.009 2015
[58]

van den Heuvel , title =

Alberto Brini and Abu Manju and Edwin R. van den Heuvel , title =. Communications in Statistics - Simulation and Computation , volume =. 2025 , publisher =

2025
[59]

Statistics and Computing , year=

Failli, Dalila and Marino, Maria Francesca and Arpino, Bruno , title=. Statistics and Computing , year=
[60]

Model-Based Clustering

Gormley, Isobel Claire and Murphy, Thomas Brendan and Raftery, Adrian E. Model-Based Clustering. Annual Review of Statistics and Its Application. 2023. doi:https://doi.org/10.1146/annurev-statistics-033121-115326

work page doi:10.1146/annurev-statistics-033121-115326 2023
[61]

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , author =

Wade, S. , title =. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume =. 2023 , month =. doi:10.1098/rsta.2022.0149 , url =

work page doi:10.1098/rsta.2022.0149 2023
[62]

Dunson , title =

Noirrit Kiran Chandra and Antonio Canale and David B. Dunson , title =. Journal of Machine Learning Research , year =
[63]

Chen and Daniela M

Yiqun T. Chen and Daniela M. Witten , title =. Journal of Machine Learning Research , year =
[64]

Tony Cai and Rong Ma , title =

T. Tony Cai and Rong Ma , title =. Journal of Machine Learning Research , year =
[65]

2009 , issn =

A simple method for screening variables before clustering microarray data , journal =. 2009 , issn =. doi:https://doi.org/10.1016/j.csda.2009.02.001 , url =

work page doi:10.1016/j.csda.2009.02.001 2009
[66]

Classification of social media users with generalized functional data analysis , journal =

Anthony Weishampel and Ana-Maria Staicu and William Rand , keywords =. Classification of social media users with generalized functional data analysis , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.csda.2022.107647 , url =

work page doi:10.1016/j.csda.2022.107647 2023
[67]

and Andrews, Tallulah S

Kiselev, Vladimir Yu. and Andrews, Tallulah S. and Hemberg, Martin , title =. Nature Reviews Genetics , volume =. 2019 , doi =

2019
[68]

and Street, Kelly and Irizarry, Rafael A

Grabski, Isabella N. and Street, Kelly and Irizarry, Rafael A. , title =. Nature Methods , volume =. 2023 , doi =

2023
[69]

and Bowen, Natasha K

Weller, Bridget E. and Bowen, Natasha K. and Faubert, Sarah J. , title =. Journal of Black Psychology , volume =. 2020 , doi =

2020
[70]

Engineering Applications of Artificial Intelligence , volume =

A comprehensive survey of clustering algorithms:. Engineering Applications of Artificial Intelligence , volume =. 2022 , issn =. doi:https://doi.org/10.1016/j.engappai.2022.104743 , url =

work page doi:10.1016/j.engappai.2022.104743 2022
[71]

Journal of the American Statistical Association , volume =

Zhongyuan Lyu and Ling Chen and Yuqi Gu , title =. Journal of the American Statistical Association , volume =. 2025 , publisher =

2025
[72]

2025 , issn =

Categorical data clustering: 25 years beyond K-modes , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.eswa.2025.126608 , url =

work page doi:10.1016/j.eswa.2025.126608 2025
[73]

Psychometrika , author=

Spectral Clustering with Likelihood Refinement for High-Dimensional Latent Class Recovery , volume=. Psychometrika , author=. 2026 , pages=. doi:10.1017/psy.2026.10095 , number=

work page doi:10.1017/psy.2026.10095 2026
[74]

Psychometrika , author=

A Tensor-. Psychometrika , author=. 2023 , pages=. doi:10.1007/s11336-022-09887-1 , number=

work page doi:10.1007/s11336-022-09887-1 2023
[75]

Data Mining and Knowledge Discovery , year=

Huang, Zhexue , title=. Data Mining and Knowledge Discovery , year=
[76]

2000 , issn =

Rock: A robust clustering algorithm for categorical attributes , journal =. 2000 , issn =. doi:https://doi.org/10.1016/S0306-4379(00)00022-3 , url =

work page doi:10.1016/s0306-4379(00)00022-3 2000
[77]

and Sevcik, Kenneth C

Andritsos, Periklis and Tsaparas, Panayiotis and Miller, Ren \'e e J. and Sevcik, Kenneth C. LIMBO: Scalable Clustering of Categorical Data. Advances in Database Technology - EDBT 2004. 2004

2004
[78]

Statistics Surveys , volume=

Variable selection methods for model-based clustering , author=. Statistics Surveys , volume=. 2018 , publisher=

2018

[1] [1]

Bioinformatics , volume =

Chen, Guanhua and Wang, Xinyue and Sun, Qiang and Tang, Zheng-Zheng , title =. Bioinformatics , volume =. 2025 , month =

2025

[2] [2]

Polynomials and Polynomial Inequalities , series =

Peter Borwein and Tam. Polynomials and Polynomial Inequalities , series =. 1995 , isbn =

1995

[3] [3]

William Feller , title =

[4] [4]

Shannon , title =

Claude E. Shannon , title =. Bell System Technical Journal , volume =

[5] [5]

Serge Lang , title =

[6] [6]

Ingwer Borg and Patrick J. F. Groenen , title =. 2005 , isbn =

2005

[7] [7]

Journal of Machine Learning Research , volume =

Laurens van der Maaten and Geoffrey Hinton , title =. Journal of Machine Learning Research , volume =. 2008 , issn =

2008

[8] [8]

Statistics and Computing , volume =

Ulrike von Luxburg , title =. Statistics and Computing , volume =. 2007 , doi =

2007

[9] [9]

Jolliffe , title =

Ian T. Jolliffe , title =. 2002 , isbn =

2002

[10] [10]

Bishop , title =

Christopher M. Bishop , title =. 2006 , isbn =

2006

[11] [11]

The Gut Microbiota of Rural Papua New Guineans: Composition, Diversity Patterns, and Ecological Processes , journal =

Mart. The Gut Microbiota of Rural Papua New Guineans: Composition, Diversity Patterns, and Ecological Processes , journal =. 2015 , doi =

2015

[12] [12]

Irish Babies' Names , year =

[13] [13]

and Scrucca, L

Casa, A. and Scrucca, L. and Menardi, G. , title =. Advances in Data Analysis and Classification , volume =. 2021 , doi =

2021

[14] [14]

and Brodley, C.E

Fern, X.Z. and Brodley, C.E. , title =. Proceedings of the 20th international conference on machine learning , volume =. 2003 , doi =

2003

[15] [15]

and Candela, M

Schnorr, S.L. and Candela, M. and Rampelli, S. and Centanni, M. and Consolandi, C. and Basaglia, G. and Turroni, S. and Biagi, E. and Peano, C. and Severgnini, M. and others , title =. Nature Communications , volume =. 2014 , doi =

2014

[16] [16]

and Zhang, L

Shi, Y. and Zhang, L. and Peterson, C.B. and others , title =. Microbiome , volume =. 2022 , doi =

2022

[17] [17]

and Chen, Z

Wang, C. and Chen, Z. and Xi, R. , title =. Annals of Applied Statistics , volume =. 2025 , doi =

2025

[18] [18]

2026 , note =

vegan: Community Ecology Package , author =. 2026 , note =

2026

[19] [19]

Journal of classification , volume=

Comparing partitions , author=. Journal of classification , volume=. 1985 , publisher=

1985

[20] [20]

2024 , note =

kernlab: Kernel-Based Machine Learning Lab , author =. 2024 , note =

2024

[21] [21]

kernlab -- An

Alexandros Karatzoglou and Alex Smola and Kurt Hornik and Achim Zeileis , journal =. kernlab -- An. 2004 , volume =

2004

[22] [22]

Krijthe , year =

Jesse H. Krijthe , year =

[23] [23]

Journal of Open Source Software , year =

simstudy: Illuminating research methods through data generation , author =. Journal of Open Source Software , year =

[24] [24]

2013 , issn =

An extensive comparative study of cluster validity indices , journal =. 2013 , issn =. doi:https://doi.org/10.1016/j.patcog.2012.07.021 , url =

work page doi:10.1016/j.patcog.2012.07.021 2013

[25] [25]

2021 , issn =

Clustering with the Average Silhouette Width , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.csda.2021.107190 , url =

work page doi:10.1016/j.csda.2021.107190 2021

[26] [26]

Lausser, Ludwig and Schmid, Florian and Schirra, Lyn-Rouven and Wilhelm, Adalbert F. X. and Kestler, Hans A. , title=. Advances in Data Analysis and Classification , year=. doi:10.1007/s11634-016-0277-3 , url=

work page doi:10.1007/s11634-016-0277-3

[27] [27]

Priebe and Youngser Park and David J

Congyuan Yang and Carey E. Priebe and Youngser Park and David J. Marchette , title =. Journal of Computational and Graphical Statistics , volume =. 2021 , publisher =

2021

[28] [28]

, title =

MacQueen, James B. , title =. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability , volume =. 1967 , publisher =

1967

[29] [29]

, title =

Lloyd, Stuart P. , title =. IEEE Transactions on Information Theory , volume =. 1982 , doi =

1982

[30] [30]

, title =

Rousseeuw, Peter J. , title =. Journal of Computational and Applied Mathematics , volume =. 1987 , doi =

1987

[31] [31]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =

Tibshirani, Robert and Walther, Guenther and Hastie, Trevor , title =. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =. 2001 , month =. doi:10.1111/1467-9868.00293 , url =

work page doi:10.1111/1467-9868.00293 2001

[32] [32]

Brendan and Raftery, Adrian E

Bouveyron, Charles and Celeux, Gilles and Murphy, T. Brendan and Raftery, Adrian E. , year=. Model-Based Clustering and Classification for Data Science: With Applications in R , publisher=

[33] [33]

https://doi.org/10.1201/9781003277965

Luca Scrucca and Chris Fraley and T. Brendan Murphy and Adrian E. Raftery , publisher =. Model-Based Clustering, Classification, and Density Estimation Using. doi:10.1201/9781003277965 , year =

work page doi:10.1201/9781003277965

[34] [34]

2026 , url =

R: A Language and Environment for Statistical Computing , author =. 2026 , url =

2026

[35] [35]

McLachlan and Suren Rathnayake

McLachlan, Geoffrey J. and Rathnayake, Suren , title =. WIREs Data Mining and Knowledge Discovery , volume =. doi:https://doi.org/10.1002/widm.1135 , url =. https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1135 , year =

work page doi:10.1002/widm.1135

[36] [36]

Journal of the American Statistical Association , volume =

Chris Fraley and Adrian E Raftery , title =. Journal of the American Statistical Association , volume =. 2002 , publisher =

2002

[37] [37]

Journal of Classification , year=

Anderlucci, Laura and Fortunato, Francesca and Montanari, Angela , title=. Journal of Classification , year=. doi:10.1007/s00357-021-09403-7 , url=

work page doi:10.1007/s00357-021-09403-7

[38] [38]

and McNicholas, Paul D

Payne, Andrea and Silva, Anjali and Rothstein, Steven J. and McNicholas, Paul D. and Subedi, Sanjeena , title=. Statistics and Computing , year=. doi:10.1007/s11222-025-10720-9 , url=

work page doi:10.1007/s11222-025-10720-9

[39] [39]

Journal of Classification , year=

Tu, Wangshu and Subedi, Sanjeena , title=. Journal of Classification , year=

[40] [40]

Scientific Reports , year=

Fang, Yuan and Subedi, Sanjeena , title=. Scientific Reports , year=. doi:10.1038/s41598-023-41318-8 , url=

work page doi:10.1038/s41598-023-41318-8

[41] [41]

2014 , issn =

Computational Statistics & Data Analysis , volume =. 2014 , issn =. doi:https://doi.org/10.1016/j.csda.2012.12.008 , url =

work page doi:10.1016/j.csda.2012.12.008 2014

[42] [42]

Journal of Machine Learning Research , year =

Dapeng Yao and Fangzheng Xie and Yanxun Xu , title =. Journal of Machine Learning Research , year =

[43] [43]

Gallivan and Adrian Barbu , title =

Yijia Zhou and Kyle A. Gallivan and Adrian Barbu , title =. Journal of Computational and Graphical Statistics , volume =. 2025 , publisher =

2025

[44] [44]

Clarke and Jennifer L

Saeid Amiri and Bertrand S. Clarke and Jennifer L. Clarke , title =. Journal of Computational and Graphical Statistics , volume =. 2018 , publisher =

2018

[45] [45]

Journal of Nonparametric Statistics , volume =

Yong Wang and Reza Modarres , title =. Journal of Nonparametric Statistics , volume =. 2025 , publisher =

2025

[46] [46]

Journal of the American Statistical Association , volume =

Xiaoxia Champon and Ana-Maria Staicu and Anthony Weishampel and Chathura Jayalah and William Rand , title =. Journal of the American Statistical Association , volume =

[47] [47]

Journal of the American Statistical Association , volume =

Zhiyi Tian and Jiaming Xu and Jen Tang , title =. Journal of the American Statistical Association , volume =. 2024 , publisher =

2024

[48] [48]

Journal of the American Statistical Association , volume =

Raffaele Argiento and Edoardo Filippi-Mazzola and Lucia Paci , title =. Journal of the American Statistical Association , volume =. 2025 , publisher =

2025

[49] [49]

Biometrika , volume =

Ghilotti, L and Beraha, M and Guglielmi, A , title =. Biometrika , volume =. 2025 , month =

2025

[50] [50]

Bioinformatics Advances , volume =

Rao, Jackie and Kirk, Paul D W , title =. Bioinformatics Advances , volume =. 2025 , month =

2025

[51] [51]

The R Journal , year =

Papastamoulis, Panagiotis and Rattray, Magnus , title =. The R Journal , year =. doi:10.32614/RJ-2017-022 , volume =

work page doi:10.32614/rj-2017-022 2017

[52] [52]

2009 , issn =

Discrete data clustering using finite mixture models , journal =. 2009 , issn =

2009

[53] [53]

2025 , author =

Review of Post-Clustering Inference Methods , journal =. 2025 , author =

2025

[54] [54]

and Hemanth, Duraisamy Jude and Sethi, Jasleen K

Mittal, Mamta and Goyal, Lalit M. and Hemanth, Duraisamy Jude and Sethi, Jasleen K. , title =. WIREs Data Mining and Knowledge Discovery , volume =

[55] [55]

Advances in Data Analysis and Classification , year=

Mori, Matteo and Anderlucci, Laura , title=. Advances in Data Analysis and Classification , year=

[56] [56]

Advances in Data Analysis and Classification , year=

Papastamoulis, Panagiotis , title=. Advances in Data Analysis and Classification , year=

[57] [57]

2015 , issn =

Model based clustering of high-dimensional binary data , journal =. 2015 , issn =. doi:https://doi.org/10.1016/j.csda.2014.12.009 , url =

work page doi:10.1016/j.csda.2014.12.009 2015

[58] [58]

van den Heuvel , title =

Alberto Brini and Abu Manju and Edwin R. van den Heuvel , title =. Communications in Statistics - Simulation and Computation , volume =. 2025 , publisher =

2025

[59] [59]

Statistics and Computing , year=

Failli, Dalila and Marino, Maria Francesca and Arpino, Bruno , title=. Statistics and Computing , year=

[60] [60]

Model-Based Clustering

Gormley, Isobel Claire and Murphy, Thomas Brendan and Raftery, Adrian E. Model-Based Clustering. Annual Review of Statistics and Its Application. 2023. doi:https://doi.org/10.1146/annurev-statistics-033121-115326

work page doi:10.1146/annurev-statistics-033121-115326 2023

[61] [61]

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , author =

Wade, S. , title =. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume =. 2023 , month =. doi:10.1098/rsta.2022.0149 , url =

work page doi:10.1098/rsta.2022.0149 2023

[62] [62]

Dunson , title =

Noirrit Kiran Chandra and Antonio Canale and David B. Dunson , title =. Journal of Machine Learning Research , year =

[63] [63]

Chen and Daniela M

Yiqun T. Chen and Daniela M. Witten , title =. Journal of Machine Learning Research , year =

[64] [64]

Tony Cai and Rong Ma , title =

T. Tony Cai and Rong Ma , title =. Journal of Machine Learning Research , year =

[65] [65]

2009 , issn =

A simple method for screening variables before clustering microarray data , journal =. 2009 , issn =. doi:https://doi.org/10.1016/j.csda.2009.02.001 , url =

work page doi:10.1016/j.csda.2009.02.001 2009

[66] [66]

Classification of social media users with generalized functional data analysis , journal =

Anthony Weishampel and Ana-Maria Staicu and William Rand , keywords =. Classification of social media users with generalized functional data analysis , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.csda.2022.107647 , url =

work page doi:10.1016/j.csda.2022.107647 2023

[67] [67]

and Andrews, Tallulah S

Kiselev, Vladimir Yu. and Andrews, Tallulah S. and Hemberg, Martin , title =. Nature Reviews Genetics , volume =. 2019 , doi =

2019

[68] [68]

and Street, Kelly and Irizarry, Rafael A

Grabski, Isabella N. and Street, Kelly and Irizarry, Rafael A. , title =. Nature Methods , volume =. 2023 , doi =

2023

[69] [69]

and Bowen, Natasha K

Weller, Bridget E. and Bowen, Natasha K. and Faubert, Sarah J. , title =. Journal of Black Psychology , volume =. 2020 , doi =

2020

[70] [70]

Engineering Applications of Artificial Intelligence , volume =

A comprehensive survey of clustering algorithms:. Engineering Applications of Artificial Intelligence , volume =. 2022 , issn =. doi:https://doi.org/10.1016/j.engappai.2022.104743 , url =

work page doi:10.1016/j.engappai.2022.104743 2022

[71] [71]

Journal of the American Statistical Association , volume =

Zhongyuan Lyu and Ling Chen and Yuqi Gu , title =. Journal of the American Statistical Association , volume =. 2025 , publisher =

2025

[72] [72]

2025 , issn =

Categorical data clustering: 25 years beyond K-modes , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.eswa.2025.126608 , url =

work page doi:10.1016/j.eswa.2025.126608 2025

[73] [73]

Psychometrika , author=

Spectral Clustering with Likelihood Refinement for High-Dimensional Latent Class Recovery , volume=. Psychometrika , author=. 2026 , pages=. doi:10.1017/psy.2026.10095 , number=

work page doi:10.1017/psy.2026.10095 2026

[74] [74]

Psychometrika , author=

A Tensor-. Psychometrika , author=. 2023 , pages=. doi:10.1007/s11336-022-09887-1 , number=

work page doi:10.1007/s11336-022-09887-1 2023

[75] [75]

Data Mining and Knowledge Discovery , year=

Huang, Zhexue , title=. Data Mining and Knowledge Discovery , year=

[76] [76]

2000 , issn =

Rock: A robust clustering algorithm for categorical attributes , journal =. 2000 , issn =. doi:https://doi.org/10.1016/S0306-4379(00)00022-3 , url =

work page doi:10.1016/s0306-4379(00)00022-3 2000

[77] [77]

and Sevcik, Kenneth C

Andritsos, Periklis and Tsaparas, Panayiotis and Miller, Ren \'e e J. and Sevcik, Kenneth C. LIMBO: Scalable Clustering of Categorical Data. Advances in Database Technology - EDBT 2004. 2004

2004

[78] [78]

Statistics Surveys , volume=

Variable selection methods for model-based clustering , author=. Statistics Surveys , volume=. 2018 , publisher=

2018