A Simple Algorithm for Clustering Discrete Distributions
Pith reviewed 2026-05-08 05:40 UTC · model grok-4.3
The pith
A projection-based algorithm clusters mixtures of discrete distributions by applying k-means to their best low-rank approximation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that for mixtures of discrete distributions, projecting samples onto centers obtained from k-means on the best rank-k approximation of the data matrix correctly recovers the clusters under a separation condition on the centers. The same procedure works for Gaussian mixtures, providing a unified geometric approach.
What carries the argument
The projection onto approximate centers computed via k-means on the rank-k singular value decomposition approximation of the data matrix.
If this is right
- The algorithm is rotationally invariant and does not depend on coordinate-specific combinatorial projections.
- It resolves McSherry's conjecture on the existence of such geometric algorithms for discrete distributions.
- The method applies equally to continuous distributions like high-dimensional Gaussians.
- Clustering succeeds whenever the mixture components have sufficiently separated centers.
Where Pith is reading between the lines
- This geometric approach could extend naturally to other mixture models where low-rank structure captures the centers.
- In practice, the method might allow clustering in regimes where dimension is much larger than the number of samples.
- If separation is only marginally satisfied, adaptive choice of the rank k could improve robustness.
Load-bearing premise
The cluster centers must be separated by a sufficient distance relative to the noise level so that the low-rank approximation and k-means yield projections that distinguish the components correctly.
What would settle it
A concrete counterexample where the separation condition holds yet the projections lead to incorrect clustering, or an experiment showing misclustering when the rank-k approximation distorts the centers.
Figures
read the original abstract
We propose a simple, projection-based algorithm for clustering mixtures of discrete (Bernoulli) distributions. Unlike previous approaches that rely on coordinate-specific ``combinatorial projections,'' our algorithm is rotationally invariant and works by projecting samples onto approximate centers obtained via a $k$-means computation on the best rank-$k$ approximation of the data matrix. This resolves a conjecture of McSherry on the existence of such geometric algorithms for discrete distributions. The same algorithm also applies to continuous distributions such as high-dimensional Gaussians, providing a unified approach across distribution types. We prove that the algorithm succeeds under a natural separation condition on the cluster centers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a simple, rotationally invariant projection-based algorithm for clustering mixtures of discrete Bernoulli distributions (and extends it to Gaussians). The algorithm computes the best rank-k approximation of the data matrix, runs k-means on the resulting low-rank matrix to obtain approximate centers, and projects samples onto these centers to assign clusters. It claims to resolve McSherry's conjecture on the existence of such geometric algorithms for discrete distributions and proves correctness under a natural separation condition on the cluster centers.
Significance. If the proof holds, the result supplies a unified geometric method that avoids coordinate-specific combinatorial projections, offering a simpler and more invariant approach to mixture clustering across discrete and continuous settings. This could streamline both theoretical analysis and practical implementation for high-dimensional data.
major comments (2)
- Main theorem (proof of success under separation): the argument that the rank-k SVD approximation plus k-means yields centers accurate enough for the subsequent projection step to recover the clusters relies on error bounds that are not fully derived or quantified in the provided analysis; explicit comparison of the projection error to the separation distance is needed to confirm the guarantee holds for Bernoulli mixtures.
- Algorithm section (projection onto approximate centers): the precise definition of how samples are projected onto the k-means centers obtained from the rank-k matrix, and why this step is rotationally invariant and free of coordinate bias for discrete distributions, requires a supporting lemma or explicit calculation showing invariance under orthogonal transformations.
minor comments (3)
- Add pseudocode for the full algorithm, including the rank-k approximation, k-means, and projection steps, to improve reproducibility and clarity.
- Include a reference to the exact statement of McSherry's conjecture being resolved, along with a brief discussion of how the separation condition compares to prior assumptions in the literature.
- Clarify notation for the data matrix and its rank-k approximation at first use; ensure all parameters (e.g., k, separation threshold) are defined before appearing in the theorem statement.
Simulated Author's Rebuttal
We thank the referee for their careful reading of our manuscript and for providing constructive feedback. We address each of the major comments below and will incorporate the suggested improvements in the revised version.
read point-by-point responses
-
Referee: Main theorem (proof of success under separation): the argument that the rank-k SVD approximation plus k-means yields centers accurate enough for the subsequent projection step to recover the clusters relies on error bounds that are not fully derived or quantified in the provided analysis; explicit comparison of the projection error to the separation distance is needed to confirm the guarantee holds for Bernoulli mixtures.
Authors: We agree that the error bounds require more explicit quantification to fully establish the main theorem. In the revised manuscript, we will expand the proof section to derive the error bounds from the rank-k SVD approximation and the k-means clustering step in detail. We will also include an explicit comparison showing that the projection error is smaller than half the minimum separation distance between centers, thereby confirming the correctness of the clustering for Bernoulli mixtures under the stated separation condition. revision: yes
-
Referee: Algorithm section (projection onto approximate centers): the precise definition of how samples are projected onto the k-means centers obtained from the rank-k matrix, and why this step is rotationally invariant and free of coordinate bias for discrete distributions, requires a supporting lemma or explicit calculation showing invariance under orthogonal transformations.
Authors: Thank you for this suggestion. The projection is performed by assigning each data point to the nearest of the approximate centers obtained from k-means on the rank-k approximation. To clarify the invariance, we will add a new lemma in the algorithm section that proves the rotational invariance of the SVD-based rank-k approximation and the k-means step under orthogonal transformations. This will demonstrate that the projection step is free of coordinate bias, distinguishing it from combinatorial projection methods that depend on specific coordinates for discrete distributions. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes a projection-based clustering algorithm relying on standard rank-k SVD approximation followed by k-means, which are external, well-established techniques not derived from the paper's own fitted parameters or definitions. The abstract and available description present the method as resolving an external conjecture of McSherry via a geometrically invariant approach under a stated separation condition on centers. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The derivation chain remains independent of its target clustering outputs and is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. InConference on Learning Theory (COLT), pages 458–469, 2005
work page 2005
-
[2]
N. Alon and N. Kahale. A spectral technique for coloring random 3-colorable graphs.SIAM J. Comput., 26(6):1733–1748, 1997
work page 1997
-
[3]
N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph. InAnnual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 594–598, 1998
work page 1998
-
[4]
R. Boppana. Eigenvalues and graph bisection: an average case analysis. InIEEE Symposium on Foundations of Computer Science (FOCS), pages 280–285, 1987
work page 1987
-
[5]
A. Dasgupta, J. Hopcroft, R. Kannan, and P. Mitra. Spectral clustering with limited indepen- dence. InAnnual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1036–1045, 2007. 11
work page 2007
-
[6]
A. Dasgupta, J. Hopcroft, and F. McSherry. Spectral analysis of random graphs with skewed degree distributions. InIEEE Symposium on Foundations of Computer Science (FOCS), pages 602–610, 2004
work page 2004
-
[7]
P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering in large graphs and matrices. InAnnual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 291–299, 1999
work page 1999
-
[8]
K. Jain and V. Vazirani. Approximation algorithms for metric facility location and median problems using the primal-dual schema and lagrangian relaxation.J. ACM, 48(2):274–296, 2001
work page 2001
- [9]
-
[10]
T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. A local search approximation algorithm for k-means clustering.Comput. Geom., 28(2-3):89–112, 2004
work page 2004
- [11]
-
[12]
Petrov.Sums of independent random variables
V. Petrov.Sums of independent random variables. Springer, 1975
work page 1975
- [13]
-
[14]
J. Schmidt, A. Siegel, and A. Srinivasan. Chernoff-Hoeffding bounds for applications with limited independence.SIAM J. Discret. Math., 8(2):223–250, 1995
work page 1995
-
[15]
S. Vempala and G. Wang. A spectral algorithm for learning mixture models.Journal of Com- puter and System Sciences, 68(4):841–860, 2004
work page 2004
-
[16]
V. Vu. Spectral norm of random matrices. InACM Symposium on Theory of computing (STOC), pages 619–626, 2005
work page 2005
-
[17]
Mitra.Clustering algorithms for random and pseudo-random structures
P. Mitra.Clustering algorithms for random and pseudo-random structures. PhD thesis, Yale University, 2008
work page 2008
-
[18]
S. Neumann. Bipartite stochastic block models with tiny clusters. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2018
work page 2018
-
[19]
M. Seif and Y. Chen. Clustering mixtures of discrete distributions: a note on Mitra’s algorithm. arXiv preprint arXiv:2405.19559, 2024. 12
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.