pith. sign in

arxiv: 2604.23083 · v1 · submitted 2026-04-25 · 📊 stat.ML · cs.LG· stat.ME

Turtle shell clustering: A mixture approach to discriminative clustering with applications to flow cytometry and other data

Pith reviewed 2026-05-08 07:28 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.ME
keywords clusteringdiscriminative clusteringmixture modelsmutual informationflow cytometryunsupervised learningGaussian mixturescomponent selection
0
0 comments X

The pith

A mixture of Gaussians and uniform distributions under a regularized mutual information objective draws non-linear cluster boundaries and selects the number of groups without labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a clustering procedure that combines generative ideas about cluster geometry with discriminative ideas about boundaries between groups. It optimizes a regularized mutual information objective using a mixture model that places Gaussian components on cluster interiors and uniform components on noise or background regions. The regularization term plus an explicit merge step lets the algorithm decide how many components to keep, producing a fully unsupervised procedure. Tests on simulated data and flow cytometry experiments show the method recovering intuitive groupings even when clusters are irregular or contaminated by noise.

Core claim

The turtle shell method is a probabilistic discriminative clustering procedure based on a regularized mutual information objective function. It employs a mixture of mixtures of Gaussian and uniform distributions to model the conditional distribution, enabling the estimation of non-linear boundaries. Automatic component selection is achieved through the regularizing term and a merge step analogous to reversible jump MCMC techniques.

What carries the argument

The regularized mutual information objective function applied to a mixture model of Gaussian and uniform distributions, together with a merge step for automatic component number selection.

If this is right

  • The method estimates non-linear decision boundaries between clusters without any supervision.
  • The regularizer and merge step together determine the number of clusters automatically.
  • Clusters with irregular shapes or embedded noise are still recovered as intuitive groups.
  • The approach applies directly to flow cytometry data to separate distinct cell populations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regularized objective could be paired with other base distributions to handle different types of noise beyond uniforms.
  • The uniform components offer a built-in mechanism for identifying points that do not belong to any cluster.
  • Embedding the method inside a dimensionality-reduction pipeline might extend its use to higher-dimensional biological or image datasets.

Load-bearing premise

The regularized mutual information objective combined with a mixture of Gaussian and uniform distributions will produce meaningful, non-linear boundaries and automatic component selection in a fully unsupervised setting without requiring labeled data.

What would settle it

A dataset with known ground-truth clusters of highly irregular non-convex shapes where the method either selects the wrong number of components or fails to recover the true assignments better than a standard Gaussian mixture model.

Figures

Figures reproduced from arXiv: 2604.23083 by Arthur White, Mackenzie R. Neal, Paul D. McNicholas.

Figure 1
Figure 1. Figure 1: Clusters from the EM estimation of a GMM when (a) the BIC is used to select view at source ↗
Figure 2
Figure 2. Figure 2: Histogram of data generated from a mixture of Gaussian and uniform distributions. view at source ↗
Figure 3
Figure 3. Figure 3: Example of RIM clustering results when a multi-logit is assumed. view at source ↗
Figure 4
Figure 4. Figure 4: Estimated number of clusters for each initialization method. view at source ↗
Figure 4
Figure 4. Figure 4: Estimated number of clusters for each initialization method. view at source ↗
Figure 5
Figure 5. Figure 5: An example result on a simulated dataset from Section 3.3. view at source ↗
Figure 6
Figure 6. Figure 6: An example result from each tested method on a simulated dataset from the cross view at source ↗
Figure 7
Figure 7. Figure 7: An example result from each tested method on a simulated dataset from the view at source ↗
Figure 8
Figure 8. Figure 8: An example results from each tested method on a simulated dataset from the view at source ↗
Figure 9
Figure 9. Figure 9: ARI values obtained from each method on each benchmark clustering dataset view at source ↗
Figure 10
Figure 10. Figure 10: ARI values obtained from each method on each flow cytometry dataset under view at source ↗
read the original abstract

Generative approaches to clustering provide information on geometric properties of clusters, whereas discriminative approaches provide boundaries between clusters. Ideas from both approaches are incorporated to present a fully unsupervised, probabilistic, and discriminative clustering method via a regularized mutual information objective function, wherein a mixture of mixtures of Gaussian and uniform distributions is used for formulation of the conditional model. Automatic selection of the number of components is established with the introduction of the regularizing term and a merge step, similar to those applied in reversible jump Markov chain Monte Carlo methods used in Bayesian clustering. Consequently, the turtle shell method -- a fully unsupervised clustering method capable of estimating non-linear boundary lines, automatically selecting the number of components, and capturing intuitive clusters in the presence of data abnormalities such as noise and/or irregular cluster shapes -- is introduced. We test this method on various simulated and real datasets commonly explored in clustering research, and extend the analysis to datasets arising from flow cytometry experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces the turtle shell clustering method as a fully unsupervised probabilistic discriminative clustering approach. It formulates the problem via a regularized mutual information objective that employs a mixture-of-mixtures model consisting of Gaussian and uniform distributions to capture cluster geometry and boundaries. A regularizing term together with a merge step modeled after reversible-jump MCMC is used to achieve automatic selection of the number of components. The method is claimed to recover non-linear boundaries and to remain robust to noise and irregular shapes; it is evaluated on simulated data, standard clustering benchmarks, and flow-cytometry datasets.

Significance. If the central claims are substantiated, the work offers a principled bridge between generative mixture modeling and discriminative boundary estimation within a single unsupervised objective. Automatic component selection without labeled data or exhaustive hyper-parameter search would be a practical advance for noisy, high-dimensional applications such as flow cytometry. The explicit use of uniform components to model background or outliers is a concrete modeling choice that could generalize to other domains with contamination.

major comments (3)
  1. [§3.2] §3.2, Eq. (7): the regularized mutual-information objective contains a free regularization parameter λ whose selection procedure is not fully specified; the text states that λ is 'chosen once' yet provides neither a data-driven rule nor a sensitivity analysis showing that downstream cluster count and boundaries remain stable across a plausible range of λ.
  2. [§4.3] §4.3, Algorithm 1 (merge step): the acceptance probability for the merge operation is stated to be 'analogous to reversible-jump MCMC' but the precise Metropolis-Hastings ratio, proposal distribution, and Jacobian term are not derived; without these quantities it is impossible to verify that the merge step yields a consistent estimator of the number of components rather than an ad-hoc post-processing rule.
  3. [Table 2] Table 2 and Figure 4: the reported ARI and NMI values on the flow-cytometry data are given for a single run; no standard errors across random initializations or cross-validation folds are supplied, making it difficult to assess whether the apparent superiority over k-means and GMM is statistically reliable.
minor comments (3)
  1. [§2.1] Notation for the uniform component density is introduced in §2.1 but never given an explicit functional form; adding the support and normalization constant would remove ambiguity.
  2. [Figure 3] The caption of Figure 3 does not indicate the value of λ used for the displayed partition; this information should be added for reproducibility.
  3. [§3.1] Several references to 'mutual information' in §3.1 omit the base of the logarithm; consistency with the information-theoretic literature would be improved by stating whether nats or bits are used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive comments on our paper. We respond to each major comment in turn and indicate the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2, Eq. (7): the regularized mutual-information objective contains a free regularization parameter λ whose selection procedure is not fully specified; the text states that λ is 'chosen once' yet provides neither a data-driven rule nor a sensitivity analysis showing that downstream cluster count and boundaries remain stable across a plausible range of λ.

    Authors: We agree that the selection procedure for the regularization parameter λ requires more detail. In the revised manuscript, we will specify a data-driven rule for choosing λ, such as selecting the value that maximizes the objective on a held-out subset or a default based on data dimensionality, and include a sensitivity analysis demonstrating that the cluster count and boundaries are stable over a range of λ values. revision: yes

  2. Referee: [§4.3] §4.3, Algorithm 1 (merge step): the acceptance probability for the merge operation is stated to be 'analogous to reversible-jump MCMC' but the precise Metropolis-Hastings ratio, proposal distribution, and Jacobian term are not derived; without these quantities it is impossible to verify that the merge step yields a consistent estimator of the number of components rather than an ad-hoc post-processing rule.

    Authors: The merge step is a deterministic post-processing rule applied after the main optimization to automatically select the number of components by merging those that do not improve the objective. It is inspired by but not a direct implementation of reversible-jump MCMC. We will revise the manuscript to remove the MCMC analogy, explicitly state that the acceptance is based on whether the regularized mutual information increases after the merge, and clarify that this is a heuristic procedure rather than a theoretically consistent MCMC estimator. revision: yes

  3. Referee: [Table 2] Table 2 and Figure 4: the reported ARI and NMI values on the flow-cytometry data are given for a single run; no standard errors across random initializations or cross-validation folds are supplied, making it difficult to assess whether the apparent superiority over k-means and GMM is statistically reliable.

    Authors: We acknowledge the need for measures of variability. In the revised version, we will repeat the flow-cytometry experiments across multiple random initializations (e.g., 20 runs) and report the mean ARI and NMI along with standard errors in Table 2. Error bars will be added to Figure 4 to reflect this variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines a new unsupervised clustering procedure by combining a regularized mutual information objective with a mixture-of-mixtures model (Gaussians plus uniforms) and an explicit merge step for component selection. These elements are introduced as part of the method construction itself rather than derived from or fitted to a target quantity that is then re-labeled as a prediction. No load-bearing self-citation chains, self-definitional loops, or renamings of known results appear in the abstract or high-level description; the merge step is presented as an algorithmic addition analogous to existing RJMCMC techniques but not claimed to be forced by prior author work. The derivation therefore remains self-contained and independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the modeling choice that data clusters can be represented as mixtures of Gaussians plus uniforms and that regularization plus merging suffices for automatic unsupervised selection; these are domain assumptions without independent evidence supplied in the abstract.

free parameters (1)
  • regularization parameter
    Controls the strength of the regularizing term in the mutual information objective; its value must be chosen or tuned to achieve automatic component selection.
axioms (2)
  • domain assumption Data can be adequately modeled by a mixture of mixtures of Gaussian and uniform distributions for the conditional model.
    Used to formulate the probabilistic discriminative clustering objective.
  • ad hoc to paper A merge step analogous to reversible jump MCMC will correctly determine the number of components without supervision.
    Introduced to enable automatic selection of the number of components.

pith-pipeline@v0.9.0 · 5468 in / 1468 out tokens · 31032 ms · 2026-05-08T07:28:45.090699+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Chattopadhyay, M

    Aghaeepour, N., P. Chattopadhyay, M. Chikina, T. Dhaene, S. Van Gassen, M. Kursa, B. N. Lambrecht, M. Malek, G. J. McLachlan, Y. Qian, P. Qiu, Y. Saeys, R. Stanton, D. Tong, C. Vens, S. Walkowiak, K. Wang, G. Finak, R. Gottardo, T. Mosmann, G. P. Nolan, R. H. Scheuermann, and R. R. Brinkman (2016). A benchmark for evaluation of algorithms for identificati...

  2. [2]

    Finak, F

    Aghaeepour, N., G. Finak, F. Consortium, D. Consortium, H. Hoos, T. R. Mosmann, R. Brinkman, R. Gottardo, and R. H. Scheuermann (2013). Critical assessment of automated flow cytometry data analysis techniques. Nature methods\/ 10\/ (3), 228--238

  3. [3]

    Baudry, J.-P. (2015). Estimation and model selection for model-based clustering with the conditional classification likelihood. Electronic Journal of Statistics\/ 9 , 1041--1077

  4. [4]

    Celeux, and G

    Biernacki, C., G. Celeux, and G. Govaert (2002). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence\/ 22\/ (7), 719--725

  5. [5]

    D., J.-L

    Blondel, V. D., J.-L. Guillaume, R. Lambiotte, and E. Lefebvre (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment\/ 2008\/ (10), P10008

  6. [6]

    Heading, and D

    Bridle, J., A. Heading, and D. MacKay (1991). Unsupervised classifiers, mutual information and phantom targets. Advances in neural information processing systems\/ 4

  7. [7]

    Browne, R. P., P. D. McNicholas, and M. D. Sparling (2011). Model-based learning using a mixture of mixtures of G aussian and uniform distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence\/ 34\/ (4), 814--817

  8. [8]

    Byrd, R. H., P. Lu, J. Nocedal, and C. Zhu (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing\/ 16\/ (5), 1190--1208

  9. [9]

    Cardoso, M. (2013). Wholesale customers . UCI Machine Learning Repository. DOI : https://doi.org/10.24432/C5030X

  10. [10]

    Carnell, R. (2024). lhs: Latin Hypercube Samples . R package version 1.2.0

  11. [11]

    Niewczas, P

    Charytanowicz, M., J. Niewczas, P. Kulczycki, P. Kowalski, and S. Lukasik (2010). Seeds . UCI Machine Learning Repository. DOI : https://doi.org/10.24432/C5H30K

  12. [12]

    Alkhassim, R

    Commenges, D., C. Alkhassim, R. Gottardo, B. Hejblum, and R. Thi \'e baut (2018). cytometree: A binary tree algorithm for automatic gating in cytometry analysis. Cytometry Part A\/ 93\/ (11), 1132--1140

  13. [13]

    Alkhassim, R

    Commenges, D., C. Alkhassim, R. Gottardo, B. P. Hejblum, and Rodolphe Thi\'ebaut (2018). cytometree: a binary tree algorithm for automatic gating in cytometry analysis [software]. Cytometry Part A\/ 93\/ (11), 1132--1140. Describes the R package version 2.0.6

  14. [14]

    Nepusz, V

    Csárdi, G., T. Nepusz, V. Traag, S. Horvát, F. Zanini, D. Noom, K. Müller, D. Schoch, and M. Salmon (2026). igraph : Network Analysis and Visualization in R . R package version 2.2.1

  15. [15]

    Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B\/ 39\/ (1), 1--38

  16. [16]

    Doherty, U. P., R. M. McLoughlin, and A. White (2025). Challenges and adaptations of model-based clustering for flow and mass cytometry. WIREs Computational Statistics\/ 17\/ (1), e70017

  17. [17]

    Grandvalet, Y. and Y. Bengio (2004). Semi-supervised learning by entropy minimization. Advances in neural information processing systems\/ 17

  18. [18]

    Hejblum, B. P., C. Alkhassim, R. Gottardo, F. Caron, and R. Thi \'e baut (2019). Sequential Dirichlet process mixtures of multivariate skew t -distributions for model-based clustering of flow cytometry data . The Annals of Applied Statistics\/ 13\/ (1), 638 -- 660

  19. [19]

    Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of Classification\/ 2\/ (1), 193--218

  20. [20]

    Hung, Y., Y. Wang, V. Zarnitsyna, C. Zhu, and C. J. Wu (2013). Hidden M arkov models with applications in cell adhesion experiments. Journal of the American Statistical Association\/ 108\/ (504), 1469--1479

  21. [21]

    Hurley, C. (2025). gclus: Clustering Graphics . R package version 1.3.3

  22. [22]

    Pucella, H

    Khodadadi-Jamayran, A., J. Pucella, H. Zhou, N. Doudican, J. C. D. driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosisand Adriana Heguy, B. Reizis, and A. Tsirigos (2020). i C ell R : Combined coverage correction and principal component alignment for batch alignment in single-cell sequencing analysis. bioRxiv\/

  23. [23]

    Perona, and R

    Krause, A., P. Perona, and R. Gomes (2010). Discriminative clustering by regularized information maximization. Advances in neural information processing systems\/ 23

  24. [24]

    Levine, J. H., E. F. Simonds, S. C. Bendall, K. L. Davis, D. A. El-ad, M. D. Tadmor, O. Litvin, H. G. Fienberg, A. Jager, E. R. Zunder, et al. (2015). Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell\/ 162\/ (1), 184--197

  25. [25]

    Liu, X., W. Song, B. Wong, T. Zhang, S. Yu, G. Lin, and X. Ding (2019). A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biology\/ 20\/ (297)

  26. [26]

    Lun, A. (2025). bluster: Clustering Algorithms for Bioconductor . R package version 1.18.0

  27. [27]

    Marin, D., M. Tang, I. B. Ayed, and Y. Boykov (2017). Kernel clustering: Density biases and solutions. IEEE Transactions on Pattern Analysis and Machine Intelligence\/ 41\/ (1), 136--147

  28. [28]

    McNicholas, P. D. (2016a). Mixture Model-Based Classification . Boca Raton: Chapman & Hall/CRC Press

  29. [29]

    McNicholas, P. D. (2016b). Model-based clustering. Journal of Classification\/ 33\/ (3), 331--373

  30. [30]

    Moulavi, D., P. A. Jaskowiak, R. J. Campello, A. Zimek, and J. Sander (2014). Density-based clustering validation. In Proceedings of the 2014 SIAM international conference on data mining , pp.\ 839--847. SIAM

  31. [31]

    Mattei, C

    Ohl, L., P.-A. Mattei, C. Bouveyron, W. Harchaoui, M. Leclercq, A. Droit, and F. Precioso (2022). Generalised mutual information for discriminative clustering. Advances in Neural Information Processing Systems\/ 35 , 3377--3390

  32. [32]

    Qian, Y., C. Wei, F. Eun-Hyung Lee, J. Campbell, J. Halliley, J. A. Lee, J. Cai, Y. M. Kong, E. Sadat, E. Thomson, et al. (2010). Elucidation of seventeen human peripheral blood b-cell subsets and quantification of the tetanus response using a density-based method for the automated identification of cell populations in multidimensional flow cytometry data...

  33. [33]

    R: A Language and Environment for Statistical Computing

    R Core Team (2025). R: A Language and Environment for Statistical Computing . Vienna, Austria: R Foundation for Statistical Computing

  34. [34]

    Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics\/ 6\/ (2), 461--464

  35. [35]

    Fraley, T

    Scrucca, L., C. Fraley, T. B. Murphy, and A. E. Raftery (2023). Model-Based Clustering, Classification, and Density Estimation Using mclust in R . Chapman and Hall/CRC

  36. [36]

    Tortora, C., R. P. Browne, A. ElSherbiny, B. C. Franczak, and P. D. McNicholas (2021). Model-based clustering, classification, and discriminant analysis using the generalized hyperbolic distribution: MixGHD R package. Journal of Statistical Software\/ 98\/ (3), 1--24

  37. [37]

    Callebaut, M

    Van Gassen, S., B. Callebaut, M. J. Van Helden, B. N. Lambrecht, P. Demeester, T. Dhaene, and Y. Saeys (2015). Flowsom: Using self-organizing maps for visualization and interpretation of cytometry data. Cytometry Part A\/ 87\/ (7), 636--645

  38. [38]

    Zhang, Z., K. L. Chan, Y. Wu, and C. Chen (2004). Learning a multivariate gaussian mixture model with the reversible jump mcmc algorithm. Statistics and Computing\/ 14\/ (4), 343--355

  39. [39]

    Zhang, Z., C. Chen, J. Sun, and K. L. Chan (2003). EM algorithms for G aussian mixtures with split-and-merge operation. Pattern recognition\/ 36\/ (9), 1973--1983

  40. [40]

    Lin, and X

    Zou, Y., Y. Lin, and X. Song (2024). Bayesian heterogeneous hidden M arkov models with an unknown number of states. Journal of Computational and Graphical Statistics\/ 33\/ (1), 15--24