Turtle shell clustering: A mixture approach to discriminative clustering with applications to flow cytometry and other data
Pith reviewed 2026-05-08 07:28 UTC · model grok-4.3
The pith
A mixture of Gaussians and uniform distributions under a regularized mutual information objective draws non-linear cluster boundaries and selects the number of groups without labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The turtle shell method is a probabilistic discriminative clustering procedure based on a regularized mutual information objective function. It employs a mixture of mixtures of Gaussian and uniform distributions to model the conditional distribution, enabling the estimation of non-linear boundaries. Automatic component selection is achieved through the regularizing term and a merge step analogous to reversible jump MCMC techniques.
What carries the argument
The regularized mutual information objective function applied to a mixture model of Gaussian and uniform distributions, together with a merge step for automatic component number selection.
If this is right
- The method estimates non-linear decision boundaries between clusters without any supervision.
- The regularizer and merge step together determine the number of clusters automatically.
- Clusters with irregular shapes or embedded noise are still recovered as intuitive groups.
- The approach applies directly to flow cytometry data to separate distinct cell populations.
Where Pith is reading between the lines
- The same regularized objective could be paired with other base distributions to handle different types of noise beyond uniforms.
- The uniform components offer a built-in mechanism for identifying points that do not belong to any cluster.
- Embedding the method inside a dimensionality-reduction pipeline might extend its use to higher-dimensional biological or image datasets.
Load-bearing premise
The regularized mutual information objective combined with a mixture of Gaussian and uniform distributions will produce meaningful, non-linear boundaries and automatic component selection in a fully unsupervised setting without requiring labeled data.
What would settle it
A dataset with known ground-truth clusters of highly irregular non-convex shapes where the method either selects the wrong number of components or fails to recover the true assignments better than a standard Gaussian mixture model.
Figures
read the original abstract
Generative approaches to clustering provide information on geometric properties of clusters, whereas discriminative approaches provide boundaries between clusters. Ideas from both approaches are incorporated to present a fully unsupervised, probabilistic, and discriminative clustering method via a regularized mutual information objective function, wherein a mixture of mixtures of Gaussian and uniform distributions is used for formulation of the conditional model. Automatic selection of the number of components is established with the introduction of the regularizing term and a merge step, similar to those applied in reversible jump Markov chain Monte Carlo methods used in Bayesian clustering. Consequently, the turtle shell method -- a fully unsupervised clustering method capable of estimating non-linear boundary lines, automatically selecting the number of components, and capturing intuitive clusters in the presence of data abnormalities such as noise and/or irregular cluster shapes -- is introduced. We test this method on various simulated and real datasets commonly explored in clustering research, and extend the analysis to datasets arising from flow cytometry experiments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the turtle shell clustering method as a fully unsupervised probabilistic discriminative clustering approach. It formulates the problem via a regularized mutual information objective that employs a mixture-of-mixtures model consisting of Gaussian and uniform distributions to capture cluster geometry and boundaries. A regularizing term together with a merge step modeled after reversible-jump MCMC is used to achieve automatic selection of the number of components. The method is claimed to recover non-linear boundaries and to remain robust to noise and irregular shapes; it is evaluated on simulated data, standard clustering benchmarks, and flow-cytometry datasets.
Significance. If the central claims are substantiated, the work offers a principled bridge between generative mixture modeling and discriminative boundary estimation within a single unsupervised objective. Automatic component selection without labeled data or exhaustive hyper-parameter search would be a practical advance for noisy, high-dimensional applications such as flow cytometry. The explicit use of uniform components to model background or outliers is a concrete modeling choice that could generalize to other domains with contamination.
major comments (3)
- [§3.2] §3.2, Eq. (7): the regularized mutual-information objective contains a free regularization parameter λ whose selection procedure is not fully specified; the text states that λ is 'chosen once' yet provides neither a data-driven rule nor a sensitivity analysis showing that downstream cluster count and boundaries remain stable across a plausible range of λ.
- [§4.3] §4.3, Algorithm 1 (merge step): the acceptance probability for the merge operation is stated to be 'analogous to reversible-jump MCMC' but the precise Metropolis-Hastings ratio, proposal distribution, and Jacobian term are not derived; without these quantities it is impossible to verify that the merge step yields a consistent estimator of the number of components rather than an ad-hoc post-processing rule.
- [Table 2] Table 2 and Figure 4: the reported ARI and NMI values on the flow-cytometry data are given for a single run; no standard errors across random initializations or cross-validation folds are supplied, making it difficult to assess whether the apparent superiority over k-means and GMM is statistically reliable.
minor comments (3)
- [§2.1] Notation for the uniform component density is introduced in §2.1 but never given an explicit functional form; adding the support and normalization constant would remove ambiguity.
- [Figure 3] The caption of Figure 3 does not indicate the value of λ used for the displayed partition; this information should be added for reproducibility.
- [§3.1] Several references to 'mutual information' in §3.1 omit the base of the logarithm; consistency with the information-theoretic literature would be improved by stating whether nats or bits are used.
Simulated Author's Rebuttal
Thank you for the constructive comments on our paper. We respond to each major comment in turn and indicate the revisions we plan to make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2, Eq. (7): the regularized mutual-information objective contains a free regularization parameter λ whose selection procedure is not fully specified; the text states that λ is 'chosen once' yet provides neither a data-driven rule nor a sensitivity analysis showing that downstream cluster count and boundaries remain stable across a plausible range of λ.
Authors: We agree that the selection procedure for the regularization parameter λ requires more detail. In the revised manuscript, we will specify a data-driven rule for choosing λ, such as selecting the value that maximizes the objective on a held-out subset or a default based on data dimensionality, and include a sensitivity analysis demonstrating that the cluster count and boundaries are stable over a range of λ values. revision: yes
-
Referee: [§4.3] §4.3, Algorithm 1 (merge step): the acceptance probability for the merge operation is stated to be 'analogous to reversible-jump MCMC' but the precise Metropolis-Hastings ratio, proposal distribution, and Jacobian term are not derived; without these quantities it is impossible to verify that the merge step yields a consistent estimator of the number of components rather than an ad-hoc post-processing rule.
Authors: The merge step is a deterministic post-processing rule applied after the main optimization to automatically select the number of components by merging those that do not improve the objective. It is inspired by but not a direct implementation of reversible-jump MCMC. We will revise the manuscript to remove the MCMC analogy, explicitly state that the acceptance is based on whether the regularized mutual information increases after the merge, and clarify that this is a heuristic procedure rather than a theoretically consistent MCMC estimator. revision: yes
-
Referee: [Table 2] Table 2 and Figure 4: the reported ARI and NMI values on the flow-cytometry data are given for a single run; no standard errors across random initializations or cross-validation folds are supplied, making it difficult to assess whether the apparent superiority over k-means and GMM is statistically reliable.
Authors: We acknowledge the need for measures of variability. In the revised version, we will repeat the flow-cytometry experiments across multiple random initializations (e.g., 20 runs) and report the mean ARI and NMI along with standard errors in Table 2. Error bars will be added to Figure 4 to reflect this variability. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper defines a new unsupervised clustering procedure by combining a regularized mutual information objective with a mixture-of-mixtures model (Gaussians plus uniforms) and an explicit merge step for component selection. These elements are introduced as part of the method construction itself rather than derived from or fitted to a target quantity that is then re-labeled as a prediction. No load-bearing self-citation chains, self-definitional loops, or renamings of known results appear in the abstract or high-level description; the merge step is presented as an algorithmic addition analogous to existing RJMCMC techniques but not claimed to be forced by prior author work. The derivation therefore remains self-contained and independent of its own outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- regularization parameter
axioms (2)
- domain assumption Data can be adequately modeled by a mixture of mixtures of Gaussian and uniform distributions for the conditional model.
- ad hoc to paper A merge step analogous to reversible jump MCMC will correctly determine the number of components without supervision.
Reference graph
Works this paper leans on
-
[1]
Aghaeepour, N., P. Chattopadhyay, M. Chikina, T. Dhaene, S. Van Gassen, M. Kursa, B. N. Lambrecht, M. Malek, G. J. McLachlan, Y. Qian, P. Qiu, Y. Saeys, R. Stanton, D. Tong, C. Vens, S. Walkowiak, K. Wang, G. Finak, R. Gottardo, T. Mosmann, G. P. Nolan, R. H. Scheuermann, and R. R. Brinkman (2016). A benchmark for evaluation of algorithms for identificati...
work page 2016
- [2]
-
[3]
Baudry, J.-P. (2015). Estimation and model selection for model-based clustering with the conditional classification likelihood. Electronic Journal of Statistics\/ 9 , 1041--1077
work page 2015
-
[4]
Biernacki, C., G. Celeux, and G. Govaert (2002). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence\/ 22\/ (7), 719--725
work page 2002
- [5]
-
[6]
Bridle, J., A. Heading, and D. MacKay (1991). Unsupervised classifiers, mutual information and phantom targets. Advances in neural information processing systems\/ 4
work page 1991
-
[7]
Browne, R. P., P. D. McNicholas, and M. D. Sparling (2011). Model-based learning using a mixture of mixtures of G aussian and uniform distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence\/ 34\/ (4), 814--817
work page 2011
-
[8]
Byrd, R. H., P. Lu, J. Nocedal, and C. Zhu (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing\/ 16\/ (5), 1190--1208
work page 1995
-
[9]
Cardoso, M. (2013). Wholesale customers . UCI Machine Learning Repository. DOI : https://doi.org/10.24432/C5030X
-
[10]
Carnell, R. (2024). lhs: Latin Hypercube Samples . R package version 1.2.0
work page 2024
-
[11]
Charytanowicz, M., J. Niewczas, P. Kulczycki, P. Kowalski, and S. Lukasik (2010). Seeds . UCI Machine Learning Repository. DOI : https://doi.org/10.24432/C5H30K
-
[12]
Commenges, D., C. Alkhassim, R. Gottardo, B. Hejblum, and R. Thi \'e baut (2018). cytometree: A binary tree algorithm for automatic gating in cytometry analysis. Cytometry Part A\/ 93\/ (11), 1132--1140
work page 2018
-
[13]
Commenges, D., C. Alkhassim, R. Gottardo, B. P. Hejblum, and Rodolphe Thi\'ebaut (2018). cytometree: a binary tree algorithm for automatic gating in cytometry analysis [software]. Cytometry Part A\/ 93\/ (11), 1132--1140. Describes the R package version 2.0.6
work page 2018
- [14]
-
[15]
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B\/ 39\/ (1), 1--38
work page 1977
-
[16]
Doherty, U. P., R. M. McLoughlin, and A. White (2025). Challenges and adaptations of model-based clustering for flow and mass cytometry. WIREs Computational Statistics\/ 17\/ (1), e70017
work page 2025
-
[17]
Grandvalet, Y. and Y. Bengio (2004). Semi-supervised learning by entropy minimization. Advances in neural information processing systems\/ 17
work page 2004
-
[18]
Hejblum, B. P., C. Alkhassim, R. Gottardo, F. Caron, and R. Thi \'e baut (2019). Sequential Dirichlet process mixtures of multivariate skew t -distributions for model-based clustering of flow cytometry data . The Annals of Applied Statistics\/ 13\/ (1), 638 -- 660
work page 2019
-
[19]
Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of Classification\/ 2\/ (1), 193--218
work page 1985
-
[20]
Hung, Y., Y. Wang, V. Zarnitsyna, C. Zhu, and C. J. Wu (2013). Hidden M arkov models with applications in cell adhesion experiments. Journal of the American Statistical Association\/ 108\/ (504), 1469--1479
work page 2013
-
[21]
Hurley, C. (2025). gclus: Clustering Graphics . R package version 1.3.3
work page 2025
-
[22]
Khodadadi-Jamayran, A., J. Pucella, H. Zhou, N. Doudican, J. C. D. driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosisand Adriana Heguy, B. Reizis, and A. Tsirigos (2020). i C ell R : Combined coverage correction and principal component alignment for batch alignment in single-cell sequencing analysis. bioRxiv\/
work page 2020
-
[23]
Krause, A., P. Perona, and R. Gomes (2010). Discriminative clustering by regularized information maximization. Advances in neural information processing systems\/ 23
work page 2010
-
[24]
Levine, J. H., E. F. Simonds, S. C. Bendall, K. L. Davis, D. A. El-ad, M. D. Tadmor, O. Litvin, H. G. Fienberg, A. Jager, E. R. Zunder, et al. (2015). Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell\/ 162\/ (1), 184--197
work page 2015
-
[25]
Liu, X., W. Song, B. Wong, T. Zhang, S. Yu, G. Lin, and X. Ding (2019). A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biology\/ 20\/ (297)
work page 2019
-
[26]
Lun, A. (2025). bluster: Clustering Algorithms for Bioconductor . R package version 1.18.0
work page 2025
-
[27]
Marin, D., M. Tang, I. B. Ayed, and Y. Boykov (2017). Kernel clustering: Density biases and solutions. IEEE Transactions on Pattern Analysis and Machine Intelligence\/ 41\/ (1), 136--147
work page 2017
-
[28]
McNicholas, P. D. (2016a). Mixture Model-Based Classification . Boca Raton: Chapman & Hall/CRC Press
-
[29]
McNicholas, P. D. (2016b). Model-based clustering. Journal of Classification\/ 33\/ (3), 331--373
-
[30]
Moulavi, D., P. A. Jaskowiak, R. J. Campello, A. Zimek, and J. Sander (2014). Density-based clustering validation. In Proceedings of the 2014 SIAM international conference on data mining , pp.\ 839--847. SIAM
work page 2014
- [31]
-
[32]
Qian, Y., C. Wei, F. Eun-Hyung Lee, J. Campbell, J. Halliley, J. A. Lee, J. Cai, Y. M. Kong, E. Sadat, E. Thomson, et al. (2010). Elucidation of seventeen human peripheral blood b-cell subsets and quantification of the tetanus response using a density-based method for the automated identification of cell populations in multidimensional flow cytometry data...
work page 2010
-
[33]
R: A Language and Environment for Statistical Computing
R Core Team (2025). R: A Language and Environment for Statistical Computing . Vienna, Austria: R Foundation for Statistical Computing
work page 2025
-
[34]
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics\/ 6\/ (2), 461--464
work page 1978
- [35]
-
[36]
Tortora, C., R. P. Browne, A. ElSherbiny, B. C. Franczak, and P. D. McNicholas (2021). Model-based clustering, classification, and discriminant analysis using the generalized hyperbolic distribution: MixGHD R package. Journal of Statistical Software\/ 98\/ (3), 1--24
work page 2021
-
[37]
Van Gassen, S., B. Callebaut, M. J. Van Helden, B. N. Lambrecht, P. Demeester, T. Dhaene, and Y. Saeys (2015). Flowsom: Using self-organizing maps for visualization and interpretation of cytometry data. Cytometry Part A\/ 87\/ (7), 636--645
work page 2015
-
[38]
Zhang, Z., K. L. Chan, Y. Wu, and C. Chen (2004). Learning a multivariate gaussian mixture model with the reversible jump mcmc algorithm. Statistics and Computing\/ 14\/ (4), 343--355
work page 2004
-
[39]
Zhang, Z., C. Chen, J. Sun, and K. L. Chan (2003). EM algorithms for G aussian mixtures with split-and-merge operation. Pattern recognition\/ 36\/ (9), 1973--1983
work page 2003
-
[40]
Zou, Y., Y. Lin, and X. Song (2024). Bayesian heterogeneous hidden M arkov models with an unknown number of states. Journal of Computational and Graphical Statistics\/ 33\/ (1), 15--24
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.