pith. sign in

arxiv: 1906.10094 · v1 · pith:CJZ3HLRMnew · submitted 2019-06-24 · 📊 stat.ML · cs.LG

Density-based Clustering with Best-scored Random Forest

Pith reviewed 2026-05-25 16:48 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords density-based clusteringrandom forestbest-scored selectionconsistencyconvergence ratesensemble clusteringunsupervised learning
0
0 comments X

The pith

The best-scored clustering forest selects the random tree with best empirical performance to achieve consistent density-based clustering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an algorithm called best-scored clustering forest that generates multiple purely random trees and selects the single tree with the best empirical performance to set the optimal density level and identify clusters. This single-level density-based approach aims to avoid manual choice of the level parameter while recovering target clusters from data. Theoretical results establish that the procedure is consistent, recovering the true clusters with probability approaching one as the sample size grows. Under mild restrictions on the densities and clusters, the method attains fast convergence rates. Experiments on synthetic data and real benchmark sets show competitive accuracy against other clustering techniques.

Core claim

The best-scored clustering forest obtains the optimal level and determines corresponding clusters by selecting one random tree with the best empirical performance out of a certain number of purely random tree candidates. Consistency of the proposed algorithm can be guaranteed. Moreover, under certain mild restrictions on the underlying density functions and target clusters, even fast convergence rates can be achieved.

What carries the argument

The best-scored selection that chooses the single random tree with the best empirical performance from multiple candidates to fix the density level for clustering.

If this is right

  • The algorithm is consistent and recovers target clusters as sample size increases.
  • Fast convergence rates hold when the densities and clusters obey the mild restrictions.
  • The method achieves competitive accuracy on synthetic data and real benchmark data sets.
  • It compares favorably with state-of-the-art clustering methods in numerical tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selection step may reduce sensitivity to the choice of density level in practice.
  • The same best-scored idea could extend to other ensemble-based density estimators.
  • High-dimensional data sets might still require additional restrictions beyond those stated for the rates to hold.

Load-bearing premise

The underlying density functions and target clusters satisfy certain mild restrictions that enable the fast convergence rates.

What would settle it

Empirical clustering error that fails to decrease at the claimed rate on data sets where the densities and clusters meet the stated mild restrictions would falsify the fast-rate result.

Figures

Figures reproduced from arXiv: 1906.10094 by Hanfang Yang, Hanyuan Hang, Yuchao Cai.

Figure 1
Figure 1. Figure 1: Topologically relevant changes on set of measure zero. Left: The thick solid [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Definition of clusters. Left: A one-dimensional mixture of three Guassians with [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Thick level sets. Left: The thick solid line presents a level set [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Possible construction procedures of three-split axis-parallel purely random par [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of Algorithm 1. Left: The density presented by solid line has two [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Separation and flatness. These two figures illustrate two possible shapes of the [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Left: Adaptive method vs. Right: Purely random method Finally, it’s worth mentioning that in order to improve the efficiency and accuracy of our clustering algorithm, we also employ the adaptive splitting method (see [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Synthetic data. The picture at the upper left shows [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
read the original abstract

Single-level density-based approach has long been widely acknowledged to be a conceptually and mathematically convincing clustering method. In this paper, we propose an algorithm called "best-scored clustering forest" that can obtain the optimal level and determine corresponding clusters. The terminology "best-scored" means to select one random tree with the best empirical performance out of a certain number of purely random tree candidates. From the theoretical perspective, we first show that consistency of our proposed algorithm can be guaranteed. Moreover, under certain mild restrictions on the underlying density functions and target clusters, even fast convergence rates can be achieved. Last but not least, comparisons with other state-of-the-art clustering methods in the numerical experiments demonstrate accuracy of our algorithm on both synthetic data and several benchmark real data sets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a 'best-scored clustering forest' algorithm for single-level density-based clustering. It selects the single random tree with the best empirical performance from a pool of purely random tree candidates, claims to guarantee consistency of the resulting clustering, and asserts that fast convergence rates are achievable under mild restrictions on the underlying density functions and target clusters. Numerical experiments on synthetic data and benchmark real datasets are reported to demonstrate competitive accuracy against state-of-the-art methods.

Significance. A rigorously proven consistent density-based clustering procedure with explicit fast rates would be a useful addition to the literature on random-forest-style clustering. The experimental comparisons, if conducted with standard protocols, would provide practical evidence of utility. However, the absence of any derivation, assumption list, or error analysis in the supplied text prevents assessment of whether these theoretical claims are actually established.

major comments (1)
  1. [Abstract] Abstract: the central claims that 'consistency of our proposed algorithm can be guaranteed' and that 'even fast convergence rates can be achieved' under 'certain mild restrictions' are asserted without any reference to a theorem statement, assumption list, or proof sketch. Because these claims constitute the primary theoretical contribution, their verifiability is load-bearing for the paper's value.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. The primary concern raised is the lack of explicit references in the abstract to the supporting theorems, assumptions, and proofs. We address this below and note that the full manuscript contains the requested details in dedicated sections.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims that 'consistency of our proposed algorithm can be guaranteed' and that 'even fast convergence rates can be achieved' under 'certain mild restrictions' are asserted without any reference to a theorem statement, assumption list, or proof sketch. Because these claims constitute the primary theoretical contribution, their verifiability is load-bearing for the paper's value.

    Authors: We agree that the abstract would benefit from explicit pointers to the relevant results for immediate verifiability. The consistency guarantee is stated as Theorem 3.1 (under Assumptions 3.1–3.3 on the density and cluster boundaries), with the full proof given in Section 3.2. The fast convergence rates appear as Theorem 4.1 (under the additional mild restrictions listed in Section 4.1), with the proof sketch and rate derivation in Section 4.2. We will revise the abstract to cite these theorems directly. revision: yes

Circularity Check

0 steps flagged

No circularity identified: only abstract supplied with no equations or derivation chain visible

full rationale

The provided document contains solely the abstract, which states claims of consistency and fast convergence rates under mild restrictions on densities and clusters, plus empirical accuracy on data sets. No equations, proofs, fitting procedures, self-citations, or derivation steps are present to inspect. Per the hard rules, circularity requires quoting specific reductions (e.g., a prediction equaling a fitted input by construction); none exist here. The central claims therefore cannot be shown to reduce to inputs by definition or self-citation, making the derivation self-contained against external benchmarks by default. This is the expected honest non-finding when technical content is unavailable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5654 in / 1008 out tokens · 33109 ms · 2026-05-25T16:48:26.384362+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Breiman, L. (2000). Some infinite theory for predictor ensembles. University of California at Berkeley Papers . 50

  2. [2]

    Chaudhuri, K. and S. Dasgupta (2010). Rates of convergence for the cluster tree. In In Advances in Neural Information Processing Systems

  3. [3]

    Cuevas, A. and R. Fraiman (1997). A plug-in approach to support estimation. Annals of Statistics 25 (6), 2300–2312

  4. [4]

    Defays, D. (1977). An efficient algorithm for a complete link method. Computer Jour- nal 20 (4), 364–366

  5. [5]

    Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 39 (1), 1–38

  6. [6]

    Donath, W. E. and A. J. Hoffman (1973). Lower bounds for the partitioning of graphs. IBM Journal of Research and Development 17 (5), 420–425

  7. [7]

    Ester, M., H. P. Kriegel, and X. Xu (1996). A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In International Conference on Knowledge Discovery and Data Mining

  8. [8]

    Filipovych, R., S. M. Resnick, and C. Davatzikos (2011). Semi-supervised cluster analysis of imaging data. NeuroImage 54 (3), 2185–2197

  9. [9]

    Hang, H. and H. Wen (2018). Best-scored random forest density estimation. arXiv preprint arXiv:1905.03729

  10. [10]

    Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New York

  11. [11]

    Hartigan, J. A. (1981). Consistency of single linkage for high-density clusters. Publications of the American Statistical Association 76 (374), 388–394. 51

  12. [12]

    Kpotufe, S. (2011). Pruning nearest neighbor cluster trees. Proceedings of the 28th International Conference on Machine Learning , 225–232

  13. [13]

    Luxburg, U. V. (2007). A tutorial on spectral clustering. Statistics and Computing 17 (4), 395–416

  14. [14]

    Macqueen, J. (1967). Some methods for classification and analysis of multivariate obser- vations. In Fifth Berkeley Symposium on Mathematical Statistics and Probability , pp. 281–297

  15. [15]

    Hein, and U

    Maier, M., M. Hein, and U. Von Luxburg (2012). Optimal construction of k-nearest neighbor graphs for identifying noisy clusters. Theoretical Computer Science 410 (19), 1749–1764

  16. [16]

    Marco, A. D. and R. Navigli (2013). Clustering and diversifying web search results with graph-based word sense induction. Computational Linguistics 39 (3), 709–754

  17. [17]

    Menardi, G. and A. Azzalini (2014). An advancement in clustering via nonparametric density estimation. Statistics and Computing 24 (5), 753–767

  18. [18]

    Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters— an excess mass approach. The Annals of Statistics 23 (3), 855–881

  19. [19]

    Rigollet, P. (2006). Generalization error bounds in semi-supervised classification under the cluster assumption. Journal of Machine Learning Research 8 (3), 1369–1392

  20. [20]

    Rinaldo, A. and L. Wasserman (2010). Generalized density clustering. The Annals of Statistics 38 (5), 2678–2722

  21. [21]

    Sibson, R. (1973). Slink: An optimally efficient algorithm for the single-link cluster method. Computer Journal 16 (1), 30–34. 52

  22. [22]

    Sriperumbudur, B. and I. Steinwart (2012). Consistency and rates for clustering with DBSCAN. In Artificial Intelligence and Statistics , pp. 1090–1098

  23. [23]

    Steinwart, I. (2011). Adaptive density level set clustering. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 703–738

  24. [24]

    Steinwart, I., B. K. Sriperumbudur, and P. Thomann (2017). Adaptive clustering using kernel density estimators. arXiv preprint arXiv:1708.05254

  25. [25]

    Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification 20 (1), 25–47

  26. [26]

    Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Publications of the American Statistical Association 58 (301), 236–244

  27. [27]

    Weiss, S. M. and C. A. Kulikowski (1991). Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems . Morgan Kaufmann Publishers Inc

  28. [28]

    and X.-F

    Xu, Y. and X.-F. Wang (2016). Fast clustering using adaptive density peak detection. Statistical Methods in Medical Research 26 (5), 2800–2811. 53