pith. sign in

arxiv: 2604.13816 · v1 · submitted 2026-04-15 · 💻 cs.LG

Composite Silhouette: A Subsampling-based Aggregation Strategy

Pith reviewed 2026-05-10 14:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords cluster validationsilhouette coefficientnumber of clusterssubsamplingmicro-averagingmacro-averaginginternal criterionunsupervised learning
0
0 comments X

The pith

Composite Silhouette aggregates micro- and macro-averaged scores from subsampled clusterings to select the true number of clusters more accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Composite Silhouette as a new internal validation criterion for determining the number of clusters when ground-truth labels are unavailable. Standard micro-averaged Silhouette tends to favor larger clusters under size imbalance, while macro-averaging can amplify noise from small groups; the new method combines the two for each subsample through an adaptive convex weight based on their normalized discrepancy, smoothed by a bounded nonlinearity, then averages the results across subsamples. This approach aims to reconcile the strengths of both averaging styles while providing finite-sample concentration guarantees. A sympathetic reader would care because cluster-count selection is a core unsupervised task, and biased metrics lead to unreliable partitions in real data with uneven group sizes.

Core claim

Composite Silhouette aggregates evidence across repeated subsampled clusterings rather than a single partition. For each subsample, micro- and macro-averaged Silhouette scores are combined through an adaptive convex weight determined by their normalized discrepancy and smoothed by a bounded nonlinearity; the final score is obtained by averaging these subsample-level composites. The criterion reconciles the strengths of micro- and macro-averaging and improves recovery of the ground-truth number of clusters on both synthetic and real-world datasets.

What carries the argument

The adaptive convex weight, set by the normalized discrepancy between micro- and macro-averaged Silhouette scores and smoothed by a bounded nonlinearity, applied across repeated subsampled clusterings.

If this is right

  • The method yields more accurate recovery of the ground-truth number of clusters than standard Silhouette on synthetic and real-world data.
  • Finite-sample concentration guarantees hold for the subsampling-based estimate.
  • Key mathematical properties of the composite criterion are established.
  • Bias from cluster size imbalance is reduced without overemphasizing noise in small groups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The subsampling aggregation could be tested on streaming data where full clustering is infeasible.
  • Similar adaptive weighting might stabilize other internal metrics such as Davies-Bouldin or Calinski-Harabasz.
  • If the discrepancy-based rule proves robust, the same pattern could apply to ensemble validation across different clustering algorithms.

Load-bearing premise

The adaptive convex weight based on normalized discrepancy between micro- and macro-scores, together with the bounded nonlinearity, produces a stable and unbiased aggregate that improves cluster-count selection.

What would settle it

Run the method on multiple datasets with known ground-truth cluster counts and controlled size imbalance; if it does not recover the true number more frequently than standard micro-averaged Silhouette, the advantage claim fails.

read the original abstract

Determining the number of clusters is a central challenge in unsupervised learning, where ground-truth labels are unavailable. The Silhouette coefficient is a widely used internal validation metric for this task, yet its standard micro-averaged form tends to favor larger clusters under size imbalance. Macro-averaging mitigates this bias by weighting clusters equally, but may overemphasize noise from under-represented groups. We introduce Composite Silhouette, an internal criterion for cluster-count selection that aggregates evidence across repeated subsampled clusterings rather than relying on a single partition. For each subsample, micro- and macro-averaged Silhouette scores are combined through an adaptive convex weight determined by their normalized discrepancy and smoothed by a bounded nonlinearity; the final score is then obtained by averaging these subsample-level composites. We establish key properties of the criterion and derive finite-sample concentration guarantees for its subsampling estimate. Experiments on synthetic and real-world datasets show that Composite Silhouette effectively reconciles the strengths of micro- and macro-averaging, yielding more accurate recovery of the ground-truth number of clusters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Composite Silhouette, a subsampling-based aggregation strategy for cluster-count selection. For each subsample, it computes an adaptive convex combination of micro- and macro-averaged Silhouette scores, where the weight is determined by the normalized discrepancy between the two scores and smoothed via a bounded nonlinearity. The final criterion is the average of these composite scores over subsamples. The authors claim to establish key properties of this criterion and derive finite-sample concentration guarantees for the subsampling estimator. Experiments on synthetic and real-world data are said to show superior recovery of the ground-truth number of clusters.

Significance. If the central claims hold, the work offers a principled way to mitigate the cluster-size bias in standard Silhouette while avoiding the noise sensitivity of pure macro-averaging. The adaptive weighting and subsampling aggregation represent a potentially useful advance for internal cluster validation in imbalanced datasets. The provision of concentration bounds, if rigorously established, would strengthen the method's theoretical foundation beyond purely empirical proposals.

major comments (2)
  1. [§4] §4 (finite-sample concentration guarantees): The derivation applies standard concentration inequalities (e.g., bounded differences) directly to the composite score. Because the adaptive convex weight is computed from the micro- and macro-Silhouette values on the identical subsample, the composite is a data-dependent function of the data. No Lipschitz constant on the discrepancy-to-weight map or separate bias/variance decomposition for the adaptive step is supplied, so the guarantees do not automatically transfer; this is load-bearing for the stability and unbiasedness claims under imbalance.
  2. [§5] §5 (experimental evaluation): The claim that Composite Silhouette yields more accurate ground-truth k recovery rests on experiments whose design is not fully specified (number of subsamples, imbalance regimes tested, exact synthetic generation process, number of independent runs, and statistical testing). Without these details the empirical superiority over micro- and macro-Silhouette cannot be verified and the weakest assumption (stable unbiased aggregate) remains untested.
minor comments (1)
  1. [Method] The definition of the normalized discrepancy and the precise form of the bounded nonlinearity are stated only in prose; an explicit equation or algorithm box would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We respond point-by-point to the major comments below, providing clarifications on the theoretical analysis and committing to expand experimental details for reproducibility.

read point-by-point responses
  1. Referee: [§4] §4 (finite-sample concentration guarantees): The derivation applies standard concentration inequalities (e.g., bounded differences) directly to the composite score. Because the adaptive convex weight is computed from the micro- and macro-Silhouette values on the identical subsample, the composite is a data-dependent function of the data. No Lipschitz constant on the discrepancy-to-weight map or separate bias/variance decomposition for the adaptive step is supplied, so the guarantees do not automatically transfer; this is load-bearing for the stability and unbiasedness claims under imbalance.

    Authors: We acknowledge that the adaptive weighting renders the per-subsample composite a data-dependent function. However, because both the micro- and macro-averaged Silhouette scores lie in [-1,1] and the weight is obtained via a bounded, continuous nonlinearity of their normalized discrepancy (itself in [0,1]), the resulting composite score remains bounded in [-1,1] irrespective of the realized weight. This boundedness permits direct application of McDiarmid’s inequality to the average over independent subsamples. In the revision we will explicitly derive and state the Lipschitz constant of the discrepancy-to-weight map (which is finite because the nonlinearity is smooth and bounded) and include a short bias-variance discussion showing that any adaptivity-induced bias vanishes with increasing subsample size. These additions will make the transfer of the concentration guarantees fully rigorous. revision: yes

  2. Referee: [§5] §5 (experimental evaluation): The claim that Composite Silhouette yields more accurate ground-truth k recovery rests on experiments whose design is not fully specified (number of subsamples, imbalance regimes tested, exact synthetic generation process, number of independent runs, and statistical testing). Without these details the empirical superiority over micro- and macro-Silhouette cannot be verified and the weakest assumption (stable unbiased aggregate) remains untested.

    Authors: We agree that the experimental protocol requires fuller specification. In the revised manuscript we will add: the number of subsamples (50), the imbalance regimes examined (size ratios from 1:1 to 1:100), the precise synthetic data generator (Gaussian mixtures with controlled means, covariances, and per-cluster sample sizes), the number of independent Monte-Carlo runs (200), and the statistical tests employed (Wilcoxon signed-rank tests with reported p-values). These details were present in the supplementary code and appendix but will be moved into the main text. The synthetic experiments, which use known ground-truth labels, directly evaluate recovery accuracy and consistency across runs and imbalance levels, thereby providing empirical evidence for the stability of the aggregated criterion. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The Composite Silhouette criterion is explicitly constructed by combining the two standard (micro- and macro-averaged) Silhouette scores via an adaptive convex weight computed from their normalized discrepancy and smoothed by a bounded nonlinearity, then averaged over subsamples. The paper states that it establishes key properties of this criterion and derives finite-sample concentration guarantees for the resulting subsampling estimator. No step in the provided description reduces a claimed result to its own inputs by construction, renames a fitted quantity as a prediction, or relies on a load-bearing self-citation whose content is itself unverified; the adaptive weighting rule is a deliberate definitional choice rather than a self-referential equation or statistical tautology. The derivation therefore remains self-contained against the standard Silhouette definitions and standard concentration tools.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the adaptive weighting rule and the finite-sample concentration guarantees for the subsampled estimator; these are asserted but not derived in the provided abstract.

axioms (1)
  • domain assumption Finite-sample concentration inequalities apply to the subsampling-based composite estimator
    Invoked to support the theoretical properties of the criterion

pith-pipeline@v0.9.0 · 5481 in / 1252 out tokens · 59302 ms · 2026-05-10T14:06:21.053975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    ACM computing surveys (CSUR)31(3), 264–323 (1999)

    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM computing surveys (CSUR)31(3), 264–323 (1999)

  2. [2]

    Journal of computational and applied mathematics20, 53–65 (1987)

    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics20, 53–65 (1987)

  3. [3]

    Journal of Machine Learning Research12, 2825–2830 (2011)

    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, ´E.: Scikit-learn: Machine learning in python. Journal of Machine Learning Research12, 2825–2830 (2011)

  4. [4]

    Computa- tional Statistics & Data Analysis158, 107190 (2021)

    Batool, F., Hennig, C.: Clustering with the average silhouette width. Computa- tional Statistics & Data Analysis158, 107190 (2021)

  5. [5]

    In: Pedreschi, D., Monreale, A., Guidotti, R., Pellungrini, R., Naretto, F

    Pavlopoulos, J., Vardakas, G., Likas, A.: Revisiting silhouette aggregation. In: Pedreschi, D., Monreale, A., Guidotti, R., Pellungrini, R., Naretto, F. (eds.) Discovery Science, pp. 354–368. Springer, Cham (2025)

  6. [6]

    Wang, Y., Zhao, Y., Therneau, T., Atkinson, E., P. Tafti, A., Zhang, N., Amin, S., Limper, A., Khosla, S., Liu, H.: Unsupervised machine learning for the discovery 16 of latent disease clusters and patient subgroups using electronic health records. Journal of Biomedical Informatics102, 103364 (2019)

  7. [7]

    Analytics2, 809–823 (2023)

    John, J., Shobayo, O., Ogunleye, B.: An exploration of clustering algorithms for customer segmentation in the uk retail market. Analytics2, 809–823 (2023)

  8. [8]

    Information Retrieval Journal25, 239–268 (2022)

    Yuan, M., Zobel, J., Lin, P.: Measurement of clustering effectiveness for document collections. Information Retrieval Journal25, 239–268 (2022)

  9. [9]

    SIGMOD Record31(2002)

    Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Cluster validity methods: Part i. SIGMOD Record31(2002)

  10. [10]

    ACM SIGMOD Record31(2002)

    Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering validity checking meth- ods: Part ii. ACM SIGMOD Record31(2002)

  11. [11]

    Pattern Recognit.46, 243–256 (2013)

    Arbelaitz, O., Gurrutxaga, I., Muguerza, J., P´ erez, J.M., Perona, I.: An exten- sive comparative study of cluster validity indices. Pattern Recognit.46, 243–256 (2013)

  12. [12]

    Journal of Cybernetics3(3), 32–57 (1973)

    Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Journal of Cybernetics3(3), 32–57 (1973)

  13. [13]

    Communica- tions in Statistics-theory and Methods3(1), 1–27 (1974)

    Cali´ nski, T., Harabasz, J.: A dendrite method for cluster analysis. Communica- tions in Statistics-theory and Methods3(1), 1–27 (1974)

  14. [14]

    IEEE transactions on pattern analysis and machine intelligence (2), 224–227 (1979)

    Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence (2), 224–227 (1979)

  15. [15]

    In: Classification and Data Analysis: Theory and Applications 28, pp

    Dudek, A.: Silhouette index as clustering evaluation tool. In: Classification and Data Analysis: Theory and Applications 28, pp. 19–33 (2020). Springer

  16. [16]

    Statistical Analysis and Data Mining: The ASA Data Science Journal3(2010)

    Vendramin, L., Campello, R.J.G.B., Hruschka, E.R.: Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining: The ASA Data Science Journal3(2010)

  17. [17]

    ArXivabs/2407.20246 (2024)

    Hassan, B.A., Tayfor, N., Hassan, A.A., Ahmed, A.M., Hamad, R.K., Abdalla, N.N.: From a-to-z review of clustering validation indices. ArXivabs/2407.20246 (2024)

  18. [18]

    Thorndike, R.L.: Who belongs in the family? Psychometrika18(4), 267–276 (1953)

  19. [19]

    EURASIP Journal on Wireless Communications and Networking2021(2020)

    Shi, C., Wei, B., Wei, S., Wang, W., Liu, H., Liu, J.: A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. EURASIP Journal on Wireless Communications and Networking2021(2020)

  20. [20]

    Journal of the Royal Statistical Society: Series B 17 (Statistical Methodology)63(2001)

    Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B 17 (Statistical Methodology)63(2001)

  21. [21]

    In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pp

    Shahapure, K.R., Nicholas, C.: Cluster quality analysis using silhouette score. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pp. 747–748 (2020). IEEE

  22. [22]

    Neural Computation16, 1299–1323 (2004)

    Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering solutions. Neural Computation16, 1299–1323 (2004)

  23. [23]

    In: 2010 IEEE International Conference on Data Mining, pp

    Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering validation measures. In: 2010 IEEE International Conference on Data Mining, pp. 911–916 (2010). IEEE

  24. [24]

    In: Inui, K., Jiang, J., Ng, V., Wan, X

    Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pp. 3982–3992. Association for Computati...