pith. sign in

arxiv: 2605.22110 · v1 · pith:N5DUVK2Znew · submitted 2026-05-21 · 📊 stat.ME

Two-stage Ensemble Clustering of Functional Data Using Random Projections

Pith reviewed 2026-05-22 04:09 UTC · model grok-4.3

classification 📊 stat.ME
keywords functional dataclusteringrandom projectionsGaussian processesensemble methodsMADD dissimilaritytwo-stage algorithm
0
0 comments X

The pith

A two-stage method using random projections clusters functional data with higher accuracy than current approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a clustering technique for functional data that first projects each curve onto many independent Gaussian process realizations and groups them using a high-dimensional dissimilarity measure called MADD. This initial partition then informs the estimation of a covariance operator, which is used to generate more targeted random projections in a second stage for refined clustering. A cost function finally chooses the best result among options. The approach handles irregular observations and is shown through tests to work well on various functional data types where traditional methods struggle.

Core claim

The central discovery is that an ensemble of random projections from Gaussian processes can capture differences between functional populations at a population level, and that refining these projections in a second stage using labels from the first stage leads to improved clustering performance across a range of settings.

What carries the argument

The two-stage clustering procedure that employs prespecified Gaussian random projections and the MADD dissimilarity for initial grouping, followed by covariance-driven projections for refinement.

If this is right

  • The method applies to irregular and partially observed functional data without special adjustments.
  • Extensive simulations demonstrate superior accuracy compared to many existing clustering techniques.
  • Real-data applications confirm the method's practical effectiveness.
  • Population-level analysis explains why the random projections distinguish distributional differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the first stage provides reasonable starting labels, the second stage can significantly improve separation by focusing projections on data-specific structures.
  • This framework might generalize to clustering other high-dimensional objects by adapting the projection families.
  • Selecting the optimal clustering via the normalized cost function could be tested on larger datasets to assess scalability.

Load-bearing premise

The first stage clustering with fixed projection families produces initial labels accurate enough that the covariance estimated from them improves the separation achieved by the second stage projections.

What would settle it

A simulation study in which the first-stage clusters are forced to be random or incorrect, followed by checking whether the second stage still yields better final clustering than a single-stage approach.

Figures

Figures reproduced from arXiv: 2605.22110 by Anirvan Chakraborty, Shyamal K. De, Sourav Chakrabarty.

Figure 1
Figure 1. Figure 1: Plots of Satellite and BME datasets showing the different clusters [PITH_FULL_IMAGE:figures/full_fig_p019_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Plots of Berkeley and Medfly datasets showing the different clusters [PITH_FULL_IMAGE:figures/full_fig_p026_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Plots of Meatspectrum and Flours datasets (manually fragmented) [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗
read the original abstract

We propose a computationally simple framework for clustering functional data based on Gaussian-process-generated random projections. In this approach, each curve is first projected onto a large collection of independent Gaussian process realizations. The resulting high-dimensional representations are clustered using the Mean Absolute Difference of Distances (MADD), a dissimilarity measure well suited for high-dimensional settings. A population-level analysis of this dissimilarity provides insight into how random projections help capture distributional differences between functional populations. We introduce a second stage of clustering to additionally leverage on data-driven projection directions. Thus, in Stage I, an initial clustering is obtained using a set of prespecified projection families. In Stage II, this partition is refined by constructing Gaussian random projections based on an estimated covariance operator that uses the first stage of cluster labels. Finally, a normalized cost function is used to select the optimal clustering among candidate solutions. The proposed clustering algorithm is broadly applicable to diverse functional data regimes including irregular and partially observed data. Through extensive simulations and real-data applications, we show that the proposed method achieves a high degree of accuracy and outperforms many of the state-of-the-art methods across a wide range of functional data settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-stage clustering framework for functional data. Stage I projects each curve onto realizations from prespecified Gaussian process families and applies the Mean Absolute Difference of Distances (MADD) dissimilarity for initial clustering. Stage II refines the partition by estimating a covariance operator from the Stage I labels to produce data-driven projections. A normalized cost function selects the final clustering. The method is presented as applicable to irregular and partially observed data, with claims of high accuracy and outperformance of state-of-the-art methods based on extensive simulations and real-data examples. A population-level analysis of MADD is used to justify the projection approach.

Significance. If the two-stage refinement reliably improves upon Stage I without excessive sensitivity to label errors, the framework would offer a practical, computationally simple tool for functional data clustering that handles missingness and high dimensionality better than many existing approaches. The explicit use of both fixed and adaptive random projections, together with the MADD analysis, provides a clear algorithmic contribution that could be adopted in applied settings where functional observations are incomplete.

major comments (2)
  1. [Stage II] Section on Stage II and population-level analysis of MADD: The central accuracy claim rests on Stage II improving separation by using labels from Stage I to estimate the covariance operator. However, no sensitivity analysis, breakdown-point bounds, or simulations with controlled Stage I error rates are provided to show when this refinement yields gains versus degradation; this is load-bearing because the method is explicitly iterative and the population analysis assumes sufficiently accurate initial labels.
  2. [Simulations and real-data applications] Simulations and real-data sections: The abstract and description assert outperformance across a wide range of settings, yet the visible material lacks explicit reporting of baseline implementations, number of Monte Carlo replications, error-bar statistics, or exact parameter choices for the competing methods; without these, the quantitative support for the superiority claim cannot be fully evaluated.
minor comments (2)
  1. [Methods] The notation and definition of the normalized cost function used for final selection could be stated more explicitly, including its dependence on the projection dimension and number of clusters.
  2. [Figures] Figure captions for the simulation results should include the exact functional forms and noise levels used in each scenario to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the detailed and constructive feedback on our paper. The comments highlight important aspects that will enhance the presentation and validation of our two-stage clustering method. We respond to each major comment below and outline the planned revisions.

read point-by-point responses
  1. Referee: [Stage II] Section on Stage II and population-level analysis of MADD: The central accuracy claim rests on Stage II improving separation by using labels from Stage I to estimate the covariance operator. However, no sensitivity analysis, breakdown-point bounds, or simulations with controlled Stage I error rates are provided to show when this refinement yields gains versus degradation; this is load-bearing because the method is explicitly iterative and the population analysis assumes sufficiently accurate initial labels.

    Authors: We agree that demonstrating the robustness of Stage II to potential errors in the initial clustering is crucial for the method's reliability. While the population-level analysis of MADD assumes sufficiently accurate labels to justify the projection approach, we will strengthen the manuscript by adding a sensitivity analysis. This will include simulations with controlled misclassification rates in Stage I labels and an examination of when the refinement step improves or degrades performance. We will also include a brief discussion of the breakdown behavior in the revised text. revision: yes

  2. Referee: [Simulations and real-data applications] Simulations and real-data sections: The abstract and description assert outperformance across a wide range of settings, yet the visible material lacks explicit reporting of baseline implementations, number of Monte Carlo replications, error-bar statistics, or exact parameter choices for the competing methods; without these, the quantitative support for the superiority claim cannot be fully evaluated.

    Authors: We acknowledge that these details are essential for reproducibility and evaluation. In the revised manuscript, we will explicitly report the number of Monte Carlo replications performed, include error bars or standard error statistics in the simulation results, provide descriptions or references for the baseline implementations, and specify the exact parameter choices and settings used for the competing methods. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit two-stage algorithm with external empirical validation

full rationale

The paper presents an algorithmic procedure rather than a mathematical derivation of a target quantity. Stage I applies prespecified GP projections and MADD dissimilarity to obtain initial labels; Stage II then estimates the covariance operator from those labels to generate data-driven projections. This dependence is an explicit iterative design choice, not a self-definitional loop or a fitted parameter renamed as a prediction. Population-level analysis of MADD is invoked to motivate why projections capture distributional differences, but the analysis is presented as interpretive support rather than a load-bearing uniqueness theorem. All performance claims rest on simulations and real-data applications that serve as external benchmarks, not on internal reduction to the method's own inputs. No self-citation chains, ansatz smuggling, or renaming of known results appear in the provided description.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard functional data analysis assumptions plus the suitability of Gaussian processes for generating useful random projections; no new free parameters or invented entities are introduced beyond the algorithmic choices.

axioms (2)
  • domain assumption Gaussian process realizations provide sufficiently rich random directions to capture distributional differences between functional populations.
    Invoked in the description of Stage I projections and the population-level analysis of MADD.
  • domain assumption Initial cluster labels from prespecified projections are accurate enough to yield a useful covariance estimate for Stage II refinement.
    Central to the justification of the two-stage procedure.

pith-pipeline@v0.9.0 · 5733 in / 1458 out tokens · 45522 ms · 2026-05-22T04:09:30.820970+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Unsupervised curve clustering using b-splines

    Christophe Abraham, Pierre-Andr´ e Cornillon, ERIC Matzner-Løber, and Nicolas Molinari. Unsupervised curve clustering using b-splines. Scandinavian journal of statistics, 30(3):581–595, 2003

  2. [2]

    Learning mixtures of gaussian processes through random projection

    Emmanuel Akeweje and Mimi Zhang. Learning mixtures of gaussian processes through random projection. InProceedings of the 41st In- ternational Conference on Machine Learning, Proceedings of Machine Learning Research, 2024

  3. [3]

    Model-based clustering of functional data via mixtures of t distributions.Advances in Data Analysis and Classification, 18(3):563–595, 2024

    Cristina Anton and Iain Smith. Model-based clustering of functional data via mixtures of t distributions.Advances in Data Analysis and Classification, 18(3):563–595, 2024

  4. [4]

    Carey, Pablo Liedo, Hans-Georg M¨ uller, Jane-Ling Wang, and Jeng-Min Chiou

    James R. Carey, Pablo Liedo, Hans-Georg M¨ uller, Jane-Ling Wang, and Jeng-Min Chiou. Relationship of age patterns of fecundity to mortality, longevity, and lifetime reproduction in a large cohort of mediterranean fruit fly females.The Journals of Gerontology: Series A: Biological Sciences and Medical Sciences, 53A(4):B245–B251, 1998

  5. [5]

    Sparse and smooth functional data clustering.Statistical Papers, 65(2):795–825, 2024

    Fabio Centofanti, Antonio Lepore, and Biagio Palumbo. Sparse and smooth functional data clustering.Statistical Papers, 65(2):795–825, 2024

  6. [6]

    Optimally weighted l2 distance for functional data.Biometrics, 70(3):516–525, 2014

    Huaihou Chen, Philip T Reiss, and Thaddeus Tarpey. Optimally weighted l2 distance for functional data.Biometrics, 70(3):516–525, 2014

  7. [7]

    Clus- tering brain signals: A robust approach using functional data ranking

    Tianbo Chen, Ying Sun, Carolina Euan, and Hernando Ombao. Clus- tering brain signals: A robust approach using functional data ranking. Journal of Classification, 38(3):425–442, 2021

  8. [8]

    Chiou and P.-L

    J.-M. Chiou and P.-L. Li. Functional clustering and identifying sub- structures of longitudinal data.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):679–699, 2007

  9. [9]

    A sharp form of the Cram´ er–Wold theorem.Journal of Theoretical Probability, 20(2):201–209, 2007

    Juan Antonio Cuesta-Albertos, Ricardo Fraiman, and Thomas Rans- ford. A sharp form of the Cram´ er–Wold theorem.Journal of Theoretical Probability, 20(2):201–209, 2007

  10. [10]

    Clustering functional data into groups by using projections.Journal of the Royal Statistical Society Series B: Statistical Methodology, 81(2):271–304, 2019

    Aurore Delaigle, Peter Hall, and Tung Pham. Clustering functional data into groups by using projections.Journal of the Royal Statistical Society Series B: Statistical Methodology, 81(2):271–304, 2019

  11. [11]

    Curves discrimination: a nonpara- metric functional approach.Computational Statistics & Data Analysis, 44(1-2):161–173, 2003

    Fr´ ed´ eric Ferraty and Philippe Vieu. Curves discrimination: a nonpara- metric functional approach.Computational Statistics & Data Analysis, 44(1-2):161–173, 2003

  12. [12]

    Giacofci, S

    M. Giacofci, S. Lambert-Lacroix, G. Marot, and F. Picard. Wavelet- Based Clustering for Mixed-Effects Functional Models in High Dimen- sion.Biometrics, 69(1):31–40, 02 2013

  13. [13]

    Functional neural networks: shift invariant models for functional data with applications to eeg classification

    Florian Heinrichs, Mavin Heim, and Corinna Weber. Functional neural networks: shift invariant models for functional data with applications to eeg classification. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  14. [14]

    Springer Science & Business Media, 2012

    Lajos Horv´ ath and Piotr Kokoszka.Inference for functional data with applications, volume 200. Springer Science & Business Media, 2012

  15. [15]

    Society for Industrial and Applied Mathematics, Philadelphia, PA, 2021

    Tsung-Yu Hsieh, Yiwei Sun, Suhang Wang, and Vasant Honavar.Func- tional Autoencoders for Functional Data Representation Learning, pages 666–674. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2021

  16. [16]

    Funclust: A curves clustering method using functional random variables density approximation.Neu- rocomputing, 112:164–171, 2013

    Julien Jacques and Cristian Preda. Funclust: A curves clustering method using functional random variables density approximation.Neu- rocomputing, 112:164–171, 2013. Advances in artificial neural networks, machine learning, and computational intelligence

  17. [17]

    Functional clus- ter analysis via orthonormalized gaussian basis expansions and its ap- plication.Journal of classification, 27:211–230, 2010

    Mitsunori Kayano, Koji Dozono, and Sadanori Konishi. Functional clus- ter analysis via orthonormalized gaussian basis expansions and its ap- plication.Journal of classification, 27:211–230, 2010

  18. [18]

    MAGMA: inference and prediction using multi-task Gaussian processes with common mean.Machine Learning, 111(5):1821–1849, 2022

    Arthur Leroy, Pierre Latouche, Benjamin Guedj, and Servane Gey. MAGMA: inference and prediction using multi-task Gaussian processes with common mean.Machine Learning, 111(5):1821–1849, 2022

  19. [19]

    Classification of functional data: A segmenta- tion approach.Computational Statistics & Data Analysis, 52(10):4790– 4800, 2008

    Bin Li and Qingzhao Yu. Classification of functional data: A segmenta- tion approach.Computational Statistics & Data Analysis, 52(10):4790– 4800, 2008

  20. [20]

    K-means algorithms for functional data.Neurocom- puting, 151:231–245, 2015

    Mar´ ıa Luz L´ opez Garc´ ıa, Ricardo Garc´ ıa-R´ odenas, and Antonia Gonz´ alez G´ omez. K-means algorithms for functional data.Neurocom- puting, 151:231–245, 2015

  21. [21]

    Parameter clustering in bayesian functional principal component analysis of neuroscientific data.Statistics in Medicine, 40(1):167–184, 2021

    Nicol` o Margaritella, Vanda In´ acio, and Ruth King. Parameter clustering in bayesian functional principal component analysis of neuroscientific data.Statistics in Medicine, 40(1):167–184, 2021

  22. [22]

    A k-means procedure based on a mahalanobis type distance for clustering multivariate functional data.Statistical Methods & Appli- cations, 28:301–322, 2019

    Andrea Martino, Andrea Ghiglietti, Francesca Ieva, and Anna Maria Paganoni. A k-means procedure based on a mahalanobis type distance for clustering multivariate functional data.Statistical Methods & Appli- cations, 28:301–322, 2019

  23. [23]

    Objective criteria for the evaluation of clustering methods.Journal of the American Statistical Association, 66(336):846– 850, 1971

    William M Rand. Objective criteria for the evaluation of clustering methods.Journal of the American Statistical Association, 66(336):846– 850, 1971

  24. [24]

    Multi- variate functional data clustering using adaptive density peak detection

    Rui Ren, Kuangnan Fang, Qingzhao Zhang, and Xiaofeng Wang. Multi- variate functional data clustering using adaptive density peak detection. Statistics in Medicine, 42(10):1565–1582, 2023

  25. [25]

    Support vector machine for functional data classification.Neurocomputing, 69(7-9):730–742, 2006

    Fabrice Rossi and Nathalie Villa. Support vector machine for functional data classification.Neurocomputing, 69(7-9):730–742, 2006

  26. [26]

    Soham Sarkar and Anil K. Ghosh. On perfect clustering of high dimen- sion, low sample size data.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(9):2257–2272, 2019. Published online 2019, print issue 2020

  27. [27]

    Singh, Shirley Coyle, and Mimi Zhang

    Samuel V. Singh, Shirley Coyle, and Mimi Zhang. Shape-informed clus- tering of multi-dimensional functional data via deep functional autoen- coders. InAdvances in Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, USA, Dec 2025. NeurIPS 2025 poster / proceed- ings (OpenReview)

  28. [28]

    Phase and amplitude- based clustering for functional data.Computational Statistics & Data Analysis, 56(7):2360–2374, 2012

    Leen Slaets, Gerda Claeskens, and Mia Hubert. Phase and amplitude- based clustering for functional data.Computational Statistics & Data Analysis, 56(7):2360–2374, 2012

  29. [29]

    Sriperumbudur, Kenji Fukumizu, and Gert R.G

    Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R.G. Lanck- riet. Universality, characteristic kernels and rkhs embedding of mea- sures.Journal of Machine Learning Research, 12(70):2389–2410, 2011

  30. [30]

    Functional data analysis for sparse longitudinal data.Journal of the American statistical association, 100(470):577–590, 2005

    Fang Yao, Hans-Georg M¨ uller, and Jane-Ling Wang. Functional data analysis for sparse longitudinal data.Journal of the American statistical association, 100(470):577–590, 2005