Two-stage Ensemble Clustering of Functional Data Using Random Projections
Pith reviewed 2026-05-22 04:09 UTC · model grok-4.3
The pith
A two-stage method using random projections clusters functional data with higher accuracy than current approaches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that an ensemble of random projections from Gaussian processes can capture differences between functional populations at a population level, and that refining these projections in a second stage using labels from the first stage leads to improved clustering performance across a range of settings.
What carries the argument
The two-stage clustering procedure that employs prespecified Gaussian random projections and the MADD dissimilarity for initial grouping, followed by covariance-driven projections for refinement.
If this is right
- The method applies to irregular and partially observed functional data without special adjustments.
- Extensive simulations demonstrate superior accuracy compared to many existing clustering techniques.
- Real-data applications confirm the method's practical effectiveness.
- Population-level analysis explains why the random projections distinguish distributional differences.
Where Pith is reading between the lines
- If the first stage provides reasonable starting labels, the second stage can significantly improve separation by focusing projections on data-specific structures.
- This framework might generalize to clustering other high-dimensional objects by adapting the projection families.
- Selecting the optimal clustering via the normalized cost function could be tested on larger datasets to assess scalability.
Load-bearing premise
The first stage clustering with fixed projection families produces initial labels accurate enough that the covariance estimated from them improves the separation achieved by the second stage projections.
What would settle it
A simulation study in which the first-stage clusters are forced to be random or incorrect, followed by checking whether the second stage still yields better final clustering than a single-stage approach.
Figures
read the original abstract
We propose a computationally simple framework for clustering functional data based on Gaussian-process-generated random projections. In this approach, each curve is first projected onto a large collection of independent Gaussian process realizations. The resulting high-dimensional representations are clustered using the Mean Absolute Difference of Distances (MADD), a dissimilarity measure well suited for high-dimensional settings. A population-level analysis of this dissimilarity provides insight into how random projections help capture distributional differences between functional populations. We introduce a second stage of clustering to additionally leverage on data-driven projection directions. Thus, in Stage I, an initial clustering is obtained using a set of prespecified projection families. In Stage II, this partition is refined by constructing Gaussian random projections based on an estimated covariance operator that uses the first stage of cluster labels. Finally, a normalized cost function is used to select the optimal clustering among candidate solutions. The proposed clustering algorithm is broadly applicable to diverse functional data regimes including irregular and partially observed data. Through extensive simulations and real-data applications, we show that the proposed method achieves a high degree of accuracy and outperforms many of the state-of-the-art methods across a wide range of functional data settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage clustering framework for functional data. Stage I projects each curve onto realizations from prespecified Gaussian process families and applies the Mean Absolute Difference of Distances (MADD) dissimilarity for initial clustering. Stage II refines the partition by estimating a covariance operator from the Stage I labels to produce data-driven projections. A normalized cost function selects the final clustering. The method is presented as applicable to irregular and partially observed data, with claims of high accuracy and outperformance of state-of-the-art methods based on extensive simulations and real-data examples. A population-level analysis of MADD is used to justify the projection approach.
Significance. If the two-stage refinement reliably improves upon Stage I without excessive sensitivity to label errors, the framework would offer a practical, computationally simple tool for functional data clustering that handles missingness and high dimensionality better than many existing approaches. The explicit use of both fixed and adaptive random projections, together with the MADD analysis, provides a clear algorithmic contribution that could be adopted in applied settings where functional observations are incomplete.
major comments (2)
- [Stage II] Section on Stage II and population-level analysis of MADD: The central accuracy claim rests on Stage II improving separation by using labels from Stage I to estimate the covariance operator. However, no sensitivity analysis, breakdown-point bounds, or simulations with controlled Stage I error rates are provided to show when this refinement yields gains versus degradation; this is load-bearing because the method is explicitly iterative and the population analysis assumes sufficiently accurate initial labels.
- [Simulations and real-data applications] Simulations and real-data sections: The abstract and description assert outperformance across a wide range of settings, yet the visible material lacks explicit reporting of baseline implementations, number of Monte Carlo replications, error-bar statistics, or exact parameter choices for the competing methods; without these, the quantitative support for the superiority claim cannot be fully evaluated.
minor comments (2)
- [Methods] The notation and definition of the normalized cost function used for final selection could be stated more explicitly, including its dependence on the projection dimension and number of clusters.
- [Figures] Figure captions for the simulation results should include the exact functional forms and noise levels used in each scenario to improve reproducibility.
Simulated Author's Rebuttal
We are grateful to the referee for the detailed and constructive feedback on our paper. The comments highlight important aspects that will enhance the presentation and validation of our two-stage clustering method. We respond to each major comment below and outline the planned revisions.
read point-by-point responses
-
Referee: [Stage II] Section on Stage II and population-level analysis of MADD: The central accuracy claim rests on Stage II improving separation by using labels from Stage I to estimate the covariance operator. However, no sensitivity analysis, breakdown-point bounds, or simulations with controlled Stage I error rates are provided to show when this refinement yields gains versus degradation; this is load-bearing because the method is explicitly iterative and the population analysis assumes sufficiently accurate initial labels.
Authors: We agree that demonstrating the robustness of Stage II to potential errors in the initial clustering is crucial for the method's reliability. While the population-level analysis of MADD assumes sufficiently accurate labels to justify the projection approach, we will strengthen the manuscript by adding a sensitivity analysis. This will include simulations with controlled misclassification rates in Stage I labels and an examination of when the refinement step improves or degrades performance. We will also include a brief discussion of the breakdown behavior in the revised text. revision: yes
-
Referee: [Simulations and real-data applications] Simulations and real-data sections: The abstract and description assert outperformance across a wide range of settings, yet the visible material lacks explicit reporting of baseline implementations, number of Monte Carlo replications, error-bar statistics, or exact parameter choices for the competing methods; without these, the quantitative support for the superiority claim cannot be fully evaluated.
Authors: We acknowledge that these details are essential for reproducibility and evaluation. In the revised manuscript, we will explicitly report the number of Monte Carlo replications performed, include error bars or standard error statistics in the simulation results, provide descriptions or references for the baseline implementations, and specify the exact parameter choices and settings used for the competing methods. revision: yes
Circularity Check
No circularity: explicit two-stage algorithm with external empirical validation
full rationale
The paper presents an algorithmic procedure rather than a mathematical derivation of a target quantity. Stage I applies prespecified GP projections and MADD dissimilarity to obtain initial labels; Stage II then estimates the covariance operator from those labels to generate data-driven projections. This dependence is an explicit iterative design choice, not a self-definitional loop or a fitted parameter renamed as a prediction. Population-level analysis of MADD is invoked to motivate why projections capture distributional differences, but the analysis is presented as interpretive support rather than a load-bearing uniqueness theorem. All performance claims rest on simulations and real-data applications that serve as external benchmarks, not on internal reduction to the method's own inputs. No self-citation chains, ansatz smuggling, or renaming of known results appear in the provided description.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Gaussian process realizations provide sufficiently rich random directions to capture distributional differences between functional populations.
- domain assumption Initial cluster labels from prespecified projections are accurate enough to yield a useful covariance estimate for Stage II refinement.
Reference graph
Works this paper leans on
-
[1]
Unsupervised curve clustering using b-splines
Christophe Abraham, Pierre-Andr´ e Cornillon, ERIC Matzner-Løber, and Nicolas Molinari. Unsupervised curve clustering using b-splines. Scandinavian journal of statistics, 30(3):581–595, 2003
work page 2003
-
[2]
Learning mixtures of gaussian processes through random projection
Emmanuel Akeweje and Mimi Zhang. Learning mixtures of gaussian processes through random projection. InProceedings of the 41st In- ternational Conference on Machine Learning, Proceedings of Machine Learning Research, 2024
work page 2024
-
[3]
Cristina Anton and Iain Smith. Model-based clustering of functional data via mixtures of t distributions.Advances in Data Analysis and Classification, 18(3):563–595, 2024
work page 2024
-
[4]
Carey, Pablo Liedo, Hans-Georg M¨ uller, Jane-Ling Wang, and Jeng-Min Chiou
James R. Carey, Pablo Liedo, Hans-Georg M¨ uller, Jane-Ling Wang, and Jeng-Min Chiou. Relationship of age patterns of fecundity to mortality, longevity, and lifetime reproduction in a large cohort of mediterranean fruit fly females.The Journals of Gerontology: Series A: Biological Sciences and Medical Sciences, 53A(4):B245–B251, 1998
work page 1998
-
[5]
Sparse and smooth functional data clustering.Statistical Papers, 65(2):795–825, 2024
Fabio Centofanti, Antonio Lepore, and Biagio Palumbo. Sparse and smooth functional data clustering.Statistical Papers, 65(2):795–825, 2024
work page 2024
-
[6]
Optimally weighted l2 distance for functional data.Biometrics, 70(3):516–525, 2014
Huaihou Chen, Philip T Reiss, and Thaddeus Tarpey. Optimally weighted l2 distance for functional data.Biometrics, 70(3):516–525, 2014
work page 2014
-
[7]
Clus- tering brain signals: A robust approach using functional data ranking
Tianbo Chen, Ying Sun, Carolina Euan, and Hernando Ombao. Clus- tering brain signals: A robust approach using functional data ranking. Journal of Classification, 38(3):425–442, 2021
work page 2021
-
[8]
J.-M. Chiou and P.-L. Li. Functional clustering and identifying sub- structures of longitudinal data.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):679–699, 2007
work page 2007
-
[9]
A sharp form of the Cram´ er–Wold theorem.Journal of Theoretical Probability, 20(2):201–209, 2007
Juan Antonio Cuesta-Albertos, Ricardo Fraiman, and Thomas Rans- ford. A sharp form of the Cram´ er–Wold theorem.Journal of Theoretical Probability, 20(2):201–209, 2007
work page 2007
-
[10]
Aurore Delaigle, Peter Hall, and Tung Pham. Clustering functional data into groups by using projections.Journal of the Royal Statistical Society Series B: Statistical Methodology, 81(2):271–304, 2019
work page 2019
-
[11]
Fr´ ed´ eric Ferraty and Philippe Vieu. Curves discrimination: a nonpara- metric functional approach.Computational Statistics & Data Analysis, 44(1-2):161–173, 2003
work page 2003
-
[12]
M. Giacofci, S. Lambert-Lacroix, G. Marot, and F. Picard. Wavelet- Based Clustering for Mixed-Effects Functional Models in High Dimen- sion.Biometrics, 69(1):31–40, 02 2013
work page 2013
-
[13]
Florian Heinrichs, Mavin Heim, and Corinna Weber. Functional neural networks: shift invariant models for functional data with applications to eeg classification. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023
work page 2023
-
[14]
Springer Science & Business Media, 2012
Lajos Horv´ ath and Piotr Kokoszka.Inference for functional data with applications, volume 200. Springer Science & Business Media, 2012
work page 2012
-
[15]
Society for Industrial and Applied Mathematics, Philadelphia, PA, 2021
Tsung-Yu Hsieh, Yiwei Sun, Suhang Wang, and Vasant Honavar.Func- tional Autoencoders for Functional Data Representation Learning, pages 666–674. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2021
work page 2021
-
[16]
Julien Jacques and Cristian Preda. Funclust: A curves clustering method using functional random variables density approximation.Neu- rocomputing, 112:164–171, 2013. Advances in artificial neural networks, machine learning, and computational intelligence
work page 2013
-
[17]
Mitsunori Kayano, Koji Dozono, and Sadanori Konishi. Functional clus- ter analysis via orthonormalized gaussian basis expansions and its ap- plication.Journal of classification, 27:211–230, 2010
work page 2010
-
[18]
Arthur Leroy, Pierre Latouche, Benjamin Guedj, and Servane Gey. MAGMA: inference and prediction using multi-task Gaussian processes with common mean.Machine Learning, 111(5):1821–1849, 2022
work page 2022
-
[19]
Bin Li and Qingzhao Yu. Classification of functional data: A segmenta- tion approach.Computational Statistics & Data Analysis, 52(10):4790– 4800, 2008
work page 2008
-
[20]
K-means algorithms for functional data.Neurocom- puting, 151:231–245, 2015
Mar´ ıa Luz L´ opez Garc´ ıa, Ricardo Garc´ ıa-R´ odenas, and Antonia Gonz´ alez G´ omez. K-means algorithms for functional data.Neurocom- puting, 151:231–245, 2015
work page 2015
-
[21]
Nicol` o Margaritella, Vanda In´ acio, and Ruth King. Parameter clustering in bayesian functional principal component analysis of neuroscientific data.Statistics in Medicine, 40(1):167–184, 2021
work page 2021
-
[22]
Andrea Martino, Andrea Ghiglietti, Francesca Ieva, and Anna Maria Paganoni. A k-means procedure based on a mahalanobis type distance for clustering multivariate functional data.Statistical Methods & Appli- cations, 28:301–322, 2019
work page 2019
-
[23]
William M Rand. Objective criteria for the evaluation of clustering methods.Journal of the American Statistical Association, 66(336):846– 850, 1971
work page 1971
-
[24]
Multi- variate functional data clustering using adaptive density peak detection
Rui Ren, Kuangnan Fang, Qingzhao Zhang, and Xiaofeng Wang. Multi- variate functional data clustering using adaptive density peak detection. Statistics in Medicine, 42(10):1565–1582, 2023
work page 2023
-
[25]
Support vector machine for functional data classification.Neurocomputing, 69(7-9):730–742, 2006
Fabrice Rossi and Nathalie Villa. Support vector machine for functional data classification.Neurocomputing, 69(7-9):730–742, 2006
work page 2006
-
[26]
Soham Sarkar and Anil K. Ghosh. On perfect clustering of high dimen- sion, low sample size data.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(9):2257–2272, 2019. Published online 2019, print issue 2020
work page 2019
-
[27]
Singh, Shirley Coyle, and Mimi Zhang
Samuel V. Singh, Shirley Coyle, and Mimi Zhang. Shape-informed clus- tering of multi-dimensional functional data via deep functional autoen- coders. InAdvances in Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, USA, Dec 2025. NeurIPS 2025 poster / proceed- ings (OpenReview)
work page 2025
-
[28]
Leen Slaets, Gerda Claeskens, and Mia Hubert. Phase and amplitude- based clustering for functional data.Computational Statistics & Data Analysis, 56(7):2360–2374, 2012
work page 2012
-
[29]
Sriperumbudur, Kenji Fukumizu, and Gert R.G
Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R.G. Lanck- riet. Universality, characteristic kernels and rkhs embedding of mea- sures.Journal of Machine Learning Research, 12(70):2389–2410, 2011
work page 2011
-
[30]
Fang Yao, Hans-Georg M¨ uller, and Jane-Ling Wang. Functional data analysis for sparse longitudinal data.Journal of the American statistical association, 100(470):577–590, 2005
work page 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.