A New Framework for Convex Clustering in Kernel Spaces: Finite Sample Bounds, Consistency and Performance Insights

Debolina Paul; Kushal Bose; Saptarshi Chakraborty; Shubhayan Pan; Swagatam Das

arxiv: 2511.05159 · v2 · pith:ZLFRJKGXnew · submitted 2025-11-07 · 📊 stat.ML · cs.LG

A New Framework for Convex Clustering in Kernel Spaces: Finite Sample Bounds, Consistency and Performance Insights

Shubhayan Pan , Kushal Bose , Debolina Paul , Saptarshi Chakraborty , Swagatam Das This is my paper

Pith reviewed 2026-05-18 00:29 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords convex clusteringkernel methodsRKHSfinite sample boundsclustering consistencynon-convex datamachine learning

0 comments

The pith

Mapping data to a kernel Hilbert space lets convex clustering succeed on non-convex shapes while supplying finite-sample bounds and convergence guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a kernelized extension of convex clustering by projecting each data point into a reproducing kernel Hilbert space through a feature map. This lift turns non-linear and non-convex cluster structures into ones where the standard centroid-merging process can operate directly. The authors prove that the resulting algorithm converges and derive finite-sample bounds on the cluster estimates. A sympathetic reader would care because many practical datasets, such as images or sensor readings, display precisely the shapes that defeat ordinary convex clustering.

Core claim

By embedding the data in an RKHS via a feature map, convex clustering can be carried out in the transformed space to produce an embedding in finite dimensions; the procedure converges algorithmically and admits finite-sample performance bounds, with experiments indicating better results than existing clustering methods on both synthetic and real data.

What carries the argument

The feature map that projects points into a reproducing kernel Hilbert space, allowing convex clustering operations to remain valid on originally non-convex data.

If this is right

No pre-specified number of clusters is required for the procedure to run.
Finite-sample bounds apply directly to the quality of the recovered centroids and assignments.
The algorithm is guaranteed to converge for any fixed kernel and regularization parameter.
Empirical results show gains over state-of-the-art methods on non-linear synthetic and real datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lifting idea could be applied to other centroid-style methods that currently struggle with non-convex data.
Automatic kernel selection or multiple-kernel combinations might reduce the burden of manual kernel choice in practice.
The finite-dimensional embedding produced by the method could serve as a preprocessing step for downstream supervised tasks.

Load-bearing premise

There exists a kernel whose feature map turns the original non-convex data into a space where convex clustering produces meaningful clusters without excessive distortion.

What would settle it

On a benchmark dataset with known non-convex clusters, if the kernelized method recovers partitions no better than ordinary convex clustering or k-means, the claimed advantage would be refuted.

Figures

Figures reproduced from arXiv: 2511.05159 by Debolina Paul, Kushal Bose, Saptarshi Chakraborty, Shubhayan Pan, Swagatam Das.

**Figure 2.** Figure 2: Scatter plots of the synthetic dataset for (a) ground truth labels, (b) KCC, (c) convex clustering, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: The impact on NMI with varying numbers of clusters is presented. Our method KCC performs consistently compared to other methods. On each of the datasets, we apply KCC and different clustering methods, and compare the results using the NMI score. We graphically summarise the effect of increasing the number of clusters on the NMI score in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Elbow plot of Lymphoma dataset. The study reveals that the optimal number of clusters is 7. Though the data contains 9 clusters but some of them contain a very small number of points, and KCC merges them. Kernel Convex Clustering was then applied on the datasets. Using agglomerative clustering and an elbow plot, we get that the optimal number of clusters in this case is 7, as shown in [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 5.** Figure 5: t-SNE plots of Lymphoma dataset for (a) ground truth labels, (b) KCC, (c) spectral clustering, [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Elbowplot of synthetic dataset with σ1 = 1, σ2 = 100, γ = 1, ρ = 0.001. This set of values gives the optimal clustering with NMI of 1. (a) Variation of number of clusters with changing σ1 keeping others fixed (b) Variation of number of clusters with changing σ2 keeping others fixed (c) Variation of number of clusters with changing γ keeping others fixed (d) Variation of number of clusters with changing ρ k… view at source ↗

**Figure 7.** Figure 7: Variation of the number of clusters with each individual hyperparameter fixing others. For checking the dependence with respect to a hyperparameter, various triplets corresponding to the remaining hyperparameters were chosen; then for each triplet, the main hyperparameter was varied over a long range of values, of which we have illustrated just a few. The total number of clusters remains 5 across all four … view at source ↗

read the original abstract

Convex clustering is a well-regarded clustering method, resembling the similar centroid-based approach of Lloyd's $k$-means, without requiring a predefined cluster count. It starts with each data point as its centroid and iteratively merges them. Despite its advantages, this method can fail when dealing with data exhibiting linearly non-separable or non-convex structures. To mitigate the limitations, we propose a kernelized extension of the convex clustering method. This approach projects the data points into a Reproducing Kernel Hilbert Space (RKHS) using a feature map, enabling convex clustering in this transformed space. This kernelization not only allows for better handling of complex data distributions but also produces an embedding in a finite-dimensional vector space. We provide a comprehensive theoretical underpinning for our kernelized approach, proving algorithmic convergence and establishing finite sample bounds for our estimates. The effectiveness of our method is demonstrated through extensive experiments on both synthetic and real-world datasets, showing superior performance compared to state-of-the-art clustering techniques. This work marks a significant advancement in the field, offering an effective solution for clustering in non-linear and non-convex data scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper kernelizes convex clustering with finite-sample bounds but the finite-dimensional embedding claim clashes with standard infinite-dimensional RKHS kernels.

read the letter

The main thing here is that this work extends convex clustering into a kernel space to handle non-linear and non-convex data, then supplies convergence proofs and finite-sample bounds while reporting better empirical results than some baselines. That direction makes sense because ordinary convex clustering struggles once the data stops being linearly separable. They project points via a feature map into RKHS, run the merging process there, and claim the whole thing lands in a finite-dimensional vector space. The experiments on synthetic and real datasets are presented as showing clear gains, which is the practical hook. The theory is the part they emphasize most. The soft spot is exactly the dimensionality step. Most kernels that actually separate non-convex clusters, such as Gaussian, produce infinite-dimensional RKHS. If the bounds are written directly in that space without truncation, random-feature error terms, or explicit finite-rank restrictions, they cannot hold uniformly. The abstract states the embedding is finite-dimensional, so either they restrict to a special kernel family or the proofs skip the approximation gap. Either way, that needs checking in the full derivations. The experimental section would also benefit from clearer reporting on kernel parameters, how the infinite case is handled in code, and whether error bars or multiple runs are included. This paper is aimed at researchers who work on convex optimization approaches to clustering and want some theory attached. Someone already familiar with kernel methods and finite-sample analysis in statistical ML would get the most out of it. The core idea is straightforward enough that it deserves a serious referee to verify the bounds and see whether the finite-dimensional claim can be made rigorous or needs to be reframed as an approximation result.

Referee Report

2 major / 2 minor

Summary. The paper proposes a kernelized extension of convex clustering by mapping data points into a Reproducing Kernel Hilbert Space (RKHS) via a feature map. This is intended to handle linearly non-separable and non-convex data structures. The authors claim that the kernelization produces a finite-dimensional embedding, prove algorithmic convergence together with finite-sample bounds and consistency, and report superior empirical performance versus state-of-the-art clustering methods on synthetic and real-world datasets.

Significance. A correctly derived set of finite-sample bounds and consistency results for kernelized convex clustering would constitute a useful theoretical contribution, extending a convex-optimization-based method to nonlinear settings while retaining interpretability. The empirical claims, if supported by reproducible controls and error bars, would indicate practical value; however, the significance is currently limited by the unresolved tension between the finite-dimensional claim and standard infinite-dimensional RKHS constructions.

major comments (2)

[Abstract] Abstract: the assertion that kernelization 'produces an embedding in a finite-dimensional vector space' is not generally true for nonlinear kernels (e.g., Gaussian) whose RKHS is infinite-dimensional. If the convergence and finite-sample bound proofs (presumably in the theoretical sections) operate directly in this space without explicit finite-rank restriction, random-feature approximation, or truncation-error terms, the bounds cannot hold uniformly and the central theoretical claim is undermined.
[Theoretical development] Theoretical development (likely §3–4): the finite-sample bounds and consistency statements must explicitly account for the approximation error incurred when moving from the (possibly infinite) feature map to the finite-dimensional setting used in both the analysis and the reported experiments. Without such terms the bounds are not valid for the kernels that enable non-convex separation.

minor comments (2)

[Experiments] Experiments section: include error bars, explicit data-exclusion rules, and the precise kernel and bandwidth choices used; without these the superiority claims cannot be fully assessed.
[Notation] Notation and formulation: clarify how the convex clustering objective is rewritten in the RKHS and whether the feature map is assumed to be explicitly computable or only accessed via the kernel trick.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review. The comments highlight an important precision issue in our presentation of the kernelized framework. We address each major comment below and describe the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that kernelization 'produces an embedding in a finite-dimensional vector space' is not generally true for nonlinear kernels (e.g., Gaussian) whose RKHS is infinite-dimensional. If the convergence and finite-sample bound proofs (presumably in the theoretical sections) operate directly in this space without explicit finite-rank restriction, random-feature approximation, or truncation-error terms, the bounds cannot hold uniformly and the central theoretical claim is undermined.

Authors: We agree that the abstract statement is imprecise for kernels whose RKHS is infinite-dimensional. Our theoretical analysis and finite-sample bounds are derived under an explicit finite-dimensional feature map (corresponding to finite-degree polynomial kernels or other finite-rank constructions). For infinite-dimensional kernels the stated bounds would indeed require additional approximation-error terms. We will revise the abstract to remove the unqualified claim of a finite-dimensional embedding and instead state that the method operates in a finite-dimensional RKHS or via finite-dimensional approximations. revision: yes
Referee: [Theoretical development] Theoretical development (likely §3–4): the finite-sample bounds and consistency statements must explicitly account for the approximation error incurred when moving from the (possibly infinite) feature map to the finite-dimensional setting used in both the analysis and the reported experiments. Without such terms the bounds are not valid for the kernels that enable non-convex separation.

Authors: We concur that a fully general treatment for arbitrary infinite-dimensional RKHS would necessitate explicit approximation-error terms. The current proofs are developed directly inside a finite-dimensional feature space, where no truncation or random-feature error arises. To address the referee’s concern we will add a dedicated subsection (in §4) that (i) states the finite-dimensional assumption clearly, (ii) sketches how the bounds extend to infinite-dimensional kernels via random Fourier features or Nyström approximation, and (iii) supplies the corresponding additive error terms. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation remains self-contained.

full rationale

The paper starts from the standard convex clustering objective, introduces a kernel feature map to RKHS, and states that this yields a finite-dimensional embedding along with new proofs of convergence and finite-sample bounds. These theoretical results are presented as derived from the kernelized formulation rather than presupposed by it. No equation reduces to a self-definition of its own output, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests solely on a self-citation chain. The finite-dimensional claim is asserted as a direct consequence of the chosen feature map rather than derived circularly from the bounds. The overall chain is therefore independent of its target results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; kernel choice and regularization parameters are likely present but unstated here.

pith-pipeline@v0.9.0 · 5515 in / 1025 out tokens · 29331 ms · 2026-05-18T00:29:40.958363+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This observation turns out to be helpful, as we can just substitute ϕ⊤αi for every ui ... Cholesky decomposition of the kernel matrix K = Z⊤Z aids us in reducing KCC to the well-known convex clustering problem.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

, author Boyd, S

ISSN 2167-3888. doi: 10.1561/2400000003. URLhttp://dx.doi.org/10.1561/2400000003. Saptarshi Chakraborty and Jason Xu. Biconvex clustering.Journal of Computational and Graphical Statistics, 32(4):1524–1536, 2023. doi: 10.1080/10618600.2023.2197474. URLhttps://doi.org/10.1080/10618600. 2023.2197474. Xiaohui Chen and Yun Yang. Hanson–wright inequality in hil...

work page doi:10.1561/2400000003 2023
[2]

−2 exp[−Cmin( σ4 log(n) L4∥Γ∥2 HS , σ2√ log(n) L2∥Γ∥2 op )]for some constantC. Proof. Recall the matrixD defined such that the row ofD corresponding to the(i, j)th pair is Dij = ei −e j, where ei is the canonical basis element ofRn whose ith entry is 1, and the remaining entries are all 0. Since e1 −e 2,e 2 −e 3, . . . ,en−1 −e n span the rows ofD and the...

work page
[3]

So, P[ϵ⊤atat ⊤ϵ≥(1 +δ 0)σ2∥at∥2]≤ 2 n 2 2 Letz 2 0 = maxt=1,...,( n 2)(1 +δ 0)σ2∥at∥2

It is easy to see thatδ0 >0. So, P[ϵ⊤atat ⊤ϵ≥(1 +δ 0)σ2∥at∥2]≤ 2 n 2 2 Letz 2 0 = maxt=1,...,( n 2)(1 +δ 0)σ2∥at∥2. Then, for anyt∈ {1, . . . , n 2 }, ϵ⊤atat ⊤ϵ≥z 2 0 ≥(1 +δ 0)σ2∥at∥2 and hence, P[ϵ⊤atat ⊤ϵ≥z 2 0]≤P[ϵ ⊤atat ⊤ϵ ≥(1 +δ 0)σ2∥at∥2]≤ 2 n 2 2 Also, by union bound P[ max t=1,...,( n 2) ∥at ⊤ϵ∥2 ≥z 2 0]≤ ( n 2)X t=1 P[∥at ⊤ϵ∥2 ≥z 2 0] ≤ 2n 2 Ifγ ...

work page
[4]

That is, γ ′ wmin 2 ( n 2)X t=1 ∥At∗(β− ˆβ)∥ ≤ γ ′ 2 X i<j wij∥Aij∗(β− ˆβ)∥ Triangle inequality can finally be employed to get the final result as mentioned in the main paper

−2 exp[−Cmin( σ4 logn L4∥Γ∥2 HS , σ2√logn L2∥Γ∥2 op )] We finally use the fact thatwmin < w ij for all pairsi, j to get thewij terms inside the summation. That is, γ ′ wmin 2 ( n 2)X t=1 ∥At∗(β− ˆβ)∥ ≤ γ ′ 2 X i<j wij∥Aij∗(β− ˆβ)∥ Triangle inequality can finally be employed to get the final result as mentioned in the main paper. C Sensitivity Analysis of ...

work page
[5]

This process is repeated for all three remaining hyperparameters

To check the variation with respect to a particular hyperparameter, sayσ1, we select various other triplets corresponding to( σ2, ρ, γ); now for each such triplet we varyσ1, get the centroids, construct the elbow plots and finally the number of clusters, which turns out to be 5. This process is repeated for all three remaining hyperparameters. The number ...

work page

[1] [1]

, author Boyd, S

ISSN 2167-3888. doi: 10.1561/2400000003. URLhttp://dx.doi.org/10.1561/2400000003. Saptarshi Chakraborty and Jason Xu. Biconvex clustering.Journal of Computational and Graphical Statistics, 32(4):1524–1536, 2023. doi: 10.1080/10618600.2023.2197474. URLhttps://doi.org/10.1080/10618600. 2023.2197474. Xiaohui Chen and Yun Yang. Hanson–wright inequality in hil...

work page doi:10.1561/2400000003 2023

[2] [2]

−2 exp[−Cmin( σ4 log(n) L4∥Γ∥2 HS , σ2√ log(n) L2∥Γ∥2 op )]for some constantC. Proof. Recall the matrixD defined such that the row ofD corresponding to the(i, j)th pair is Dij = ei −e j, where ei is the canonical basis element ofRn whose ith entry is 1, and the remaining entries are all 0. Since e1 −e 2,e 2 −e 3, . . . ,en−1 −e n span the rows ofD and the...

work page

[3] [3]

So, P[ϵ⊤atat ⊤ϵ≥(1 +δ 0)σ2∥at∥2]≤ 2 n 2 2 Letz 2 0 = maxt=1,...,( n 2)(1 +δ 0)σ2∥at∥2

It is easy to see thatδ0 >0. So, P[ϵ⊤atat ⊤ϵ≥(1 +δ 0)σ2∥at∥2]≤ 2 n 2 2 Letz 2 0 = maxt=1,...,( n 2)(1 +δ 0)σ2∥at∥2. Then, for anyt∈ {1, . . . , n 2 }, ϵ⊤atat ⊤ϵ≥z 2 0 ≥(1 +δ 0)σ2∥at∥2 and hence, P[ϵ⊤atat ⊤ϵ≥z 2 0]≤P[ϵ ⊤atat ⊤ϵ ≥(1 +δ 0)σ2∥at∥2]≤ 2 n 2 2 Also, by union bound P[ max t=1,...,( n 2) ∥at ⊤ϵ∥2 ≥z 2 0]≤ ( n 2)X t=1 P[∥at ⊤ϵ∥2 ≥z 2 0] ≤ 2n 2 Ifγ ...

work page

[4] [4]

That is, γ ′ wmin 2 ( n 2)X t=1 ∥At∗(β− ˆβ)∥ ≤ γ ′ 2 X i<j wij∥Aij∗(β− ˆβ)∥ Triangle inequality can finally be employed to get the final result as mentioned in the main paper

−2 exp[−Cmin( σ4 logn L4∥Γ∥2 HS , σ2√logn L2∥Γ∥2 op )] We finally use the fact thatwmin < w ij for all pairsi, j to get thewij terms inside the summation. That is, γ ′ wmin 2 ( n 2)X t=1 ∥At∗(β− ˆβ)∥ ≤ γ ′ 2 X i<j wij∥Aij∗(β− ˆβ)∥ Triangle inequality can finally be employed to get the final result as mentioned in the main paper. C Sensitivity Analysis of ...

work page

[5] [5]

This process is repeated for all three remaining hyperparameters

To check the variation with respect to a particular hyperparameter, sayσ1, we select various other triplets corresponding to( σ2, ρ, γ); now for each such triplet we varyσ1, get the centroids, construct the elbow plots and finally the number of clusters, which turns out to be 5. This process is repeated for all three remaining hyperparameters. The number ...

work page