pith. sign in

arxiv: 2401.11258 · v2 · submitted 2024-01-20 · 🪐 quant-ph · cs.ET

Adaptive Quantum Optimized Centroid Initialization

Pith reviewed 2026-05-24 04:45 UTC · model grok-4.3

classification 🪐 quant-ph cs.ET
keywords centroid initializationQUBOquantum annealingk-means clusteringV-measuremalware classificationsimulated annealingadaptive refinement
0
0 comments X

The pith

AQOCI formulates k-means centroid initialization as a QUBO problem and recovers real-valued solutions via iterative refinement from quantum or quantum-inspired solvers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AQOCI as an extension of QOCI that turns the choice of starting cluster centers into a binary optimization task solvable by quantum annealing or similar methods. An added refinement loop, drawing on Gauss-Seidel and Jacobi ideas, adjusts scale and offset after each solver call so that the binary answers map back to usable real coordinates. Tests on controlled synthetic data and the MOTIF malware set show the resulting initializations produce clusterings that match or exceed those from k-means++ when clusters overlap or when sample sizes stay small, with V-measure gains reaching 26 percent in the latter case. The work matters because poor starting points are a known source of slow convergence and local-minima traps in prototype-based clustering, so any reliable alternative initialization route could improve practical results without changing the downstream algorithm.

Core claim

AQOCI casts centroid initialization as a QUBO whose binary variables encode candidate center locations, solves the QUBO with TABU search, simulated annealing, or D-Wave HybridBQM, then applies an adaptive iterative refinement step to convert the binary output into real-valued coordinates; on the MOTIF dataset this yields V-measure scores competitive with or superior to k-means++ at smaller sample sizes, while on heavily overlapping synthetic clusters AQOCI-SimAnn also improves over k-means++ and the method scales to at least ten dimensions.

What carries the argument

The iterative refinement mechanism that repeatedly adjusts scaling and offset parameters to map binary QUBO solver outputs onto accurate real-valued centroid coordinates.

If this is right

  • On data with heavily overlapping clusters AQOCI can deliver higher V-measure than k-means++ initialization.
  • At reduced sample sizes on real classification tasks such as MOTIF, AQOCI initializations produce clusterings up to 26 percent better in V-measure than k-means++.
  • The approach remains effective at least up to ten dimensions with no observed degradation.
  • On well-separated clusters the binary encoding introduces a performance ceiling that k-means++ does not share.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same QUBO-plus-refinement pattern could be tested on other distance-based clustering objectives that also suffer from initialization sensitivity.
  • If higher-resolution binary encodings or hybrid continuous-binary solvers remove the observed plateau, the method could become competitive even on cleanly separated data.
  • Because the refinement loop is solver-agnostic, classical heuristic QUBO solvers alone might already deliver most of the reported gains on modest hardware.

Load-bearing premise

The iterative refinement step can extract accurate real-valued centroid positions from the binary solver outputs without the limited resolution of the binary encoding itself capping solution quality on the target data.

What would settle it

A controlled experiment that increases the bit resolution of the binary encoding on well-separated synthetic data and still observes the same performance plateau relative to k-means++ would show that the refinement step cannot overcome the encoding limit.

Figures

Figures reproduced from arXiv: 2401.11258 by Ajinkya Borle, Charles K. Nicholas, Nicholas R. Allgood.

Figure 1
Figure 1. Figure 1: Gaussian: Inertia [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Gaussian: Silhouette [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gaussian: Completeness [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: MOTIF: Inertia [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: MOTIF: Silhouette [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: MOTIF: Completeness [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
read the original abstract

Prototype-based clustering algorithms such as k-means are sensitive to the selection of initial cluster centroids, with poor initialization leading to slower convergence and suboptimal solutions trapped in local minima. We present Adaptive Quantum Optimized Centroid Initialization (AQOCI), a method that formulates the centroid initialization problem as a Quadratic Unconstrained Binary Optimization (QUBO) problem and solves it using quantum annealing or quantum-inspired solvers. AQOCI extends a prior method (QOCI) by introducing an iterative refinement mechanism inspired by the Gauss-Seidel and Jacobi methods, enabling the recovery of real-valued centroid coordinates from binary solver outputs through adaptive scaling and offset adjustments. We evaluate AQOCI using three solver backends: TABU search, simulated annealing, and D-Wave's HybridBQM on synthetic Gaussian data with controlled sweeps over cluster separation, cluster count, dimensionality, and sample size, as well as on the MOTIF malware classification dataset, comparing against standard k-means with random initialization and k-means++ initialization. On the MOTIF dataset, AQOCI produces clusterings that are competitive with and, at smaller sample sizes, superior to k-means++, with V-measure improvements of up to 26\%. On synthetic data with heavily overlapping clusters, AQOCI--SimAnn outperforms k-means++ in V-measure. On well-separated synthetic data, k-means++ is clearly superior, and AQOCI exhibits a consistent performance plateau attributable to the binary encoding resolution. The dimensionality sweep demonstrates scalability to at least $d = 10$ without degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Adaptive Quantum Optimized Centroid Initialization (AQOCI), an extension of prior QOCI work that formulates k-means centroid initialization as a QUBO problem solved by quantum annealing or quantum-inspired solvers (TABU, simulated annealing, HybridBQM). It adds an iterative refinement step inspired by Gauss-Seidel and Jacobi methods that recovers real-valued centroid coordinates from binary outputs via adaptive scaling and offset adjustments. The method is evaluated on synthetic Gaussian data with sweeps over cluster separation, number of clusters, dimensionality, and sample size, plus the MOTIF malware dataset, with comparisons to random and k-means++ initializations; it reports competitiveness with k-means++ on MOTIF (up to 26% V-measure gain at small N) and outperformance on overlapping synthetic clusters, while noting a plateau on well-separated data due to binary encoding resolution.

Significance. If the empirical gains are shown to be robust, the work provides a concrete demonstration that quantum-inspired QUBO solvers combined with iterative refinement can yield useful initializations for prototype-based clustering in regimes with overlapping clusters or limited samples. The controlled sweeps over multiple parameters and use of independent classical baselines (random, k-means++) constitute a strength of the evaluation design.

major comments (3)
  1. [Abstract] Abstract: the reported V-measure improvements (including up to 26% on MOTIF) are given without error bars, confidence intervals, or any statistical significance tests, despite the stochastic nature of the solvers and the multiple random seeds implicit in the sweeps; this omission makes it impossible to determine whether the claimed superiority at small sample sizes is reproducible.
  2. [Abstract] Abstract: the performance plateau on well-separated synthetic Gaussians is explicitly attributed to binary encoding resolution, yet the manuscript supplies no quantitative assessment of how that same resolution limit affects the MOTIF results (whose cluster geometry is unknown), leaving open the possibility that the reported gains arise from solver heuristics or evaluation choices rather than the adaptive refinement mechanism.
  3. [Method] Method: the iterative refinement procedure (adaptive scaling and offset adjustments) is described only at a high level; without an explicit equation or pseudocode showing how the real-valued centroid coordinates are recovered from the binary QUBO solution at each iteration, it is difficult to verify that the procedure reliably overcomes the discretization error that produces the observed plateau.
minor comments (1)
  1. [Abstract] The abstract states that scalability is demonstrated to d=10 without degradation, but does not indicate whether this holds uniformly across all solver backends or only for selected ones.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate clarifications and additional details where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported V-measure improvements (including up to 26% on MOTIF) are given without error bars, confidence intervals, or any statistical significance tests, despite the stochastic nature of the solvers and the multiple random seeds implicit in the sweeps; this omission makes it impossible to determine whether the claimed superiority at small sample sizes is reproducible.

    Authors: We agree that error bars and statistical tests are necessary given the stochastic solvers. In the revised version we will report means and standard deviations over multiple independent runs (with different random seeds) for all V-measure results, and include paired statistical significance tests comparing AQOCI variants to k-means++ on the MOTIF dataset at small sample sizes. revision: yes

  2. Referee: [Abstract] Abstract: the performance plateau on well-separated synthetic Gaussians is explicitly attributed to binary encoding resolution, yet the manuscript supplies no quantitative assessment of how that same resolution limit affects the MOTIF results (whose cluster geometry is unknown), leaving open the possibility that the reported gains arise from solver heuristics or evaluation choices rather than the adaptive refinement mechanism.

    Authors: The attribution of the plateau to binary resolution on synthetic data is based on controlled sweeps where separation is known. For MOTIF, unknown geometry prevents a direct quantitative isolation of the resolution effect. We will expand the discussion to note this limitation explicitly and to emphasize that the comparative advantage on overlapping synthetic clusters (where resolution is not the limiting factor) supports the contribution of the adaptive refinement step. revision: partial

  3. Referee: [Method] Method: the iterative refinement procedure (adaptive scaling and offset adjustments) is described only at a high level; without an explicit equation or pseudocode showing how the real-valued centroid coordinates are recovered from the binary QUBO solution at each iteration, it is difficult to verify that the procedure reliably overcomes the discretization error that produces the observed plateau.

    Authors: We accept that the current description is insufficient for reproducibility. The revised manuscript will include explicit equations for the adaptive scaling and offset computation at each iteration, together with pseudocode for the full refinement loop, demonstrating how the procedure iteratively reduces discretization error. revision: yes

standing simulated objections not resolved
  • A quantitative assessment of binary encoding resolution impact specifically on the MOTIF dataset cannot be provided because the true cluster geometry is unknown.

Circularity Check

0 steps flagged

No circularity in derivation chain; empirical claims rest on external baselines

full rationale

The paper formulates centroid initialization as a QUBO problem solved by external solvers (TABU, simulated annealing, D-Wave HybridBQM) and adds an iterative refinement step inspired by Gauss-Seidel/Jacobi methods. All load-bearing claims are empirical performance comparisons (V-measure on MOTIF and synthetic Gaussians) against independent classical baselines (random k-means and k-means++). No equation or result is shown to reduce by construction to a fitted parameter defined from the target metric, nor does any central premise rest on a self-citation chain that itself lacks independent verification. The abstract explicitly notes the binary-resolution performance plateau on well-separated data, confirming the evaluation is falsifiable against external benchmarks rather than self-referential.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields a partial ledger; the method rests on the domain assumption that centroid quality can be encoded as a QUBO objective and recovered via adaptive scaling, with no explicit free parameters or invented entities named.

free parameters (1)
  • adaptive scaling and offset parameters
    Parameters adjusted iteratively to map binary solver outputs to real-valued centroids; their functional form is not specified in the abstract.
axioms (1)
  • domain assumption The centroid initialization objective can be expressed as a QUBO without essential loss of information about cluster quality.
    This mapping is the prerequisite for using quantum or quantum-inspired solvers.

pith-pipeline@v0.9.0 · 5799 in / 1408 out tokens · 29396 ms · 2026-05-24T04:45:25.892657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    formulates the centroid initialization problem as a Quadratic Unconstrained Binary Optimization (QUBO) problem... iterative refinement mechanism inspired by the Gauss-Seidel and Jacobi methods, enabling the recovery of real-valued centroid coordinates from binary solver outputs through adaptive scaling and offset adjustments

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    Allgood, Ajinkya Borle & Charles K

    Nicholas R. Allgood, Ajinkya Borle & Charles K. Nicholas (2023): Quantum Optimized Centroid Initial- ization. In Kohei Arai, editor: Proceedings of the Future Technologies Conference (FTC) 2023, V olume 2, Springer Nature Switzerland, Cham, pp. 71–85

  2. [2]

    Christian Bauckhage (2015): k-Means Clustering Is Matrix Factorization, doi:10.48550/ARXIV .1512.07548

  3. [3]

    Analyzing the Quantum Annealing Approach for Solving Linear Least Squares Problems

    Ajinkya Borle & Samuel J. Lomonaco (2018): Analyzing the Quantum Annealing Approach for Solving Linear Least Squares Problems. arXiv 1809.07649

  4. [4]

    Cryptology ePrint Archive, Report 2021/620

    El ˙zbieta Burek, Michał Misztal & Michał Wro´nski (2021): Algebraic attacks on block ciphers using quantum annealing. Cryptology ePrint Archive, Report 2021/620. https://ia.cr/2021/620

  5. [5]

    Casaña-Eslava, Paulo J.G

    Raúl V . Casaña-Eslava, Paulo J.G. Lisboa, Sandra Ortega-Martorell, Ian H. Jarman & José D. Martín- Guerrero (2020): Probabilistic quantum clustering . Knowledge-Based Systems 194, p. 105567, doi:https://doi.org/10.1016/j.knosys.2020.105567. Available at https://www.sciencedirect.com/ science/article/pii/S0950705120300587

  6. [6]

    Humble & Shigetoshi Sota (2019):Quantum annealing for sys- tems of polynomial equations

    Chia Cheng Chang, Arjun Gambhir, Travis S. Humble & Shigetoshi Sota (2019):Quantum annealing for sys- tems of polynomial equations. Scientific Reports9(1), p. 10258, doi:10.1038/s41598-019-46729-0. Available at https://doi.org/10.1038/s41598-019-46729-0

  7. [7]

    Available at https://www.dwave.com

    D-Wave: D-wave. Available at https://www.dwave.com

  8. [8]

    Forgy (1965): Cluster analysis of multivariate data: efficiency versus interpretability of classifi- cations

    Edward W. Forgy (1965): Cluster analysis of multivariate data: efficiency versus interpretability of classifi- cations. Biometrics 21(3), pp. 768–769

  9. [9]

    Joyce, Dev Amlani, Charles Nicholas & Edward Raff (2023): MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

    Robert J. Joyce, Dev Amlani, Charles Nicholas & Edward Raff (2023): MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels. Computers and Security 124, Issue C

  10. [10]

    Jain A. K. & Dubes R. C. (1988): Algorithms for Clustering Data. Prentice-Hall

  11. [11]

    Rousseeuw PJ Kaufman L (1987): Clustering by means of medoids. Proc. Statistical Data Analysis Based on the L1 Norm Conference, Neuchatel, 1987, pp. 405–416

  12. [12]

    Technical Report GT-CSE-08-01, Georgia Institute of Technology

    Jingu Kim & Haesun Park (2008): Sparse Nonnegative Matrix Factorization for Clustering. Technical Report GT-CSE-08-01, Georgia Institute of Technology. Available athttps://smartech.gatech.edu/handle/ 1853/20058. Allgood, Borle, and Nicholas 11

  13. [13]

    Daniel D. Lee & H. Sebastian Seung (2000): Algorithms for Non-Negative Matrix Factorization . In: Pro- ceedings of the 13th International Conference on Neural Information Processing Systems , NIPS’00, MIT Press, Cambridge, MA, USA, p. 535–541

  14. [14]

    IEEE Transactions on Information Theory 28(2), pp

    S Lloyd (1982): Least squares quantization in PCM . IEEE Transactions on Information Theory 28(2), pp. 129–137

  15. [15]

    J. B. MacQueen (1967): Some Methods for classification and Analysis of Multivariate Observations . In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability: 281-297, University of California Press

  16. [16]

    Nielsen & Isaac L

    Michael A. Nielsen & Isaac L. Chuang (2000): Quantum Computation and Quantum Information. Cambridge University Press

  17. [17]

    In: Annals of Operations Research

    Gintaras Palubeckis (2004): Multistart Tabu Search Strategies for the Unconstrained Binary Quadratic Op- timization Problem. In: Annals of Operations Research . Available at https://doi.org/10.1023/B: ANOR.0000039522.58036.68

  18. [18]

    Pollachini, Juan P

    Giovani G. Pollachini, Juan P. L. C. Salazar, Caio B. D. Goes, Thiago O. Maciel & Eduardo I. Duzzioni (2021): Hybrid classical-quantum approach to solve the heat equation using quantum annealers . Phys- ical Review A 104(3), doi:10.1103/physreva.104.032426. Available at https://doi.org/10.1103% 2Fphysreva.104.032426

  19. [19]

    Expert Systems with Applications 37(7), pp

    Jing Xiao, YuPing Yan, Jun Zhang & Yong Tang (2010): A quantum-inspired genetic al- gorithm for k-means clustering . Expert Systems with Applications 37(7), pp. 4966–4973, doi:https://doi.org/10.1016/j.eswa.2009.12.017. Available at https://www.sciencedirect.com/ science/article/pii/S095741740901063X