pith. sign in

arxiv: 2605.29165 · v1 · pith:W4JKLY44new · submitted 2026-05-27 · 💻 cs.DS

An Improved Greedy Approximation for (Metric) k-Means

Pith reviewed 2026-06-29 08:59 UTC · model grok-4.3

classification 💻 cs.DS
keywords k-meansapproximation algorithmfacility locationmetric clusteringgreedy algorithmLMP approximationsquared distances
0
0 comments X

The pith

A polynomial-time algorithm achieves a 3 + 2√2 + ε approximation for metric k-Means.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a new approximation algorithm for the k-Means clustering problem in general metric spaces by adapting the greedy LMP 2-approximation for facility location to squared distance costs. The resulting method yields a ratio of 3 + 2√2 + ε which is less than 5.83, improving on the prior 9 + ε bound for general metrics and the 5.92 bound for Euclidean instances. The adaptation is then combined with an existing framework for the related k-Median problem. A sympathetic reader cares because k-Means is a core task in data analysis, and tighter approximation guarantees translate directly into more reliable performance bounds for clustering large datasets.

Core claim

The paper shows that the LMP 2-approximation greedy algorithm for facility location admits an adaptation and analysis when connection costs are squared distances. This new LMP approximation is combined with the framework from the related k-Median paper to produce a polynomial-time 3 + 2√2 + ε < 5.83 approximation algorithm for k-Means in arbitrary metrics.

What carries the argument

Adapted LMP 2-approximation greedy algorithm for facility location with squared metric connection costs, which preserves the Lagrangian multiplier property.

Load-bearing premise

The LMP 2-approximation greedy algorithm for facility location admits an adaptation and analysis when connection costs are squared distances rather than linear metric distances.

What would settle it

A concrete set of points in a metric space on which the adapted algorithm returns a solution whose cost exceeds (3 + 2√2) times the optimal k-Means cost.

Figures

Figures reproduced from arXiv: 2605.29165 by Ernest van Wijland, Euiwoong Lee, Fabrizio Grandoni, Moses Charikar, Ruiquan Gao, Vincent Cohen-Addad.

Figure 1
Figure 1. Figure 1: Counterexample for the naïve variant of JMMSV algorithm. The example is on a line. Red circles represent the key clients, while blue squares represent the key facilities. The facility opening cost is f. Counterexample for the Naïve Extension. A naïve idea is to adapt the above algorithm to the case of squared distances by simply replacing d(·, ·) with d 2 (·, ·). It is instructive to see why this attempt f… view at source ↗
read the original abstract

Clustering is a basic task in data analysis and machine learning, and the optimization of clustering objectives are well-studied optimization problems; amongst these, the $k$-Means objective is arguably the most well known. Given a collection of points in a metric space, the goal is to partition them into $k$ clusters, each with an associated center, so as to minimize the sum of squared distances of points to their cluster centers. In this paper, we present a polynomial-time $3+2\sqrt{2}+\epsilon<5.83$-approximation algorithm for $k$-Means in general metrics. This substantially improves on the current-best $(9+\epsilon)$-approximation in [Ahmadian, Norouzi-Fard, Svensson, Ward - FOCS'17, SICOMP'20], and even slightly improves on the $5.92$-approximation in [Cohen-Addad, Esfandiari, Mirrokni, Narayanan - STOC'22] for the Euclidean special case. A natural approach for $k$-Means is to leverage Lagrangian Multiplier Preserving (LMP) approximations for the facility location problem. The previous best results for $k$-Means build upon an adaptation of an LMP $3$-approximation for facility location with metric connection costs in [Jain, Vazirani - J.ACM'01] based on a primal-dual method, rather than on the improved LMP greedy $2$-approximation for the same problem in [Jain, Mahdian, Markakis, Saberi, Vazirani - J.ACM'03]. The barrier to using the improved LMP algorithm was that no adaptation of this algorithm and its analysis to the case of squared metric connection costs was known (since squared distances violate triangle inequality). Our main contribution is overcoming this barrier by providing such an adaptation. This new LMP approximation algorithm is then combined with the framework recently introduced in [Cohen-Addad, Grandoni, Lee, Schwiegelshohn, Svensson - STOC'25] for the related (metric) $k$-Median problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims a polynomial-time (3 + 2√2 + ε)-approximation for metric k-Means by adapting the LMP 2-approximation greedy algorithm of Jain et al. (JACM 2003) to squared connection costs (which violate the triangle inequality) while preserving the LMP property, then plugging the result into the STOC'25 k-Median framework; this yields the stated ratio and improves on the prior 9+ε bound for general metrics.

Significance. If the adaptation holds, the result advances the approximation ratio for a core clustering problem from 9+ε down to <5.83 in general metrics (and slightly below the prior 5.92 Euclidean bound). The explicit handling of squared distances in the greedy LMP analysis is a concrete technical step that removes a known barrier; the manuscript supplies the adaptation as its central contribution rather than relying on self-referential or fitted parameters.

minor comments (2)
  1. [§1] §1 (Introduction): the statement that the new LMP algorithm 'is then combined with' the STOC'25 framework would benefit from an explicit sentence confirming that no additional loss factor is incurred beyond the ε term already stated.
  2. The notation for the squared-distance connection costs should be introduced once with a clear definition (e.g., c(i,j) = d(i,j)^2) before its first use in the algorithm description.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and their recommendation to accept. We are pleased that the significance of adapting the greedy LMP 2-approximation to squared distances (while preserving the LMP property) and combining it with the STOC'25 framework is recognized as advancing the state of the art for metric k-Means.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's derivation chain centers on a new adaptation of the LMP 2-approximation greedy algorithm (from Jain et al. 2003) to squared distances, explicitly presented as the main contribution overcoming a previously unknown barrier. This adapted LMP is then plugged into the independently cited STOC'25 framework for k-Median. No equation, parameter fit, or claim reduces by construction to its own inputs; the 3+2√2+ε bound is obtained from the supplied analysis rather than tautology or renaming. The overlapping-author citation to the framework is not load-bearing for the core adaptation step and does not create a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper introduces no new free parameters, invented entities, or ad-hoc axioms; it relies on standard definitions of metric spaces, squared distances, and the existence of the cited LMP 2-approximation and framework.

axioms (2)
  • domain assumption Existence of the LMP 2-approximation greedy algorithm for metric facility location from Jain et al. (J.ACM 2003)
    Invoked as the base algorithm whose adaptation is the main contribution.
  • domain assumption The framework from Cohen-Addad et al. (STOC 2025) applies to the adapted LMP solution for k-means
    Used to obtain the final ratio from the LMP solution.

pith-pipeline@v0.9.1-grok · 5945 in / 1309 out tokens · 25535 ms · 2026-06-29T08:59:58.534528+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages

  1. [1]

    SIAM Journal on Computing , Year =

    [AGK+04] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. “Local Search Heuristics for k-Median and Facility Location Problems”. In: SIAM J. Comput. 33.3 (2004), pp. 544–562. doi: 10.1137/S0097539702416402. url: https://doi.org/10.1137/S0097539702416402. [ANSW20] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensso...

  2. [2]

    Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019, 42:1– 42:14. doi: 10 . 4230 / LIPICS . ICALP . 2019

  3. [3]

    A (2+ ε)-Approximation Algorithm for Metric k-Median

    url: https : / / doi . org / 10 . 4230 / LIPIcs.ICALP.2019.42. [CGL+25] Vincent Cohen-Addad, Fabrizio Grandoni, Euiwoong Lee, Chris Schwiegelshohn, and Ola Svensson. “A (2+ ε)-Approximation Algorithm for Metric k-Median”. In: Proceedings of the 57th ACM Symposium on Theory of Computing, STOC 2025 . To appear

  4. [4]

    Breaching the 2 LMP Approximation Barrier for Facility Location with Applications to k-Median

    [CGLS23] Vincent Cohen-Addad, Fabrizio Grandoni, Euiwoong Lee, and Chris Schwiegelshohn. “Breaching the 2 LMP Approximation Barrier for Facility Location with Applications to k-Median”. In: Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algo- rithms, SODA 2023, Florence, Italy, January 22-25, 2023 . SIAM, 2023, pp. 940–986. doi: 10.1137/1.97816119...