Deterministic Coreset for Lp Subspace

Anirban Dasgupta; Dan Feldman; Rachit Chhaya; Supratim Shit

arxiv: 2601.00361 · v3 · pith:UWYLA7WEnew · submitted 2026-01-01 · 💻 cs.DS · cs.LG

Deterministic Coreset for Lp Subspace

Rachit Chhaya , Anirban Dasgupta , Dan Feldman , Supratim Shit This is my paper

Pith reviewed 2026-05-21 16:45 UTC · model grok-4.3

classification 💻 cs.DS cs.LG

keywords coresetsubspace embeddingLp normdeterministic algorithmregressionapproximation algorithmiterative construction

0 comments

The pith

An iterative algorithm produces the first deterministic ε-coresets for ℓ_p subspace embeddings of any p, with optimal size O(d^{max{1,p/2}}/ε²) and no log factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an iterative procedure to pick a small weighted subset of rows from a tall matrix X so that the subset preserves the p-norm of Xq for every direction q up to a (1±ε) factor. This yields a deterministic guarantee for the subspace embedding property because each iteration keeps the loss on the maintained rows bounded above and below by scaled versions of the loss on the full data. The resulting coreset has size O(d^{max{1,p/2}}/ε²), which matches the known lower bound and removes the logarithmic overhead that had persisted as an open question. The construction applies uniformly for every p from 1 to infinity and runs in polynomial time. It also supplies a deterministic route to approximate solutions of ℓ_p regression.

Core claim

The authors introduce an iterative algorithm that builds an (ε, ℓ_p)-subspace embedding coreset X' as a weighted subset of rows of X. In each iteration the loss on the current subset is maintained between appropriate lower and upper scalings of the loss on the original matrix, which directly enforces the deterministic embedding inequality (1-ε)‖Xq‖_p^p ≤ ‖X'q‖_p^p ≤ (1+ε)‖Xq‖_p^p for every q. The procedure returns a coreset whose size is O(d^{max{1,p/2}}/ε²), works for any p in [1,∞), and is tight with the existing lower bound.

What carries the argument

An iterative row-selection loop that keeps the p-norm loss on the maintained subset between scaled versions of the loss on the full dataset, thereby enforcing the embedding property without randomness.

If this is right

The coreset supplies a deterministic approximate solver for the ℓ_p regression problem.
The size bound holds for every p in [1,∞) and every ε>0 and matches the lower bound.
The construction runs in time polynomial in n, d, and ε^{-1}.
No logarithmic factors appear in the coreset size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The deterministic nature may allow the coreset to be safely composed with other non-randomized steps in larger computational pipelines.
The removal of log factors could improve practical running time when dimension d is large and randomness must be avoided.
The same iterative bounding idea might be adapted to dynamic or streaming models while preserving the size guarantee.

Load-bearing premise

In every iteration the algorithm can always locate rows to add or remove so that the p-norm loss on the current subset stays between appropriate multiplicative scalings of the loss on the entire original matrix.

What would settle it

A concrete matrix X together with a vector q for which the output coreset X' violates the (1±ε) bound on ‖X'q‖_p^p, or an instance whose smallest deterministic ε-coreset for some p must exceed size O(d^{max{1,p/2}}/ε²).

read the original abstract

We introduce the first iterative algorithm for constructing a $\varepsilon$-coreset that guarantees deterministic $\ell_p$ subspace embedding for any $p \in [1,\infty)$ and any $\varepsilon > 0$. For a given full rank matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ where $n \gg d$, $\mathbf{X}' \in \mathbb{R}^{m \times d}$ is an $(\varepsilon,\ell_p)$-subspace embedding of $\mathbf{X}$, if for every $\mathbf{q} \in \mathbb{R}^d$, $(1-\varepsilon)\|\mathbf{Xq}\|_{p}^{p} \leq \|\mathbf{X'q}\|_{p}^{p} \leq (1+\varepsilon)\|\mathbf{Xq}\|_{p}^{p}$. Specifically, in this paper, $\mathbf{X}'$ is a weighted subset of rows of $\mathbf{X}$ which is commonly known in the literature as a coreset. In every iteration, the algorithm ensures that the loss on the maintained set is upper and lower bounded by the loss on the original dataset with appropriate scalings. So, unlike typical coreset guarantees, due to bounded loss, our coreset gives a deterministic guarantee for the $\ell_p$ subspace embedding. For an error parameter $\varepsilon$, our algorithm takes $O(\mathrm{poly}(n,d,\varepsilon^{-1}))$ time and returns a deterministic $\varepsilon$-coreset, for $\ell_p$ subspace embedding whose size is $O\left(\frac{d^{\max\{1,p/2\}}}{\varepsilon^{2}}\right)$. Here, we remove the $\log$ factors in the coreset size, which had been a long-standing open problem. Our coresets are optimal as they are tight with the lower bound. As an application, our coreset can also be used for approximately solving the $\ell_p$ regression problem in a deterministic manner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents the first iterative algorithm for constructing an ε-coreset that guarantees deterministic ℓ_p subspace embedding for any p ∈ [1,∞) and ε > 0. For a full-rank matrix X ∈ ℝ^{n×d} with n ≫ d, the algorithm returns a weighted subset X' of rows with size O(d^{max{1,p/2}}/ε²) such that for every q ∈ ℝ^d, (1−ε)‖Xq‖_p^p ≤ ‖X'q‖_p^p ≤ (1+ε)‖Xq‖_p^p. The algorithm runs in O(poly(n,d,ε^{-1})) time and achieves this by ensuring per-iteration that the loss on the maintained set is upper and lower bounded by the loss on the original dataset with appropriate scalings. This removes log factors from the coreset size, matching the lower bound, and can be used for deterministic approximate ℓ_p regression.

Significance. If the claims hold, this work would represent a major advance in the field of coresets and subspace embeddings by resolving a long-standing open problem regarding the removal of logarithmic factors in deterministic ℓ_p subspace embeddings. The optimal size bound and deterministic guarantee without randomization would have broad implications for efficient algorithms in high-dimensional data processing and regression problems.

major comments (1)

[Abstract] The abstract states that the iterative algorithm ensures loss bounds leading to deterministic guarantees, but provides no details on the algorithm, pseudocode, convergence analysis, or derivation of the size bound O(d^{max{1,p/2}}/ε²). This makes it impossible to verify the central claim that the per-iteration bounding produces the stated deterministic and optimal-size coreset.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for acknowledging the potential significance of our work in resolving the open problem of log-free deterministic ℓ_p subspace embeddings. We address the major comment below.

read point-by-point responses

Referee: [Abstract] The abstract states that the iterative algorithm ensures loss bounds leading to deterministic guarantees, but provides no details on the algorithm, pseudocode, convergence analysis, or derivation of the size bound O(d^{max{1,p/2}}/ε²). This makes it impossible to verify the central claim that the per-iteration bounding produces the stated deterministic and optimal-size coreset.

Authors: We agree that the abstract provides only a high-level summary and omits the technical details. The full manuscript contains the complete description of the iterative algorithm (including pseudocode in Section 3), the per-iteration loss bounding argument that yields the deterministic guarantee, the convergence analysis, and the derivation of the coreset size bound O(d^{max{1,p/2}}/ε²) without logarithmic factors (in Section 4). These sections rigorously establish how the maintained-set loss bounds imply the overall embedding property and optimality. If the referee believes a high-level sketch of the iteration would improve the abstract, we are happy to add one in revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected from abstract

full rationale

The abstract presents an iterative construction where per-iteration loss upper and lower bounds on the maintained subset versus the original data are used to obtain the deterministic ℓ_p subspace embedding guarantee. The claimed coreset size O(d^{max{1,p/2}}/ε²) is stated as following from this process and matching a known lower bound, with no equations, fitted parameters, or self-citations shown that would reduce the final result to the inputs by construction. The bounding mechanism is described as structurally independent of typical coreset guarantees, rendering the derivation self-contained on the basis of the provided text alone.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the claim rests on the iterative loss-bounding process and the full-rank assumption for the input matrix; without the full paper it is unclear what additional mathematical lemmas or domain assumptions are used in the proof.

axioms (1)

domain assumption X is a full rank matrix in R^{n×d} with n ≫ d
Explicitly stated as the input setting for the subspace embedding definition.

pith-pipeline@v0.9.0 · 5869 in / 1429 out tokens · 59727 ms · 2026-05-21T16:45:18.508296+00:00 · methodology

Deterministic Coreset for Lp Subspace

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)