A Faster Generalized Two-Stage Approximate Top-K

Praneeth Netrapalli; Prateek Jain; Spandana Raj Babbula; Varun Yerram; Yashas Samaga

arxiv: 2506.04165 · v3 · pith:FQ6YTG6Qnew · submitted 2025-06-04 · 💻 cs.LG · cs.DS

A Faster Generalized Two-Stage Approximate Top-K

Yashas Samaga , Varun Yerram , Spandana Raj Babbula , Prateek Jain , Praneeth Netrapalli This is my paper

Pith reviewed 2026-05-19 10:53 UTC · model grok-4.3

classification 💻 cs.LG cs.DS

keywords approximate top-ktwo-stage algorithmexpected recallTPU implementationmachine learning accelerators

0 comments

The pith

Generalizing the first stage to select top K' elements per partition reduces the second-stage input size while preserving expected recall for approximate top-K selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper extends the two-stage approximate top-K algorithm by allowing each partition in the first stage to contribute the top K' elements instead of just the top one. With K' greater than one, the number of partitions can be reduced, which shrinks the candidate set for the second-stage exact sort more efficiently than the original method while keeping the expected recall unchanged. The authors provide a closed-form expression for this expected recall assuming random partitioning and show that it matches the original for equivalent settings. They also tighten the bound on recall for the baseline algorithm by a factor of two and demonstrate a Cloud TPUv5e implementation that delivers roughly tenfold speedup without recall loss.

Core claim

By selecting the top K' > 1 elements from each of fewer partitions in the first stage, the algorithm reduces the size of the subset passed to the second stage while maintaining the same expected recall as selecting top-1 from more partitions under random partitioning.

What carries the argument

The expected recall formula derived from random partitioning into equal-sized chunks, which equates different (number of partitions, K') configurations.

If this is right

The original algorithm's expected recall has a tighter provable bound than previously reported.
Fewer partitions with higher K' per partition more effectively reduces input size to the second stage.
The TPUv5e implementation achieves approximately an order of magnitude speedup over the original without sacrificing recall.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This generalization might improve top-K operations in other accelerator-based ML pipelines such as large language model inference.
Deterministic partitioning strategies could be explored to remove reliance on randomness while preserving performance.
The method may extend to related problems like approximate nearest neighbor search or top-K in sparse matrices.

Load-bearing premise

The input array is randomly partitioned into equal-sized chunks for the expected-recall calculation to hold.

What would settle it

Observing that the actual recall falls significantly below the predicted expected value on a non-randomly partitioned array or on data with strong correlations between elements.

read the original abstract

We consider the Top-$K$ selection problem, which aims to identify the largest $K$ elements in an array. Top-$K$ selection arises in many machine learning algorithms and often becomes a bottleneck on accelerators, which are optimized for dense matrix multiplications. To address this problem, Chern et al. (2022) proposed a fast two-stage approximate Top-$K$ algorithm that: (i) partitions the input array into equal-sized chunks and selects the top-$1$ element from each partition; and (ii) sorts the resulting smaller subset and returns the top $K$ elements. In this paper, we generalize the first stage so that each partition selects the top $K'$ elements (for $1 \leq K' \leq K$). Our contributions include: (i) an expression for the expected recall of this generalized algorithm under random partitioning, and a demonstration that choosing $K' > 1$ with fewer partitions in the first stage more effectively reduces the input size to the second stage while maintaining the same expected recall as the original algorithm; (ii) a bound on the expected recall of the original algorithm as a function of the algorithm parameters that is provably tighter by a factor of $2$ than the bound reported by Chern et al. (2022); and (iii) an implementation of our algorithm on Cloud TPUv5e that achieves approximately an order of magnitude speedup over the original algorithm without sacrificing recall.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This generalizes two-stage top-K to K' per chunk with a tighter bound and TPU speedups, but the recall claims hinge on random partitioning that real data may violate.

read the letter

The colleague should know two things about this paper. First, it generalizes the two-stage approximate top-K by pulling the top K' elements from each partition instead of just one, and shows that this lets you use fewer partitions while keeping expected recall the same under the random model. Second, they deliver a closed-form expression for that recall, a bound that is twice as tight as before, and real TPUv5e speedups of about 10x. What stands out as new is the expected-recall formula for arbitrary K' and the demonstration that K' greater than 1 reduces the second-stage input size more effectively. The factor-of-two tightening of the Chern et al. bound is a solid technical step. The hardware implementation on Cloud TPUv5e is concrete and shows the speedup without recall sacrifice in their setup. These pieces add up to a practical advance for accelerator kernels. The soft spots are mostly around the modeling assumptions. Everything rests on the input being randomly partitioned into equal-sized chunks so that top elements land uniformly. The stress-test note is right that structured data, like sorted activations or clustered embeddings, breaks the independence and can lower actual recall for the same parameters. The paper would be stronger with tests on real ML arrays rather than relying only on the probabilistic model. The abstract summarizes the claims but the full derivations need checking for any overlooked edge cases. This paper is for people who implement fast top-K in ML systems on TPUs or GPUs. A reader looking for better approximate selection with hardware awareness will get value from the formula and the measured results. The thinking is clear and the work engages the prior literature directly, so it deserves a serious referee to confirm the math and suggest experiments on non-random inputs. I recommend sending it for peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript generalizes the two-stage approximate Top-K algorithm of Chern et al. (2022) by selecting the top K' elements (for K' > 1) from each of fewer equal-sized partitions in the first stage. It derives an expression for expected recall under random partitioning, shows that suitable (K', partition count) pairs reduce the second-stage input size while preserving the same expected recall, provides a bound on expected recall that is provably tighter by a factor of 2 than the prior bound, and reports a Cloud TPUv5e implementation achieving approximately an order of magnitude speedup with no recall loss.

Significance. If the expected-recall claims hold, the work offers a practical route to faster Top-K on accelerators without accuracy degradation. The TPUv5e implementation and the tighter bound are concrete strengths that could influence ML systems practice. The central justification for preferring K' > 1, however, rests on the random-partitioning model, so the practical significance depends on how well the size-reduction benefit survives realistic ML data distributions.

major comments (2)

[§4] §4 (expected-recall derivation): the claimed equivalence of expected recall between the generalized (K' > 1, fewer partitions) and original (K' = 1) algorithms is derived under the assumption that the input array is randomly partitioned into equal-sized chunks. Because this model makes inclusion probabilities uniform and independent, the manuscript must either prove robustness or supply experiments on structured inputs (sorted arrays, clustered embeddings, attention scores) to confirm that the same (K', partition count) pair still reduces second-stage size while maintaining recall; without such evidence the algorithmic recommendation is not yet load-bearing.
[§5] §5 (TPU experiments): the abstract states that recall is maintained, yet the quantitative comparison (recall values or plots for matched expected-recall parameter pairs) is only summarized. Explicit tables or figures showing recall for the generalized versus original algorithm across the tested workloads would allow direct verification of the size-reduction claim.

minor comments (1)

[Abstract] The abstract refers to 'an expression for the expected recall' without indicating whether it is closed-form or involves a summation; a brief clarification in the introduction would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address the major comments point by point below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (expected-recall derivation): the claimed equivalence of expected recall between the generalized (K' > 1, fewer partitions) and original (K' = 1) algorithms is derived under the assumption that the input array is randomly partitioned into equal-sized chunks. Because this model makes inclusion probabilities uniform and independent, the manuscript must either prove robustness or supply experiments on structured inputs (sorted arrays, clustered embeddings, attention scores) to confirm that the same (K', partition count) pair still reduces second-stage size while maintaining recall; without such evidence the algorithmic recommendation is not yet load-bearing.

Authors: We agree that the derivation of expected recall equivalence relies on the random partitioning model. This assumption is explicitly stated in the manuscript and is common in the analysis of approximate top-k algorithms to obtain closed-form expressions. While a general proof of robustness to arbitrary distributions may be challenging due to the dependence on data structure, we will strengthen the paper by adding experiments on structured inputs. Specifically, in the revised version, we will include results on sorted arrays, clustered embeddings from real ML models, and attention score matrices to verify that the recommended (K', partition count) pairs maintain the expected recall and size reduction benefits. These experiments will provide empirical support for the practical applicability beyond the random model. revision: yes
Referee: [§5] §5 (TPU experiments): the abstract states that recall is maintained, yet the quantitative comparison (recall values or plots for matched expected-recall parameter pairs) is only summarized. Explicit tables or figures showing recall for the generalized versus original algorithm across the tested workloads would allow direct verification of the size-reduction claim.

Authors: We appreciate this suggestion for improving clarity. Although the current manuscript summarizes the recall results, we will revise Section 5 to include explicit tables and figures that report the recall values for both the generalized algorithm (with K' > 1) and the original (K' = 1) across all tested workloads. These will be presented for parameter pairs that achieve matched expected recall, allowing readers to directly verify the size-reduction benefits without loss in recall. revision: yes

Circularity Check

0 steps flagged

No significant circularity; expected-recall derivation is self-contained under external random-partitioning model

full rationale

The paper derives an explicit probabilistic expression for expected recall from the assumption that the input is randomly partitioned into equal-sized chunks, then shows mathematically that K'>1 with fewer partitions preserves that expectation while shrinking the second-stage input. This is a direct calculation from the stated model rather than a self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. The tighter bound is proven relative to the external Chern et al. (2022) result. The TPU implementation and empirical speedup are separate from the analysis. No step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on a domain assumption of random equal-sized partitioning together with two tunable algorithm parameters whose values are chosen to trade off recall against second-stage size.

free parameters (2)

K'
Number of top elements retained per partition; chosen to maintain target recall while shrinking the candidate set passed to the second stage.
number of partitions
Reduced when K' > 1 so that the total number of candidates fed to the final sort remains comparable to the original algorithm.

axioms (1)

domain assumption The input array is randomly partitioned into equal-sized chunks.
Invoked to derive the closed-form expression for expected recall.

pith-pipeline@v0.9.0 · 5809 in / 1437 out tokens · 60202 ms · 2026-05-19T10:53:20.315248+00:00 · methodology

A Faster Generalized Two-Stage Approximate Top-K

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)