HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference

Chong You; Praneeth Netrapalli; Prateek Jain; Sanjiv Kumar; Srinadh Bhojanapalli; Varun Yerram; Yashas Samaga B L

arxiv: 2402.09360 · v1 · pith:PS7G6OQXnew · submitted 2024-02-14 · 💻 cs.LG · cs.AI

HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference

Yashas Samaga B L , Varun Yerram , Chong You , Srinadh Bhojanapalli , Sanjiv Kumar , Prateek Jain , Praneeth Netrapalli This is my paper

classification 💻 cs.LG cs.AI

keywords highhiremodeltop-approximatecolumnslatencyrecall

0 comments

read the original abstract

Autoregressive decoding with generative Large Language Models (LLMs) on accelerators (GPUs/TPUs) is often memory-bound where most of the time is spent on transferring model parameters from high bandwidth memory (HBM) to cache. On the other hand, recent works show that LLMs can maintain quality with significant sparsity/redundancy in the feedforward (FFN) layers by appropriately training the model to operate on a top-$k$ fraction of rows/columns (where $k \approx 0.05$), there by suggesting a way to reduce the transfer of model parameters, and hence latency. However, exploiting this sparsity for improving latency is hindered by the fact that identifying top rows/columns is data-dependent and is usually performed using full matrix operations, severely limiting potential gains. To address these issues, we introduce HiRE (High Recall Approximate Top-k Estimation). HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47\times$ on a single TPUv5e device.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stochastic Sparse Attention for Memory-Bound Inference
cs.LG 2026-05 accept novelty 6.0

SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching bas...
Stochastic Sparse Attention for Memory-Bound Inference
cs.LG 2026-05 unverdicted novelty 6.0

SANTA replaces full value-cache multiply-accumulates with stochastic gather-and-add sampling from the attention distribution to reduce memory bandwidth while preserving an unbiased estimator.
Sustainable Code Generation Using Large Language Models: A Systematic Literature Review
cs.SE 2026-03 unverdicted novelty 3.0

A systematic review finds research on the sustainability of LLM-generated code to be limited, fragmented, and without accepted frameworks for measurement or benchmarking.