TokenButler: Token Importance is Predictable

Ahmed F AbouElhamayed; Chi-Chih Chang; Mohamed S. Abdelfattah; Nilesh Jain; Sameh Gobriel; Yash Akhauri; Yifei Gao

arxiv: 2503.07518 · v2 · pith:UT2HHMIYnew · submitted 2025-03-10 · 💻 cs.CL · cs.AI· cs.LG

TokenButler: Token Importance is Predictable

Yash Akhauri , Ahmed F AbouElhamayed , Yifei Gao , Chi-Chih Chang , Sameh Gobriel , Nilesh Jain , Mohamed S. Abdelfattah This is my paper

Pith reviewed 2026-05-23 00:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords KV cache sparsitytoken importance predictionattention distillationdynamic token selectionLLM inference optimizationlong context processingquery-aware pruning

0 comments

The pith

Token importance for KV-cache selection can be predicted from low-dimensional queries and projected keys to enable dynamic budgeted selection while keeping the full cache.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a small learned predictor can forecast which past tokens will matter most for the next attention computation by distilling the model's own masked attention patterns into low-dimensional importance queries. These queries are combined with a cheap projection of the actual stored keys to rank tokens on the fly. A sympathetic reader would care because this approach keeps every token in memory yet only pays attention compute on the ones that score high, avoiding both permanent eviction errors and the overhead of retrieving large chunks. The method operates at fixed depth strides and a fixed token budget, with the predictor trained once and reused across inputs.

Core claim

TokenButler learns to predict low-dimensional importance queries at fixed depth strides; these are dotted with a learned projection of the real KV-cache keys to produce cheap per-token importance scores. The scores drive dynamic selection of a budgeted subset of tokens for attention while the full KV cache remains resident. Training distills the original model's masked causal attention distributions into the lightweight predictor with minimal added parameters.

What carries the argument

Low-dimensional importance query predictor at fixed depth stride, combined with learned linear projection of KV keys to produce per-token scores for budgeted selection.

If this is right

Dynamic per-token selection under a fixed budget is possible without permanent token eviction or chunked retrieval.
Near-oracle accuracy is achieved on a synthetic small-context co-referential retrieval task where prior methods fail.
Competitive or better results hold on RULER and LongBench while delivering up to 1.6x on-GPU speedup via prediction-interval neighbor fetching.
Up to 7.6x latency reduction is obtained relative to dense attention with CPU offloading.
Prediction cost can be amortized by computing the predictor only at fixed intervals while accuracy stays within 1.1%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation approach could be tested on other attention-based architectures that maintain growing state.
If the predictor generalizes across model scales, it could be applied to reduce per-step compute in very long generations without retraining the base model.
Neighbor fetching might be replaced by learned retrieval of nearby tokens to further reduce accuracy variance on structured long contexts.
The fixed-stride design suggests an opportunity to combine TokenButler with static sparsity patterns that operate at different granularities.

Load-bearing premise

Attention distributions obtained under masked causal masking during distillation remain an accurate and stable proxy for which tokens actually matter at inference time on new inputs and tasks.

What would settle it

Running the full unpruned model and the TokenButler-pruned version on the same long inputs and measuring whether the selected token sets diverge enough to produce a measurable drop in next-token prediction accuracy or downstream task score.

read the original abstract

Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck. However, there is an opportunity to alleviate this bottleneck, prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks of tokens and many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. TokenButler predicts low-dimensional importance queries at a fixed depth stride, and combines them with a learned projection of the real KV-cache keys to score tokens cheaply, enabling dynamic per-token selection under a fixed budget while preserving the full KV cache. We train TokenButler by distilling the model's masked causal attention distributions, optimizing a lightweight predictor with minimal parameter overhead. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy where existing methods fail. Furthermore, TokenButler achieves competitive or superior performance on long-context benchmarks (RULER, LongBench), up to $\approx1.6\times$ on-GPU speedup using our proposed *prediction interval with neighbor fetching* that amortizes predictor cost while maintaining accuracy within $\approx$1.1\%, and up to 7.6$\times$ reduction in latency compared to Dense Attention with CPU offloading. Code is available: https://github.com/abdelfattah-lab/TokenButler

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TokenButler gives a concrete, low-overhead way to predict which KV tokens matter using distilled attention plus stride queries and neighbor amortization, but the validation is too thin to trust the accuracy claims yet.

read the letter

The core idea is training a small predictor on the base model's causal attention scores at fixed depth strides, then using low-dimensional queries plus a learned projection of the actual KV keys to score and drop tokens on the fly while keeping the full cache. They add a prediction-interval trick with neighbor fetching to cut the predictor cost. That combination is new enough compared to prior eviction or chunk-retrieval work, and they ship code, which helps. On their synthetic co-referential task it gets near-oracle results, and on RULER/LongBench it stays within roughly 1.1% of dense while showing GPU speedups and big latency wins versus CPU offload. Those are the practical bits worth noting. The weak part is the lack of any direct check on whether the distilled scores stay aligned with real attention once selection is active on held-out data or tasks. The abstract gives no error bars, no ablation on how the synthetic task was built, and no correlation numbers under distribution shift, so the 1.1% tolerance claim is hard to evaluate. The stress-test worry about the proxy degrading is reasonable given what's shown. This is aimed at people building long-context serving systems who need something lighter than full attention but more query-aware than static eviction. It is coherent enough on its own terms to go to a serious referee, mainly because the architecture is spelled out and the benchmarks are standard, even if more validation on the proxy would be needed in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces TokenButler, a lightweight query-aware predictor for dynamic KV-cache token selection in LLMs. It distills the base model's masked causal attention distributions at fixed depth strides to predict low-dimensional importance queries, which are combined with a learned projection of real KV keys to score tokens cheaply under a fixed budget while retaining the full cache. A 'prediction interval with neighbor fetching' amortizes predictor cost. It reports near-oracle accuracy on a novel synthetic small-context co-referential retrieval task (where prior methods fail), competitive or superior results on RULER and LongBench, up to 1.6x on-GPU speedup and 7.6x latency reduction vs. dense attention with CPU offloading, and accuracy within ~1.1% of dense, with code released.

Significance. If the distilled attention proxy remains faithful under inference-time dynamic selection and distribution shift, the method offers a promising route to high-granularity KV sparsity without permanent eviction or chunk-based retrieval, addressing a core scaling bottleneck in long-context models with low overhead. Code availability strengthens reproducibility.

major comments (3)

[Abstract / Experiments] Abstract and experimental results: near-oracle accuracy is claimed on the synthetic co-referential task and a 1.1% tolerance on real benchmarks, but no error bars, no ablation on synthetic task construction, and no direct post-selection correlation between predicted scores and full-model attention on held-out inputs are provided; this leaves the stability of the distillation proxy unverified under the exact conditions of dynamic selection.
[Method (distillation and inference-time scoring)] Method section on predictor training and scoring: the central selection logic relies on the learned projection of KV keys combined with distilled queries, yet the manuscript provides no measurement of how well these scores correlate with true importance once selection is active on unseen tasks; the amortization claims and 1.1% tolerance rest on this untested assumption.
[Experiments / Results] Results on RULER/LongBench: competitive numbers are reported, but without ablations that isolate the contribution of the fixed-stride prediction plus neighbor fetching versus simpler baselines, it is unclear whether the core novelty (low-dimensional query prediction) is load-bearing for the observed performance.

minor comments (2)

[Method] Clarify in the method whether the prediction stride is a free hyperparameter or derived, and how it interacts with the depth stride during distillation.
[Experiments] Ensure all reported speedups and accuracy figures include the exact model sizes, context lengths, and hardware details for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental results: near-oracle accuracy is claimed on the synthetic co-referential task and a 1.1% tolerance on real benchmarks, but no error bars, no ablation on synthetic task construction, and no direct post-selection correlation between predicted scores and full-model attention on held-out inputs are provided; this leaves the stability of the distillation proxy unverified under the exact conditions of dynamic selection.

Authors: We agree that error bars, task ablations, and explicit correlation measurements would strengthen the claims. In the revision we will report standard deviations across multiple random seeds for all key metrics on both the synthetic task and the real benchmarks. We will add an ablation varying the co-referential task parameters (e.g., number of referents and context length) to demonstrate robustness of the near-oracle result. For post-selection correlation, we will include a new analysis that computes Spearman rank correlation between TokenButler scores and the base model's full attention weights on held-out inputs after dynamic selection has occurred; this directly tests the proxy under the operating conditions of the method. revision: yes
Referee: [Method (distillation and inference-time scoring)] Method section on predictor training and scoring: the central selection logic relies on the learned projection of KV keys combined with distilled queries, yet the manuscript provides no measurement of how well these scores correlate with true importance once selection is active on unseen tasks; the amortization claims and 1.1% tolerance rest on this untested assumption.

Authors: The original submission presents end-to-end accuracy on RULER and LongBench (which exercise dynamic selection on unseen data) as the primary validation of the scoring mechanism. We acknowledge that an explicit correlation study under active selection would provide stronger direct evidence. We will therefore add, in a new subsection, quantitative measurements of score-to-attention correlation on held-out tasks after the budgeted selection step, together with an analysis of how the prediction-interval amortization affects this correlation. revision: yes
Referee: [Experiments / Results] Results on RULER/LongBench: competitive numbers are reported, but without ablations that isolate the contribution of the fixed-stride prediction plus neighbor fetching versus simpler baselines, it is unclear whether the core novelty (low-dimensional query prediction) is load-bearing for the observed performance.

Authors: The current experiments compare TokenButler against prior KV-sparsity methods that lack the query-prediction component, thereby providing indirect evidence of the novelty's contribution. To make the isolation explicit, we will add an ablation table that replaces the learned low-dimensional query predictor with (i) uniform random scoring and (ii) a static key-norm baseline while keeping the fixed-stride schedule and neighbor-fetching logic identical. This will quantify the incremental benefit of the distilled query prediction. revision: yes

Circularity Check

0 steps flagged

No circularity: predictor trained on independent attention targets; selection logic not self-referential

full rationale

The derivation trains a lightweight predictor to match the base LLM's masked causal attention distributions (an external computation) and then applies the resulting queries plus a fitted KV projection for scoring. This training target is computed once from the full model and is not redefined by the predictor's own outputs at inference. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing premises. The central claim therefore rests on an independently verifiable distillation objective plus benchmark results rather than reducing to its own fitted values by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on distilling from attention as a proxy and on the assumption that the synthetic co-referential task is diagnostic; no new physical entities or ungrounded constants are introduced.

free parameters (2)

prediction stride
Fixed depth stride at which the predictor is invoked; chosen to amortize cost.
token budget
Fixed per-step token retention budget used for selection.

axioms (1)

domain assumption Masked causal attention distributions serve as a reliable training signal for token importance at inference
Used to train the predictor via distillation

pith-pipeline@v0.9.0 · 5885 in / 1311 out tokens · 42397 ms · 2026-05-23T00:34:04.475530+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Compute Where it Counts: Self Optimizing Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design
cs.AI 2026-05 unverdicted novelty 4.0

The paper defines Computational Token Economics and introduces the Token Economics Trilemma as a framework for trade-offs in granularity, real-time performance, and optimality, while outlining a research agenda for th...