Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

Chi-Chih Chang; Chun-Che Yang; Grace Li Zhang; Jian-Jia Chen; Kai-Chiang Wu; Ning-Chi Huang; Pei-Shuo Wang; WenHung Lee; Xiaolin Lin

arxiv: 2606.24957 · v1 · pith:U2ICW3ADnew · submitted 2026-06-23 · 💻 cs.CL · cs.LG

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

WenHung Lee , Jian-Jia Chen , Xiaolin Lin , Pei-Shuo Wang , Chi-Chih Chang , Chun-Che Yang , Ning-Chi Huang , Grace Li Zhang

show 1 more author

Kai-Chiang Wu

This is my paper

Pith reviewed 2026-06-26 00:10 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords speculative decodingsparse verificationlong-context LLMsKV cachedraft modelattention sparsityinference acceleration

0 comments

The pith

Draft-augmented sparse verification lets speculative decoding handle 32k contexts with 9x end-to-end speedup and little accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the main slowdown in long-context speculative decoding comes from repeated full KV cache loading during verification. Dustin solves this by feeding lookahead predictions from the draft model together with the target model's earlier attention patterns to locate which tokens matter most across several verification steps. It then scores importance only on a small number of attention heads instead of all of them. On Qwen2.5-72B this produces a 27.85 times reduction in self-attention work and a 9.17 times end-to-end decoding gain at 32k length while accuracy on PG-19 and LongBench stays nearly the same.

Core claim

Dustin integrates lookahead signals from the draft model with historical attention from the target model to identify critical tokens with high fidelity across multi-step verification windows and employs a sparse estimation scheme that restricts importance scoring to a minimal subset of attention heads.

What carries the argument

Draft-augmented sparse verification, which merges draft-model lookahead with target-model attention history and limits importance scoring to a few heads to avoid full recomputation.

If this is right

Self-attention latency drops by a factor of 27.85 at 32k context.
Full decoding throughput rises by a factor of 9.17.
Accuracy loss on long-context benchmarks remains negligible.
The same token-selection logic works across multi-batch inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware that is memory-bandwidth limited could run much longer contexts without proportional slowdown.
The same draft-plus-history signal might reduce recomputation in other verification-heavy LLM stages.
Repeating the head-subset choice on different model families would test whether the sparse scoring stays reliable.

Load-bearing premise

Draft lookahead plus attention history from only a few heads will keep selecting the right tokens accurately over successive verification steps.

What would settle it

At 32k length on Qwen2.5-72B, measure accuracy on LongBench after running the method; a drop larger than a few percent or an end-to-end speedup below 5x would show the token selection is not faithful enough.

Figures

Figures reproduced from arXiv: 2606.24957 by Chi-Chih Chang, Chun-Che Yang, Grace Li Zhang, Jian-Jia Chen, Kai-Chiang Wu, Ning-Chi Huang, Pei-Shuo Wang, WenHung Lee, Xiaolin Lin.

**Figure 1.** Figure 1: Latency breakdown of a single speculative decoding step. Experiments are measured with a 32k input length and batch size 16. We compare classic Speculative Decoding (SD), MagicDec (MDec) (Sadhukhan et al., 2024), and our proposed Dustin. The x-axis notation Target(Draft) simply indicates the specific target and draft model pair used. footprint to hundreds of gigabytes, making memory access the primary fa… view at source ↗

**Figure 2.** Figure 2: Attention recovery rate analysis (Historical Score). Comparison of attention recovery rates using the future average attention (Oracle) versus historical attention scores on the Qwen2.5-32B model. The minimal gap indicates high temporal stability. This section investigates the validity of Historical Attention Scores as a low-overhead proxy for future token importance. Specifically, the analysis evaluates… view at source ↗

**Figure 4.** Figure 4: Layer-wise attention recovery analysis. Comparison of target-history, draft-lookahead, and hybrid strategies on Qwen2.5- 32B. The results highlight structural complementarity: targethistorical signals dominate in deeper layers, while draft-lookahead signals excel in early layers (via lookahead). provides a layer-wise analysis to identify where each signal is most informative, motivating a hybrid construct… view at source ↗

**Figure 5.** Figure 5: Overview of our sparse verification approach. The process begins with hybrid attention aggregation (Eq. 5) to compute a global importance map, followed by Top-K selection (Eq. 6) to determine the final verification set Iverify. proach for efficient Large Language Model (LLM) inference. Dustin identifies critical Key-Value (KV) pairs by integrating target-historical and draft-lookahead attention signals. T… view at source ↗

**Figure 6.** Figure 6: illustrates our SRH selection pipeline. Following CompressKV (Lin et al., 2025), we identify SRHs using a layer-wise selection strategy. Specifically, based on profiling scores, only a small subset of heads is retained in each layer to provide their attention scores for KV selection [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: breaks down the self-attention cost in Dustin’s target-model verification into online importance estimation and sparse verification attention. With a fixed KV budget of 512 tokens, the speedup scales with verification workload: 9.35× (16K, batch 8), and 27.85× (32K, batch 16). More detailed results are provided in Appendix G [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Normalized Latency of Online Importance Estimation strates that the hybrid approach provides a stable mechanism for importance estimation, ensuring robust performance preservation where isolated signals might fail. Results on additional datasets are provided in Appendix K. 6. Conclusion We presented Dustin, a sparse verification framework that addresses the KV-cache loading bottleneck limiting speculative… view at source ↗

**Figure 9.** Figure 9: Relationship between Attention Recovery Rate (ARR) and output-logit KL divergence. The left and right panels show results for Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct, respectively. Each point corresponds to one sampled KV subset, with ARR and KL divergence averaged across decoding steps. The negative trend indicates that higher ARR generally leads to lower output distribution distortion. These result… view at source ↗

**Figure 10.** Figure 10: Detailed latency breakdown of the target model verification phase on Qwen2.5-72B. The charts compare the latency of Full Cache (Baseline) against Dustin, decomposing the latter into Criticality Estimation (light blue) and Approximate Attention (dark blue). Our estimation overhead remains negligible across all settings, while the sparse attention yields massive latency reductions, particularly at longer co… view at source ↗

**Figure 11.** Figure 11: Additional cross-model ARR analysis across Qwen2.5 and Llama3 target–draft pairs. We compare oracle ARR with draftlookahead ARR using Qwen2.5-0.5B, 1.5B and Llama-3.2-1B, 3B as draft models across their corresponding larger target models. The results further characterize the model-pair-dependent reliability of draft-lookahead signals [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Layer-wise comparison of target-history, draft-lookahead, and hybrid selection on high-gap Qwen2.5 target–draft pairs. These results show that draft-lookahead and target-history provide complementary signals across layers, motivating the hybrid construction. the Qwen2.5 family, Qwen2.5-14B and Qwen2.5-32B still show a clear gap from the oracle even with a larger 1.5B draft, while Qwen2.5-72B aligns much m… view at source ↗

**Figure 13.** Figure 13: Impact of the budget tuning parameter m on accuracy recovery across different benchmarks. Blue indicates Target-Historical only, green indicates Draft-Lookahead only, and the orange/red curve represents the Hybrid approach. It is important to note that the optimal m used in our main experiments (Dustin-H) was determined via Bayesian optimization using Optuna on the LongReward dataset, acting as a proxy fo… view at source ↗

read the original abstract

While speculative decoding improves inference throughput for multi-batch long-context Large Language Models (LLMs), its efficiency is often limited by a verification bottleneck where Key-Value (KV) cache loading dominates latency. Existing compression methods fail in this regime: static eviction incurs accuracy loss due to saliency shift, while dynamic selection introduces prohibitive computational overhead during the verification path. We propose Dustin, a sparse verification framework designed for long-context speculative decoding. Dustin integrates lookahead signals from the draft model with historical attention from the target model to identify critical tokens with high fidelity across multi-step verification windows. To reduce recomputation latency, this approach further employs a sparse estimation scheme that restricts importance scoring to a minimal subset of attention heads. Evaluations on PG-19 and LongBench with Qwen2.5-72B demonstrate that Dustin achieves a 27.85x speedup in self-attention and a 9.17x end-to-end decoding speedup at a 32k sequence length, all with negligible accuracy degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dustin combines draft lookahead with target historical attention and sparse-head scoring to cut verification latency in long-context speculative decoding, but the accuracy preservation at 32k rests on limited visible evidence.

read the letter

The main point is that this paper gives a concrete way to speed up the verification step in speculative decoding when contexts hit 32k. It mixes signals from the draft model with the target model's past attention, then scores importance only on a small set of heads to avoid full recomputation.

What stands out as new is that particular signal combination plus the sparse-head restriction for multi-step windows. Earlier work on dynamic KV selection or static eviction is cited, but the integration here targets the exact latency spot where KV cache loading dominates.

The experiments on Qwen2.5-72B with PG-19 and LongBench report 27.85x self-attention speedup and 9.17x end-to-end decoding at 32k length, with the claim of negligible accuracy loss. That kind of number matters for anyone running long-context serving, and the setup uses real models rather than toy scales.

The soft spot is exactly the one in the stress-test note. If head saliency shifts across layers or verification steps, restricting to a minimal head subset could drop tokens that only look important later, leading to gradual quality drop even if single-step checks look fine. The abstract asserts the accuracy holds, but without per-layer ablations or token-overlap stats across steps, that part is hard to judge from what's shown.

This is worth a serious referee for groups working on inference efficiency and KV cache methods. Readers who care about measured speedups on 70B-class models will find usable numbers here, even if they want to see the missing checks. I would send it to review rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Dustin, a sparse verification framework for long-context speculative decoding. It combines draft-model lookahead signals with target-model historical attention to identify critical KV tokens and applies a sparse estimation scheme that restricts importance scoring to a minimal subset of attention heads, thereby reducing recomputation latency during verification. On PG-19 and LongBench with Qwen2.5-72B at 32k context, the method is reported to deliver 27.85× self-attention speedup and 9.17× end-to-end decoding speedup while incurring negligible accuracy degradation.

Significance. If the accuracy preservation holds under the claimed conditions, Dustin would directly address the KV-cache loading bottleneck that limits speculative decoding throughput for long contexts, providing a practical route to higher inference efficiency without requiring changes to the underlying model architecture.

major comments (1)

[Abstract; §4 (Experiments) and §5 (Analysis)] The central claim of negligible accuracy degradation rests on the fidelity of the sparse head-restricted importance scorer across multi-step verification windows. No per-layer head-ablation results or multi-step token-overlap statistics are supplied to demonstrate that the minimal head subset continues to recover the same critical tokens that full-head scoring would select when saliency shifts occur.

minor comments (1)

[§3 (Method)] The notation used to define the minimal head subset and the exact combination of draft lookahead and target historical attention scores should be formalized with equations for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger evidence on the fidelity of the sparse head-restricted importance scorer. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract; §4 (Experiments) and §5 (Analysis)] The central claim of negligible accuracy degradation rests on the fidelity of the sparse head-restricted importance scorer across multi-step verification windows. No per-layer head-ablation results or multi-step token-overlap statistics are supplied to demonstrate that the minimal head subset continues to recover the same critical tokens that full-head scoring would select when saliency shifts occur.

Authors: We agree that explicit verification of the sparse scorer's fidelity is important for supporting the accuracy claim. While the end-to-end results on PG-19 and LongBench already show negligible degradation, we will add per-layer head-ablation results and multi-step token-overlap statistics to Section 5 (Analysis) in the revision. These will demonstrate that the minimal head subset recovers the same critical tokens as full-head scoring across verification windows despite saliency shifts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims only

full rationale

The paper presents Dustin as an empirical engineering framework for sparse verification in speculative decoding. It reports measured speedups (27.85x self-attention, 9.17x end-to-end) on PG-19 and LongBench with Qwen2.5-72B at 32k length, framed as experimental outcomes rather than any derivation, fitted parameter, or self-referential prediction. No equations, ansatzes, uniqueness theorems, or self-citations that reduce claims to inputs by construction appear in the abstract or described content. The central assertions rest on benchmark measurements, which are externally falsifiable and independent of the method's internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5733 in / 1123 out tokens · 18259 ms · 2026-06-26T00:10:18.233122+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 17 canonical work pages · 8 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Cai, Z., Zhang, Y ., Gao, B., Liu, Y ., Li, Y ., Liu, T., Lu, K., Xiong, W., Dong, Y ., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Sequoia: Scalable, robust, and hardware-aware speculative decoding

Chen, Z., May, A., Svirschevski, R., Huang, Y ., Ryabinin, M., Jia, Z., and Chen, B. Sequoia: Scalable, robust, and hardware-aware speculative decoding.arXiv preprint arXiv:2402.12374,

work page arXiv
[5]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Lin, X., Wang, J., Kondrateva, O., Shi, Y ., Li, B., and Zhang, G. L. Compresskv: Semantic retrieval heads know what tokens are not important before generation.arXiv preprint arXiv:2508.02401,

work page arXiv
[7]

Transformers are multi-state rnns.arXiv preprint arXiv:2401.06104,

Oren, M., Hassid, M., Yarden, N., Adi, Y ., and Schwartz, R. Transformers are multi-state rnns.arXiv preprint arXiv:2401.06104,

work page arXiv
[8]

Compressive Transformers for Long-Range Sequence Modelling

Rae, J. W., Potapenko, A., Jayakumar, S. M., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling.arXiv preprint arXiv:1911.05507,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[9]

E.-H., May, A., Chen, T., and Chen, B

Sadhukhan, R., Chen, J., Chen, Z., Tiwari, V ., Lai, R., Shi, J., Yen, I. E.-H., May, A., Chen, T., and Chen, B. Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding.arXiv preprint arXiv:2408.11049,

work page arXiv
[10]

Specattn: Speculating sparse attention.arXiv preprint arXiv:2510.27641,

Shah, H. Specattn: Speculating sparse attention.arXiv preprint arXiv:2510.27641,

work page arXiv
[11]

Tri- force: Lossless acceleration of long sequence generation with hierarchical speculative decoding.arXiv preprint arXiv:2404.11912,

Sun, H., Chen, Z., Yang, X., Tian, Y ., and Chen, B. Tri- force: Lossless acceleration of long sequence generation with hierarchical speculative decoding.arXiv preprint arXiv:2404.11912,

work page arXiv
[12]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

10 Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

W., Keutzer, K., and Gholami, A

Tiwari, R., Xi, H., Tomar, A., Hooper, C., Kim, S., Hor- ton, M., Najibi, M., Mahoney, M. W., Keutzer, K., and Gholami, A. Quantspec: Self-speculative decoding with hierarchical quantized kv cache.arXiv preprint arXiv:2502.10424,

work page arXiv
[14]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Yang, Q. A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y .-C., Wa...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

J., et al

Yuan, Z., Shang, Y ., Zhou, Y ., Dong, Z., Zhou, Z., Xue, C., Wu, B., Li, Z., Gu, Q., Lee, Y . J., et al. Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363,

work page arXiv
[17]

Smallkv: Small model assisted compensa- tion of kv cache compression for efficient llm inference

Zhao, Y ., Peng, Y ., Nguyen, C.-T., Li, Z., Wang, X., Zhao, H., and Fu, X. Smallkv: Small model assisted compensa- tion of kv cache compression for efficient llm inference. arXiv preprint arXiv:2508.02751,

work page arXiv
[18]

speculating four tokens ( Γ = 4 ). By downsizing the estimator to 1 target layer with 4 heads and 3 draft layers with 4 heads (denoted as L∗, H∗), the overhead ratio relative to the full hybrid calculation is derived as follows: Overhead Ratio= (H ∗ d ·L ∗ d ·Γ) + (H ∗ t ·L ∗ t ·1) (Hd ·L d ·Γ) + (H t ·L t ·1) = (4·3·4) + (4·1·1) (14·24·4) + (64·80·1) = 4...

2024
[19]

Importantly, this procedure is performed once per model pair and does not require per-task or per-dataset recalibration. SRH identification cost.We follow the SRH identification procedure of CompressKV (Lin et al., 2025), where retrieval- oriented heads are identified once for a model and then reused across downstream tasks. To quantify the practical cost...

2025

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Cai, Z., Zhang, Y ., Gao, B., Liu, Y ., Li, Y ., Liu, T., Lu, K., Xiong, W., Dong, Y ., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Sequoia: Scalable, robust, and hardware-aware speculative decoding

Chen, Z., May, A., Svirschevski, R., Huang, Y ., Ryabinin, M., Jia, Z., and Chen, B. Sequoia: Scalable, robust, and hardware-aware speculative decoding.arXiv preprint arXiv:2402.12374,

work page arXiv

[5] [5]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Lin, X., Wang, J., Kondrateva, O., Shi, Y ., Li, B., and Zhang, G. L. Compresskv: Semantic retrieval heads know what tokens are not important before generation.arXiv preprint arXiv:2508.02401,

work page arXiv

[7] [7]

Transformers are multi-state rnns.arXiv preprint arXiv:2401.06104,

Oren, M., Hassid, M., Yarden, N., Adi, Y ., and Schwartz, R. Transformers are multi-state rnns.arXiv preprint arXiv:2401.06104,

work page arXiv

[8] [8]

Compressive Transformers for Long-Range Sequence Modelling

Rae, J. W., Potapenko, A., Jayakumar, S. M., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling.arXiv preprint arXiv:1911.05507,

work page internal anchor Pith review Pith/arXiv arXiv 1911

[9] [9]

E.-H., May, A., Chen, T., and Chen, B

Sadhukhan, R., Chen, J., Chen, Z., Tiwari, V ., Lai, R., Shi, J., Yen, I. E.-H., May, A., Chen, T., and Chen, B. Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding.arXiv preprint arXiv:2408.11049,

work page arXiv

[10] [10]

Specattn: Speculating sparse attention.arXiv preprint arXiv:2510.27641,

Shah, H. Specattn: Speculating sparse attention.arXiv preprint arXiv:2510.27641,

work page arXiv

[11] [11]

Tri- force: Lossless acceleration of long sequence generation with hierarchical speculative decoding.arXiv preprint arXiv:2404.11912,

Sun, H., Chen, Z., Yang, X., Tian, Y ., and Chen, B. Tri- force: Lossless acceleration of long sequence generation with hierarchical speculative decoding.arXiv preprint arXiv:2404.11912,

work page arXiv

[12] [12]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

10 Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

W., Keutzer, K., and Gholami, A

Tiwari, R., Xi, H., Tomar, A., Hooper, C., Kim, S., Hor- ton, M., Najibi, M., Mahoney, M. W., Keutzer, K., and Gholami, A. Quantspec: Self-speculative decoding with hierarchical quantized kv cache.arXiv preprint arXiv:2502.10424,

work page arXiv

[14] [14]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Yang, Q. A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y .-C., Wa...

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

J., et al

Yuan, Z., Shang, Y ., Zhou, Y ., Dong, Z., Zhou, Z., Xue, C., Wu, B., Li, Z., Gu, Q., Lee, Y . J., et al. Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363,

work page arXiv

[17] [17]

Smallkv: Small model assisted compensa- tion of kv cache compression for efficient llm inference

Zhao, Y ., Peng, Y ., Nguyen, C.-T., Li, Z., Wang, X., Zhao, H., and Fu, X. Smallkv: Small model assisted compensa- tion of kv cache compression for efficient llm inference. arXiv preprint arXiv:2508.02751,

work page arXiv

[18] [18]

speculating four tokens ( Γ = 4 ). By downsizing the estimator to 1 target layer with 4 heads and 3 draft layers with 4 heads (denoted as L∗, H∗), the overhead ratio relative to the full hybrid calculation is derived as follows: Overhead Ratio= (H ∗ d ·L ∗ d ·Γ) + (H ∗ t ·L ∗ t ·1) (Hd ·L d ·Γ) + (H t ·L t ·1) = (4·3·4) + (4·1·1) (14·24·4) + (64·80·1) = 4...

2024

[19] [19]

Importantly, this procedure is performed once per model pair and does not require per-task or per-dataset recalibration. SRH identification cost.We follow the SRH identification procedure of CompressKV (Lin et al., 2025), where retrieval- oriented heads are identified once for a model and then reused across downstream tasks. To quantify the practical cost...

2025