STS: Efficient Sparse Attention with Speculative Token Sparsity

Ceyu Xu; Jiangnan Yu; Yongji Wu; Yuan Xie

arxiv: 2605.15508 · v2 · pith:X6HMPEMEnew · submitted 2026-05-15 · 💻 cs.LG · cs.CL

STS: Efficient Sparse Attention with Speculative Token Sparsity

Ceyu Xu , Jiangnan Yu , Yongji Wu , Yuan Xie This is my paper

Pith reviewed 2026-05-20 21:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords sparse attentionspeculative decodingLLM inferencetoken sparsityattention pruninglong contextefficiencydraft model

0 comments

The pith

A smaller draft model can identify which tokens matter for a larger LLM's attention, enabling 90 percent sparsity and 2.67 times faster inference without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes STS, a sparse attention method that works by integrating a small draft model from speculative decoding into the inference process of a larger target model. It establishes that attention scores from the draft model reliably indicate which tokens the target model should focus on, allowing construction of a dynamic sparsity mask that skips most attention computations. This approach requires no model retraining and directly addresses the quadratic cost of attention for long sequences. A reader would care because it targets emerging agentic applications that process multi-million token contexts while keeping accuracy close to that of dense attention. The result is a new point on the sparsity-accuracy trade-off that outperforms earlier sparse attention techniques.

Core claim

STS constructs a token-and-head-wise sparsity mask by repurposing the attention scores computed by a smaller draft model during speculative decoding; this mask prunes the quadratic attention computation inside the larger target LLM to roughly 10 percent of its original cost. On the NarrativeQA benchmark the method delivers a 2.67 times speedup at approximately 90 percent sparsity while incurring negligible accuracy degradation relative to full dense attention.

What carries the argument

The token-and-head-wise sparsity mask built from the draft model's attention scores, which selectively prunes attention operations in the target model.

If this is right

Higher sparsity levels become achievable for any given accuracy target compared with prior sparse-attention methods.
Multi-million-token sequences can be processed with substantially lower memory and compute during inference.
The technique slots directly into existing speculative-decoding pipelines with no extra training.
Attention cost scales sub-quadratically while preserving the model's original output distribution on long-context tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same draft-to-target predictability might let other inference accelerators, such as KV-cache compression, be guided by the draft model.
Agentic systems that maintain very long interaction histories could gain real-time responsiveness without changing model weights.
Varying the size ratio between draft and target models would test how robust the important-token prediction remains.
Energy use per generated token could drop proportionally to the observed speedup on hardware with attention bottlenecks.

Load-bearing premise

The tokens the smaller draft model flags as important are the same ones the larger target model needs to attend to.

What would settle it

Apply the sparsity mask generated by the draft model to the target model on NarrativeQA and measure whether accuracy falls more than a few percent below the dense baseline.

Figures

Figures reproduced from arXiv: 2605.15508 by Ceyu Xu, Jiangnan Yu, Yongji Wu, Yuan Xie.

**Figure 2.** Figure 2: STS vs. Other Token Sparsity Works. The key difference is what “proxy” is used for selecting the important tokens. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the STS Workflow. algorithm overcomes this by leveraging a key similarity of the Transformer architecture: the attention mechanism. While models vary in size, the attention weight matrix for any head, regardless ofwhether it comes from the small draft model or the large target model, is consistently shaped N ×N, where N is the sequence length. A model’s size only determines the number of attent… view at source ↗

**Figure 4.** Figure 4: [System Performance]: End-to-end speedup analysis. (a) STS scales efficiently up to 100K context length. (b) STS maintains performance advantages on realistic benchmarks like LongBench and SwBench. as the sequence length increases. This widening gap stems from the overhead of "Just-in-Time" selection. Methods like Quest require loading coarse-grained Q and K metadata from HBM to compute importance scores… view at source ↗

**Figure 5.** Figure 5: [System Optimization]: Latency of KV-cache offloading strategies. 8 16 32 64 128 Concurrency 2000 4000 6000 Token/s Llama 3.2 1B Llama 3.1 8B Llama 3.1 70B [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: [Draft Overhead Analysis]: Throughput scalability under concurrency. ARC Challenge GSM-8K Hellaswag MMLU TruthfulQA Winogrande 0 50 8B Accuracy (%) (a) ARC Challenge GSM-8K Hellaswag MMLU TruthfulQA Winogrande 0 50 70B Accuracy (%) (b) Dense STS [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: [General Benchmark Fidelity]: Zero-shot accuracy on standard reasoning benchmarks. STS incurs negligible accuracy loss across 8B and 70B models, demonstrating that sparsity patterns generalize effectively across model scales. of memory transfer latency, making CPU offloading a viable option for serving long-context workloads when GPU memory is limited. From a resource utilization perspective, this strat… view at source ↗

**Figure 8.** Figure 8: [LongBench Fidelity]: LongBench results on Llama-3.1-8B. STS (red) consistently outperforms sparse baselines (Quest, Tidal) and matches Dense performance across diverse tasks. GSM8K [7] (math), and ARC [6] (reasoning). STS tracks the Dense baseline within 0.5% variance across all tasks for both 8B and 70B models. Implication: This result is significant because it proves that the sparsity patterns learned … view at source ↗

**Figure 9.** Figure 9: [LongBench Fidelity]: LongBench results on Llama-3.1-70B. STS scales effectively to larger models, preserving long-range dependencies even in multi-document QA tasks. 0 2000 4000 6000 8000 10000 Input Length (tokens) 6.2 6.4 6.6 6.8 Perplexity Full Attention STS Quest TidalDecode [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: [Fidelity]: Language modeling perplexity on Wiki-103. STS maintains perplexity within 0.01 of the dense baseline, whereas heuristic methods degrade at long contexts. layer, it represents the maximum fraction of attention entries that can be discarded without exceeding a strict perplexity degradation threshold (∆PPL < 0.01). This sensitivity analysis reveals the inherent heterogeneity of the model: some l… view at source ↗

**Figure 12.** Figure 12: [Micro-analysis]: Layer-wise sparsity patterns. The draft-guided mask (STS) closely aligns with Oracle’s distribution. tal attention topology as large models, making them excellent predictors for sparse attention without the need for expensive gradients or full-model computations. 7 Related Work We classify efficient long-context inference into three paradigms: state eviction (KV compression), heuristic a… view at source ↗

read the original abstract

The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model's attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining negligible accuracy degradation compared to dense attention. STS establishes a new state-of-the-art on the sparsity-accuracy trade-off, outperforming prior techniques by enabling higher sparsity levels for a given accuracy budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STS uses draft-model attention scores to build a dynamic token-and-head sparsity mask for the target LLM inside a speculative decoding setup, claiming 2.67x speedup at 90% sparsity on NarrativeQA with little accuracy drop, but the draft-to-target transfer is only supported by end-to-end numbers.

read the letter

The punchline is that STS repurposes attention scores from a draft model in speculative decoding to create a token-and-head sparsity mask for the target LLM, reporting a 2.67x speedup at around 90% sparsity on NarrativeQA with negligible accuracy loss. This approach is new in how it directly uses the draft model's scores to prune the target's attention computation without retraining. It combines two lines of work that have mostly stayed separate until now. The paper does well at explaining the memory and compute issues with long sequences and at showing a concrete performance gain on a representative benchmark. The no-retraining aspect makes it immediately applicable to existing models. The soft spots are around the evidence for the key assumption. The abstract says tokens important to the draft are predictive for the target, but there's no reported overlap, correlation, or agreement metric between the draft-derived mask and what the target model would have selected on its own. We only get the end-to-end accuracy number. Without that direct check, it's possible the accuracy holds for reasons other than accurate importance transfer, such as task redundancy. The evaluation also appears limited to one dataset with no error bars or ablations mentioned, which leaves the robustness unclear. This work is aimed at researchers focused on efficient inference for large-context models, especially in agent applications. Someone already working on speculative decoding or sparse attention techniques would find the combination useful to explore. It deserves a serious referee. The idea is worth checking in detail, and review would likely strengthen the claims by requiring those missing validation steps.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes STS, a training-free sparse attention mechanism for large language models that integrates with speculative decoding frameworks. It repurposes attention scores from a smaller draft model to dynamically construct a token-and-head-wise sparsity mask for the target model, pruning quadratic attention computations. On the NarrativeQA benchmark, STS reports a 2.67x speedup at approximately 90% sparsity while maintaining negligible accuracy degradation relative to dense attention, and claims a new state-of-the-art on the sparsity-accuracy trade-off.

Significance. If the draft-to-target importance transfer assumption holds, STS could meaningfully advance efficient long-context inference for LLMs in agentic settings by reducing memory and compute bottlenecks without retraining. The reuse of draft-model computations already performed in speculative decoding is a practical strength that could facilitate adoption. The reported numbers on NarrativeQA suggest a favorable operating point, but broader significance hinges on demonstrating that the sparsity mask reliably identifies target-critical tokens rather than relying solely on end-to-end accuracy preservation.

major comments (2)

[Abstract and key insight paragraph] Abstract and key insight paragraph: the central claim that 'tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model' is load-bearing for the sparsity mask construction, yet the manuscript provides only end-to-end accuracy on NarrativeQA. No direct metrics (token overlap, rank correlation, or per-head mask agreement) between draft-derived and target-optimal masks are reported; without these, it remains possible that accuracy is preserved by dataset redundancy or conservative masking rather than the claimed transferability.
[Evaluation section] Evaluation section: the 2.67x speedup at ~90% sparsity is presented without error bars, detailed dataset statistics, or ablations testing the draft-target correlation assumption across model pairs or tasks. This weakens confidence that the negligible accuracy degradation generalizes, as the central claim depends on the mask correctly identifying target-critical tokens.

minor comments (1)

[Method] The manuscript would benefit from a clearer description of how the sparsity threshold is chosen and whether it is fixed or adaptive across layers or heads.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract and key insight paragraph] Abstract and key insight paragraph: the central claim that 'tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model' is load-bearing for the sparsity mask construction, yet the manuscript provides only end-to-end accuracy on NarrativeQA. No direct metrics (token overlap, rank correlation, or per-head mask agreement) between draft-derived and target-optimal masks are reported; without these, it remains possible that accuracy is preserved by dataset redundancy or conservative masking rather than the claimed transferability.

Authors: We agree that direct metrics would provide stronger support for the draft-to-target transfer assumption underlying the sparsity mask. End-to-end accuracy and speedup are the primary practical metrics, but we recognize that intermediate validation would address potential alternative explanations such as dataset redundancy. In the revised manuscript we will add a new subsection reporting token overlap, rank correlation, and per-head mask agreement between draft-derived masks and target-model attention scores on a representative sample of NarrativeQA examples. This analysis will be included to directly substantiate the key insight. revision: yes
Referee: [Evaluation section] Evaluation section: the 2.67x speedup at ~90% sparsity is presented without error bars, detailed dataset statistics, or ablations testing the draft-target correlation assumption across model pairs or tasks. This weakens confidence that the negligible accuracy degradation generalizes, as the central claim depends on the mask correctly identifying target-critical tokens.

Authors: We acknowledge that the current evaluation lacks error bars and broader ablations, which limits claims about generalization. We will revise the evaluation section to include error bars computed over multiple random seeds for the reported speedup and accuracy figures, along with additional dataset statistics for NarrativeQA. We will also add a brief discussion of preliminary results on one additional task and model pair to illustrate the correlation assumption. Comprehensive ablations across all possible model pairs and tasks are beyond the scope of the current work but will be noted as future directions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical assumption tested via end-to-end results

full rationale

The paper asserts that draft-model attention scores are predictive of target-model token importance as a key insight, then constructs a sparsity mask from the draft run and measures end-to-end speedup and accuracy on NarrativeQA. No equations, fitted parameters, or self-citations are shown that would make the reported 2.67x speedup or 90% sparsity level equivalent to the input mask by construction. The predictive relationship is presented as an empirical premise rather than a derived result that loops back to itself, and the evaluation remains independent of any internal fitting to the target model's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on one central domain assumption and introduces no new free parameters or invented entities beyond standard LLM components.

axioms (1)

domain assumption Tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model
Stated as the key insight enabling the sparsity mask construction.

pith-pipeline@v0.9.0 · 5693 in / 1093 out tokens · 27422 ms · 2026-05-20T21:02:29.897105+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model... repurposes the draft model’s attention scores to dynamically construct a token-and-head-wise sparsity mask

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

[1]

Longbench: A bilingual, multitask benchmark for long context understanding, 2024

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding, 2024

work page 2024
[2]

Peters, and Arman Cohan

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Long- former: The long-document transformer, 2020

work page 2020
[3]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with specu- lative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Generating long sequences with sparse trans- formers, 2019

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse trans- formers, 2019

work page 2019
[5]

Rethinking attention with performers, 2022

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2022

work page 2022
[6]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

work page 2018
[7]

Training veri- fiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training veri- fiers to solve math word problems, 2021

work page 2021
[8]

Lazyllm: Dynamic token pruning for efficient long context llm inference, 2024

Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi. Lazyllm: Dynamic token pruning for efficient long context llm inference, 2024

work page 2024
[9]

Seerat- tention: Learning intrinsic sparse attention in your llms, 2025

Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok- Hay So, Ting Cao, Fan Yang, and Mao Yang. Seerat- tention: Learning intrinsic sparse attention in your llms, 2025

work page 2025
[10]

Mamba: Linear-time sequence modeling with selective state spaces, 2024

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024

work page 2024
[11]

Measuring massive multitask language understanding, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

work page 2021
[12]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Minference 1.0: Accelerating pre- filling for long-context llms via dynamic sparse atten- tion, 2024

work page 2024
[13]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

work page 2024
[14]

Reformer: The efficient transformer, 2020

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020

work page 2020
[15]

Efficient memory man- agement for large language model serving with page- dattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[16]

Gonza- lez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention, 2023

work page 2023
[17]

Fast inference from transformers via speculative decoding, 2023

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding, 2023

work page 2023
[18]

Snapkv: Llm knows what you are looking for before generation, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation, 2024

work page 2024
[19]

Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023

work page 2023
[20]

Agentbench: Evaluating llms as agents, 2025

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2025. 13

work page 2025
[21]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time, 2023

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time, 2023

work page 2023
[22]

Deja vu: Contextual sparsity for efficient llms at inference time, 2023

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time, 2023

work page 2023
[23]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

work page 2016
[24]

In-context learning and induction heads, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Con- erly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish,...

work page 2022
[25]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023

work page 2023
[26]

Quest: Query-aware spar- sity for efficient long-context llm inference, 2024

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware spar- sity for efficient long-context llm inference, 2024

work page 2024
[27]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.CoRR, abs/1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Li, Madian Khabsa, Han Fang, and Hao Ma

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear com- plexity, 2020

work page 2020
[29]

Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2024

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2024

work page 2024
[30]

Infllm: Training-free long-context ex- trapolation for llms with an efficient context memory, 2024

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. Infllm: Training-free long-context ex- trapolation for llms with an efficient context memory, 2024

work page 2024
[31]

Duoattention: Efficient long-context llm inference with retrieval and streaming heads, 2024

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads, 2024

work page 2024
[32]

Efficient streaming language models with attention sinks, 2024

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024

work page 2024
[33]

Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference, 2024

Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference, 2024

work page 2024
[34]

Tidaldecode: Fast and accurate llm decoding with position persistent sparse attention, 2024

Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, and Zhihao Jia. Tidaldecode: Fast and accurate llm decoding with position persistent sparse attention, 2024

work page 2024
[35]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

work page 2023
[36]

Flashinfer: Kernel library for llm serving

Zihao Ye et al. Flashinfer: Kernel library for llm serving. https://github.com/flashinfer-ai/ flashinfer, 2024

work page 2024
[37]

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y . X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Na- tive sparse attention: Hardware-aligned and natively trainable sparse attention, 2025

work page 2025
[38]

Big bird: Transformers for longer sequences, 2021

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021

work page 2021
[39]

H 2o: Heavy-hitter oracle for efficient generative inference of large language models, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H 2o: Heavy-hitter oracle for efficient generative inference of large language models, 2023. 14

work page 2023

[1] [1]

Longbench: A bilingual, multitask benchmark for long context understanding, 2024

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding, 2024

work page 2024

[2] [2]

Peters, and Arman Cohan

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Long- former: The long-document transformer, 2020

work page 2020

[3] [3]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with specu- lative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Generating long sequences with sparse trans- formers, 2019

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse trans- formers, 2019

work page 2019

[5] [5]

Rethinking attention with performers, 2022

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2022

work page 2022

[6] [6]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

work page 2018

[7] [7]

Training veri- fiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training veri- fiers to solve math word problems, 2021

work page 2021

[8] [8]

Lazyllm: Dynamic token pruning for efficient long context llm inference, 2024

Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi. Lazyllm: Dynamic token pruning for efficient long context llm inference, 2024

work page 2024

[9] [9]

Seerat- tention: Learning intrinsic sparse attention in your llms, 2025

Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok- Hay So, Ting Cao, Fan Yang, and Mao Yang. Seerat- tention: Learning intrinsic sparse attention in your llms, 2025

work page 2025

[10] [10]

Mamba: Linear-time sequence modeling with selective state spaces, 2024

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024

work page 2024

[11] [11]

Measuring massive multitask language understanding, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

work page 2021

[12] [12]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Minference 1.0: Accelerating pre- filling for long-context llms via dynamic sparse atten- tion, 2024

work page 2024

[13] [13]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

work page 2024

[14] [14]

Reformer: The efficient transformer, 2020

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020

work page 2020

[15] [15]

Efficient memory man- agement for large language model serving with page- dattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023

[16] [16]

Gonza- lez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention, 2023

work page 2023

[17] [17]

Fast inference from transformers via speculative decoding, 2023

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding, 2023

work page 2023

[18] [18]

Snapkv: Llm knows what you are looking for before generation, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation, 2024

work page 2024

[19] [19]

Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023

work page 2023

[20] [20]

Agentbench: Evaluating llms as agents, 2025

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2025. 13

work page 2025

[21] [21]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time, 2023

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time, 2023

work page 2023

[22] [22]

Deja vu: Contextual sparsity for efficient llms at inference time, 2023

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time, 2023

work page 2023

[23] [23]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

work page 2016

[24] [24]

In-context learning and induction heads, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Con- erly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish,...

work page 2022

[25] [25]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023

work page 2023

[26] [26]

Quest: Query-aware spar- sity for efficient long-context llm inference, 2024

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware spar- sity for efficient long-context llm inference, 2024

work page 2024

[27] [27]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.CoRR, abs/1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Li, Madian Khabsa, Han Fang, and Hao Ma

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear com- plexity, 2020

work page 2020

[29] [29]

Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2024

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2024

work page 2024

[30] [30]

Infllm: Training-free long-context ex- trapolation for llms with an efficient context memory, 2024

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. Infllm: Training-free long-context ex- trapolation for llms with an efficient context memory, 2024

work page 2024

[31] [31]

Duoattention: Efficient long-context llm inference with retrieval and streaming heads, 2024

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads, 2024

work page 2024

[32] [32]

Efficient streaming language models with attention sinks, 2024

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024

work page 2024

[33] [33]

Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference, 2024

Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference, 2024

work page 2024

[34] [34]

Tidaldecode: Fast and accurate llm decoding with position persistent sparse attention, 2024

Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, and Zhihao Jia. Tidaldecode: Fast and accurate llm decoding with position persistent sparse attention, 2024

work page 2024

[35] [35]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

work page 2023

[36] [36]

Flashinfer: Kernel library for llm serving

Zihao Ye et al. Flashinfer: Kernel library for llm serving. https://github.com/flashinfer-ai/ flashinfer, 2024

work page 2024

[37] [37]

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y . X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Na- tive sparse attention: Hardware-aligned and natively trainable sparse attention, 2025

work page 2025

[38] [38]

Big bird: Transformers for longer sequences, 2021

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021

work page 2021

[39] [39]

H 2o: Heavy-hitter oracle for efficient generative inference of large language models, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H 2o: Heavy-hitter oracle for efficient generative inference of large language models, 2023. 14

work page 2023