STS: Efficient Sparse Attention with Speculative Token Sparsity
Pith reviewed 2026-05-20 21:02 UTC · model grok-4.3
The pith
A smaller draft model can identify which tokens matter for a larger LLM's attention, enabling 90 percent sparsity and 2.67 times faster inference without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
STS constructs a token-and-head-wise sparsity mask by repurposing the attention scores computed by a smaller draft model during speculative decoding; this mask prunes the quadratic attention computation inside the larger target LLM to roughly 10 percent of its original cost. On the NarrativeQA benchmark the method delivers a 2.67 times speedup at approximately 90 percent sparsity while incurring negligible accuracy degradation relative to full dense attention.
What carries the argument
The token-and-head-wise sparsity mask built from the draft model's attention scores, which selectively prunes attention operations in the target model.
If this is right
- Higher sparsity levels become achievable for any given accuracy target compared with prior sparse-attention methods.
- Multi-million-token sequences can be processed with substantially lower memory and compute during inference.
- The technique slots directly into existing speculative-decoding pipelines with no extra training.
- Attention cost scales sub-quadratically while preserving the model's original output distribution on long-context tasks.
Where Pith is reading between the lines
- The same draft-to-target predictability might let other inference accelerators, such as KV-cache compression, be guided by the draft model.
- Agentic systems that maintain very long interaction histories could gain real-time responsiveness without changing model weights.
- Varying the size ratio between draft and target models would test how robust the important-token prediction remains.
- Energy use per generated token could drop proportionally to the observed speedup on hardware with attention bottlenecks.
Load-bearing premise
The tokens the smaller draft model flags as important are the same ones the larger target model needs to attend to.
What would settle it
Apply the sparsity mask generated by the draft model to the target model on NarrativeQA and measure whether accuracy falls more than a few percent below the dense baseline.
Figures
read the original abstract
The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model's attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining negligible accuracy degradation compared to dense attention. STS establishes a new state-of-the-art on the sparsity-accuracy trade-off, outperforming prior techniques by enabling higher sparsity levels for a given accuracy budget.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes STS, a training-free sparse attention mechanism for large language models that integrates with speculative decoding frameworks. It repurposes attention scores from a smaller draft model to dynamically construct a token-and-head-wise sparsity mask for the target model, pruning quadratic attention computations. On the NarrativeQA benchmark, STS reports a 2.67x speedup at approximately 90% sparsity while maintaining negligible accuracy degradation relative to dense attention, and claims a new state-of-the-art on the sparsity-accuracy trade-off.
Significance. If the draft-to-target importance transfer assumption holds, STS could meaningfully advance efficient long-context inference for LLMs in agentic settings by reducing memory and compute bottlenecks without retraining. The reuse of draft-model computations already performed in speculative decoding is a practical strength that could facilitate adoption. The reported numbers on NarrativeQA suggest a favorable operating point, but broader significance hinges on demonstrating that the sparsity mask reliably identifies target-critical tokens rather than relying solely on end-to-end accuracy preservation.
major comments (2)
- [Abstract and key insight paragraph] Abstract and key insight paragraph: the central claim that 'tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model' is load-bearing for the sparsity mask construction, yet the manuscript provides only end-to-end accuracy on NarrativeQA. No direct metrics (token overlap, rank correlation, or per-head mask agreement) between draft-derived and target-optimal masks are reported; without these, it remains possible that accuracy is preserved by dataset redundancy or conservative masking rather than the claimed transferability.
- [Evaluation section] Evaluation section: the 2.67x speedup at ~90% sparsity is presented without error bars, detailed dataset statistics, or ablations testing the draft-target correlation assumption across model pairs or tasks. This weakens confidence that the negligible accuracy degradation generalizes, as the central claim depends on the mask correctly identifying target-critical tokens.
minor comments (1)
- [Method] The manuscript would benefit from a clearer description of how the sparsity threshold is chosen and whether it is fixed or adaptive across layers or heads.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract and key insight paragraph] Abstract and key insight paragraph: the central claim that 'tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model' is load-bearing for the sparsity mask construction, yet the manuscript provides only end-to-end accuracy on NarrativeQA. No direct metrics (token overlap, rank correlation, or per-head mask agreement) between draft-derived and target-optimal masks are reported; without these, it remains possible that accuracy is preserved by dataset redundancy or conservative masking rather than the claimed transferability.
Authors: We agree that direct metrics would provide stronger support for the draft-to-target transfer assumption underlying the sparsity mask. End-to-end accuracy and speedup are the primary practical metrics, but we recognize that intermediate validation would address potential alternative explanations such as dataset redundancy. In the revised manuscript we will add a new subsection reporting token overlap, rank correlation, and per-head mask agreement between draft-derived masks and target-model attention scores on a representative sample of NarrativeQA examples. This analysis will be included to directly substantiate the key insight. revision: yes
-
Referee: [Evaluation section] Evaluation section: the 2.67x speedup at ~90% sparsity is presented without error bars, detailed dataset statistics, or ablations testing the draft-target correlation assumption across model pairs or tasks. This weakens confidence that the negligible accuracy degradation generalizes, as the central claim depends on the mask correctly identifying target-critical tokens.
Authors: We acknowledge that the current evaluation lacks error bars and broader ablations, which limits claims about generalization. We will revise the evaluation section to include error bars computed over multiple random seeds for the reported speedup and accuracy figures, along with additional dataset statistics for NarrativeQA. We will also add a brief discussion of preliminary results on one additional task and model pair to illustrate the correlation assumption. Comprehensive ablations across all possible model pairs and tasks are beyond the scope of the current work but will be noted as future directions. revision: partial
Circularity Check
No circularity: empirical assumption tested via end-to-end results
full rationale
The paper asserts that draft-model attention scores are predictive of target-model token importance as a key insight, then constructs a sparsity mask from the draft run and measures end-to-end speedup and accuracy on NarrativeQA. No equations, fitted parameters, or self-citations are shown that would make the reported 2.67x speedup or 90% sparsity level equivalent to the input mask by construction. The predictive relationship is presented as an empirical premise rather than a derived result that loops back to itself, and the evaluation remains independent of any internal fitting to the target model's own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model... repurposes the draft model’s attention scores to dynamically construct a token-and-head-wise sparsity mask
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Longbench: A bilingual, multitask benchmark for long context understanding, 2024
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding, 2024
work page 2024
-
[2]
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Long- former: The long-document transformer, 2020
work page 2020
-
[3]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with specu- lative sampling.arXiv preprint arXiv:2302.01318, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Generating long sequences with sparse trans- formers, 2019
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse trans- formers, 2019
work page 2019
-
[5]
Rethinking attention with performers, 2022
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2022
work page 2022
-
[6]
Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018
work page 2018
-
[7]
Training veri- fiers to solve math word problems, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training veri- fiers to solve math word problems, 2021
work page 2021
-
[8]
Lazyllm: Dynamic token pruning for efficient long context llm inference, 2024
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi. Lazyllm: Dynamic token pruning for efficient long context llm inference, 2024
work page 2024
-
[9]
Seerat- tention: Learning intrinsic sparse attention in your llms, 2025
Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok- Hay So, Ting Cao, Fan Yang, and Mao Yang. Seerat- tention: Learning intrinsic sparse attention in your llms, 2025
work page 2025
-
[10]
Mamba: Linear-time sequence modeling with selective state spaces, 2024
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024
work page 2024
-
[11]
Measuring massive multitask language understanding, 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021
work page 2021
-
[12]
Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Minference 1.0: Accelerating pre- filling for long-context llms via dynamic sparse atten- tion, 2024
work page 2024
-
[13]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024
work page 2024
-
[14]
Reformer: The efficient transformer, 2020
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020
work page 2020
-
[15]
Efficient memory man- agement for large language model serving with page- dattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machinery
work page 2023
-
[16]
Gonza- lez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention, 2023
work page 2023
-
[17]
Fast inference from transformers via speculative decoding, 2023
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding, 2023
work page 2023
-
[18]
Snapkv: Llm knows what you are looking for before generation, 2024
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation, 2024
work page 2024
-
[19]
Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023
work page 2023
-
[20]
Agentbench: Evaluating llms as agents, 2025
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2025. 13
work page 2025
-
[21]
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time, 2023
work page 2023
-
[22]
Deja vu: Contextual sparsity for efficient llms at inference time, 2023
Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time, 2023
work page 2023
-
[23]
Pointer sentinel mixture models, 2016
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016
work page 2016
-
[24]
In-context learning and induction heads, 2022
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Con- erly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish,...
work page 2022
-
[25]
Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023
work page 2023
-
[26]
Quest: Query-aware spar- sity for efficient long-context llm inference, 2024
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware spar- sity for efficient long-context llm inference, 2024
work page 2024
-
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.CoRR, abs/1706.03762, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
Li, Madian Khabsa, Han Fang, and Hao Ma
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear com- plexity, 2020
work page 2020
-
[29]
Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2024
Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2024
work page 2024
-
[30]
Infllm: Training-free long-context ex- trapolation for llms with an efficient context memory, 2024
Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. Infllm: Training-free long-context ex- trapolation for llms with an efficient context memory, 2024
work page 2024
-
[31]
Duoattention: Efficient long-context llm inference with retrieval and streaming heads, 2024
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads, 2024
work page 2024
-
[32]
Efficient streaming language models with attention sinks, 2024
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024
work page 2024
-
[33]
Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference, 2024
Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference, 2024
work page 2024
-
[34]
Tidaldecode: Fast and accurate llm decoding with position persistent sparse attention, 2024
Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, and Zhihao Jia. Tidaldecode: Fast and accurate llm decoding with position persistent sparse attention, 2024
work page 2024
-
[35]
React: Synergizing reasoning and acting in language models, 2023
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023
work page 2023
-
[36]
Flashinfer: Kernel library for llm serving
Zihao Ye et al. Flashinfer: Kernel library for llm serving. https://github.com/flashinfer-ai/ flashinfer, 2024
work page 2024
-
[37]
Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y . X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Na- tive sparse attention: Hardware-aligned and natively trainable sparse attention, 2025
work page 2025
-
[38]
Big bird: Transformers for longer sequences, 2021
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021
work page 2021
-
[39]
H 2o: Heavy-hitter oracle for efficient generative inference of large language models, 2023
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H 2o: Heavy-hitter oracle for efficient generative inference of large language models, 2023. 14
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.