pith. sign in

arxiv: 2605.15508 · v2 · pith:X6HMPEMEnew · submitted 2026-05-15 · 💻 cs.LG · cs.CL

STS: Efficient Sparse Attention with Speculative Token Sparsity

Pith reviewed 2026-05-20 21:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords sparse attentionspeculative decodingLLM inferencetoken sparsityattention pruninglong contextefficiencydraft model
0
0 comments X

The pith

A smaller draft model can identify which tokens matter for a larger LLM's attention, enabling 90 percent sparsity and 2.67 times faster inference without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes STS, a sparse attention method that works by integrating a small draft model from speculative decoding into the inference process of a larger target model. It establishes that attention scores from the draft model reliably indicate which tokens the target model should focus on, allowing construction of a dynamic sparsity mask that skips most attention computations. This approach requires no model retraining and directly addresses the quadratic cost of attention for long sequences. A reader would care because it targets emerging agentic applications that process multi-million token contexts while keeping accuracy close to that of dense attention. The result is a new point on the sparsity-accuracy trade-off that outperforms earlier sparse attention techniques.

Core claim

STS constructs a token-and-head-wise sparsity mask by repurposing the attention scores computed by a smaller draft model during speculative decoding; this mask prunes the quadratic attention computation inside the larger target LLM to roughly 10 percent of its original cost. On the NarrativeQA benchmark the method delivers a 2.67 times speedup at approximately 90 percent sparsity while incurring negligible accuracy degradation relative to full dense attention.

What carries the argument

The token-and-head-wise sparsity mask built from the draft model's attention scores, which selectively prunes attention operations in the target model.

If this is right

  • Higher sparsity levels become achievable for any given accuracy target compared with prior sparse-attention methods.
  • Multi-million-token sequences can be processed with substantially lower memory and compute during inference.
  • The technique slots directly into existing speculative-decoding pipelines with no extra training.
  • Attention cost scales sub-quadratically while preserving the model's original output distribution on long-context tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same draft-to-target predictability might let other inference accelerators, such as KV-cache compression, be guided by the draft model.
  • Agentic systems that maintain very long interaction histories could gain real-time responsiveness without changing model weights.
  • Varying the size ratio between draft and target models would test how robust the important-token prediction remains.
  • Energy use per generated token could drop proportionally to the observed speedup on hardware with attention bottlenecks.

Load-bearing premise

The tokens the smaller draft model flags as important are the same ones the larger target model needs to attend to.

What would settle it

Apply the sparsity mask generated by the draft model to the target model on NarrativeQA and measure whether accuracy falls more than a few percent below the dense baseline.

Figures

Figures reproduced from arXiv: 2605.15508 by Ceyu Xu, Jiangnan Yu, Yongji Wu, Yuan Xie.

Figure 1
Figure 1. Figure 1: A taxonomy of methods for reducing the computa [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: STS vs. Other Token Sparsity Works. The key difference is what “proxy” is used for selecting the important tokens. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the STS Workflow. algorithm overcomes this by leveraging a key similarity of the Transformer architecture: the attention mechanism. While models vary in size, the attention weight matrix for any head, regardless ofwhether it comes from the small draft model or the large target model, is consistently shaped N ×N, where N is the sequence length. A model’s size only determines the number of attent… view at source ↗
Figure 4
Figure 4. Figure 4: [System Performance]: End-to-end speedup anal￾ysis. (a) STS scales efficiently up to 100K context length. (b) STS maintains performance advantages on realistic bench￾marks like LongBench and SwBench. as the sequence length increases. This widening gap stems from the overhead of "Just-in-Time" selection. Methods like Quest require loading coarse-grained Q and K metadata from HBM to compute importance scores… view at source ↗
Figure 5
Figure 5. Figure 5: [System Optimization]: Latency of KV-cache of￾floading strategies. 8 16 32 64 128 Concurrency 2000 4000 6000 Token/s Llama 3.2 1B Llama 3.1 8B Llama 3.1 70B [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: [Draft Overhead Analysis]: Throughput scalability under concurrency. ARC Challenge GSM-8K Hellaswag MMLU TruthfulQA Winogrande 0 50 8B Accuracy (%) (a) ARC Challenge GSM-8K Hellaswag MMLU TruthfulQA Winogrande 0 50 70B Accuracy (%) (b) Dense STS [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: [General Benchmark Fidelity]: Zero-shot ac￾curacy on standard reasoning benchmarks. STS incurs negligible accuracy loss across 8B and 70B models, demon￾strating that sparsity patterns generalize effectively across model scales. of memory transfer latency, making CPU offloading a viable option for serving long-context workloads when GPU mem￾ory is limited. From a resource utilization perspective, this strat… view at source ↗
Figure 8
Figure 8. Figure 8: [LongBench Fidelity]: LongBench results on Llama-3.1-8B. STS (red) consistently outperforms sparse baselines (Quest, Tidal) and matches Dense performance across diverse tasks. GSM8K [7] (math), and ARC [6] (reasoning). STS tracks the Dense baseline within 0.5% variance across all tasks for both 8B and 70B models. Implication: This result is signifi￾cant because it proves that the sparsity patterns learned … view at source ↗
Figure 9
Figure 9. Figure 9: [LongBench Fidelity]: LongBench results on Llama-3.1-70B. STS scales effectively to larger models, preserving long-range dependencies even in multi-document QA tasks. 0 2000 4000 6000 8000 10000 Input Length (tokens) 6.2 6.4 6.6 6.8 Perplexity Full Attention STS Quest TidalDecode [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: [Fidelity]: Language modeling perplexity on Wiki-103. STS maintains perplexity within 0.01 of the dense baseline, whereas heuristic methods degrade at long contexts. layer, it represents the maximum fraction of attention entries that can be discarded without exceeding a strict perplexity degradation threshold (∆PPL < 0.01). This sensitivity analy￾sis reveals the inherent heterogeneity of the model: some l… view at source ↗
Figure 12
Figure 12. Figure 12: [Micro-analysis]: Layer-wise sparsity patterns. The draft-guided mask (STS) closely aligns with Oracle’s distribution. tal attention topology as large models, making them excellent predictors for sparse attention without the need for expensive gradients or full-model computations. 7 Related Work We classify efficient long-context inference into three paradigms: state eviction (KV compression), heuristic a… view at source ↗
read the original abstract

The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model's attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining negligible accuracy degradation compared to dense attention. STS establishes a new state-of-the-art on the sparsity-accuracy trade-off, outperforming prior techniques by enabling higher sparsity levels for a given accuracy budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes STS, a training-free sparse attention mechanism for large language models that integrates with speculative decoding frameworks. It repurposes attention scores from a smaller draft model to dynamically construct a token-and-head-wise sparsity mask for the target model, pruning quadratic attention computations. On the NarrativeQA benchmark, STS reports a 2.67x speedup at approximately 90% sparsity while maintaining negligible accuracy degradation relative to dense attention, and claims a new state-of-the-art on the sparsity-accuracy trade-off.

Significance. If the draft-to-target importance transfer assumption holds, STS could meaningfully advance efficient long-context inference for LLMs in agentic settings by reducing memory and compute bottlenecks without retraining. The reuse of draft-model computations already performed in speculative decoding is a practical strength that could facilitate adoption. The reported numbers on NarrativeQA suggest a favorable operating point, but broader significance hinges on demonstrating that the sparsity mask reliably identifies target-critical tokens rather than relying solely on end-to-end accuracy preservation.

major comments (2)
  1. [Abstract and key insight paragraph] Abstract and key insight paragraph: the central claim that 'tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model' is load-bearing for the sparsity mask construction, yet the manuscript provides only end-to-end accuracy on NarrativeQA. No direct metrics (token overlap, rank correlation, or per-head mask agreement) between draft-derived and target-optimal masks are reported; without these, it remains possible that accuracy is preserved by dataset redundancy or conservative masking rather than the claimed transferability.
  2. [Evaluation section] Evaluation section: the 2.67x speedup at ~90% sparsity is presented without error bars, detailed dataset statistics, or ablations testing the draft-target correlation assumption across model pairs or tasks. This weakens confidence that the negligible accuracy degradation generalizes, as the central claim depends on the mask correctly identifying target-critical tokens.
minor comments (1)
  1. [Method] The manuscript would benefit from a clearer description of how the sparsity threshold is chosen and whether it is fixed or adaptive across layers or heads.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract and key insight paragraph] Abstract and key insight paragraph: the central claim that 'tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model' is load-bearing for the sparsity mask construction, yet the manuscript provides only end-to-end accuracy on NarrativeQA. No direct metrics (token overlap, rank correlation, or per-head mask agreement) between draft-derived and target-optimal masks are reported; without these, it remains possible that accuracy is preserved by dataset redundancy or conservative masking rather than the claimed transferability.

    Authors: We agree that direct metrics would provide stronger support for the draft-to-target transfer assumption underlying the sparsity mask. End-to-end accuracy and speedup are the primary practical metrics, but we recognize that intermediate validation would address potential alternative explanations such as dataset redundancy. In the revised manuscript we will add a new subsection reporting token overlap, rank correlation, and per-head mask agreement between draft-derived masks and target-model attention scores on a representative sample of NarrativeQA examples. This analysis will be included to directly substantiate the key insight. revision: yes

  2. Referee: [Evaluation section] Evaluation section: the 2.67x speedup at ~90% sparsity is presented without error bars, detailed dataset statistics, or ablations testing the draft-target correlation assumption across model pairs or tasks. This weakens confidence that the negligible accuracy degradation generalizes, as the central claim depends on the mask correctly identifying target-critical tokens.

    Authors: We acknowledge that the current evaluation lacks error bars and broader ablations, which limits claims about generalization. We will revise the evaluation section to include error bars computed over multiple random seeds for the reported speedup and accuracy figures, along with additional dataset statistics for NarrativeQA. We will also add a brief discussion of preliminary results on one additional task and model pair to illustrate the correlation assumption. Comprehensive ablations across all possible model pairs and tasks are beyond the scope of the current work but will be noted as future directions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical assumption tested via end-to-end results

full rationale

The paper asserts that draft-model attention scores are predictive of target-model token importance as a key insight, then constructs a sparsity mask from the draft run and measures end-to-end speedup and accuracy on NarrativeQA. No equations, fitted parameters, or self-citations are shown that would make the reported 2.67x speedup or 90% sparsity level equivalent to the input mask by construction. The predictive relationship is presented as an empirical premise rather than a derived result that loops back to itself, and the evaluation remains independent of any internal fitting to the target model's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on one central domain assumption and introduces no new free parameters or invented entities beyond standard LLM components.

axioms (1)
  • domain assumption Tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model
    Stated as the key insight enabling the sparsity mask construction.

pith-pipeline@v0.9.0 · 5693 in / 1093 out tokens · 27422 ms · 2026-05-20T21:02:29.897105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

  1. [1]

    Longbench: A bilingual, multitask benchmark for long context understanding, 2024

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding, 2024

  2. [2]

    Peters, and Arman Cohan

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Long- former: The long-document transformer, 2020

  3. [3]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with specu- lative sampling.arXiv preprint arXiv:2302.01318, 2023

  4. [4]

    Generating long sequences with sparse trans- formers, 2019

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse trans- formers, 2019

  5. [5]

    Rethinking attention with performers, 2022

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2022

  6. [6]

    Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

  7. [7]

    Training veri- fiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training veri- fiers to solve math word problems, 2021

  8. [8]

    Lazyllm: Dynamic token pruning for efficient long context llm inference, 2024

    Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi. Lazyllm: Dynamic token pruning for efficient long context llm inference, 2024

  9. [9]

    Seerat- tention: Learning intrinsic sparse attention in your llms, 2025

    Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok- Hay So, Ting Cao, Fan Yang, and Mao Yang. Seerat- tention: Learning intrinsic sparse attention in your llms, 2025

  10. [10]

    Mamba: Linear-time sequence modeling with selective state spaces, 2024

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024

  11. [11]

    Measuring massive multitask language understanding, 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

  12. [12]

    Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Minference 1.0: Accelerating pre- filling for long-context llms via dynamic sparse atten- tion, 2024

  13. [13]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

  14. [14]

    Reformer: The efficient transformer, 2020

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020

  15. [15]

    Efficient memory man- agement for large language model serving with page- dattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machinery

  16. [16]

    Gonza- lez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention, 2023

  17. [17]

    Fast inference from transformers via speculative decoding, 2023

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding, 2023

  18. [18]

    Snapkv: Llm knows what you are looking for before generation, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation, 2024

  19. [19]

    Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023

  20. [20]

    Agentbench: Evaluating llms as agents, 2025

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2025. 13

  21. [21]

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time, 2023

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time, 2023

  22. [22]

    Deja vu: Contextual sparsity for efficient llms at inference time, 2023

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time, 2023

  23. [23]

    Pointer sentinel mixture models, 2016

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

  24. [24]

    In-context learning and induction heads, 2022

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Con- erly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish,...

  25. [25]

    Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023

  26. [26]

    Quest: Query-aware spar- sity for efficient long-context llm inference, 2024

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware spar- sity for efficient long-context llm inference, 2024

  27. [27]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.CoRR, abs/1706.03762, 2017

  28. [28]

    Li, Madian Khabsa, Han Fang, and Hao Ma

    Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear com- plexity, 2020

  29. [29]

    Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2024

    Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2024

  30. [30]

    Infllm: Training-free long-context ex- trapolation for llms with an efficient context memory, 2024

    Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. Infllm: Training-free long-context ex- trapolation for llms with an efficient context memory, 2024

  31. [31]

    Duoattention: Efficient long-context llm inference with retrieval and streaming heads, 2024

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads, 2024

  32. [32]

    Efficient streaming language models with attention sinks, 2024

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024

  33. [33]

    Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference, 2024

    Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference, 2024

  34. [34]

    Tidaldecode: Fast and accurate llm decoding with position persistent sparse attention, 2024

    Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, and Zhihao Jia. Tidaldecode: Fast and accurate llm decoding with position persistent sparse attention, 2024

  35. [35]

    React: Synergizing reasoning and acting in language models, 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

  36. [36]

    Flashinfer: Kernel library for llm serving

    Zihao Ye et al. Flashinfer: Kernel library for llm serving. https://github.com/flashinfer-ai/ flashinfer, 2024

  37. [37]

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y . X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Na- tive sparse attention: Hardware-aligned and natively trainable sparse attention, 2025

  38. [38]

    Big bird: Transformers for longer sequences, 2021

    Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021

  39. [39]

    H 2o: Heavy-hitter oracle for efficient generative inference of large language models, 2023

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H 2o: Heavy-hitter oracle for efficient generative inference of large language models, 2023. 14