SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Guangxuan Xiao; Oreste Villa; Song Han; Xin Dong; Yaosheng Fu

arxiv: 2606.04511 · v1 · pith:JJMNOVYWnew · submitted 2026-06-03 · 💻 cs.CL · cs.LG

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Yaosheng Fu , Guangxuan Xiao , Xin Dong , Song Han , Oreste Villa This is my paper

Pith reviewed 2026-06-28 06:26 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords sparse attentionlong-context LLMKV cache offloadingdecoupled attentionlookahead selectionprefetch overlapgrouped-query attentioninference efficiency

0 comments

The pith

SparDA adds a small Forecast projection to sparse attention that predicts the next layer's KV blocks, enabling overlapped CPU prefetch and lower selection cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that sparse attention in long-context LLMs can be accelerated by separating the block-selection logic from the current-layer query computation. A fourth projection called Forecast is added per layer and trained only to reproduce the attention distribution of the original selector. Because it operates independently, the design uses one Forecast head per grouped-query group instead of one per head, keeping the parameter overhead below 0.5 percent. The resulting lookahead selection allows CPU-to-GPU transfers of the next layer's KV cache to run while the current layer executes. On two 8B sparse-pretrained models this yields measured speedups while preserving accuracy.

Core claim

SparDA decouples sparse attention by introducing a Forecast projection that forecasts the KV blocks required by the subsequent layer. The projection is trained solely to match the original selector's attention distribution and employs one head per GQA group, adding less than 0.5 percent parameters. Lookahead selection produced by Forecast overlaps PCIe transfers with current-layer execution, removing the transfer bottleneck that otherwise dominates sparse offload attention at long contexts.

What carries the argument

The Forecast projection, an additional per-layer linear map that generates KV-block predictions independently of the current query and thereby enables lookahead selection.

If this is right

Accuracy matches or slightly exceeds the sparse-attention offload baseline on the tested 8B models.
Prefill phase reaches up to 1.25 times the speed of the sparse-attention offload baseline.
Decode phase reaches up to 1.7 times the speed of the sparse-attention offload baseline.
Larger feasible batch sizes on one GPU produce up to 5.3 times higher decode throughput than the non-offload sparse baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupled-forecast idea could be applied to other memory hierarchies such as NVMe or remote GPU memory.
Because Forecast is trained independently, it might be possible to adapt it to new tasks without retraining the rest of the model.
If Forecast accuracy holds at very long contexts, the method could reduce the number of GPUs required for single-request long-context serving.

Load-bearing premise

A projection trained only to match the original selector's attention distribution will continue to produce sufficiently accurate KV-block predictions at inference time across sequence lengths and tasks.

What would settle it

An experiment that measures attention-block hit rate on a held-out long-context task and shows either a drop in downstream accuracy or no net speedup once the Forecast predictions are used.

read the original abstract

Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains $O(T^2)$ complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU-to-GPU prefetch with current-layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi-head selector. SparDA adds $<$0.5% parameters and trains only the Forecast projections by matching the original selector's attention distribution. On two sparse-pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25$\times$ prefill speedup and 1.7$\times$ decode speedup over the sparse-attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3$\times$ higher decode throughput than the non-offload sparse baseline. Our source code is available at https://github.com/NVlabs/SparDA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SparDA adds a small decoupled Forecast projection per GQA group to overlap KV prefetch with computation in offloaded sparse attention, but the transfer from distribution matching to accurate block selection at inference remains the main untested piece.

read the letter

The core addition is a fourth per-layer projection, the Forecast, that runs ahead to pick KV blocks for the next layer. By decoupling it from the actual query and sharing one head per GQA group, they keep the overhead tiny and cut the selection cost that was still quadratic in the original sparse selector. The training is limited to matching the existing selector's attention scores, and they release the code.

On the two 8B models they tested, accuracy holds or improves slightly while they show 1.25 imes prefill and 1.7 imes decode speedups over the offload baseline, plus larger throughput gains once bigger batches fit on one GPU. That is the practical payoff they are after.

The soft spot is the one flagged in the stress test. Matching the full attention distribution does not guarantee that the top-K blocks chosen will be the ones the real next-layer queries would have picked, especially when attention is spread out or when sequence length and task shift. If the prefetch misses often, either accuracy slips or the overlap benefit disappears. The abstract gives no numbers on prediction hit rate or how it changes with length, so the speedups rest on an assumption that still needs checking.

This is for people who already run sparse attention with CPU offload and want to squeeze more throughput on single-GPU setups. The concrete claims plus released code make it worth sending to referees rather than desk-rejecting, though the experiments will need ablations on the Forecast accuracy itself before the gains look solid.

Referee Report

1 major / 1 minor

Summary. The paper introduces SparDA, a decoupled sparse attention architecture for long-context LLM inference that adds a per-layer Forecast projection (alongside Q/K/V) to predict KV blocks needed by the next layer. This enables lookahead prefetch overlapping CPU-to-GPU transfers with current-layer execution. The Forecast is trained solely by matching the original selector's attention distribution (<0.5% added parameters, one head per GQA group), and experiments on two sparse-pretrained 8B models report accuracy parity or slight gains plus speedups of up to 1.25× prefill / 1.7× decode over the sparse offload baseline and 5.3× decode throughput over the non-offload baseline.

Significance. If the empirical results hold under rigorous controls, SparDA would meaningfully mitigate PCIe transfer and selection overheads in offloaded sparse attention while preserving model quality, with the open-source code (https://github.com/NVlabs/SparDA) providing a clear reproducibility strength. The approach's value rests on whether distribution matching suffices for accurate block-level prefetch across lengths and tasks.

major comments (1)

[Abstract] Abstract: the reported accuracy preservation and speedups rest on the Forecast projection producing sufficiently accurate top-K KV-block predictions at inference. Training occurs only by matching the original selector's attention distribution; this objective does not directly optimize or guarantee block-level selection correctness when attention mass is diffuse, multi-modal, or when sequence lengths/tasks differ from training data, which is load-bearing for the prefetch benefit and the 1.25×/1.7×/5.3× claims.

minor comments (1)

The abstract states that the GQA implementation uses one Forecast head per group to reduce selection overhead, but the precise reduction factor and its interaction with the original multi-head selector should be quantified with an equation or table in the methods section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the training objective and its relation to the reported speedups. We address the point below.

read point-by-point responses

Referee: [Abstract] Abstract: the reported accuracy preservation and speedups rest on the Forecast projection producing sufficiently accurate top-K KV-block predictions at inference. Training occurs only by matching the original selector's attention distribution; this objective does not directly optimize or guarantee block-level selection correctness when attention mass is diffuse, multi-modal, or when sequence lengths/tasks differ from training data, which is load-bearing for the prefetch benefit and the 1.25×/1.7×/5.3× claims.

Authors: We agree that distribution matching is an indirect objective and does not directly optimize or guarantee top-K block selection accuracy, especially under diffuse/multi-modal attention or distribution shift. The manuscript's claims rest on empirical validation rather than theoretical guarantees. On the two evaluated 8B models, the Forecast yields predictions accurate enough to preserve (or slightly improve) accuracy while enabling the reported speedups. We will revise the abstract to explicitly note the distribution-matching objective and that speedups are empirically supported. We will also add a short discussion in the methods section on why this objective suffices in practice for the tested regimes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of training objective

full rationale

The paper introduces Forecast as an additional projection trained to match an existing selector's attention distribution, then reports measured speedups and accuracy on held-out inference workloads. No derivation step reduces a claimed prediction or result to the training objective by construction (no fitted-input-called-prediction pattern). No self-citation is load-bearing for the central claims, no uniqueness theorem is invoked, and no ansatz is smuggled. The reported 1.25×/1.7× speedups and accuracy parity are external measurements, not quantities forced by the paper's own equations or definitions. This is the common case of an independent empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; full details on training hyperparameters, exact loss formulation, and any additional modeling assumptions are unavailable.

invented entities (1)

Forecast projection no independent evidence
purpose: Predicts the KV blocks required by the subsequent layer to enable lookahead selection and prefetch overlap
New per-layer projection introduced alongside Q, K, V; trained separately by matching the original selector distribution.

pith-pipeline@v0.9.1-grok · 5796 in / 1324 out tokens · 38358 ms · 2026-06-28T06:26:02.343018+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Self-taught agentic long context understanding

Yufan Zhuang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Jingbo Shang, Zicheng Liu, and Emad Barsoum. Self-taught agentic long context understanding. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

2025
[2]

LoongRL: Reinforcement learning for advanced reasoning over long contexts

Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, and Mao Yang. LoongRL: Reinforcement learning for advanced reasoning over long contexts. InInternational Conference on Learning Representations (ICLR), 2026

2026
[3]

Native sparse attention: Hardware-aligned and natively trainable sparse attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

2025
[4]

InfLLM-V2: Dense-sparse switchable attention for seamless short-to-long adaptation

Weilin Zhao, Zihan Zhou, Zhou Su, Chaojun Xiao, Yuxuan Li, Yanghao Li, Yudi Zhang, Weilun Zhao, Zhen Li, Yuxiang Huang, Ao Sun, Xu Han, and Zhiyuan Liu. InfLLM-V2: Dense-sparse switchable attention for seamless short-to-long adaptation. InInternational Conference on Learning Representations (ICLR), 2026

2026
[5]

Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu. MoBA: Mixture of block attention for long-conte...

2025
[6]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

DeepSeek-V4: Technical report.https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/ main/DeepSeek_V4.pdf, 2026

DeepSeek-AI. DeepSeek-V4: Technical report.https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/ main/DeepSeek_V4.pdf, 2026

2026
[9]

Sparseserve: Unlocking parallelism for dynamic sparse attention in long-context llm serving.arXiv preprint arXiv:2509.24626,

Qihui Zhou, Peiqi Yin, Pengfei Zuo, and James Cheng. SparseServe: Unlocking parallelism for dynamic sparse attention in long-context LLM serving.arXiv preprint arXiv:2509.24626, 2025

work page arXiv 2025
[10]

Nosa: Native and offloadable sparse attention.arXiv preprint arXiv:2510.13602,

Yuxiang Huang, Pengjie Wang, Jicheng Han, Weilin Zhao, Zhou Su, Ao Sun, Hongya Lyu, Hengyu Zhao, Yudong Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. NOSA: Native and offloadable sparse attention.arXiv preprint arXiv:2510.13602, 2025

work page arXiv 2025
[11]

InfiniGen: Efficient generative inference of large language models with dynamic KV cache management

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. InfiniGen: Efficient generative inference of large language models with dynamic KV cache management. InUSENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024

2024
[12]

Indexcache: Accelerating sparse attention via cross-layer index reuse.arXiv preprint arXiv:2603.12201,

Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, and Juanzi Li. IndexCache: Accelerating sparse attention via cross-layer index reuse.arXiv preprint arXiv:2603.12201, 2026

work page arXiv 2026
[13]

HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

Yufei Xu, Fanxu Meng, Fan Jiang, Yuxuan Wang, Ruijie Zhou, Zhaohui Wang, Jiexi Wu, Zhixin Pan, Xiaojuan Tang, Wenjie Pei, Tongxuan Liu, Di Yin, Xing Sun, and Muhan Zhang. HISA: Efficient hierarchical indexing for fine-grained sparse attention.arXiv preprint arXiv:2603.28458, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Minicpm4: Ultra-efficient llms on end devices

MiniCPM Team. MiniCPM4: Ultra-efficient LLMs on end devices.arXiv preprint arXiv:2506.07900, 2025

work page arXiv 2025
[15]

Barrett, Zhangyang Wang, and Beidi Chen

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark W. Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InConference on Neural Information Processing Systems (NeurIPS), 2023

2023
[16]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InConference on Neural Information Processing Systems (NeurIPS), 2024. 10 SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

2024
[17]

DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads. InInternational Conference on Learning Representations (ICLR), 2025

2025
[18]

QUEST: Query-aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning (ICML), 2024

2024
[19]

Sparq attention: Bandwidth-efficient LLM inference

Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, and Douglas Orr. Sparq attention: Bandwidth-efficient LLM inference. InInternational Conference on Machine Learning (ICML), 2024

2024
[20]

RocketKV: Accelerating long-context LLM inference via two-stage KV cache compression

Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, and Alexey Tumanov. RocketKV: Accelerating long-context LLM inference via two-stage KV cache compression. InInternational Conference on Machine Learning (ICML), 2025

2025
[21]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024
[22]

FlexPrefill: A context-aware sparse attention mechanism for efficient long-sequence inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. FlexPrefill: A context-aware sparse attention mechanism for efficient long-sequence inference. InInternational Conference on Learning Representations (ICLR), 2025

2025
[23]

XAttention: Block sparse attention with antidiagonal scoring

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. XAttention: Block sparse attention with antidiagonal scoring. InInternational Conference on Machine Learning (ICML), 2025

2025
[24]

SeerAttention: Self-distilled attention gating for efficient long-context prefilling

Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok-Hay So, Ting Cao, Fan Yang, and Mao Yang. SeerAttention: Self-distilled attention gating for efficient long-context prefilling. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025
[25]

ArkVale: Efficient generative LLM inference with recallable key-value eviction

Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang. ArkVale: Efficient generative LLM inference with recallable key-value eviction. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024
[26]

ShadowKV: KV cache in shadows for high-throughput long-context LLM inference

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. ShadowKV: KV cache in shadows for high-throughput long-context LLM inference. InInternational Conference on Machine Learning (ICML), 2025

2025
[27]

MagicPIG: LSH sampling for efficient LLM generation

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. MagicPIG: LSH sampling for efficient LLM generation. In International Conference on Learning Representations (ICLR), 2025

2025
[28]

HiSparse: Turbocharging sparse attention with hierarchical memory.https://www.lmsys.org/blog/2026-04-10-sglang-hisparse, 2026

Zhiqiang Xie, Zhangheng Huang, and Tingwei Huang. HiSparse: Turbocharging sparse attention with hierarchical memory.https://www.lmsys.org/blog/2026-04-10-sglang-hisparse, 2026

2026
[29]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

2024
[30]

HELMET: How to evaluate long-context models effectively and thoroughly

Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. HELMET: How to evaluate long-context models effectively and thoroughly. InInternational Conference on Learning Representations (ICLR), 2025

2025
[31]

LongBench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

2024
[32]

RULER: What’s the real context size of your long-context language models? InConference on Language Modeling (COLM), 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InConference on Language Modeling (COLM), 2024

2024
[33]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations (ICLR), 2024

2024
[34]

How to train long-context language models (effectively)

Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively). InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025. 11 SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

2025
[35]

LongRoPE: Extending LLM context window beyond 2 million tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. LongRoPE: Extending LLM context window beyond 2 million tokens. InInternational Conference on Machine Learning (ICML), 2024. 12 SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference A. Algorithm Pseudocode Algorithm 1:SparDA pr...

2024

[1] [1]

Self-taught agentic long context understanding

Yufan Zhuang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Jingbo Shang, Zicheng Liu, and Emad Barsoum. Self-taught agentic long context understanding. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

2025

[2] [2]

LoongRL: Reinforcement learning for advanced reasoning over long contexts

Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, and Mao Yang. LoongRL: Reinforcement learning for advanced reasoning over long contexts. InInternational Conference on Learning Representations (ICLR), 2026

2026

[3] [3]

Native sparse attention: Hardware-aligned and natively trainable sparse attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

2025

[4] [4]

InfLLM-V2: Dense-sparse switchable attention for seamless short-to-long adaptation

Weilin Zhao, Zihan Zhou, Zhou Su, Chaojun Xiao, Yuxuan Li, Yanghao Li, Yudi Zhang, Weilun Zhao, Zhen Li, Yuxiang Huang, Ao Sun, Xu Han, and Zhiyuan Liu. InfLLM-V2: Dense-sparse switchable attention for seamless short-to-long adaptation. InInternational Conference on Learning Representations (ICLR), 2026

2026

[5] [5]

Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu. MoBA: Mixture of block attention for long-conte...

2025

[6] [6]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

DeepSeek-V4: Technical report.https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/ main/DeepSeek_V4.pdf, 2026

DeepSeek-AI. DeepSeek-V4: Technical report.https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/ main/DeepSeek_V4.pdf, 2026

2026

[9] [9]

Sparseserve: Unlocking parallelism for dynamic sparse attention in long-context llm serving.arXiv preprint arXiv:2509.24626,

Qihui Zhou, Peiqi Yin, Pengfei Zuo, and James Cheng. SparseServe: Unlocking parallelism for dynamic sparse attention in long-context LLM serving.arXiv preprint arXiv:2509.24626, 2025

work page arXiv 2025

[10] [10]

Nosa: Native and offloadable sparse attention.arXiv preprint arXiv:2510.13602,

Yuxiang Huang, Pengjie Wang, Jicheng Han, Weilin Zhao, Zhou Su, Ao Sun, Hongya Lyu, Hengyu Zhao, Yudong Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. NOSA: Native and offloadable sparse attention.arXiv preprint arXiv:2510.13602, 2025

work page arXiv 2025

[11] [11]

InfiniGen: Efficient generative inference of large language models with dynamic KV cache management

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. InfiniGen: Efficient generative inference of large language models with dynamic KV cache management. InUSENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024

2024

[12] [12]

Indexcache: Accelerating sparse attention via cross-layer index reuse.arXiv preprint arXiv:2603.12201,

Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, and Juanzi Li. IndexCache: Accelerating sparse attention via cross-layer index reuse.arXiv preprint arXiv:2603.12201, 2026

work page arXiv 2026

[13] [13]

HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

Yufei Xu, Fanxu Meng, Fan Jiang, Yuxuan Wang, Ruijie Zhou, Zhaohui Wang, Jiexi Wu, Zhixin Pan, Xiaojuan Tang, Wenjie Pei, Tongxuan Liu, Di Yin, Xing Sun, and Muhan Zhang. HISA: Efficient hierarchical indexing for fine-grained sparse attention.arXiv preprint arXiv:2603.28458, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Minicpm4: Ultra-efficient llms on end devices

MiniCPM Team. MiniCPM4: Ultra-efficient LLMs on end devices.arXiv preprint arXiv:2506.07900, 2025

work page arXiv 2025

[15] [15]

Barrett, Zhangyang Wang, and Beidi Chen

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark W. Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InConference on Neural Information Processing Systems (NeurIPS), 2023

2023

[16] [16]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InConference on Neural Information Processing Systems (NeurIPS), 2024. 10 SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

2024

[17] [17]

DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads. InInternational Conference on Learning Representations (ICLR), 2025

2025

[18] [18]

QUEST: Query-aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning (ICML), 2024

2024

[19] [19]

Sparq attention: Bandwidth-efficient LLM inference

Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, and Douglas Orr. Sparq attention: Bandwidth-efficient LLM inference. InInternational Conference on Machine Learning (ICML), 2024

2024

[20] [20]

RocketKV: Accelerating long-context LLM inference via two-stage KV cache compression

Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, and Alexey Tumanov. RocketKV: Accelerating long-context LLM inference via two-stage KV cache compression. InInternational Conference on Machine Learning (ICML), 2025

2025

[21] [21]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024

[22] [22]

FlexPrefill: A context-aware sparse attention mechanism for efficient long-sequence inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. FlexPrefill: A context-aware sparse attention mechanism for efficient long-sequence inference. InInternational Conference on Learning Representations (ICLR), 2025

2025

[23] [23]

XAttention: Block sparse attention with antidiagonal scoring

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. XAttention: Block sparse attention with antidiagonal scoring. InInternational Conference on Machine Learning (ICML), 2025

2025

[24] [24]

SeerAttention: Self-distilled attention gating for efficient long-context prefilling

Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok-Hay So, Ting Cao, Fan Yang, and Mao Yang. SeerAttention: Self-distilled attention gating for efficient long-context prefilling. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025

[25] [25]

ArkVale: Efficient generative LLM inference with recallable key-value eviction

Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang. ArkVale: Efficient generative LLM inference with recallable key-value eviction. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024

[26] [26]

ShadowKV: KV cache in shadows for high-throughput long-context LLM inference

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. ShadowKV: KV cache in shadows for high-throughput long-context LLM inference. InInternational Conference on Machine Learning (ICML), 2025

2025

[27] [27]

MagicPIG: LSH sampling for efficient LLM generation

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. MagicPIG: LSH sampling for efficient LLM generation. In International Conference on Learning Representations (ICLR), 2025

2025

[28] [28]

HiSparse: Turbocharging sparse attention with hierarchical memory.https://www.lmsys.org/blog/2026-04-10-sglang-hisparse, 2026

Zhiqiang Xie, Zhangheng Huang, and Tingwei Huang. HiSparse: Turbocharging sparse attention with hierarchical memory.https://www.lmsys.org/blog/2026-04-10-sglang-hisparse, 2026

2026

[29] [29]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

2024

[30] [30]

HELMET: How to evaluate long-context models effectively and thoroughly

Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. HELMET: How to evaluate long-context models effectively and thoroughly. InInternational Conference on Learning Representations (ICLR), 2025

2025

[31] [31]

LongBench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

2024

[32] [32]

RULER: What’s the real context size of your long-context language models? InConference on Language Modeling (COLM), 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InConference on Language Modeling (COLM), 2024

2024

[33] [33]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations (ICLR), 2024

2024

[34] [34]

How to train long-context language models (effectively)

Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively). InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025. 11 SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

2025

[35] [35]

LongRoPE: Extending LLM context window beyond 2 million tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. LongRoPE: Extending LLM context window beyond 2 million tokens. InInternational Conference on Machine Learning (ICML), 2024. 12 SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference A. Algorithm Pseudocode Algorithm 1:SparDA pr...

2024