SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Fan Yang; Hayden Kwok-Hay So; Lei Wang; Li Dong; Lingxiao Ma; Mao Yang; Shijie Cao; Shuming Guo; Tianzhu Ye; Ting Cao

arxiv: 2506.08889 · v1 · pith:QN5FTDJAnew · submitted 2025-06-10 · 💻 cs.LG · cs.AI

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Yizhao Gao , Shuming Guo , Shijie Cao , Yuqing Xia , Yu Cheng , Lei Wang , Lingxiao Ma , Yutao Sun

show 7 more authors

Tianzhu Ye Li Dong Hayden Kwok-Hay So Yu Hua Ting Cao Fan Yang Mao Yang

This is my paper

classification 💻 cs.LG cs.AI

keywords seerattention-rattentionsparsedecodingreasoninggatinglongseerattention

0 comments

read the original abstract

We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
cs.CL 2025-10 conditional novelty 7.0

DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and ach...
You Only Index Once: Cross-Layer Sparse Attention with Shared Routing
cs.CL 2026-06 unverdicted novelty 6.0

CLSA shares both KV cache and routing indices across decoder layers to amortize top-k selection, delivering up to 7.6x decoding speedup and 17.1x throughput at 128K context while preserving accuracy.
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
cs.CL 2026-05 unverdicted novelty 6.0

DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
cs.LG 2026-05 unverdicted novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
cs.CL 2025-12 unverdicted novelty 6.0

BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
cs.LG 2026-05 unverdicted novelty 5.0

Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference
cs.DC 2026-03 unverdicted novelty 5.0

Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.