pith. sign in

arxiv: 2506.08889 · v1 · pith:QN5FTDJAnew · submitted 2025-06-10 · 💻 cs.LG · cs.AI

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

classification 💻 cs.LG cs.AI
keywords seerattention-rattentionsparsedecodingreasoninggatinglongseerattention
0
0 comments X
read the original abstract

We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

    cs.CL 2025-10 conditional novelty 7.0

    DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and ach...

  2. You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

    cs.CL 2026-06 unverdicted novelty 6.0

    CLSA shares both KV cache and routing indices across decoder layers to amortize top-k selection, delivering up to 7.6x decoding speedup and 17.1x throughput at 128K context while preserving accuracy.

  3. DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

    cs.CL 2026-05 unverdicted novelty 6.0

    DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.

  4. Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

    cs.LG 2026-05 unverdicted novelty 6.0

    A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.

  5. Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

    cs.LG 2026-04 unverdicted novelty 6.0

    SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

  6. BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

    cs.CL 2025-12 unverdicted novelty 6.0

    BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.

  7. An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

    cs.LG 2026-05 unverdicted novelty 5.0

    Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.

  8. LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.

  9. Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference

    cs.DC 2026-03 unverdicted novelty 5.0

    Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.