pith. sign in

arxiv: 2506.04108 · v2 · pith:OW7XPX7Knew · submitted 2025-06-04 · 💻 cs.CL

Rectified Sparse Attention

classification 💻 cs.CL
keywords resaattentiongenerationsparsecachedecodingdenseefficiency
0
0 comments X
read the original abstract

Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42$\times$ end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

    cs.CL 2026-06 unverdicted novelty 6.0

    CLSA shares both KV cache and routing indices across decoder layers to amortize top-k selection, delivering up to 7.6x decoding speedup and 17.1x throughput at 128K context while preserving accuracy.

  2. BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

    cs.CL 2025-12 unverdicted novelty 6.0

    BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.

  3. Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

    cs.LG 2025-11 unverdicted novelty 6.0

    Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.