Rectified Sparse Attention

Furu Wei; Jian Chen; Jianyong Wang; Li Dong; Shijie Cao; Tianzhu Ye; Yizhao Gao; Yuqing Xia; Yutao Sun

arxiv: 2506.04108 · v2 · pith:OW7XPX7Knew · submitted 2025-06-04 · 💻 cs.CL

Rectified Sparse Attention

Yutao Sun , Tianzhu Ye , Li Dong , Yuqing Xia , Jian Chen , Yizhao Gao , Shijie Cao , Jianyong Wang

show 1 more author

Furu Wei

This is my paper

classification 💻 cs.CL

keywords resaattentiongenerationsparsecachedecodingdenseefficiency

0 comments

read the original abstract

Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42$\times$ end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing
cs.CL 2026-06 unverdicted novelty 6.0

CLSA shares both KV cache and routing indices across decoder layers to amortize top-k selection, delivering up to 7.6x decoding speedup and 17.1x throughput at 128K context while preserving accuracy.
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
cs.CL 2025-12 unverdicted novelty 6.0

BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
cs.LG 2025-11 unverdicted novelty 6.0

Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.