pith. sign in

arxiv: 2508.18224 · v3 · pith:SO62FFJJnew · submitted 2025-08-25 · 💻 cs.DC · cs.LG

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

classification 💻 cs.DC cs.LG
keywords attentionsparseimplementationkernelllmsaverageefficientgroup
0
0 comments X
read the original abstract

Recent advances in sparse attention mechanisms have demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boosts while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt a much smaller number of query heads in each GQA group -- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), an alternative kernel implementation that enables efficient NSA computation across a wide range of popular LLMs with a varied, smaller number of heads in each GQA group on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction, (ii) up to 1.25x and 1.09x on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36x and 1.11x on average for prefill-phase speedup in LLM generative inference. The source code is open-sourced and publicly available at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling

    cs.DC 2026-05 unverdicted novelty 7.0

    HexAGenT reduces the SLO scale required for timely agentic LLM workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment on heterogeneous A100/H100/H200 clusters.

  2. MiniMax Sparse Attention

    cs.AI 2026-06 unverdicted novelty 6.0

    MiniMax Sparse Attention is a GQA-based block-sparse attention mechanism that selects top-k blocks independently per group and delivers 28.4x per-token compute reduction at 1M context with on-par performance plus 14.2...

  3. Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models

    cs.LG 2026-06 unverdicted novelty 6.0

    Sparrow uses a dynamic sparsity schedule keyed to the lower tail of sparse-to-dense actor-policy mismatch to enable stable and faster rollouts in long-context RL for LLMs.

  4. HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware

    cs.DC 2026-05 unverdicted novelty 6.0

    HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.

  5. Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

    cs.LG 2026-04 unverdicted novelty 5.0

    ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.