FG-Attn: Leveraging Fine-Grained Sparse Attention in Video Diffusion Models

Ashish Gondimalla; Kavya Sreedhar; Nandita Vijaykumar; Narges Shahidi; Sankeerth Durvasula; Suraj Kothawade; Suvinay Subramanian; Tianlei Pang; Zain Moustafa

arxiv: 2509.16518 · v2 · pith:ZXPWT5DDnew · submitted 2025-09-20 · 💻 cs.CV · cs.AR

FG-Attn: Leveraging Fine-Grained Sparse Attention in Video Diffusion Models

Sankeerth Durvasula , Kavya Sreedhar , Zain Moustafa , Suraj Kothawade , Tianlei Pang , Ashish Gondimalla , Suvinay Subramanian , Narges Shahidi

show 1 more author

Nandita Vijaykumar

This is my paper

classification 💻 cs.CV cs.AR

keywords attentionfg-attnsparsegenerationdiffusionfine-grainedmethodssparsity

0 comments

read the original abstract

Using diffusion transformers for media generation may require evaluating attention over extremely long sequences, with attention layers accounting for the majority of generation latency. Exploiting sparsity in attention maps offers a promising opportunity to reduce this cost. In this work, we show that attention maps in diffusion transformers exhibit significant fine-grained sparsity in video generation models. Existing sparse attention methods, however, are too coarse-grained, leaving a large fraction of redundant computation unaddressed, or incur high overheads at finer granularity. We propose FG-Attn, a novel, low-overhead fine-grained sparse attention mechanism that skips score computations at the granularity of a MxN tile, where N>=1 and M>=16, and where each block is the result of query-key dot products between M queries and N keys. FG-Attn addresses the key challenge of hardware underutilization in sparse attention kernels on GPUs, without incurring the overheads of irregular memory access and redundant operations. FG-Attn can fully supersede existing sparse attention methods and extend block sparse attention methods to finer granularities on modern GPUs. At 70% sparsity, FG-Attn is up to 2.45X faster than the state-of-art FlashInfer, and reduces attention kernel time by 14.7% on average. FG-Attn speeds up end-to-end video generation times by up to 1.40X (1.18X on average) over Flash Attention 3.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

DFSAttn is a training-free framework for dynamic fine-grained sparse attention in video DiTs that achieves up to 2.1x speedup while preserving generation quality via Hilbert reordering, hierarchical scoring, and adapt...
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.