pith. sign in

arxiv: 2602.04789 · v2 · pith:2H2GUTNWnew · submitted 2026-02-04 · 💻 cs.CV

Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

classification 💻 cs.CV
keywords attentionsparsegenerationmodelschunkforcinglighttextit
0
0 comments X
read the original abstract

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, $1.2{\sim}1.3\times$ end-to-end speedup). Combined with FP8 quantization and LightVAE, \textsc{Light Forcing} further achieves a $2.3\times$ speedup and 19.7\,FPS on an RTX~5090 GPU. Code will be released at \href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video ...

  2. Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.