pith. sign in

arxiv: 2605.23445 · v1 · pith:4M4TCNHBnew · submitted 2026-05-22 · 💻 cs.CV

DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation

Pith reviewed 2026-05-25 04:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords sparse attentiondiffusion transformersvideo generationefficient inferenceblock sparse attentionHilbert curve
0
0 comments X

The pith

DFSAttn applies dynamic fine-grained sparse attention to diffusion transformers for video generation, reaching 2.1x end-to-end speedup at high sparsity while preserving quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard block sparse attention fails on diffusion transformers because their attention maps are both dynamic across frames and fine-grained within regions. It introduces a training-free framework that reorders tokens, scores blocks hierarchically, and caches masks with adaptive ratios to exploit this sparsity pattern. If the approach holds, video generation models could run at much higher sparsity ratios without the quality collapse seen in prior methods, directly lowering the quadratic cost of 3D attention. The work focuses on inference efficiency rather than retraining, showing consistent gains over existing sparse baselines in experiments.

Core claim

Existing block sparse attention degrades at high sparsity in DiTs because it cannot capture the dynamic and fine-grained sparsity patterns in their attention maps; DFSAttn overcomes this by combining Hilbert curve token reordering for fine-grained yet GPU-efficient sparsity, hierarchical block scoring for accurate importance estimation, and adaptive sparse mask caching, yielding up to 2.1x speedup with maintained generation quality.

What carries the argument

Hilbert curve-based token reordering combined with hierarchical block scoring and adaptive-ratio mask caching, which together enable dynamic fine-grained sparsification without retraining.

If this is right

  • Higher sparsity ratios become usable in DiT video models without the quality drop that previously limited acceleration.
  • End-to-end inference time for long video sequences drops by roughly half while output fidelity stays comparable to dense attention.
  • The three design components can be combined modularly with other attention optimizations that preserve token ordering or importance signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token reordering and hierarchical scoring could extend to image-only diffusion models where attention maps also show non-uniform sparsity.
  • Adaptive mask caching might reduce memory traffic on hardware that supports irregular sparse matrix operations, beyond the reported GPU speedups.
  • If the lower-bound analysis on attention recall generalizes, it offers a way to predict the minimum sparsity a given model can tolerate before quality degrades.

Load-bearing premise

Attention maps inside diffusion transformers for video contain inherent dynamic fine-grained sparsity that block-based methods cannot exploit without quality loss.

What would settle it

Measure generation quality metrics such as FVD or CLIP similarity at 80-90% sparsity ratios on standard video benchmarks; if DFSAttn quality falls below full attention while prior block methods do not, the claim on dynamic sparsity handling is falsified.

Figures

Figures reproduced from arXiv: 2605.23445 by Jie Hu, Kun Yuan, Yutong He, Zixiang Gao.

Figure 1
Figure 1. Figure 1: Overview of DFSAttn: a) Video tokens are reordered using a 3D Hilbert curve, and sub-block attention scores are aggregated to estimate block-wise importance. The resulting masks are applied via sparse Flash Attention, while cross-attention remains dense to preserve text–video alignment. b) Sparse masks are cached and reused across diffusion timesteps. c) Structured block sparsity in the reordered sequence … view at source ↗
Figure 2
Figure 2. Figure 2: 3D full attention maps in DiTs exhibit dynamic and fine-grained sparsity patterns. (f, h, w), corresponding to the number of frames, height, and width, respectively. This 3D latent is then flattened into a single token sequence before being fed into the trans￾former. Consequently, the input sequence length for each transformer block is N = f × h × w. For simplicity, we omit text-conditioning tokens here, a… view at source ↗
Figure 3
Figure 3. Figure 3: The mean attention recall (solid line) rises monotonically across diffusion steps, with low variance across various samples (shaded region). 4. Motivation In this section, we identify the intrinsic sparsity patterns of attention in diffusion transformers and show that the effec￾tiveness of block sparse attention is fundamentally governed by the statistical structure of attention scores. To formalize this c… view at source ↗
Figure 4
Figure 4. Figure 4: The 3D Hilbert curve in 4 × 4 × 4 space. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of videos generated by DFSAttn and other baselines on Wan2.1-T2V-14B [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PSNR (solid, left) and latency (dashed, right) vs. overall sparsity. Main results are evaluated under adaptive 80% sparsity. followed by local raster traversal. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Video generation examples from dense attention and DFSAttn on HunyuanVideo. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Video generation examples from dense attention and DFSAttn on Wan2.1-T2V-14B. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to mitigate this by focusing computation on important regions. However, attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity, which causes existing block sparse attention methods to degrade significantly in quality, especially at high sparsity ratios. In this paper, we revisit block sparse attention and derive a theoretical lower bound on attention recall to characterize the key factors governing its effectiveness. Guided by these insights, we propose DFSAttn, a training-free sparse attention framework that enables dynamic, fine-grained sparsification efficiently. DFSAttn incorporates three core designs: Hilbert curve-based token reordering to achieve fine-grained sparsity while preserving efficient GPU execution, hierarchical block scoring for accurate block importance estimation, and sparse mask caching with adaptive ratios to balance accuracy and efficiency. Experimental results demonstrate that DFSAttn consistently outperforms prior methods under high sparsity, achieving up to 2.1$\times$ end-to-end speedup while maintaining high generation quality. Our code is open-sourced and available at https://github.com/jessica-hujie/DFSAttn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper claims that attention maps in diffusion transformers (DiTs) for video generation exhibit dynamic fine-grained sparsity that defeats standard block-sparse methods at high sparsity ratios. It derives a theoretical lower bound on attention recall, then introduces the training-free DFSAttn framework consisting of Hilbert-curve token reordering, hierarchical block scoring, and adaptive sparse-mask caching. Experiments reportedly show consistent outperformance over prior sparse-attention baselines, with up to 2.1× end-to-end speedup while preserving generation quality; code is released.

Significance. If the lower-bound derivation is correct and the reported speed/quality trade-off holds under rigorous evaluation, the work would be a useful practical advance for accelerating high-resolution video diffusion models. The training-free design and open-sourced implementation are concrete strengths that lower the barrier to adoption and verification.

minor comments (4)
  1. The abstract states a lower bound on recall but does not indicate whether the bound is tight or how the three proposed mechanisms are shown to approach it; a short proof sketch or numerical verification in the main text would strengthen the motivation section.
  2. Generation-quality claims should be supported by standard video metrics (FVD, CLIP similarity, human preference) with error bars over multiple seeds; the current description only mentions “high generation quality.”
  3. The hierarchical scoring and adaptive-ratio caching mechanisms would benefit from an ablation table isolating each component’s contribution to the final speedup/quality numbers.
  4. Figure captions and algorithm boxes should explicitly state the sparsity ratios at which each method is evaluated so readers can directly compare the reported 2.1× figure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the training-free design and open-sourced code as strengths, and the recommendation for minor revision. The referee's description of the paper is accurate.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives a theoretical lower bound on attention recall from the properties of block sparse attention in DiTs, then designs DFSAttn (Hilbert reordering, hierarchical scoring, mask caching) guided by that bound. No equation or claim reduces a 'prediction' or result to a fitted parameter, self-defined quantity, or self-citation chain by construction. The method is explicitly training-free, and performance claims (speedup and quality) are externally measurable rather than tautological with the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no full text available to enumerate additional free parameters, axioms, or entities.

axioms (1)
  • domain assumption Attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity
    Explicitly stated in the abstract as the reason existing block-sparse methods degrade at high sparsity ratios.

pith-pipeline@v0.9.0 · 5752 in / 1191 out tokens · 19765 ms · 2026-05-25T04:41:46.461132+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 5 internal anchors

  1. [1]

    Flashattention-2: Faster attention with better paral- lelism and work partitioning

    Dao, T. Flashattention-2: Faster attention with better paral- lelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pp. 35549– 35562,

  2. [2]

    FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers.arXiv preprint arXiv:2509.16518,

    Durvasula, S., Sreedhar, K., Moustafa, Z., Kothawade, S., Gondimalla, A., Subramanian, S., Shahidi, N., and Vijaykumar, N. FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers.arXiv preprint arXiv:2509.16518,

  3. [3]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al. Hunyuan- video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

  4. [4]

    Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

    Lai, X., Lu, J., Luo, Y ., Ma, Y ., and Zhou, X. Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

  5. [5]

    and Xu, L

    Li, Y . and Xu, L. Hilbert-Guided Block-Sparse Local At- tention.arXiv preprint arXiv:2511.05832,

  6. [6]

    Mixture of distributions matters: Dynamic sparse attention for efficient video diffusion transformers.arXiv preprint arXiv:2601.11641, 2026b

    Liu, Y ., Hu, Y ., Zhang, Z., Jiang, K., and Yuan, K. Mixture of distributions matters: Dynamic sparse attention for efficient video diffusion transformers.arXiv preprint arXiv:2601.11641, 2026b. Lu, C., Zhou, Y ., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in...

  7. [7]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

  8. [8]

    Draftattention: Fast video diffusion via low-resolution attention guidance

    Shen, X., Han, C., Zhou, Y ., Xie, Y ., Gong, Y ., Wang, Q., Wang, Y ., Wang, Y ., Zhao, P., and Gu, J. Draftattention: Fast video diffusion via low-resolution attention guidance. arXiv preprint arXiv:2505.14708,

  9. [9]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

  10. [10]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  11. [11]

    Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,

    Xi, H., Yang, S., Zhao, Y ., Xu, C., Li, M., Li, X., Lin, Y ., Cai, H., Zhang, J., Li, D., et al. Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,

  12. [12]

    Effi- cient streaming language models with attention sinks

    Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Effi- cient streaming language models with attention sinks. In International Conference on Learning Representations, volume 2024, pp. 21875–21895,

  13. [13]

    Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

    Xu, R., Xiao, G., Huang, H., Guo, J., and Han, S. Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

  14. [14]

    Cogvideox: Text-to-video diffusion models with an ex- pert transformer

    10 DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., et al. Cogvideox: Text-to-video diffusion models with an ex- pert transformer. InInternational Conference on Learn- ing Representations, volume 2025, pp. 83048–83077,

  15. [15]

    Spargeattn: Accurate sparse attention accelerat- ing any model inference

    Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and Chen, J. Spargeattn: Accurate sparse attention accelerat- ing any model inference. InInternational Conference on Machine Learning (ICML), 2025a. Zhang, P., Chen, Y ., Su, R., Ding, H., Stoica, I., Liu, Z., and Zhang, H. Fast video generation with sliding tile attention.arXiv preprint arXiv:250...

  16. [16]

    Real-time video generation with pyramid attention broadcast

    Zhao, X., Jin, X., Wang, K., and You, Y . Real-time video generation with pyramid attention broadcast. InInterna- tional Conference on Learning Representations, volume 2025, pp. 3296–3319,

  17. [17]

    HilbertA: Hilbert Attention for Image Generation with Diffusion Models.arXiv preprint arXiv:2509.26538,

    Zheng, S., Lu, W., Xia, Y ., Liu, H., and Wang, S. HilbertA: Hilbert Attention for Image Generation with Diffusion Models.arXiv preprint arXiv:2509.26538,

  18. [18]

    Open-Sora: Democratizing Efficient Video Production for All

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y ., Li, T., and You, Y . Open-sora: Democratiz- ing efficient video production for all.arXiv preprint arXiv:2412.20404,

  19. [19]

    (27) where K= max i ∥zi∥ψ2 is the subgaussian norm of z, ∥A∥2 F is the Frobenius norm of the matrix, and ∥A∥ is the operator norm of the matrix

    implies that there exists a constant c′ such that P z⊤Az−E[z ⊤Az] ≤ − µD 2 ≤exp −c′ min µ2 D 4K 4∥A∥2 F , µD 2K 2∥A∥ . (27) where K= max i ∥zi∥ψ2 is the subgaussian norm of z, ∥A∥2 F is the Frobenius norm of the matrix, and ∥A∥ is the operator norm of the matrix. Sincezis Gaussian, the subgaussian norm satisfies ∥zi∥ψ2 ≤C ψ2 δ for all coordinates zi, wher...