DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation
Pith reviewed 2026-05-25 04:41 UTC · model grok-4.3
The pith
DFSAttn applies dynamic fine-grained sparse attention to diffusion transformers for video generation, reaching 2.1x end-to-end speedup at high sparsity while preserving quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing block sparse attention degrades at high sparsity in DiTs because it cannot capture the dynamic and fine-grained sparsity patterns in their attention maps; DFSAttn overcomes this by combining Hilbert curve token reordering for fine-grained yet GPU-efficient sparsity, hierarchical block scoring for accurate importance estimation, and adaptive sparse mask caching, yielding up to 2.1x speedup with maintained generation quality.
What carries the argument
Hilbert curve-based token reordering combined with hierarchical block scoring and adaptive-ratio mask caching, which together enable dynamic fine-grained sparsification without retraining.
If this is right
- Higher sparsity ratios become usable in DiT video models without the quality drop that previously limited acceleration.
- End-to-end inference time for long video sequences drops by roughly half while output fidelity stays comparable to dense attention.
- The three design components can be combined modularly with other attention optimizations that preserve token ordering or importance signals.
Where Pith is reading between the lines
- The same token reordering and hierarchical scoring could extend to image-only diffusion models where attention maps also show non-uniform sparsity.
- Adaptive mask caching might reduce memory traffic on hardware that supports irregular sparse matrix operations, beyond the reported GPU speedups.
- If the lower-bound analysis on attention recall generalizes, it offers a way to predict the minimum sparsity a given model can tolerate before quality degrades.
Load-bearing premise
Attention maps inside diffusion transformers for video contain inherent dynamic fine-grained sparsity that block-based methods cannot exploit without quality loss.
What would settle it
Measure generation quality metrics such as FVD or CLIP similarity at 80-90% sparsity ratios on standard video benchmarks; if DFSAttn quality falls below full attention while prior block methods do not, the claim on dynamic sparsity handling is falsified.
Figures
read the original abstract
Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to mitigate this by focusing computation on important regions. However, attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity, which causes existing block sparse attention methods to degrade significantly in quality, especially at high sparsity ratios. In this paper, we revisit block sparse attention and derive a theoretical lower bound on attention recall to characterize the key factors governing its effectiveness. Guided by these insights, we propose DFSAttn, a training-free sparse attention framework that enables dynamic, fine-grained sparsification efficiently. DFSAttn incorporates three core designs: Hilbert curve-based token reordering to achieve fine-grained sparsity while preserving efficient GPU execution, hierarchical block scoring for accurate block importance estimation, and sparse mask caching with adaptive ratios to balance accuracy and efficiency. Experimental results demonstrate that DFSAttn consistently outperforms prior methods under high sparsity, achieving up to 2.1$\times$ end-to-end speedup while maintaining high generation quality. Our code is open-sourced and available at https://github.com/jessica-hujie/DFSAttn.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that attention maps in diffusion transformers (DiTs) for video generation exhibit dynamic fine-grained sparsity that defeats standard block-sparse methods at high sparsity ratios. It derives a theoretical lower bound on attention recall, then introduces the training-free DFSAttn framework consisting of Hilbert-curve token reordering, hierarchical block scoring, and adaptive sparse-mask caching. Experiments reportedly show consistent outperformance over prior sparse-attention baselines, with up to 2.1× end-to-end speedup while preserving generation quality; code is released.
Significance. If the lower-bound derivation is correct and the reported speed/quality trade-off holds under rigorous evaluation, the work would be a useful practical advance for accelerating high-resolution video diffusion models. The training-free design and open-sourced implementation are concrete strengths that lower the barrier to adoption and verification.
minor comments (4)
- The abstract states a lower bound on recall but does not indicate whether the bound is tight or how the three proposed mechanisms are shown to approach it; a short proof sketch or numerical verification in the main text would strengthen the motivation section.
- Generation-quality claims should be supported by standard video metrics (FVD, CLIP similarity, human preference) with error bars over multiple seeds; the current description only mentions “high generation quality.”
- The hierarchical scoring and adaptive-ratio caching mechanisms would benefit from an ablation table isolating each component’s contribution to the final speedup/quality numbers.
- Figure captions and algorithm boxes should explicitly state the sparsity ratios at which each method is evaluated so readers can directly compare the reported 2.1× figure.
Simulated Author's Rebuttal
We thank the referee for the positive summary, recognition of the training-free design and open-sourced code as strengths, and the recommendation for minor revision. The referee's description of the paper is accurate.
Circularity Check
No significant circularity detected
full rationale
The paper derives a theoretical lower bound on attention recall from the properties of block sparse attention in DiTs, then designs DFSAttn (Hilbert reordering, hierarchical scoring, mask caching) guided by that bound. No equation or claim reduces a 'prediction' or result to a fitted parameter, self-defined quantity, or self-citation chain by construction. The method is explicitly training-free, and performance claims (speedup and quality) are externally measurable rather than tautological with the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a token representation model and derive a theoretical lower bound on attention recall... three key factors: the sparsity budget, the inter-block similarity gap, and the block-level semantic diversity.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
reorder video tokens using a 3D Hilbert curve... preserves spatiotemporal locality
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flashattention-2: Faster attention with better paral- lelism and work partitioning
Dao, T. Flashattention-2: Faster attention with better paral- lelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pp. 35549– 35562,
work page 2024
-
[2]
FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers.arXiv preprint arXiv:2509.16518,
Durvasula, S., Sreedhar, K., Moustafa, Z., Kothawade, S., Gondimalla, A., Subramanian, S., Shahidi, N., and Vijaykumar, N. FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers.arXiv preprint arXiv:2509.16518,
-
[3]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al. Hunyuan- video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Lai, X., Lu, J., Luo, Y ., Ma, Y ., and Zhou, X. Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,
- [5]
-
[6]
Liu, Y ., Hu, Y ., Zhang, Z., Jiang, K., and Yuan, K. Mixture of distributions matters: Dynamic sparse attention for efficient video diffusion transformers.arXiv preprint arXiv:2601.11641, 2026b. Lu, C., Zhou, Y ., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in...
-
[7]
Progressive Distillation for Fast Sampling of Diffusion Models
Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Draftattention: Fast video diffusion via low-resolution attention guidance
Shen, X., Han, C., Zhou, Y ., Xie, Y ., Gong, Y ., Wang, Q., Wang, Y ., Wang, Y ., Zhao, P., and Gu, J. Draftattention: Fast video diffusion via low-resolution attention guidance. arXiv preprint arXiv:2505.14708,
-
[9]
Denoising Diffusion Implicit Models
Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Xi, H., Yang, S., Zhao, Y ., Xu, C., Li, M., Li, X., Lin, Y ., Cai, H., Zhang, J., Li, D., et al. Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,
-
[12]
Effi- cient streaming language models with attention sinks
Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Effi- cient streaming language models with attention sinks. In International Conference on Learning Representations, volume 2024, pp. 21875–21895,
work page 2024
-
[13]
Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,
Xu, R., Xiao, G., Huang, H., Guo, J., and Han, S. Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,
-
[14]
Cogvideox: Text-to-video diffusion models with an ex- pert transformer
10 DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., et al. Cogvideox: Text-to-video diffusion models with an ex- pert transformer. InInternational Conference on Learn- ing Representations, volume 2025, pp. 83048–83077,
work page 2025
-
[15]
Spargeattn: Accurate sparse attention accelerat- ing any model inference
Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and Chen, J. Spargeattn: Accurate sparse attention accelerat- ing any model inference. InInternational Conference on Machine Learning (ICML), 2025a. Zhang, P., Chen, Y ., Su, R., Ding, H., Stoica, I., Liu, Z., and Zhang, H. Fast video generation with sliding tile attention.arXiv preprint arXiv:250...
-
[16]
Real-time video generation with pyramid attention broadcast
Zhao, X., Jin, X., Wang, K., and You, Y . Real-time video generation with pyramid attention broadcast. InInterna- tional Conference on Learning Representations, volume 2025, pp. 3296–3319,
work page 2025
-
[17]
Zheng, S., Lu, W., Xia, Y ., Liu, H., and Wang, S. HilbertA: Hilbert Attention for Image Generation with Diffusion Models.arXiv preprint arXiv:2509.26538,
-
[18]
Open-Sora: Democratizing Efficient Video Production for All
Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y ., Li, T., and You, Y . Open-sora: Democratiz- ing efficient video production for all.arXiv preprint arXiv:2412.20404,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
implies that there exists a constant c′ such that P z⊤Az−E[z ⊤Az] ≤ − µD 2 ≤exp −c′ min µ2 D 4K 4∥A∥2 F , µD 2K 2∥A∥ . (27) where K= max i ∥zi∥ψ2 is the subgaussian norm of z, ∥A∥2 F is the Frobenius norm of the matrix, and ∥A∥ is the operator norm of the matrix. Sincezis Gaussian, the subgaussian norm satisfies ∥zi∥ψ2 ≤C ψ2 δ for all coordinates zi, wher...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.