DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation

Jie Hu; Kun Yuan; Yutong He; Zixiang Gao

arxiv: 2605.23445 · v1 · pith:4M4TCNHBnew · submitted 2026-05-22 · 💻 cs.CV

DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation

Jie Hu , Zixiang Gao , Yutong He , Kun Yuan This is my paper

Pith reviewed 2026-05-25 04:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords sparse attentiondiffusion transformersvideo generationefficient inferenceblock sparse attentionHilbert curve

0 comments

The pith

DFSAttn applies dynamic fine-grained sparse attention to diffusion transformers for video generation, reaching 2.1x end-to-end speedup at high sparsity while preserving quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard block sparse attention fails on diffusion transformers because their attention maps are both dynamic across frames and fine-grained within regions. It introduces a training-free framework that reorders tokens, scores blocks hierarchically, and caches masks with adaptive ratios to exploit this sparsity pattern. If the approach holds, video generation models could run at much higher sparsity ratios without the quality collapse seen in prior methods, directly lowering the quadratic cost of 3D attention. The work focuses on inference efficiency rather than retraining, showing consistent gains over existing sparse baselines in experiments.

Core claim

Existing block sparse attention degrades at high sparsity in DiTs because it cannot capture the dynamic and fine-grained sparsity patterns in their attention maps; DFSAttn overcomes this by combining Hilbert curve token reordering for fine-grained yet GPU-efficient sparsity, hierarchical block scoring for accurate importance estimation, and adaptive sparse mask caching, yielding up to 2.1x speedup with maintained generation quality.

What carries the argument

Hilbert curve-based token reordering combined with hierarchical block scoring and adaptive-ratio mask caching, which together enable dynamic fine-grained sparsification without retraining.

If this is right

Higher sparsity ratios become usable in DiT video models without the quality drop that previously limited acceleration.
End-to-end inference time for long video sequences drops by roughly half while output fidelity stays comparable to dense attention.
The three design components can be combined modularly with other attention optimizations that preserve token ordering or importance signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token reordering and hierarchical scoring could extend to image-only diffusion models where attention maps also show non-uniform sparsity.
Adaptive mask caching might reduce memory traffic on hardware that supports irregular sparse matrix operations, beyond the reported GPU speedups.
If the lower-bound analysis on attention recall generalizes, it offers a way to predict the minimum sparsity a given model can tolerate before quality degrades.

Load-bearing premise

Attention maps inside diffusion transformers for video contain inherent dynamic fine-grained sparsity that block-based methods cannot exploit without quality loss.

What would settle it

Measure generation quality metrics such as FVD or CLIP similarity at 80-90% sparsity ratios on standard video benchmarks; if DFSAttn quality falls below full attention while prior block methods do not, the claim on dynamic sparsity handling is falsified.

Figures

Figures reproduced from arXiv: 2605.23445 by Jie Hu, Kun Yuan, Yutong He, Zixiang Gao.

**Figure 1.** Figure 1: Overview of DFSAttn: a) Video tokens are reordered using a 3D Hilbert curve, and sub-block attention scores are aggregated to estimate block-wise importance. The resulting masks are applied via sparse Flash Attention, while cross-attention remains dense to preserve text–video alignment. b) Sparse masks are cached and reused across diffusion timesteps. c) Structured block sparsity in the reordered sequence … view at source ↗

**Figure 2.** Figure 2: 3D full attention maps in DiTs exhibit dynamic and fine-grained sparsity patterns. (f, h, w), corresponding to the number of frames, height, and width, respectively. This 3D latent is then flattened into a single token sequence before being fed into the transformer. Consequently, the input sequence length for each transformer block is N = f × h × w. For simplicity, we omit text-conditioning tokens here, a… view at source ↗

**Figure 3.** Figure 3: The mean attention recall (solid line) rises monotonically across diffusion steps, with low variance across various samples (shaded region). 4. Motivation In this section, we identify the intrinsic sparsity patterns of attention in diffusion transformers and show that the effectiveness of block sparse attention is fundamentally governed by the statistical structure of attention scores. To formalize this c… view at source ↗

**Figure 4.** Figure 4: The 3D Hilbert curve in 4 × 4 × 4 space. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of videos generated by DFSAttn and other baselines on Wan2.1-T2V-14B [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: PSNR (solid, left) and latency (dashed, right) vs. overall sparsity. Main results are evaluated under adaptive 80% sparsity. followed by local raster traversal. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Video generation examples from dense attention and DFSAttn on HunyuanVideo. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Video generation examples from dense attention and DFSAttn on Wan2.1-T2V-14B. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to mitigate this by focusing computation on important regions. However, attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity, which causes existing block sparse attention methods to degrade significantly in quality, especially at high sparsity ratios. In this paper, we revisit block sparse attention and derive a theoretical lower bound on attention recall to characterize the key factors governing its effectiveness. Guided by these insights, we propose DFSAttn, a training-free sparse attention framework that enables dynamic, fine-grained sparsification efficiently. DFSAttn incorporates three core designs: Hilbert curve-based token reordering to achieve fine-grained sparsity while preserving efficient GPU execution, hierarchical block scoring for accurate block importance estimation, and sparse mask caching with adaptive ratios to balance accuracy and efficiency. Experimental results demonstrate that DFSAttn consistently outperforms prior methods under high sparsity, achieving up to 2.1$\times$ end-to-end speedup while maintaining high generation quality. Our code is open-sourced and available at https://github.com/jessica-hujie/DFSAttn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DFSAttn adds Hilbert-curve reordering plus hierarchical scoring to block-sparse attention for DiT video models and claims a 2.1x speedup at high sparsity.

read the letter

The main thing to know is that DFSAttn introduces Hilbert-curve token reordering to turn fine-grained attention sparsity into something that still runs efficiently on GPUs, paired with hierarchical block scoring and adaptive mask caching. All of it is training-free and motivated by a derived lower bound on attention recall. This combination looks new relative to the block-sparse baselines they cite. The designs directly target the stated weakness of prior methods when sparsity gets high and patterns stay dynamic and fine-grained. Open-sourcing the code is a clear positive for anyone who wants to test it. The paper does a clean job laying out the motivation and then mapping each component to a specific part of the problem. The experimental claim of consistent outperformance with up to 2.1x end-to-end speedup is the part that matters most for practitioners. The soft spots sit in the experimental section. The abstract reports quality preservation but gives no error bars, dataset details, or ablation numbers here, so the size of the actual gains from each new piece is still unclear. The central assumption that attention maps are inherently fine-grained and dynamic needs the full results to land convincingly. This paper is for people working on inference efficiency in video diffusion transformers. A reader already thinking about sparse attention patterns would get concrete mechanisms worth trying. It deserves peer review because the problem is real, the fixes are specific, and the code is available for verification.

Referee Report

0 major / 4 minor

Summary. The paper claims that attention maps in diffusion transformers (DiTs) for video generation exhibit dynamic fine-grained sparsity that defeats standard block-sparse methods at high sparsity ratios. It derives a theoretical lower bound on attention recall, then introduces the training-free DFSAttn framework consisting of Hilbert-curve token reordering, hierarchical block scoring, and adaptive sparse-mask caching. Experiments reportedly show consistent outperformance over prior sparse-attention baselines, with up to 2.1× end-to-end speedup while preserving generation quality; code is released.

Significance. If the lower-bound derivation is correct and the reported speed/quality trade-off holds under rigorous evaluation, the work would be a useful practical advance for accelerating high-resolution video diffusion models. The training-free design and open-sourced implementation are concrete strengths that lower the barrier to adoption and verification.

minor comments (4)

The abstract states a lower bound on recall but does not indicate whether the bound is tight or how the three proposed mechanisms are shown to approach it; a short proof sketch or numerical verification in the main text would strengthen the motivation section.
Generation-quality claims should be supported by standard video metrics (FVD, CLIP similarity, human preference) with error bars over multiple seeds; the current description only mentions “high generation quality.”
The hierarchical scoring and adaptive-ratio caching mechanisms would benefit from an ablation table isolating each component’s contribution to the final speedup/quality numbers.
Figure captions and algorithm boxes should explicitly state the sparsity ratios at which each method is evaluated so readers can directly compare the reported 2.1× figure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the training-free design and open-sourced code as strengths, and the recommendation for minor revision. The referee's description of the paper is accurate.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives a theoretical lower bound on attention recall from the properties of block sparse attention in DiTs, then designs DFSAttn (Hilbert reordering, hierarchical scoring, mask caching) guided by that bound. No equation or claim reduces a 'prediction' or result to a fitted parameter, self-defined quantity, or self-citation chain by construction. The method is explicitly training-free, and performance claims (speedup and quality) are externally measurable rather than tautological with the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no full text available to enumerate additional free parameters, axioms, or entities.

axioms (1)

domain assumption Attention maps in DiTs exhibit inherently dynamic and fine-grained sparsity
Explicitly stated in the abstract as the reason existing block-sparse methods degrade at high sparsity ratios.

pith-pipeline@v0.9.0 · 5752 in / 1191 out tokens · 19765 ms · 2026-05-25T04:41:46.461132+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a token representation model and derive a theoretical lower bound on attention recall... three key factors: the sparsity budget, the inter-block similarity gap, and the block-level semantic diversity.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

reorder video tokens using a 3D Hilbert curve... preserves spatiotemporal locality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 5 internal anchors

[1]

Flashattention-2: Faster attention with better paral- lelism and work partitioning

Dao, T. Flashattention-2: Faster attention with better paral- lelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pp. 35549– 35562,

work page 2024
[2]

FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers.arXiv preprint arXiv:2509.16518,

Durvasula, S., Sreedhar, K., Moustafa, Z., Kothawade, S., Gondimalla, A., Subramanian, S., Shahidi, N., and Vijaykumar, N. FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers.arXiv preprint arXiv:2509.16518,

work page arXiv
[3]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al. Hunyuan- video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

Lai, X., Lu, J., Luo, Y ., Ma, Y ., and Zhou, X. Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

work page arXiv
[5]

and Xu, L

Li, Y . and Xu, L. Hilbert-Guided Block-Sparse Local At- tention.arXiv preprint arXiv:2511.05832,

work page arXiv
[6]

Mixture of distributions matters: Dynamic sparse attention for efficient video diffusion transformers.arXiv preprint arXiv:2601.11641, 2026b

Liu, Y ., Hu, Y ., Zhang, Z., Jiang, K., and Yuan, K. Mixture of distributions matters: Dynamic sparse attention for efficient video diffusion transformers.arXiv preprint arXiv:2601.11641, 2026b. Lu, C., Zhou, Y ., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in...

work page arXiv
[7]

Progressive Distillation for Fast Sampling of Diffusion Models

Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Draftattention: Fast video diffusion via low-resolution attention guidance

Shen, X., Han, C., Zhou, Y ., Xie, Y ., Gong, Y ., Wang, Q., Wang, Y ., Wang, Y ., Zhao, P., and Gu, J. Draftattention: Fast video diffusion via low-resolution attention guidance. arXiv preprint arXiv:2505.14708,

work page arXiv
[9]

Denoising Diffusion Implicit Models

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,

Xi, H., Yang, S., Zhao, Y ., Xu, C., Li, M., Li, X., Lin, Y ., Cai, H., Zhang, J., Li, D., et al. Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,

work page arXiv
[12]

Effi- cient streaming language models with attention sinks

Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Effi- cient streaming language models with attention sinks. In International Conference on Learning Representations, volume 2024, pp. 21875–21895,

work page 2024
[13]

Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

Xu, R., Xiao, G., Huang, H., Guo, J., and Han, S. Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

work page arXiv
[14]

Cogvideox: Text-to-video diffusion models with an ex- pert transformer

10 DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., et al. Cogvideox: Text-to-video diffusion models with an ex- pert transformer. InInternational Conference on Learn- ing Representations, volume 2025, pp. 83048–83077,

work page 2025
[15]

Spargeattn: Accurate sparse attention accelerat- ing any model inference

Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and Chen, J. Spargeattn: Accurate sparse attention accelerat- ing any model inference. InInternational Conference on Machine Learning (ICML), 2025a. Zhang, P., Chen, Y ., Su, R., Ding, H., Stoica, I., Liu, Z., and Zhang, H. Fast video generation with sliding tile attention.arXiv preprint arXiv:250...

work page arXiv
[16]

Real-time video generation with pyramid attention broadcast

Zhao, X., Jin, X., Wang, K., and You, Y . Real-time video generation with pyramid attention broadcast. InInterna- tional Conference on Learning Representations, volume 2025, pp. 3296–3319,

work page 2025
[17]

HilbertA: Hilbert Attention for Image Generation with Diffusion Models.arXiv preprint arXiv:2509.26538,

Zheng, S., Lu, W., Xia, Y ., Liu, H., and Wang, S. HilbertA: Hilbert Attention for Image Generation with Diffusion Models.arXiv preprint arXiv:2509.26538,

work page arXiv
[18]

Open-Sora: Democratizing Efficient Video Production for All

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y ., Li, T., and You, Y . Open-sora: Democratiz- ing efficient video production for all.arXiv preprint arXiv:2412.20404,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

(27) where K= max i ∥zi∥ψ2 is the subgaussian norm of z, ∥A∥2 F is the Frobenius norm of the matrix, and ∥A∥ is the operator norm of the matrix

implies that there exists a constant c′ such that P z⊤Az−E[z ⊤Az] ≤ − µD 2 ≤exp −c′ min µ2 D 4K 4∥A∥2 F , µD 2K 2∥A∥ . (27) where K= max i ∥zi∥ψ2 is the subgaussian norm of z, ∥A∥2 F is the Frobenius norm of the matrix, and ∥A∥ is the operator norm of the matrix. Sincezis Gaussian, the subgaussian norm satisfies ∥zi∥ψ2 ≤C ψ2 δ for all coordinates zi, wher...

work page 2026

[1] [1]

Flashattention-2: Faster attention with better paral- lelism and work partitioning

Dao, T. Flashattention-2: Faster attention with better paral- lelism and work partitioning. InInternational Conference on Learning Representations, volume 2024, pp. 35549– 35562,

work page 2024

[2] [2]

FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers.arXiv preprint arXiv:2509.16518,

Durvasula, S., Sreedhar, K., Moustafa, Z., Kothawade, S., Gondimalla, A., Subramanian, S., Shahidi, N., and Vijaykumar, N. FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers.arXiv preprint arXiv:2509.16518,

work page arXiv

[3] [3]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al. Hunyuan- video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

Lai, X., Lu, J., Luo, Y ., Ma, Y ., and Zhou, X. Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

work page arXiv

[5] [5]

and Xu, L

Li, Y . and Xu, L. Hilbert-Guided Block-Sparse Local At- tention.arXiv preprint arXiv:2511.05832,

work page arXiv

[6] [6]

Mixture of distributions matters: Dynamic sparse attention for efficient video diffusion transformers.arXiv preprint arXiv:2601.11641, 2026b

Liu, Y ., Hu, Y ., Zhang, Z., Jiang, K., and Yuan, K. Mixture of distributions matters: Dynamic sparse attention for efficient video diffusion transformers.arXiv preprint arXiv:2601.11641, 2026b. Lu, C., Zhou, Y ., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in...

work page arXiv

[7] [7]

Progressive Distillation for Fast Sampling of Diffusion Models

Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Draftattention: Fast video diffusion via low-resolution attention guidance

Shen, X., Han, C., Zhou, Y ., Xie, Y ., Gong, Y ., Wang, Q., Wang, Y ., Wang, Y ., Zhao, P., and Gu, J. Draftattention: Fast video diffusion via low-resolution attention guidance. arXiv preprint arXiv:2505.14708,

work page arXiv

[9] [9]

Denoising Diffusion Implicit Models

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[10] [10]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,

Xi, H., Yang, S., Zhao, Y ., Xu, C., Li, M., Li, X., Lin, Y ., Cai, H., Zhang, J., Li, D., et al. Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,

work page arXiv

[12] [12]

Effi- cient streaming language models with attention sinks

Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Effi- cient streaming language models with attention sinks. In International Conference on Learning Representations, volume 2024, pp. 21875–21895,

work page 2024

[13] [13]

Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

Xu, R., Xiao, G., Huang, H., Guo, J., and Han, S. Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

work page arXiv

[14] [14]

Cogvideox: Text-to-video diffusion models with an ex- pert transformer

10 DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., et al. Cogvideox: Text-to-video diffusion models with an ex- pert transformer. InInternational Conference on Learn- ing Representations, volume 2025, pp. 83048–83077,

work page 2025

[15] [15]

Spargeattn: Accurate sparse attention accelerat- ing any model inference

Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and Chen, J. Spargeattn: Accurate sparse attention accelerat- ing any model inference. InInternational Conference on Machine Learning (ICML), 2025a. Zhang, P., Chen, Y ., Su, R., Ding, H., Stoica, I., Liu, Z., and Zhang, H. Fast video generation with sliding tile attention.arXiv preprint arXiv:250...

work page arXiv

[16] [16]

Real-time video generation with pyramid attention broadcast

Zhao, X., Jin, X., Wang, K., and You, Y . Real-time video generation with pyramid attention broadcast. InInterna- tional Conference on Learning Representations, volume 2025, pp. 3296–3319,

work page 2025

[17] [17]

HilbertA: Hilbert Attention for Image Generation with Diffusion Models.arXiv preprint arXiv:2509.26538,

Zheng, S., Lu, W., Xia, Y ., Liu, H., and Wang, S. HilbertA: Hilbert Attention for Image Generation with Diffusion Models.arXiv preprint arXiv:2509.26538,

work page arXiv

[18] [18]

Open-Sora: Democratizing Efficient Video Production for All

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y ., Li, T., and You, Y . Open-sora: Democratiz- ing efficient video production for all.arXiv preprint arXiv:2412.20404,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

(27) where K= max i ∥zi∥ψ2 is the subgaussian norm of z, ∥A∥2 F is the Frobenius norm of the matrix, and ∥A∥ is the operator norm of the matrix

implies that there exists a constant c′ such that P z⊤Az−E[z ⊤Az] ≤ − µD 2 ≤exp −c′ min µ2 D 4K 4∥A∥2 F , µD 2K 2∥A∥ . (27) where K= max i ∥zi∥ψ2 is the subgaussian norm of z, ∥A∥2 F is the Frobenius norm of the matrix, and ∥A∥ is the operator norm of the matrix. Sincezis Gaussian, the subgaussian norm satisfies ∥zi∥ψ2 ≤C ψ2 δ for all coordinates zi, wher...

work page 2026