DFSAttn is a training-free framework for dynamic fine-grained sparse attention in video DiTs that achieves up to 2.1x speedup while preserving generation quality via Hilbert reordering, hierarchical scoring, and adaptive caching.
hub
Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776
25 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
CoReDiT reduces self-attention FLOPs in DiTs by up to 55% via linear-time spatial coherence pruning and neighbor-based reconstruction, delivering 1.33x-1.72x speedups with maintained quality.
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Attention sparsity in video DiTs is an input-stable layer-wise property, enabling offline profiling and online bidirectional QK co-clustering for up to 1.93x speedup with PSNR up to 29 dB.
SyncCache accelerates DiT-based audio-driven portrait animation up to 4.12x via spatially-asymmetric probing and modality-decoupled caching while preserving near-lossless quality and audio sync.
EcoVideo introduces entropy-driven dynamic frame selection for cloud-edge DiT video generation, yielding up to 2.9x speedup with adaptive keyframe budgets.
RhymeFlow is a training-free acceleration framework that decouples denoising trajectories across video frames by dense processing of semantic keyframes and asynchronous skipping for non-keyframes, augmented by a latent trajectory projection module to maintain consistency.
Light Interaction accelerates interactive video world models up to 2.59x via adaptive context management, denoising cache acceleration, and 3D block sparse attention without retraining.
LVSA is a training-free block-sparse attention technique combining structured windows with rotating global anchors that reduces inference compute 2.98-3.33x on video diffusion models at extended horizons while remaining quality-neutral or positive.
Veda formulates tile selection in video diffusion attention as a reconstruction problem from full attention maps, using statistics-aware and head-aware scoring to enable high sparsity with maintained quality and hardware speedups up to 5.1x end-to-end.
RT-Lynx shifts DiT sparsity from weights to activations, reports up to 1.55x linear-layer speedup while preserving generation quality across multiple diffusion models.
SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.
DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with a semantic motion router.
AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
LIPAR prunes redundant inter-frame latent patches in video generation and recovers attention to deliver 1.53x speedup at 19.3 FPS with no quality drop or extra training.
S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.
Quant VideoGen reduces KV cache memory by up to 7 times in autoregressive video diffusion models via semantic aware smoothing and progressive residual quantization, achieving better quality than baselines with under 4% latency overhead.
SURF accelerates high-resolution video generation up to 12.5x by using noise reshifting for low-res previews from pretrained models and a shifting-window Refiner for efficient upscaling that retains original signatures.
DiSC accelerates DiT and PixArt-Sigma diffusion models 3.47-4.74x over A100 GPUs by reusing cached tokens across denoising steps and reusing sparsity masks in attention, using hash-based bank distribution to run sparse workloads on standard compute units.
OASIS reduces redundancy in diffusion models for real-world video super-resolution via attention specialization routing and progressive training, delivering state-of-the-art quality with 6.2x faster inference than prior one-step baselines.
OSP-Next reports 83.73% VBench score and up to 2.27x speedup via hybrid sparse attention, SSP parallelism, HiF8 quantization, and Mix-GRPO on diffusion transformers.
citing papers explorer
-
SURF: Signature-Retained Fast Video Generation
SURF accelerates high-resolution video generation up to 12.5x by using noise reshifting for low-res previews from pretrained models and a shifting-window Refiner for efficient upscaling that retains original signatures.
-
Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution
OASIS reduces redundancy in diffusion models for real-world video super-resolution via attention specialization routing and progressive training, delivering state-of-the-art quality with 6.2x faster inference than prior one-step baselines.