AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
VDT: General-Purpose Video Diffusion Transformers via Mask Modeling
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 2polarities
background 2representative citing papers
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
RoPeSLR combines 3D RoPE-guided sparse attention with head-wise low-rank parameterization to achieve sub-quadratic complexity in DiTs while preserving distance awareness for efficient ultra-long video synthesis.
LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duration talking video generation.
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
citing papers explorer
-
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
-
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
-
RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers
RoPeSLR combines 3D RoPE-guided sparse attention with head-wise low-rank parameterization to achieve sub-quadratic complexity in DiTs while preserving distance awareness for efficient ultra-long video synthesis.
-
Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation
LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duration talking video generation.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.