VDT: General-Purpose Video Diffusion Transformers via Mask Modeling

Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding · 2023 · arXiv 2305.13311

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

cs.MM · 2026-04-22 · unverdicted · novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.

FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

cs.CV · 2026-03-10 · unverdicted · novelty 7.0

FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.

RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

RoPeSLR combines 3D RoPE-guided sparse attention with head-wise low-rank parameterization to achieve sub-quadratic complexity in DiTs while preserving distance awareness for efficient ultra-long video synthesis.

Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation

cs.CV · 2024-11-24 · unverdicted · novelty 6.0

LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duration talking video generation.

Making Time Editable in Video Diffusion Transformers

cs.CV · 2026-06-08 · unverdicted · novelty 3.0

Introduces a lightweight temporal module to extend pretrained DiT video models with time editing capabilities while preserving the original generative prior.

Evolution of Video Generative Foundations

cs.CV · 2026-04-07 · unverdicted · novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

cs.DC · 2026-06-11

citing papers explorer

Showing 7 of 7 citing papers after filters.

AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe cs.MM · 2026-04-22 · unverdicted · none · ref 41
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation cs.CV · 2026-03-10 · unverdicted · none · ref 30
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models cs.CV · 2026-05-29 · unverdicted · none · ref 28
Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.
RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers cs.CV · 2026-05-20 · unverdicted · none · ref 13
RoPeSLR combines 3D RoPE-guided sparse attention with head-wise low-rank parameterization to achieve sub-quadratic complexity in DiTs while preserving distance awareness for efficient ultra-long video synthesis.
Making Time Editable in Video Diffusion Transformers cs.CV · 2026-06-08 · unverdicted · none · ref 7
Introduces a lightweight temporal module to extend pretrained DiT video models with time editing capabilities while preserving the original generative prior.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 75
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving cs.DC · 2026-06-11 · unreviewed · ref 20

VDT: General-Purpose Video Diffusion Transformers via Mask Modeling

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer