CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
Advances in neural information processing systems35, 8633– 8646 (2022)
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4years
2026 4representative citing papers
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and reference-guided video stylization.
citing papers explorer
-
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and reference-guided video stylization.
- From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation