CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
arXiv:2412.18597 (2024)
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2verdicts
UNVERDICTED 2representative citing papers
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.
citing papers explorer
-
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
-
GenHSI: Controllable Generation of Human-Scene Interaction Videos
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.