VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Aliaksandr Siarohin; Andrea Tagliasacchi; Chaoyang Wang; David B. Lindell; Guocheng Qian; Hsin-Ying Lee; Ivan Skorokhodov; Jiaxu Zou; Michael Vasilkovsky; Sergey Tulyakov

arxiv: 2407.12781 · v3 · pith:JMEMFZEDnew · submitted 2024-07-17 · 💻 cs.CV

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Sherwin Bahmani , Ivan Skorokhodov , Aliaksandr Siarohin , Willi Menapace , Guocheng Qian , Michael Vasilkovsky , Hsin-Ying Lee , Chaoyang Wang

show 4 more authors

Jiaxu Zou Andrea Tagliasacchi David B. Lindell Sergey Tulyakov

This is my paper

classification 💻 cs.CV

keywords cameracontrolmodelsvideodiffusiongenerationapproachcontrollable

0 comments

read the original abstract

Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of complex videos from a text description. However, most existing models lack fine-grained control over camera movement, which is critical for downstream applications related to content creation, visual effects, and 3D vision. Recently, new methods demonstrate the ability to generate videos with controllable camera poses these techniques leverage pre-trained U-Net-based diffusion models that explicitly disentangle spatial and temporal generation. Still, no existing approach enables camera control for new, transformer-based video diffusion models that process spatial and temporal information jointly. Here, we propose to tame video transformers for 3D camera control using a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings based on Pl\"ucker coordinates. The approach demonstrates state-of-the-art performance for controllable video generation after fine-tuning on the RealEstate10K dataset. To the best of our knowledge, our work is the first to enable camera control for transformer-based video diffusion models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds
cs.AI 2026-06 unverdicted novelty 7.0

Look-Before-Move is a framework that converts narrative intent into Semantic Observation Contracts, uses Monte Carlo Viewpoint Search for feasible viewpoints, and applies Semantic Trajectory Grounding for coherent cam...
Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds
cs.AI 2026-06 unverdicted novelty 7.0

Look-Before-Move separates narrative observation specification from camera motion via semantic contracts, Monte Carlo viewpoint search, and trajectory grounding, tested on a new 50-story 3D benchmark.
Probing into Camera Control of Video Models
cs.CV 2026-05 unverdicted novelty 7.0

A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 7.0

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
cs.CV 2026-05 unverdicted novelty 7.0

h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Semantic-Aware, Physics-Informed, Geometry-Grounded Weather Video Synthesis
cs.CV 2026-06 unverdicted novelty 6.0

A new framework factorizes weather video synthesis into semantic appearance anchoring, physics-informed Gaussian particle simulation under gravity/wind/turbulence, and geometry-grounded alignment to produce diverse re...
Lighting-Consistent Object Transfer Across Radiance Fields
cs.GR 2026-06 unverdicted novelty 6.0

Diffusion-based per-view harmonization for lighting-consistent object transfer between 3DGS scenes, using heterogeneous training data and final 3D consolidation.
TriMotion: Modality-Agnostic Camera Control for Video Generation
cs.CV 2026-06 unverdicted novelty 6.0

TriMotion is a modality-agnostic framework that maps video, pose, and text descriptions of the same camera trajectory into a shared motion embedding space, trained with a new triplet dataset and latent consistency obj...
Streaming Video Generation with Streaming Force Control
cs.CV 2026-06 unverdicted novelty 6.0

StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 6.0

R-DMesh proposes a VAE-based disentanglement of base mesh, motion trajectories, and rectification offset plus Triflow Attention and rectified-flow diffusion to produce 4D meshes aligned to video despite initial pose mismatch.
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
cs.CV 2026-05 unverdicted novelty 6.0

h-control augments hard-replacement guidance with block-conditional pseudo-Gibbs refinement on unobserved latent sites and adaptive 3D patch freezing to achieve superior FVD on RealEstate10K and DAVIS.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
cs.CV 2024-09 unverdicted novelty 6.0

ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 5.0

R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.