pith. sign in

arxiv: 2406.05630 · v3 · pith:3BPW4TZ7new · submitted 2024-06-09 · 💻 cs.CV

Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

classification 💻 cs.CV
keywords videoobjectgenerationboxescontroldiffusionboundingcontrollable
0
0 comments X
read the original abstract

Controllable video generation has attracted significant attention, largely due to advances in video diffusion models. In domains such as autonomous driving, it is essential to develop highly accurate predictions for object motions. This paper tackles a crucial challenge of how to exert precise control over object motion for realistic video synthesis. To accomplish this, we 1) control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space, 2) employ a distinct, specialized model to forecast the trajectories of object bounding boxes based on their previous and, if desired, future positions, and 3) adapt and enhance a separate video diffusion network to create video content based on these high quality trajectory forecasts. Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation. Extensive experiments conducted on the KITTI, Virtual-KITTI 2, BDD100k, and nuScenes datasets validate the effectiveness of our approach in producing realistic and controllable video generation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LooseControlVideo: Directorial Video Control using Spatial Blocking

    cs.CV 2026-06 unverdicted novelty 6.0

    LooseControlVideo fine-tunes a video model on DNOCS-annotated data to enable layout and trajectory control via oriented 3D boxes, reporting 1.2-3x gains in trajectory accuracy over 2D baselines on nuScenes, HO-3D and BEHAVE.