hub Canonical reference

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang · 2024 · cs.CV · arXiv 2406.02509

Canonical reference. 100% of citing Pith papers cite this work as background.

21 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 21 citing papers arXiv PDF

abstract

Recently video diffusion models have emerged as expressive generative tools for high-quality video content creation readily available to general users. However, these models often do not offer precise control over camera poses for video generation, limiting the expression of cinematic language and user control. To address this issue, we introduce CamCo, which allows fine-grained Camera pose Control for image-to-video generation. We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Pl\"ucker coordinates. To enhance 3D consistency in the videos produced, we integrate an epipolar attention module in each attention block that enforces epipolar constraints to the feature maps. Additionally, we fine-tune CamCo on real-world videos with camera poses estimated through structure-from-motion algorithms to better synthesize object motion. Our experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models while effectively generating plausible object motion. Project page: https://ir1d.github.io/CamCo/

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9

citation-polarity summary

background 9

representative citing papers

PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset

cs.CV · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

PicoEyes delivers a unified end-to-end model for full 3D gaze estimation including eye parameters, axes, segmentation and depth from monocular or binocular near-eye images, supported by a new large-scale multi-view dataset.

Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

cs.CV · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.

OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.

SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

cs.CV · 2026-03-27 · unverdicted · novelty 7.0

SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

cs.CV · 2025-12-11 · unverdicted · novelty 7.0

A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.

CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion

cs.CV · 2025-09-24 · unverdicted · novelty 7.0

CamPVG is the first diffusion-based framework for generating geometrically consistent panoramic videos from camera pose inputs using a panoramic Plücker embedding and spherical epipolar attention module.

RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.

PhyCo: Learning Controllable Physical Priors for Generative Motion

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ benchmark.

Lyra 2.0: Explorable Generative 3D Worlds

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching distillation.

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians

cs.CV · 2026-01-02 · unverdicted · novelty 6.0 · 2 refs

Pixel-to-4D builds a dynamic 3D Gaussian representation from one image and samples object motion in a single forward pass to produce camera-controlled videos with claimed state-of-the-art quality and speed on KITTI, Waymo, RealEstate10K and DL3DV-10K.

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

cs.CV · 2025-11-01 · unverdicted · novelty 6.0

A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

cs.CV · 2024-09-03 · unverdicted · novelty 6.0

ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

cs.CV · 2024-04-02 · unverdicted · novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.

Pose-Aware Diffusion for 3D Generation

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.

Image-to-Video Diffusion: From Foundations to Open Frontiers

cs.CV · 2026-05-17 · unverdicted · novelty 3.0

A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.

Cosmos World Foundation Model Platform for Physical AI

cs.CV · 2025-01-07 · unverdicted · novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

citing papers explorer

Showing 21 of 21 citing papers.

PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset cs.CV · 2026-05-08 · unverdicted · none · ref 42 · 2 links · internal anchor
PicoEyes delivers a unified end-to-end model for full 3D gaze estimation including eye parameters, axes, segmentation and depth from monocular or binocular near-eye images, supported by a new large-scale multi-view dataset.
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting cs.CV · 2026-04-23 · unverdicted · none · ref 43 · internal anchor
Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models cs.CV · 2026-04-19 · unverdicted · none · ref 79 · 2 links · internal anchor
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches cs.CV · 2026-04-15 · unverdicted · none · ref 53 · internal anchor
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control cs.CV · 2026-04-07 · unverdicted · none · ref 37 · internal anchor
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras cs.CV · 2026-03-27 · unverdicted · none · ref 42 · internal anchor
SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space cs.CV · 2025-12-11 · unverdicted · none · ref 96 · internal anchor
A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.
CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion cs.CV · 2025-09-24 · unverdicted · none · ref 6 · internal anchor
CamPVG is the first diffusion-based framework for generating geometrically consistent panoramic videos from camera pose inputs using a panoramic Plücker embedding and spherical epipolar attention module.
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control cs.CV · 2026-05-07 · unverdicted · none · ref 24 · internal anchor
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
PhyCo: Learning Controllable Physical Priors for Generative Motion cs.CV · 2026-04-30 · unverdicted · none · ref 42 · internal anchor
PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ benchmark.
Lyra 2.0: Explorable Generative 3D Worlds cs.CV · 2026-04-14 · unverdicted · none · ref 120 · internal anchor
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories cs.CV · 2026-04-10 · unverdicted · none · ref 19 · internal anchor
A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling cs.CV · 2026-04-08 · unverdicted · none · ref 93 · internal anchor
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching distillation.
Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians cs.CV · 2026-01-02 · unverdicted · none · ref 38 · 2 links · internal anchor
Pixel-to-4D builds a dynamic 3D Gaussian representation from one image and samples object motion in a single forward pass to produce camera-controlled videos with claimed state-of-the-art quality and speed on KITTI, Waymo, RealEstate10K and DL3DV-10K.
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models cs.CV · 2025-11-01 · unverdicted · none · ref 93 · internal anchor
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis cs.CV · 2024-09-03 · unverdicted · none · ref 57 · internal anchor
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation cs.CV · 2024-04-02 · unverdicted · none · ref 158 · internal anchor
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer cs.CV · 2026-05-14 · unverdicted · none · ref 61 · internal anchor
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.
Pose-Aware Diffusion for 3D Generation cs.CV · 2026-05-01 · unverdicted · none · ref 54 · internal anchor
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
Image-to-Video Diffusion: From Foundations to Open Frontiers cs.CV · 2026-05-17 · unverdicted · none · ref 16 · internal anchor
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.
Cosmos World Foundation Model Platform for Physical AI cs.CV · 2025-01-07 · unverdicted · none · ref 225 · internal anchor
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer