hub Canonical reference

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang · 2024 · cs.CV · arXiv 2409.02048

Canonical reference. 86% of citing Pith papers cite this work as background.

39 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 39 citing papers arXiv PDF

abstract

Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. In this work, we propose \textbf{ViewCrafter}, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images with the prior of video diffusion model. Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames with precise camera pose control. To further enlarge the generation range of novel views, we tailored an iterative view synthesis strategy together with a camera trajectory planning algorithm to progressively extend the 3D clues and the areas covered by the novel views. With ViewCrafter, we can facilitate various applications, such as immersive experiences with real-time rendering by efficiently optimizing a 3D-GS representation using the reconstructed 3D points and the generated novel views, and scene-level text-to-3D generation for more imaginative content creation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in synthesizing high-fidelity and consistent novel views.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 19 baseline 3

citation-polarity summary

background 19 baseline 3

representative citing papers

Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

PREX decomposes target 4D video volumes into Preserve, Reveal, and Expand roles with a region-aware adapter on a frozen diffusion backbone, trained via proxy tasks, and introduces the PREBench benchmark to reduce region-structured editing failures.

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.

3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

cs.CV · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.

DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

Novel View Synthesis as Video Completion

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.

OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.

ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction

cs.CV · 2026-04-02 · unverdicted · novelty 7.0

ProDiG progressively transforms aerial Gaussian splats into coherent ground-level 3D reconstructions via diffusion guidance and specialized attention modules.

SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

cs.CV · 2026-03-27 · unverdicted · novelty 7.0

SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

cs.CV · 2026-03-18 · unverdicted · novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.

FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction

cs.CV · 2026-01-26 · unverdicted · novelty 7.0

FreeOrbit4D recovers a foreground-complete 4D proxy via decoupled background and object-centric reconstruction to provide geometric guidance for large-angle camera redirection in monocular videos using conditional video diffusion.

GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.

UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on novel view synthesis and stereo conversion.

$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

cs.CV · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

h-control augments hard-replacement guidance with block-conditional pseudo-Gibbs refinement on unobserved latent sites and adaptive 3D patch freezing to achieve superior FVD on RealEstate10K and DAVIS.

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

AnyRecon enables scalable 3D reconstruction from arbitrary sparse unordered views by combining video diffusion with explicit global geometric memory and retrieval to maintain consistency across large viewpoint changes.

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.

Lyra 2.0: Explorable Generative 3D Worlds

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

cs.CV · 2026-04-12 · unverdicted · novelty 6.0

Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.

citing papers explorer

Showing 39 of 39 citing papers.

Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning cs.CV · 2026-05-20 · unverdicted · none · ref 38 · internal anchor
PREX decomposes target 4D video volumes into Preserve, Reveal, and Expand roles with a region-aware adapter on a frozen diffusion backbone, trained via proxy tasks, and introduces the PREBench benchmark to reduce region-structured editing failures.
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 28 · internal anchor
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling cs.CV · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models cs.CV · 2026-04-20 · unverdicted · none · ref 65 · internal anchor
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models cs.CV · 2026-04-19 · unverdicted · none · ref 86 · 2 links · internal anchor
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches cs.CV · 2026-04-15 · unverdicted · none · ref 54 · internal anchor
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos cs.CV · 2026-04-14 · unverdicted · none · ref 42 · internal anchor
DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 90 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Novel View Synthesis as Video Completion cs.CV · 2026-04-09 · unverdicted · none · ref 48 · internal anchor
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control cs.CV · 2026-04-07 · unverdicted · none · ref 43 · internal anchor
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction cs.CV · 2026-04-02 · unverdicted · none · ref 44 · internal anchor
ProDiG progressively transforms aerial Gaussian splats into coherent ground-level 3D reconstructions via diffusion guidance and specialized attention modules.
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras cs.CV · 2026-03-27 · unverdicted · none · ref 48 · internal anchor
SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation cs.CV · 2026-03-18 · unverdicted · none · ref 76 · internal anchor
ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction cs.CV · 2026-01-26 · unverdicted · none · ref 47 · internal anchor
FreeOrbit4D recovers a foreground-complete 4D proxy via decoupled background and object-centric reconstruction to provide geometric guidance for large-angle camera redirection in monocular videos using conditional video diffusion.
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 88 · internal anchor
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video cs.CV · 2026-05-14 · unverdicted · none · ref 14 · internal anchor
Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis cs.CV · 2026-05-12 · unverdicted · none · ref 54 · internal anchor
UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on novel view synthesis and stereo conversion.
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement cs.CV · 2026-05-12 · unverdicted · none · ref 44 · 2 links · internal anchor
h-control augments hard-replacement guidance with block-conditional pseudo-Gibbs refinement on unobserved latent sites and adaptive 3D patch freezing to achieve superior FVD on RealEstate10K and DAVIS.
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model cs.CV · 2026-04-21 · unverdicted · none · ref 29 · internal anchor
AnyRecon enables scalable 3D reconstruction from arbitrary sparse unordered views by combining video diffusion with explicit global geometric memory and retrieval to maintain consistency across large viewpoint changes.
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation cs.CV · 2026-04-21 · unverdicted · none · ref 69 · internal anchor
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 58 · internal anchor
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 255 · internal anchor
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
Lyra 2.0: Explorable Generative 3D Worlds cs.CV · 2026-04-14 · unverdicted · none · ref 130 · internal anchor
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models cs.CV · 2026-04-12 · unverdicted · none · ref 62 · internal anchor
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
NavCrafter: Exploring 3D Scenes from a Single Image cs.CV · 2026-04-03 · unverdicted · none · ref 17 · internal anchor
NavCrafter generates controllable novel-view videos from one image via video diffusion, geometry-aware expansion, and enhanced 3D Gaussian Splatting to achieve state-of-the-art synthesis under large viewpoint changes.
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling cs.CV · 2025-12-16 · unverdicted · none · ref 77 · internal anchor
WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.
Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation cs.CV · 2025-11-21 · unverdicted · none · ref 51 · internal anchor
Fine-tuning text-to-video models on sparse low-quality synthetic data for physical camera controls outperforms fine-tuning on photorealistic data.
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models cs.CV · 2025-11-01 · unverdicted · none · ref 102 · internal anchor
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling cs.CV · 2025-07-10 · unverdicted · none · ref 89 · internal anchor
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
BulletGen: Improving 4D Reconstruction with Bullet-Time Generation cs.GR · 2025-06-23 · unverdicted · none · ref 88 · internal anchor
BulletGen enhances 4D dynamic scene reconstruction from monocular videos by supervising Gaussian optimization with diffusion-generated frames aligned at a bullet-time step, achieving SOTA on novel-view synthesis and tracking.
DissolveStereo: Coarse Depth Injection for Zero-Shot Stereo Video Generation cs.CV · 2024-11-21 · unverdicted · none · ref 45 · internal anchor
DissolveStereo injects coarse dissolved depth maps into video diffusion latents via noisy restart and iterative refinement to produce temporally coherent stereo videos zero-shot.
Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate cs.CV · 2026-05-18 · unverdicted · none · ref 34 · internal anchor
Introduces a robustness benchmark for multiview 3D consistency and COLMAP-based metrics that better detect hallucinations in 3D foundation models than existing neural metrics.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer cs.CV · 2026-05-14 · unverdicted · none · ref 62 · internal anchor
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.
Pose-Aware Diffusion for 3D Generation cs.CV · 2026-05-01 · unverdicted · none · ref 61 · internal anchor
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos cs.CV · 2024-11-20 · unverdicted · none · ref 89 · internal anchor
KFC-W is a self-supervised 3D-aware video model trained on videos and multiview internet photos that produces geometrically consistent interpolations between unposed input images without any 3D annotations.
Image-to-Video Diffusion: From Foundations to Open Frontiers cs.CV · 2026-05-17 · unverdicted · none · ref 127 · internal anchor
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.
A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation cs.CV · 2025-08-13 · unverdicted · none · ref 244 · internal anchor
A survey that categorizes and summarizes methods applying 3D Gaussian Splatting to segmentation, editing, generation, and related tasks, including datasets and evaluation protocols.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 285 · internal anchor
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation cs.CV · 2026-04-27 · unreviewed · ref 57 · 2 links · internal anchor

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer