arxiv: 2409.02048 · v1 · submitted 2024-09-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu , Jinbo Xing , Li Yuan , Wenbo Hu , Xiaoyu Li , Zhipeng Huang , Xiangjun Gao , Tien-Tsin Wong

show 2 more authors

Ying Shan Yonghong Tian

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords novel view synthesisvideo diffusionpoint-based representationiterative synthesiscamera trajectory planning3D Gaussian splattingsingle-image 3D

0 comments

The pith

ViewCrafter steers a pre-trained video diffusion model with coarse point clouds and planned trajectories to synthesize consistent high-fidelity novel views from single or sparse images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViewCrafter as a method that combines the generative power of video diffusion models with simple 3D point information to create accurate new viewpoints of ordinary scenes. It starts from one or a few images, conditions the diffusion process on point-based clues, and generates video frames that follow exact camera paths. An iterative loop then plans new trajectories, adds more points from the synthesized views, and expands coverage without needing dense input captures. The approach targets applications such as optimizing 3D Gaussian splatting for real-time rendering and enabling text-to-3D scene creation. If the steering works reliably, it reduces the data demands of traditional neural 3D reconstruction while preserving visual quality and geometric consistency.

Core claim

ViewCrafter uses a video diffusion model conditioned on point-based 3D clues and explicit camera trajectories to generate sequences of high-quality novel views. An iterative synthesis procedure with a dedicated trajectory planning algorithm progressively enlarges the set of reconstructed points and the spatial extent of the synthesized views, allowing high-fidelity results from minimal input images.

What carries the argument

Iterative view synthesis loop that conditions a video diffusion model on coarse point clouds and planned camera trajectories to extend 3D coverage.

If this is right

The generated views and points can be used to optimize a 3D Gaussian splatting representation that supports real-time rendering.
The same pipeline enables scene-level text-to-3D generation by first creating consistent views and then fitting a 3D model.
The method works on generic scenes and shows strong generalization across diverse datasets without retraining the diffusion model.
It reduces reliance on dense multi-view captures that currently limit practical 3D reconstruction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the diffusion model already encodes strong 3D priors, adding explicit point conditioning may become unnecessary for short trajectories.
The trajectory planner could be replaced by learned policies that adapt to scene content rather than following fixed heuristics.
Extending the loop to handle dynamic objects would require the underlying video model to maintain temporal coherence beyond static geometry.

Load-bearing premise

A pre-trained video diffusion model can be steered by coarse point clouds and planned trajectories without accumulating geometric drift or view inconsistencies across repeated synthesis steps.

What would settle it

Running the iterative process around a full 360-degree orbit of a known scene and measuring whether the final set of generated views produces a 3D reconstruction whose projected points deviate measurably from the initial input points or exhibit visible seams.

read the original abstract

Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. In this work, we propose \textbf{ViewCrafter}, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images with the prior of video diffusion model. Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames with precise camera pose control. To further enlarge the generation range of novel views, we tailored an iterative view synthesis strategy together with a camera trajectory planning algorithm to progressively extend the 3D clues and the areas covered by the novel views. With ViewCrafter, we can facilitate various applications, such as immersive experiences with real-time rendering by efficiently optimizing a 3D-GS representation using the reconstructed 3D points and the generated novel views, and scene-level text-to-3D generation for more imaginative content creation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in synthesizing high-fidelity and consistent novel views.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViewCrafter gives a workable way to condition video diffusion on point clouds and run an iterative camera-planning loop for novel views from sparse inputs, though the consistency claims rest on unshown experimental details.

read the letter

The main point is that this paper takes a pre-trained video diffusion model, feeds it coarse point clouds from the input views, and adds an iterative loop that plans new camera trajectories to synthesize more frames and grow the covered area. That specific combination of diffusion prior plus point conditioning plus explicit planning is not in the earlier work they cite, so the engineering synthesis is the actual contribution here. They also show how to feed the outputs straight into 3D Gaussian splatting for real-time rendering and into text-to-3D pipelines, which are practical next steps. The abstract says the method generalizes across diverse datasets and produces high-fidelity consistent views, and the full text apparently backs this with experiments. That part is useful for anyone who needs to bootstrap 3D from one or a few images without dense capture. The soft spot is the iterative loop itself. The stress-test note is right to flag possible geometric drift: nothing in the abstract describes extra mechanisms like depth-aware losses or reprojection checks that would keep errors from compounding when the point cloud is updated from generated frames. If the experiments only show short-range qualitative results without ablations on iteration count, baseline distance, or quantitative consistency metrics, the “precise camera pose control” and “strong generalization” claims stay harder to judge. This is for people already working on diffusion-based 3D or sparse-view reconstruction who want a concrete recipe they can try. It deserves a serious referee because the method is spelled out clearly enough to implement and the applications are timely, even if the quantitative support needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper proposes ViewCrafter, a method that conditions pre-trained video diffusion models on coarse point-based 3D representations derived from single or sparse input images to synthesize high-fidelity novel views with explicit camera-pose control. An iterative synthesis loop combined with a camera-trajectory planner progressively expands the covered 3D region, after which the generated views and points are used to optimize a 3D Gaussian Splatting representation or to support text-to-3D generation. The authors claim strong generalization and superior performance across diverse scenes.

Significance. If the iterative conditioning scheme maintains geometric consistency, the work would offer a practical route to high-quality novel-view synthesis from minimal captures by repurposing large-scale video diffusion priors, thereby lowering the data barrier for immersive rendering and scene-level generative 3D pipelines.

major comments (2)

[Section 3.2] Section 3.2 (iterative view synthesis): the method relies on the diffusion prior respecting expanding point clouds and planned trajectories, yet supplies no explicit 3D-consistency mechanism (depth-aware attention, reprojection loss, or latent-space 3D injection) that would bound stochastic drift. Without such a safeguard, early-frame pose or depth errors will corrupt subsequent point clouds, directly threatening the central claim of “precise camera pose control” for large view ranges.
[Experimental section] Experimental section: the abstract asserts “superior performance” and “strong generalization” but the manuscript text provides neither quantitative tables (PSNR, SSIM, LPIPS on DTU/LLFF/RealEstate10K) nor ablation studies isolating the contribution of the point conditioner versus the trajectory planner. These omissions render the performance claims unverifiable from the given material.

minor comments (2)

[Section 3.1] Clarify the precise form of point-cloud conditioning (e.g., whether points are rasterized into the latent space or injected via cross-attention) and state the number of diffusion steps used at inference.
[Figure 4] Figure 4 and the accompanying text should include failure cases (e.g., thin structures or reflective surfaces) to illustrate the practical limits of the iterative loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point-by-point below and will incorporate revisions to strengthen the presentation of 3D consistency and to provide verifiable quantitative results.

read point-by-point responses

Referee: [Section 3.2] Section 3.2 (iterative view synthesis): the method relies on the diffusion prior respecting expanding point clouds and planned trajectories, yet supplies no explicit 3D-consistency mechanism (depth-aware attention, reprojection loss, or latent-space 3D injection) that would bound stochastic drift. Without such a safeguard, early-frame pose or depth errors will corrupt subsequent point clouds, directly threatening the central claim of “precise camera pose control” for large view ranges.

Authors: We appreciate the referee's concern about potential stochastic drift in the iterative synthesis loop. In ViewCrafter, the coarse point cloud is rendered into each target view's camera frustum and injected as an additional conditioning signal into the video diffusion model's latent space at every denoising step; this provides ongoing geometric guidance that the pre-trained prior respects. The camera-trajectory planner further mitigates drift by selecting short, overlapping trajectories that keep new views anchored to the current point cloud before the cloud is updated. While we do not add an auxiliary reprojection loss or depth-aware attention layers, the combination of explicit point conditioning and incremental planning empirically limits error accumulation, as evidenced by the consistent novel-view sequences in our experiments. To make this mechanism clearer, we will expand Section 3.2 with a dedicated paragraph on implicit consistency enforcement and include additional visualizations of point-cloud evolution and pose-error accumulation in the revision. revision: partial
Referee: [Experimental section] Experimental section: the abstract asserts “superior performance” and “strong generalization” but the manuscript text provides neither quantitative tables (PSNR, SSIM, LPIPS on DTU/LLFF/RealEstate10K) nor ablation studies isolating the contribution of the point conditioner versus the trajectory planner. These omissions render the performance claims unverifiable from the given material.

Authors: We acknowledge that the current manuscript version focuses on qualitative demonstrations and downstream applications. To render the claims of superior performance and strong generalization verifiable, we will add a new quantitative evaluation subsection that reports PSNR, SSIM, and LPIPS on DTU, LLFF, and RealEstate10K, together with comparisons against recent single-view and sparse-view baselines. We will also include ablation studies that isolate the point conditioner from the trajectory planner by measuring performance when each component is removed. These additions will appear in the revised experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; relies on external pre-trained video diffusion prior

full rationale

The paper conditions an external pre-trained video diffusion model on coarse point clouds and planned trajectories, then applies an iterative synthesis loop with camera planning. No derivation, equation, or central claim reduces by construction to a quantity the authors themselves fitted or defined in terms of the output. The video diffusion weights are treated as a fixed external prior rather than a self-derived component. Self-citations, if present, are not load-bearing for the core claim of consistency under iteration. This yields a normal low score with the method remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate explicit free parameters, axioms, or new entities; the approach inherits the inductive biases of a pre-trained video diffusion model and assumes that point clouds provide sufficient coarse 3D guidance.

pith-pipeline@v0.9.0 · 5529 in / 1107 out tokens · 37409 ms · 2026-05-13T22:55:10.394119+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
cs.CV 2026-05 unverdicted novelty 7.0

h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
cs.CV 2026-05 unverdicted novelty 7.0

3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
cs.CV 2026-04 unverdicted novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
cs.CV 2026-04 unverdicted novelty 7.0

DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Novel View Synthesis as Video Completion
cs.CV 2026-04 unverdicted novelty 7.0

Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
cs.CV 2026-04 unverdicted novelty 7.0

OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

ProDiG progressively transforms aerial Gaussian splats into coherent ground-level 3D reconstructions via diffusion guidance and specialized attention modules.
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
cs.CV 2026-03 unverdicted novelty 7.0

SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
cs.CV 2026-03 unverdicted novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
cs.CV 2026-05 unverdicted novelty 6.0

Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis
cs.CV 2026-05 unverdicted novelty 6.0

UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on no...
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
cs.CV 2026-04 unverdicted novelty 6.0

AnyRecon enables scalable 3D reconstruction from arbitrary sparse unordered views by combining video diffusion with explicit global geometric memory and retrieval to maintain consistency across large viewpoint changes.
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
Lyra 2.0: Explorable Generative 3D Worlds
cs.CV 2026-04 unverdicted novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
NavCrafter: Exploring 3D Scenes from a Single Image
cs.CV 2026-04 unverdicted novelty 6.0

NavCrafter generates controllable novel-view videos from one image via video diffusion, geometry-aware expansion, and enhanced 3D Gaussian Splatting to achieve state-of-the-art synthesis under large viewpoint changes.
Pose-Aware Diffusion for 3D Generation
cs.CV 2026-05 unverdicted novelty 5.0

PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
cs.CV 2026-04 unverdicted novelty 4.0

World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 25 Pith papers · 6 internal anchors

[1]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P . P . Srinivasan, M. Tancik, J. T. Barron, R. Ra- mamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in ECCV, 2020

work page 2020
[2]

3d gaus- sian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaus- sian splatting for real-time radiance field rendering,” ACM TOG , 2023

work page 2023
[3]

Synsin: End-to- end view synthesis from a single image,

O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson, “Synsin: End-to- end view synthesis from a single image,” in CVPR, 2020

work page 2020
[4]

Geometry-free view syn- thesis: Transformers and no 3d priors,

R. Rombach, P . Esser, and B. Ommer, “Geometry-free view syn- thesis: Transformers and no 3d priors,” in ICCV, 2021

work page 2021
[5]

Pixelsynth: Generating a 3d-consistent experience from a single image,

C. Rockwell, D. F. Fouhey, and J. Johnson, “Pixelsynth: Generating a 3d-consistent experience from a single image,” in ICCV, 2021

work page 2021
[6]

Bridging implicit and explicit geometric transformation for single-image view synthesis,

B. Park, H. Go, and C. Kim, “Bridging implicit and explicit geometric transformation for single-image view synthesis,” IEEE TP AMI, 2024

work page 2024
[7]

Stereo magnification: Learning view synthesis using multiplane images,

T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: Learning view synthesis using multiplane images,” ACM TOG, 2018

work page 2018
[8]

Single-view view synthesis in the wild with learned adaptive multiplane images,

Y. Han, R. Wang, and J. Yang, “Single-view view synthesis in the wild with learned adaptive multiplane images,” in SIGGRAPH Conference, 2022

work page 2022
[9]

Single-view view synthesis with mul- tiplane images,

R. Tucker and N. Snavely, “Single-view view synthesis with mul- tiplane images,” in CVPR, 2020

work page 2020
[10]

pixelnerf: Neural radiance fields from one or few images,

A. Yu, V . Ye, M. Tancik, and A. Kanazawa, “pixelnerf: Neural radiance fields from one or few images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 4578–4587

work page 2021
[11]

Zero-1-to-3: Zero-shot one image to 3d object,

R. Liu, R. Wu, B. Van Hoorick, P . Tokmakov, S. Zakharov, and C. Vondrick, “Zero-1-to-3: Zero-shot one image to 3d object,” in ICCV, 2023

work page 2023
[12]

ZeroNVS: Zero- shot 360-degree view synthesis from a single real image,

K. Sargent, Z. Li, T. Shah, C. Herrmann, H.-X. Yu, Y. Zhang, E. R. Chan, D. Lagun, L. Fei-Fei, D. Sun, and J. Wu, “ZeroNVS: Zero- shot 360-degree view synthesis from a single real image,” inCVPR, 2024

work page 2024
[13]

Motionctrl: A unified and flexible motion controller for video generation,

Z. Wang, Z. Yuan, X. Wang, T. Chen, M. Xia, P . Luo, and Y. Shan, “Motionctrl: A unified and flexible motion controller for video generation,” in SIGGRAPH Conference, 2024

work page 2024
[14]

Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes, 2023

J. Chung, S. Lee, H. Nam, J. Lee, and K. M. Lee, “Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,” arXiv preprint arXiv:2311.13384, 2023

work page arXiv 2023
[15]

Realm- dreamer: Text-driven 3d scene generation with inpainting and depth diffusion,

J. Shriram, A. Trevithick, L. Liu, and R. Ramamoorthi, “Realm- dreamer: Text-driven 3d scene generation with inpainting and depth diffusion,” arXiv preprint arXiv:2404.07199 , 2024

work page arXiv 2024
[16]

Open-sora-plan,

P .-Y. Lab and T. A. etc., “Open-sora-plan,” https://github.com/ PKU-YuanGroup/Open-Sora-Plan, 2024. [Online]. Available: https://doi.org/10.5281/zenodo.10948109

work page doi:10.5281/zenodo.10948109 2024
[17]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V . Voleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Dynamicrafter: Animating open-domain images with video diffusion priors,

J. Xing, M. Xia, Y. Zhang, H. Chen, X. Wang, T.-T. Wong, and Y. Shan, “Dynamicrafter: Animating open-domain images with video diffusion priors,” arXiv preprint arXiv:2310.12190 , 2023

work page arXiv 2023
[19]

Dust3r: Geometric 3d vision made easy,

S. Wang, V . Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” in CVPR, 2024

work page 2024
[20]

Grounding image matching in 3d with mast3r, 2024

V . Leroy, Y. Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” arXiv:2406.09756, 2024

work page arXiv 2024
[21]

Tanks and temples: Benchmarking large-scale scene reconstruction,

A. Knapitsch, J. Park, Q.-Y. Zhou, and V . Koltun, “Tanks and temples: Benchmarking large-scale scene reconstruction,” ACM TOG, 2017

work page 2017
[23]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017

work page 2017
[24]

Real-time radiance fields for single-image portrait view synthe- sis,

A. Trevithick, M. Chan, M. Stengel, E. Chan, C. Liu, Z. Yu, S. Khamis, M. Chandraker, R. Ramamoorthi, and K. Nagano, “Real-time radiance fields for single-image portrait view synthe- sis,” ACM TOG, 2023

work page 2023
[25]

Nofa: Nerf-based one-shot facial avatar reconstruction,

W. Yu, Y. Fan, Y. Zhang, X. Wang, F. Yin, Y. Bai, Y.-P . Cao, Y. Shan, Y. Wu, Z. Sun et al. , “Nofa: Nerf-based one-shot facial avatar reconstruction,” in SIGGRAPH Conference, 2023

work page 2023
[26]

Lrm: Large reconstruction model for single image to 3d,

Y. Hong, K. Zhang, J. Gu, S. Bi, Y. Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan, “Lrm: Large reconstruction model for single image to 3d,” in ICLR, 2024

work page 2024
[27]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction,

D. Charatan, S. L. Li, A. Tagliasacchi, and V . Sitzmann, “pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction,” in CVPR, 2024

work page 2024
[28]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images,

Y. Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T.-J. Cham, and J. Cai, “Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images,” in ECCV, 2024

work page 2024
[29]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, 2020

work page 2020
[30]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2021

work page 2021
[32]

Hifi-123: Towards high-fidelity one image to 3d content generation,

W. Yu, L. Yuan, Y.-P . Cao, X. Gao, X. Li, L. Quan, Y. Shan, and Y. Tian, “Hifi-123: Towards high-fidelity one image to 3d content generation,” in ECCV, 2024

work page 2024
[33]

Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior,

J. Tang, T. Wang, B. Zhang, T. Zhang, R. Yi, L. Ma, and D. Chen, “Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior,” in ICCV, 2023

work page 2023
[34]

Dreamcraft3d: Hierarchical 3d generation with bootstrapped dif- fusion prior,

J. Sun, B. Zhang, R. Shao, L. Wang, W. Liu, Z. Xie, and Y. Liu, “Dreamcraft3d: Hierarchical 3d generation with bootstrapped dif- fusion prior,” in ICLR, 2024

work page 2024
[35]

Gen- erative novel view synthesis with 3d-aware diffusion models,

E. R. Chan, K. Nagano, M. A. Chan, A. W. Bergman, J. J. Park, A. Levy, M. Aittala, S. De Mello, T. Karras, and G. Wetzstein, “Gen- erative novel view synthesis with 3d-aware diffusion models,” in ICCV, 2023

work page 2023
[36]

Objaverse: A universe of annotated 3d objects,

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. Van- derBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” in CVPR, 2023

work page 2023
[37]

ShapeNet: An Information-Rich 3D Model Repository

A. X. Chang, T. Funkhouser, L. Guibas, P . Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al. , “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015. 13

work page internal anchor Pith review Pith/arXiv arXiv 2015
[38]

Novel view synthesis with diffusion models,

D. Watson, W. Chan, R. M. Brualla, J. Ho, A. Tagliasacchi, and M. Norouzi, “Novel view synthesis with diffusion models,” in ICLR, 2023

work page 2023
[39]

Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction,

J. Reizenstein, R. Shapovalov, P . Henzler, L. Sbordone, P . Labatut, and D. Novotny, “Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction,” in ICCV, 2021

work page 2021
[40]

Mvimgnet: A large-scale dataset of multi- view images,

X. Yu, M. Xu, Y. Zhang, H. Liu, C. Ye, Y. Wu, Z. Yan, C. Zhu, Z. Xiong, T. Liang et al., “Mvimgnet: A large-scale dataset of multi- view images,” in CVPR, 2023

work page 2023
[41]

Reconfusion: 3d reconstruction with diffusion priors,

R. Wu, B. Mildenhall, P . Henzler, K. Park, R. Gao, D. Watson, P . P . Srinivasan, D. Verbin, J. T. Barron, B. Poole et al., “Reconfusion: 3d reconstruction with diffusion priors,” in CVPR, 2024

work page 2024
[42]

Text2nerf: Text- driven 3d scene generation with neural radiance fields,

J. Zhang, X. Li, Z. Wan, C. Wang, and J. Liao, “Text2nerf: Text- driven 3d scene generation with neural radiance fields,” IEEE TVCG, 2024

work page 2024
[43]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022

work page 2022
[44]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023

work page 2023
[45]

T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,

C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi, and Y. Shan, “T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” in AAAI, 2024

work page 2024
[46]

Gligen: Open-set grounded text-to-image generation,

Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, “Gligen: Open-set grounded text-to-image generation,” in CVPR, 2023

work page 2023
[47]

Tooncrafter: Generative cartoon interpolation,

J. Xing, H. Liu, M. Xia, Y. Zhang, X. Wang, Y. Shan, and T.- T. Wong, “Tooncrafter: Generative cartoon interpolation,” arXiv preprint arXiv:2405.17933, 2024

work page arXiv 2024
[48]

Make-your-video: Customized video generation using textual and structural guidance,

J. Xing, M. Xia, Y. Liu, Y. Zhang, Y. Zhang, Y. He, H. Liu, H. Chen, X. Cun, X. Wang et al. , “Make-your-video: Customized video generation using textual and structural guidance,” IEEE TVCG , 2024

work page 2024
[49]

Structure and content-guided video synthesis with diffusion models,

P . Esser, J. Chiu, P . Atighehchian, J. Granskog, and A. Germani- dis, “Structure and content-guided video synthesis with diffusion models,” in ICCV, 2023

work page 2023
[50]

arXiv preprint arXiv:2308.08089 , year=

S. Yin, C. Wu, J. Liang, J. Shi, H. Li, G. Ming, and N. Duan, “Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory,” arXiv preprint arXiv:2308.08089 , 2023

work page arXiv 2023
[51]

Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model,

M. Niu, X. Cun, X. Wang, Y. Zhang, Y. Shan, and Y. Zheng, “Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model,” arXiv preprint arXiv:2405.20222, 2024

work page arXiv 2024
[52]

Vase: Object-centric appearance and shape manipulation of real videos,

E. Peruzzo, V . Goel, D. Xu, X. Xu, Y. Jiang, Z. Wang, H. Shi, and N. Sebe, “Vase: Object-centric appearance and shape manipulation of real videos,” arXiv preprint arXiv:2401.02473 , 2024

work page arXiv 2024
[53]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y. Shen, P . Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in ICLR, 2022

work page 2022
[55]

Multidiff: Consistent novel view synthesis from a single image,

N. M ¨uller, K. Schwarz, B. R ¨ossle, L. Porzi, S. R. Bul `o, M. Nießner, and P . Kontschieder, “Multidiff: Consistent novel view synthesis from a single image,” in CVPR, 2024

work page 2024
[56]

Scannet: Richly-annotated 3d reconstructions of in- door scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of in- door scenes,” in CVPR, 2017

work page 2017
[57]

Camco: Camera- controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024

D. Xu, W. Nie, C. Liu, S. Liu, J. Kautz, Z. Wang, and A. Vahdat, “Camco: Camera-controllable 3d-consistent image-to-video gener- ation,” arXiv preprint arXiv:2406.02509 , 2024

work page arXiv 2024
[58]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang, “Cameractrl: Enabling camera control for text-to-video genera- tion,” arXiv preprint arXiv:2404.02101 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Light field networks: Neural scene representations with single-evaluation rendering,

V . Sitzmann, S. Rezchikov, B. Freeman, J. Tenenbaum, and F. Du- rand, “Light field networks: Neural scene representations with single-evaluation rendering,” in NeurIPS, 2021

work page 2021
[60]

Latent-nerf for shape-guided generation of 3d shapes and textures,

G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen- Or, “Latent-nerf for shape-guided generation of 3d shapes and textures,” in CVPR, 2023

work page 2023
[61]

Plastria, “The weiszfeld algorithm: proof, amendments and extensions, ha eiselt and v

F. Plastria, “The weiszfeld algorithm: proof, amendments and extensions, ha eiselt and v. marianov (eds.) foundations of location analysis, international series in operations research and manage- ment science,” 2011

work page 2011
[62]

Learning transfer- able visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P . Mishkin, J. Clark et al., “Learning transfer- able visual models from natural language supervision,” in ICML, 2021

work page 2021
[63]

Cat3d: Create any- thing in 3d with multi-view diffusion models,

R. Gao, A. Holynski, P . Henzler, A. Brussee, R. Martin-Brualla, P . Srinivasan, J. T. Barron, and B. Poole, “Cat3d: Create any- thing in 3d with multi-view diffusion models,” arXiv preprint arXiv:2405.10314, 2024

work page arXiv 2024
[64]

View planning in robot active vision: A survey of systems, algorithms, and applications,

R. Zeng, Y. Wen, W. Zhao, and Y.-J. Liu, “View planning in robot active vision: A survey of systems, algorithms, and applications,” Computational Visual Media, vol. 6, 2020

work page 2020
[65]

Pred-nbv: Prediction- guided next-best-view planning for 3d object reconstruction,

H. Dhami, V . D. Sharma, and P . Tokekar, “Pred-nbv: Prediction- guided next-best-view planning for 3d object reconstruction,” in IROS, 2023

work page 2023
[66]

Neu-nbv: Next best view planning using uncertainty estimation in image-based neural rendering,

L. Jin, X. Chen, J. R ¨uckin, and M. Popovi ´c, “Neu-nbv: Next best view planning using uncertainty estimation in image-based neural rendering,” in IROS, 2023

work page 2023
[67]

An infor- mation gain formulation for active volumetric 3d reconstruction,

S. Isler, R. Sabzevari, J. Delmerico, and D. Scaramuzza, “An infor- mation gain formulation for active volumetric 3d reconstruction,” in ICRA, 2016

work page 2016
[68]

Next-best view policy for 3d reconstruction,

D. Peralta, J. Casimiro, A. M. Nilles, J. A. Aguilar, R. Atienza, and R. Cajote, “Next-best view policy for 3d reconstruction,” in ECCVW, 2020

work page 2020
[69]

3d photography using context-aware layered depth inpainting,

M.-L. Shih, S.-Y. Su, J. Kopf, and J.-B. Huang, “3d photography using context-aware layered depth inpainting,” in CVPR, 2020

work page 2020
[70]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision,

L. Ling, Y. Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y. Luet al., “Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision,” in CVPR, 2024

work page 2024
[71]

Accelerating 3D Deep Learning with PyTorch3D

N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. John- son, and G. Gkioxari, “Accelerating 3d deep learning with py- torch3d,” arXiv:2007.08501, 2020

work page internal anchor Pith review arXiv 2007
[72]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[73]

Image quality assessment: from error visibility to structural similarity,

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P . Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004

work page 2004
[74]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P . Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018

work page 2018
[75]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochre- iter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in NeurIPS, 2017

work page 2017
[76]

Structure-from-motion revis- ited,

J. L. Sch ¨onberger and J.-M. Frahm, “Structure-from-motion revis- ited,” in CVPR, 2016

work page 2016
[77]

Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization,

J. Li, J. Zhang, X. Bai, J. Zheng, X. Ning, J. Zhou, and L. Gu, “Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization,” in CVPR, 2024

work page 2024
[78]

Fsgs: Real-time few-shot view synthesis using gaussian splatting,

Z. Zhu, Z. Fan, Y. Jiang, and Z. Wang, “Fsgs: Real-time few-shot view synthesis using gaussian splatting,” in ECCV, 2024

work page 2024
[79]

arXiv preprint arXiv:2403.20309 (2024)

Z. Fan, W. Cong, K. Wen, K. Wang, J. Zhang, X. Ding, D. Xu, B. Ivanovic, M. Pavone, G. Pavlakos et al. , “Instantsplat: Un- bounded sparse-view pose-free gaussian splatting in 40 seconds,” arXiv:2403.20309, 2024

work page arXiv 2024
[80]

arXiv preprint arXiv:2407.12781 , year=

S. Bahmani, I. Skorokhodov, A. Siarohin, W. Menapace, G. Qian, M. Vasilkovsky, H.-Y. Lee, C. Wang, J. Zou, A. Tagliasacchi et al. , “Vd3d: Taming large video diffusion transformers for 3d camera control,” arXiv preprint arXiv:2407.12781 , 2024

work page arXiv 2024
[81]

Pl ¨ucker coordinates for lines in the space,

Y.-B. Jia, “Pl ¨ucker coordinates for lines in the space,”Problem Solver T echniques for Applied Computer Science, Com-S-477/577 Course Handout, 2020

work page 2020