Recognition: 2 theorem links
· Lean TheoremViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Pith reviewed 2026-05-13 22:55 UTC · model grok-4.3
The pith
ViewCrafter steers a pre-trained video diffusion model with coarse point clouds and planned trajectories to synthesize consistent high-fidelity novel views from single or sparse images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViewCrafter uses a video diffusion model conditioned on point-based 3D clues and explicit camera trajectories to generate sequences of high-quality novel views. An iterative synthesis procedure with a dedicated trajectory planning algorithm progressively enlarges the set of reconstructed points and the spatial extent of the synthesized views, allowing high-fidelity results from minimal input images.
What carries the argument
Iterative view synthesis loop that conditions a video diffusion model on coarse point clouds and planned camera trajectories to extend 3D coverage.
If this is right
- The generated views and points can be used to optimize a 3D Gaussian splatting representation that supports real-time rendering.
- The same pipeline enables scene-level text-to-3D generation by first creating consistent views and then fitting a 3D model.
- The method works on generic scenes and shows strong generalization across diverse datasets without retraining the diffusion model.
- It reduces reliance on dense multi-view captures that currently limit practical 3D reconstruction.
Where Pith is reading between the lines
- If the diffusion model already encodes strong 3D priors, adding explicit point conditioning may become unnecessary for short trajectories.
- The trajectory planner could be replaced by learned policies that adapt to scene content rather than following fixed heuristics.
- Extending the loop to handle dynamic objects would require the underlying video model to maintain temporal coherence beyond static geometry.
Load-bearing premise
A pre-trained video diffusion model can be steered by coarse point clouds and planned trajectories without accumulating geometric drift or view inconsistencies across repeated synthesis steps.
What would settle it
Running the iterative process around a full 360-degree orbit of a known scene and measuring whether the final set of generated views produces a 3D reconstruction whose projected points deviate measurably from the initial input points or exhibit visible seams.
read the original abstract
Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. In this work, we propose \textbf{ViewCrafter}, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images with the prior of video diffusion model. Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames with precise camera pose control. To further enlarge the generation range of novel views, we tailored an iterative view synthesis strategy together with a camera trajectory planning algorithm to progressively extend the 3D clues and the areas covered by the novel views. With ViewCrafter, we can facilitate various applications, such as immersive experiences with real-time rendering by efficiently optimizing a 3D-GS representation using the reconstructed 3D points and the generated novel views, and scene-level text-to-3D generation for more imaginative content creation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in synthesizing high-fidelity and consistent novel views.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ViewCrafter, a method that conditions pre-trained video diffusion models on coarse point-based 3D representations derived from single or sparse input images to synthesize high-fidelity novel views with explicit camera-pose control. An iterative synthesis loop combined with a camera-trajectory planner progressively expands the covered 3D region, after which the generated views and points are used to optimize a 3D Gaussian Splatting representation or to support text-to-3D generation. The authors claim strong generalization and superior performance across diverse scenes.
Significance. If the iterative conditioning scheme maintains geometric consistency, the work would offer a practical route to high-quality novel-view synthesis from minimal captures by repurposing large-scale video diffusion priors, thereby lowering the data barrier for immersive rendering and scene-level generative 3D pipelines.
major comments (2)
- [Section 3.2] Section 3.2 (iterative view synthesis): the method relies on the diffusion prior respecting expanding point clouds and planned trajectories, yet supplies no explicit 3D-consistency mechanism (depth-aware attention, reprojection loss, or latent-space 3D injection) that would bound stochastic drift. Without such a safeguard, early-frame pose or depth errors will corrupt subsequent point clouds, directly threatening the central claim of “precise camera pose control” for large view ranges.
- [Experimental section] Experimental section: the abstract asserts “superior performance” and “strong generalization” but the manuscript text provides neither quantitative tables (PSNR, SSIM, LPIPS on DTU/LLFF/RealEstate10K) nor ablation studies isolating the contribution of the point conditioner versus the trajectory planner. These omissions render the performance claims unverifiable from the given material.
minor comments (2)
- [Section 3.1] Clarify the precise form of point-cloud conditioning (e.g., whether points are rasterized into the latent space or injected via cross-attention) and state the number of diffusion steps used at inference.
- [Figure 4] Figure 4 and the accompanying text should include failure cases (e.g., thin structures or reflective surfaces) to illustrate the practical limits of the iterative loop.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point-by-point below and will incorporate revisions to strengthen the presentation of 3D consistency and to provide verifiable quantitative results.
read point-by-point responses
-
Referee: [Section 3.2] Section 3.2 (iterative view synthesis): the method relies on the diffusion prior respecting expanding point clouds and planned trajectories, yet supplies no explicit 3D-consistency mechanism (depth-aware attention, reprojection loss, or latent-space 3D injection) that would bound stochastic drift. Without such a safeguard, early-frame pose or depth errors will corrupt subsequent point clouds, directly threatening the central claim of “precise camera pose control” for large view ranges.
Authors: We appreciate the referee's concern about potential stochastic drift in the iterative synthesis loop. In ViewCrafter, the coarse point cloud is rendered into each target view's camera frustum and injected as an additional conditioning signal into the video diffusion model's latent space at every denoising step; this provides ongoing geometric guidance that the pre-trained prior respects. The camera-trajectory planner further mitigates drift by selecting short, overlapping trajectories that keep new views anchored to the current point cloud before the cloud is updated. While we do not add an auxiliary reprojection loss or depth-aware attention layers, the combination of explicit point conditioning and incremental planning empirically limits error accumulation, as evidenced by the consistent novel-view sequences in our experiments. To make this mechanism clearer, we will expand Section 3.2 with a dedicated paragraph on implicit consistency enforcement and include additional visualizations of point-cloud evolution and pose-error accumulation in the revision. revision: partial
-
Referee: [Experimental section] Experimental section: the abstract asserts “superior performance” and “strong generalization” but the manuscript text provides neither quantitative tables (PSNR, SSIM, LPIPS on DTU/LLFF/RealEstate10K) nor ablation studies isolating the contribution of the point conditioner versus the trajectory planner. These omissions render the performance claims unverifiable from the given material.
Authors: We acknowledge that the current manuscript version focuses on qualitative demonstrations and downstream applications. To render the claims of superior performance and strong generalization verifiable, we will add a new quantitative evaluation subsection that reports PSNR, SSIM, and LPIPS on DTU, LLFF, and RealEstate10K, together with comparisons against recent single-view and sparse-view baselines. We will also include ablation studies that isolate the point conditioner from the trajectory planner by measuring performance when each component is removed. These additions will appear in the revised experimental section. revision: yes
Circularity Check
No significant circularity; relies on external pre-trained video diffusion prior
full rationale
The paper conditions an external pre-trained video diffusion model on coarse point clouds and planned trajectories, then applies an iterative synthesis loop with camera planning. No derivation, equation, or central claim reduces by construction to a quantity the authors themselves fitted or defined in terms of the output. The video diffusion weights are treated as a fixed external prior rather than a self-derived component. Self-citations, if present, are not load-bearing for the core claim of consistency under iteration. This yields a normal low score with the method remaining self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 26 Pith papers
-
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
-
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.
-
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
-
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
Novel View Synthesis as Video Completion
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
-
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
-
ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction
ProDiG progressively transforms aerial Gaussian splats into coherent ground-level 3D reconstructions via diffusion guidance and specialized attention modules.
-
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.
-
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
-
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
-
UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis
UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on no...
-
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
AnyRecon enables scalable 3D reconstruction from arbitrary sparse unordered views by combining video diffusion with explicit global geometric memory and retrieval to maintain consistency across large viewpoint changes.
-
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
-
Lyra 2.0: Explorable Generative 3D Worlds
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
-
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
-
NavCrafter: Exploring 3D Scenes from a Single Image
NavCrafter generates controllable novel-view videos from one image via video diffusion, geometry-aware expansion, and enhanced 3D Gaussian Splatting to achieve state-of-the-art synthesis under large viewpoint changes.
-
Pose-Aware Diffusion for 3D Generation
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
-
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[1]
Nerf: Representing scenes as neural radiance fields for view synthesis,
B. Mildenhall, P . P . Srinivasan, M. Tancik, J. T. Barron, R. Ra- mamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in ECCV, 2020
work page 2020
-
[2]
3d gaus- sian splatting for real-time radiance field rendering,
B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaus- sian splatting for real-time radiance field rendering,” ACM TOG , 2023
work page 2023
-
[3]
Synsin: End-to- end view synthesis from a single image,
O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson, “Synsin: End-to- end view synthesis from a single image,” in CVPR, 2020
work page 2020
-
[4]
Geometry-free view syn- thesis: Transformers and no 3d priors,
R. Rombach, P . Esser, and B. Ommer, “Geometry-free view syn- thesis: Transformers and no 3d priors,” in ICCV, 2021
work page 2021
-
[5]
Pixelsynth: Generating a 3d-consistent experience from a single image,
C. Rockwell, D. F. Fouhey, and J. Johnson, “Pixelsynth: Generating a 3d-consistent experience from a single image,” in ICCV, 2021
work page 2021
-
[6]
Bridging implicit and explicit geometric transformation for single-image view synthesis,
B. Park, H. Go, and C. Kim, “Bridging implicit and explicit geometric transformation for single-image view synthesis,” IEEE TP AMI, 2024
work page 2024
-
[7]
Stereo magnification: Learning view synthesis using multiplane images,
T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: Learning view synthesis using multiplane images,” ACM TOG, 2018
work page 2018
-
[8]
Single-view view synthesis in the wild with learned adaptive multiplane images,
Y. Han, R. Wang, and J. Yang, “Single-view view synthesis in the wild with learned adaptive multiplane images,” in SIGGRAPH Conference, 2022
work page 2022
-
[9]
Single-view view synthesis with mul- tiplane images,
R. Tucker and N. Snavely, “Single-view view synthesis with mul- tiplane images,” in CVPR, 2020
work page 2020
-
[10]
pixelnerf: Neural radiance fields from one or few images,
A. Yu, V . Ye, M. Tancik, and A. Kanazawa, “pixelnerf: Neural radiance fields from one or few images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 4578–4587
work page 2021
-
[11]
Zero-1-to-3: Zero-shot one image to 3d object,
R. Liu, R. Wu, B. Van Hoorick, P . Tokmakov, S. Zakharov, and C. Vondrick, “Zero-1-to-3: Zero-shot one image to 3d object,” in ICCV, 2023
work page 2023
-
[12]
ZeroNVS: Zero- shot 360-degree view synthesis from a single real image,
K. Sargent, Z. Li, T. Shah, C. Herrmann, H.-X. Yu, Y. Zhang, E. R. Chan, D. Lagun, L. Fei-Fei, D. Sun, and J. Wu, “ZeroNVS: Zero- shot 360-degree view synthesis from a single real image,” inCVPR, 2024
work page 2024
-
[13]
Motionctrl: A unified and flexible motion controller for video generation,
Z. Wang, Z. Yuan, X. Wang, T. Chen, M. Xia, P . Luo, and Y. Shan, “Motionctrl: A unified and flexible motion controller for video generation,” in SIGGRAPH Conference, 2024
work page 2024
-
[14]
Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes, 2023
J. Chung, S. Lee, H. Nam, J. Lee, and K. M. Lee, “Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,” arXiv preprint arXiv:2311.13384, 2023
-
[15]
Realm- dreamer: Text-driven 3d scene generation with inpainting and depth diffusion,
J. Shriram, A. Trevithick, L. Liu, and R. Ramamoorthi, “Realm- dreamer: Text-driven 3d scene generation with inpainting and depth diffusion,” arXiv preprint arXiv:2404.07199 , 2024
-
[16]
P .-Y. Lab and T. A. etc., “Open-sora-plan,” https://github.com/ PKU-YuanGroup/Open-Sora-Plan, 2024. [Online]. Available: https://doi.org/10.5281/zenodo.10948109
-
[17]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V . Voleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Dynamicrafter: Animating open-domain images with video diffusion priors,
J. Xing, M. Xia, Y. Zhang, H. Chen, X. Wang, T.-T. Wong, and Y. Shan, “Dynamicrafter: Animating open-domain images with video diffusion priors,” arXiv preprint arXiv:2310.12190 , 2023
-
[19]
Dust3r: Geometric 3d vision made easy,
S. Wang, V . Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” in CVPR, 2024
work page 2024
-
[20]
Grounding image matching in 3d with mast3r, 2024
V . Leroy, Y. Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” arXiv:2406.09756, 2024
-
[21]
Tanks and temples: Benchmarking large-scale scene reconstruction,
A. Knapitsch, J. Park, Q.-Y. Zhou, and V . Koltun, “Tanks and temples: Benchmarking large-scale scene reconstruction,” ACM TOG, 2017
work page 2017
-
[23]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017
work page 2017
-
[24]
Real-time radiance fields for single-image portrait view synthe- sis,
A. Trevithick, M. Chan, M. Stengel, E. Chan, C. Liu, Z. Yu, S. Khamis, M. Chandraker, R. Ramamoorthi, and K. Nagano, “Real-time radiance fields for single-image portrait view synthe- sis,” ACM TOG, 2023
work page 2023
-
[25]
Nofa: Nerf-based one-shot facial avatar reconstruction,
W. Yu, Y. Fan, Y. Zhang, X. Wang, F. Yin, Y. Bai, Y.-P . Cao, Y. Shan, Y. Wu, Z. Sun et al. , “Nofa: Nerf-based one-shot facial avatar reconstruction,” in SIGGRAPH Conference, 2023
work page 2023
-
[26]
Lrm: Large reconstruction model for single image to 3d,
Y. Hong, K. Zhang, J. Gu, S. Bi, Y. Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan, “Lrm: Large reconstruction model for single image to 3d,” in ICLR, 2024
work page 2024
-
[27]
pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction,
D. Charatan, S. L. Li, A. Tagliasacchi, and V . Sitzmann, “pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction,” in CVPR, 2024
work page 2024
-
[28]
Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images,
Y. Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T.-J. Cham, and J. Cai, “Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images,” in ECCV, 2024
work page 2024
-
[29]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, 2020
work page 2020
-
[30]
Denoising diffusion implicit models,
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2021
work page 2021
-
[32]
Hifi-123: Towards high-fidelity one image to 3d content generation,
W. Yu, L. Yuan, Y.-P . Cao, X. Gao, X. Li, L. Quan, Y. Shan, and Y. Tian, “Hifi-123: Towards high-fidelity one image to 3d content generation,” in ECCV, 2024
work page 2024
-
[33]
Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior,
J. Tang, T. Wang, B. Zhang, T. Zhang, R. Yi, L. Ma, and D. Chen, “Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior,” in ICCV, 2023
work page 2023
-
[34]
Dreamcraft3d: Hierarchical 3d generation with bootstrapped dif- fusion prior,
J. Sun, B. Zhang, R. Shao, L. Wang, W. Liu, Z. Xie, and Y. Liu, “Dreamcraft3d: Hierarchical 3d generation with bootstrapped dif- fusion prior,” in ICLR, 2024
work page 2024
-
[35]
Gen- erative novel view synthesis with 3d-aware diffusion models,
E. R. Chan, K. Nagano, M. A. Chan, A. W. Bergman, J. J. Park, A. Levy, M. Aittala, S. De Mello, T. Karras, and G. Wetzstein, “Gen- erative novel view synthesis with 3d-aware diffusion models,” in ICCV, 2023
work page 2023
-
[36]
Objaverse: A universe of annotated 3d objects,
M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. Van- derBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” in CVPR, 2023
work page 2023
-
[37]
ShapeNet: An Information-Rich 3D Model Repository
A. X. Chang, T. Funkhouser, L. Guibas, P . Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al. , “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015. 13
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[38]
Novel view synthesis with diffusion models,
D. Watson, W. Chan, R. M. Brualla, J. Ho, A. Tagliasacchi, and M. Norouzi, “Novel view synthesis with diffusion models,” in ICLR, 2023
work page 2023
-
[39]
Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction,
J. Reizenstein, R. Shapovalov, P . Henzler, L. Sbordone, P . Labatut, and D. Novotny, “Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction,” in ICCV, 2021
work page 2021
-
[40]
Mvimgnet: A large-scale dataset of multi- view images,
X. Yu, M. Xu, Y. Zhang, H. Liu, C. Ye, Y. Wu, Z. Yan, C. Zhu, Z. Xiong, T. Liang et al., “Mvimgnet: A large-scale dataset of multi- view images,” in CVPR, 2023
work page 2023
-
[41]
Reconfusion: 3d reconstruction with diffusion priors,
R. Wu, B. Mildenhall, P . Henzler, K. Park, R. Gao, D. Watson, P . P . Srinivasan, D. Verbin, J. T. Barron, B. Poole et al., “Reconfusion: 3d reconstruction with diffusion priors,” in CVPR, 2024
work page 2024
-
[42]
Text2nerf: Text- driven 3d scene generation with neural radiance fields,
J. Zhang, X. Li, Z. Wan, C. Wang, and J. Liao, “Text2nerf: Text- driven 3d scene generation with neural radiance fields,” IEEE TVCG, 2024
work page 2024
-
[43]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022
work page 2022
-
[44]
Adding conditional control to text-to-image diffusion models,
L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023
work page 2023
-
[45]
C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi, and Y. Shan, “T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” in AAAI, 2024
work page 2024
-
[46]
Gligen: Open-set grounded text-to-image generation,
Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, “Gligen: Open-set grounded text-to-image generation,” in CVPR, 2023
work page 2023
-
[47]
Tooncrafter: Generative cartoon interpolation,
J. Xing, H. Liu, M. Xia, Y. Zhang, X. Wang, Y. Shan, and T.- T. Wong, “Tooncrafter: Generative cartoon interpolation,” arXiv preprint arXiv:2405.17933, 2024
-
[48]
Make-your-video: Customized video generation using textual and structural guidance,
J. Xing, M. Xia, Y. Liu, Y. Zhang, Y. Zhang, Y. He, H. Liu, H. Chen, X. Cun, X. Wang et al. , “Make-your-video: Customized video generation using textual and structural guidance,” IEEE TVCG , 2024
work page 2024
-
[49]
Structure and content-guided video synthesis with diffusion models,
P . Esser, J. Chiu, P . Atighehchian, J. Granskog, and A. Germani- dis, “Structure and content-guided video synthesis with diffusion models,” in ICCV, 2023
work page 2023
-
[50]
arXiv preprint arXiv:2308.08089 , year=
S. Yin, C. Wu, J. Liang, J. Shi, H. Li, G. Ming, and N. Duan, “Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory,” arXiv preprint arXiv:2308.08089 , 2023
-
[51]
M. Niu, X. Cun, X. Wang, Y. Zhang, Y. Shan, and Y. Zheng, “Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model,” arXiv preprint arXiv:2405.20222, 2024
-
[52]
Vase: Object-centric appearance and shape manipulation of real videos,
E. Peruzzo, V . Goel, D. Xu, X. Xu, Y. Jiang, Z. Wang, H. Shi, and N. Sebe, “Vase: Object-centric appearance and shape manipulation of real videos,” arXiv preprint arXiv:2401.02473 , 2024
-
[53]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Lora: Low-rank adaptation of large language models,
E. J. Hu, Y. Shen, P . Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in ICLR, 2022
work page 2022
-
[55]
Multidiff: Consistent novel view synthesis from a single image,
N. M ¨uller, K. Schwarz, B. R ¨ossle, L. Porzi, S. R. Bul `o, M. Nießner, and P . Kontschieder, “Multidiff: Consistent novel view synthesis from a single image,” in CVPR, 2024
work page 2024
-
[56]
Scannet: Richly-annotated 3d reconstructions of in- door scenes,
A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of in- door scenes,” in CVPR, 2017
work page 2017
-
[57]
D. Xu, W. Nie, C. Liu, S. Liu, J. Kautz, Z. Wang, and A. Vahdat, “Camco: Camera-controllable 3d-consistent image-to-video gener- ation,” arXiv preprint arXiv:2406.02509 , 2024
-
[58]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang, “Cameractrl: Enabling camera control for text-to-video genera- tion,” arXiv preprint arXiv:2404.02101 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Light field networks: Neural scene representations with single-evaluation rendering,
V . Sitzmann, S. Rezchikov, B. Freeman, J. Tenenbaum, and F. Du- rand, “Light field networks: Neural scene representations with single-evaluation rendering,” in NeurIPS, 2021
work page 2021
-
[60]
Latent-nerf for shape-guided generation of 3d shapes and textures,
G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen- Or, “Latent-nerf for shape-guided generation of 3d shapes and textures,” in CVPR, 2023
work page 2023
-
[61]
Plastria, “The weiszfeld algorithm: proof, amendments and extensions, ha eiselt and v
F. Plastria, “The weiszfeld algorithm: proof, amendments and extensions, ha eiselt and v. marianov (eds.) foundations of location analysis, international series in operations research and manage- ment science,” 2011
work page 2011
-
[62]
Learning transfer- able visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P . Mishkin, J. Clark et al., “Learning transfer- able visual models from natural language supervision,” in ICML, 2021
work page 2021
-
[63]
Cat3d: Create any- thing in 3d with multi-view diffusion models,
R. Gao, A. Holynski, P . Henzler, A. Brussee, R. Martin-Brualla, P . Srinivasan, J. T. Barron, and B. Poole, “Cat3d: Create any- thing in 3d with multi-view diffusion models,” arXiv preprint arXiv:2405.10314, 2024
-
[64]
View planning in robot active vision: A survey of systems, algorithms, and applications,
R. Zeng, Y. Wen, W. Zhao, and Y.-J. Liu, “View planning in robot active vision: A survey of systems, algorithms, and applications,” Computational Visual Media, vol. 6, 2020
work page 2020
-
[65]
Pred-nbv: Prediction- guided next-best-view planning for 3d object reconstruction,
H. Dhami, V . D. Sharma, and P . Tokekar, “Pred-nbv: Prediction- guided next-best-view planning for 3d object reconstruction,” in IROS, 2023
work page 2023
-
[66]
Neu-nbv: Next best view planning using uncertainty estimation in image-based neural rendering,
L. Jin, X. Chen, J. R ¨uckin, and M. Popovi ´c, “Neu-nbv: Next best view planning using uncertainty estimation in image-based neural rendering,” in IROS, 2023
work page 2023
-
[67]
An infor- mation gain formulation for active volumetric 3d reconstruction,
S. Isler, R. Sabzevari, J. Delmerico, and D. Scaramuzza, “An infor- mation gain formulation for active volumetric 3d reconstruction,” in ICRA, 2016
work page 2016
-
[68]
Next-best view policy for 3d reconstruction,
D. Peralta, J. Casimiro, A. M. Nilles, J. A. Aguilar, R. Atienza, and R. Cajote, “Next-best view policy for 3d reconstruction,” in ECCVW, 2020
work page 2020
-
[69]
3d photography using context-aware layered depth inpainting,
M.-L. Shih, S.-Y. Su, J. Kopf, and J.-B. Huang, “3d photography using context-aware layered depth inpainting,” in CVPR, 2020
work page 2020
-
[70]
Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision,
L. Ling, Y. Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y. Luet al., “Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision,” in CVPR, 2024
work page 2024
-
[71]
Accelerating 3D Deep Learning with PyTorch3D
N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. John- son, and G. Gkioxari, “Accelerating 3d deep learning with py- torch3d,” arXiv:2007.08501, 2020
work page internal anchor Pith review arXiv 2007
-
[72]
Classifier-Free Diffusion Guidance
J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[73]
Image quality assessment: from error visibility to structural similarity,
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P . Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004
work page 2004
-
[74]
The unreasonable effectiveness of deep features as a perceptual metric,
R. Zhang, P . Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018
work page 2018
-
[75]
Gans trained by a two time-scale update rule converge to a local nash equilibrium,
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochre- iter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in NeurIPS, 2017
work page 2017
-
[76]
Structure-from-motion revis- ited,
J. L. Sch ¨onberger and J.-M. Frahm, “Structure-from-motion revis- ited,” in CVPR, 2016
work page 2016
-
[77]
J. Li, J. Zhang, X. Bai, J. Zheng, X. Ning, J. Zhou, and L. Gu, “Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization,” in CVPR, 2024
work page 2024
-
[78]
Fsgs: Real-time few-shot view synthesis using gaussian splatting,
Z. Zhu, Z. Fan, Y. Jiang, and Z. Wang, “Fsgs: Real-time few-shot view synthesis using gaussian splatting,” in ECCV, 2024
work page 2024
-
[79]
arXiv preprint arXiv:2403.20309 (2024)
Z. Fan, W. Cong, K. Wen, K. Wang, J. Zhang, X. Ding, D. Xu, B. Ivanovic, M. Pavone, G. Pavlakos et al. , “Instantsplat: Un- bounded sparse-view pose-free gaussian splatting in 40 seconds,” arXiv:2403.20309, 2024
-
[80]
arXiv preprint arXiv:2407.12781 , year=
S. Bahmani, I. Skorokhodov, A. Siarohin, W. Menapace, G. Qian, M. Vasilkovsky, H.-Y. Lee, C. Wang, J. Zou, A. Tagliasacchi et al. , “Vd3d: Taming large video diffusion transformers for 3d camera control,” arXiv preprint arXiv:2407.12781 , 2024
-
[81]
Pl ¨ucker coordinates for lines in the space,
Y.-B. Jia, “Pl ¨ucker coordinates for lines in the space,”Problem Solver T echniques for Applied Computer Science, Com-S-477/577 Course Handout, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.