CAT3D: Create Anything in 3D with Multi-View Diffusion Models
Pith reviewed 2026-05-19 21:24 UTC · model grok-4.3
pith:XAR2GEW7 Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{XAR2GEW7}
Prints a linked pith:XAR2GEW7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
A multi-view diffusion model generates consistent novel views from any inputs to drive fast, high-quality 3D scene reconstruction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAT3D employs a multi-view diffusion model that, conditioned on an arbitrary set of input images and a collection of target novel viewpoints, synthesizes a set of highly consistent novel views of the scene. These generated views are then passed directly to robust 3D reconstruction methods to obtain representations that support real-time rendering from arbitrary viewpoints. The resulting pipeline creates entire 3D scenes in as little as one minute and achieves better results than existing approaches for single-image and few-view 3D scene creation.
What carries the argument
A multi-view diffusion model that jointly synthesizes geometrically consistent images across multiple user-specified target viewpoints given any number of input images.
If this is right
- 3D scenes can be reconstructed from just one or a few input images instead of hundreds.
- The generated views integrate directly with existing reconstruction pipelines without added constraints.
- Complete scenes become available for real-time rendering shortly after the diffusion step.
- The method outperforms prior work on single-image and few-view 3D creation benchmarks.
- Full scene creation completes in approximately one minute.
Where Pith is reading between the lines
- The consistency property could enable reliable 3D capture using only casual smartphone snapshots.
- The same conditioning mechanism might extend to generating views for dynamic or time-varying scenes.
- Generated view sets could serve as synthetic training data to improve other 3D models.
- Applying the pipeline to uncontrolled outdoor environments with changing light would test its robustness beyond controlled settings.
Load-bearing premise
The novel views produced by the model maintain enough geometric and photometric consistency that off-the-shelf 3D reconstruction algorithms succeed without extra regularization or filtering even on sparse inputs or complex scenes.
What would settle it
Feed the model's generated views from a single real-world photo of a scene with fine geometry and varying illumination into a standard reconstruction method such as 3D Gaussian splatting and measure whether the resulting model renders without visible distortions from viewpoints outside the generated set.
read the original abstract
Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model. Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques to produce 3D representations that can be rendered from any viewpoint in real-time. CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation. See our project page for results and interactive demos at https://cat3d.github.io .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CAT3D, a multi-view diffusion model that takes any number of input images plus target novel viewpoints and generates highly consistent novel views of a scene. These views are then fed directly into standard 3D reconstruction pipelines (e.g., COLMAP or NeRF) to produce renderable 3D representations in as little as one minute, with reported outperformance over prior single-image and few-view methods.
Significance. If the consistency and reconstruction claims hold under sparse or out-of-distribution inputs, the work would meaningfully reduce the data burden for high-quality 3D capture and enable rapid scene creation for real-time rendering applications. It demonstrates practical utility of conditioned diffusion models for view synthesis that integrates with existing reconstruction tools.
major comments (2)
- [§4 (Experiments)] §4 (Experiments): The central claim that generated views are sufficiently geometrically and photometrically consistent for direct use in off-the-shelf reconstruction without extra regularization or filtering is load-bearing, yet the reported benchmarks focus on overall outperformance rather than explicit cross-view consistency metrics (e.g., multi-view depth variance or edge alignment error) on sparse inputs or scenes with complex lighting/geometry outside the training distribution.
- [§3 (Method)] §3 (Method): The description relies on standard diffusion training plus view-conditioning without an explicit cross-view consistency loss or post-processing step; this makes the assumption that outputs remain globally consistent (rather than locally plausible but drifting) an empirical outcome that must be directly verified for the downstream reconstruction claim to be supported.
minor comments (2)
- [Abstract] Abstract: The statement that the method 'outperforms existing methods' should name the specific metrics and baselines for immediate clarity.
- Figure captions in the qualitative results could include more detail on camera poses and input sparsity levels to help readers assess consistency.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major point below and have revised the manuscript to strengthen the presentation of consistency evidence.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments): The central claim that generated views are sufficiently geometrically and photometrically consistent for direct use in off-the-shelf reconstruction without extra regularization or filtering is load-bearing, yet the reported benchmarks focus on overall outperformance rather than explicit cross-view consistency metrics (e.g., multi-view depth variance or edge alignment error) on sparse inputs or scenes with complex lighting/geometry outside the training distribution.
Authors: We agree that explicit cross-view consistency metrics provide valuable additional support for the central claim. While end-to-end reconstruction success with COLMAP and NeRF already serves as a strong indirect indicator (inconsistent views would cause reconstruction failure), we have added direct quantitative evaluations in the revised Section 4. These include multi-view depth variance and photometric consistency measures computed across generated views for sparse-input cases. We have also included results on additional scenes with complex geometry and lighting. The new metrics and figures confirm low variance and alignment, reinforcing that the generated views are suitable for direct use in standard pipelines. revision: yes
-
Referee: [§3 (Method)] §3 (Method): The description relies on standard diffusion training plus view-conditioning without an explicit cross-view consistency loss or post-processing step; this makes the assumption that outputs remain globally consistent (rather than locally plausible but drifting) an empirical outcome that must be directly verified for the downstream reconstruction claim to be supported.
Authors: The referee correctly notes that our approach uses standard diffusion training with view conditioning and does not introduce an auxiliary consistency loss. Global consistency is indeed an emergent property learned from large-scale multi-view training data. To directly verify this for the reconstruction claim, the revised manuscript adds an ablation study in Section 3 and new results in Section 4 comparing reconstructions obtained from views generated with versus without multi-view conditioning. The ablations show clear degradation in both consistency and final 3D quality when conditioning is removed. We have also added qualitative consistency visualizations in the supplement. We view these empirical verifications as sufficient support while acknowledging that an explicit consistency term remains an interesting direction for future work. revision: yes
Circularity Check
No significant circularity; derivation is empirical and externally validated
full rationale
The paper presents an empirical multi-view diffusion model trained to generate novel views from sparse inputs, with consistency and downstream 3D reconstruction success demonstrated via held-out test scenes and off-the-shelf pipelines (COLMAP/NeRF) rather than any closed-form reduction of outputs to training losses or self-defined quantities. No equations or claims reduce a 'prediction' to a fitted parameter by construction, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claim rests on external evaluation benchmarks, making the method self-contained against independent data and reconstruction algorithms.
Axiom & Free-Parameter Ledger
free parameters (1)
- Diffusion model weights
axioms (1)
- standard math Standard diffusion model training objective and sampling procedure apply without modification to the multi-view conditioning setting.
Lean theorems connected to this paper
-
Foundation.DimensionForcingalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation
CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.
-
GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds
GSCompleter completes sparse 3D Gaussian Splatting scenes via a distillation-free generate-then-register pipeline using Stereo-Anchor lifting and Ray-Constrained Registration, delivering SOTA results on three benchmarks.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
-
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
Novel View Synthesis as Video Completion
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
-
HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
HAD uses multi-view reasoning from a pre-trained feedforward NVS network to estimate and mask hallucination scores in diffusion priors, reducing artifacts and achieving SOTA novel view synthesis in sparse-view 3D reco...
-
FurnSet: Exploiting Repeats for 3D Scene Reconstruction
FurnSet improves single-view 3D scene reconstruction by using per-object CLS tokens and set-aware self-attention to group and jointly reconstruct repeated object instances, with added scene-object conditioning and lay...
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
-
NavCrafter: Exploring 3D Scenes from a Single Image
NavCrafter generates controllable novel-view videos from one image via video diffusion, geometry-aware expansion, and enhanced 3D Gaussian Splatting to achieve state-of-the-art synthesis under large viewpoint changes.
-
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
-
DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing
DreamEdit3D learns separate token embeddings for segmented object components via two-phase multi-view optimization to enable text-guided 3D editing with consistent image generation and mesh reconstruction.
-
DecoRec: Decomposed 3D Scene Reconstruction from Single-View Images via Object-Level Diffusion
DecoRec decomposes single-view 3D scene reconstruction into per-object diffusion reconstructions followed by a differentiable rendering and diffusion-guided merging pipeline.
-
GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction
GeoRect4D couples 3D Gaussian splatting with a single-step diffusion rectifier via degradation-aware feedback and progressive optimization to improve fidelity and consistency in sparse-view dynamic 3D reconstruction.
-
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...
-
InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model
InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretra...
-
ViPE: Video Pose Engine for 3D Geometric Perception
ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
-
Learning World Models for Interactive Video Generation
The work introduces video retrieval augmented generation (VRAG) with explicit global state conditioning to reduce compounding errors and improve spatiotemporal consistency in interactive video world models.
Reference graph
Works this paper leans on
-
[1]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV, 2020
work page 2020
-
[2]
Instant neural graphics primitives with a multiresolution hash encoding
Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. SIGGRAPH, 2022. 10
work page 2022
-
[3]
3D Gaussian Splatting for Real-Time Radiance Field Rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. SIGGRAPH, 2023
work page 2023
-
[4]
FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization
Jiawei Yang, Marco Pavone, and Yue Wang. FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization. CVPR, 2023
work page 2023
-
[5]
SimpleNeRF: Regulariz- ing Sparse Input Neural Radiance Fields with Simpler Solutions
Nagabhushan Somraj, Adithyan Karanayil, and Rajiv Soundararajan. SimpleNeRF: Regulariz- ing Sparse Input Neural Radiance Fields with Simpler Solutions. SIGGRAPH Asia, 2023
work page 2023
-
[6]
LRM: Large Reconstruction Model for Single Image to 3D
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large Reconstruction Model for Single Image to 3D. arXiv:2311.04400, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Srinivasan, Dor Verbin, Jonathan T
Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors, 2023
work page 2023
-
[8]
Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. DreamFusion: Text-to-3D using 2D Diffusion. ICLR, 2022
work page 2022
-
[9]
Imagedream: Image-prompt multi-view diffusion for 3d generation
Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv:2312.02201, 2023
-
[10]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. CVPR, 2023
work page 2023
-
[12]
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. arXiv:2311.10709, 2023
-
[13]
Lumiere: A space-time diffusion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv, 2024
work page 2024
-
[14]
Photorealistic video generation with diffusion models, 2023
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models, 2023
work page 2023
-
[15]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024
work page 2024
-
[16]
State of the art on diffusion models for visual computing
Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T Barron, Amit H Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, et al. State of the art on diffusion models for visual computing. arXiv:2310.07204, 2023
-
[17]
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation, 2024
Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokkinos. IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation, 2024
work page 2024
-
[18]
Dream- Time: An Improved Optimization Strategy for Text-to-3D Content Creation
Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-Jun Zha, and Lei Zhang. Dream- Time: An Improved Optimization Strategy for Text-to-3D Content Creation. arXiv, 2023
work page 2023
-
[19]
SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity
Peihao Wang, Zhiwen Fan, Dejia Xu, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, et al. SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity. arXiv, 2023
work page 2023
-
[20]
Collaborative score distillation for consistent visual editing
Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, and Jinwoo Shin. Collaborative score distillation for consistent visual editing. NeurIPS, 36, 2024. 11
work page 2024
-
[21]
ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. NeurIPS, 2023
work page 2023
-
[22]
Instruct-nerf2nerf: Editing 3d scenes with instructions
Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19740–19750, 2023
work page 2023
-
[23]
Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation
Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. ICCV, 2023
work page 2023
-
[24]
Magic3D: High-Resolution Text-to-3D Content Creation
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-Resolution Text-to-3D Content Creation. CVPR, 2023
work page 2023
-
[25]
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv:2309.16653, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors
Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv:2310.08529, 2023
-
[27]
Disentan- gled 3d scene generation with layout learning
Dave Epstein, Ben Poole, Ben Mildenhall, Alexei A Efros, and Aleksander Holynski. Disentan- gled 3d scene generation with layout learning. arXiv preprint arXiv:2402.16936, 2024
-
[28]
ATT3D: Amortized Text-to-3D Object Synthesis
Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, and James Lucas. ATT3D: Amortized Text-to-3D Object Synthesis. ICCV, 2023
work page 2023
-
[29]
Realfusion: 360deg reconstruction of any object from a single image
Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Realfusion: 360deg reconstruction of any object from a single image. CVPR, 2023
work page 2023
-
[30]
Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin- Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors.arXiv:2306.17843, 2023
-
[31]
Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior
Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior. ICCV, 2023
work page 2023
-
[32]
Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models
Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models. ICCV, 2023
work page 2023
-
[33]
Monocular depth estimation using diffusion models
Saurabh Saxena, Abhishek Kar, Mohammad Norouzi, and David J Fleet. Monocular depth estimation using diffusion models. arXiv:2302.14816, 2023
-
[34]
WonderJourney: Going from Anywhere to Everywhere
Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. WonderJourney: Going from Anywhere to Everywhere. arXiv:2312.03884, 2023
-
[35]
Nerfiller: Completing scenes via generative 3d inpainting
Ethan Weber, Aleksander Hoły´nski, Varun Jampani, Saurabh Saxena, Noah Snavely, Abhishek Kar, and Angjoo Kanazawa. Nerfiller: Completing scenes via generative 3d inpainting. arXiv preprint arXiv:2312.04560, 2023
-
[36]
Zero-1-to-3: Zero-Shot One Image to 3D Object
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-Shot One Image to 3D Object. arXiv, 2023
work page 2023
-
[37]
Novel view synthesis with diffusion models
Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. arXiv:2210.04628, 2022
-
[38]
DreamBooth3D: Subject-Driven Text-to-3D Generation
Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Ben Mildenhall, Nataniel Ruiz, Shiran Zada, Kfir Aberman, Michael Rubenstein, Jonathan Barron, Yuanzhen Li, and Varun Jampani. DreamBooth3D: Subject-Driven Text-to-3D Generation. ICCV, 2023. 12
work page 2023
-
[39]
NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion
Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M Susskind, Christian Theobalt, Lingjie Liu, and Ravi Ramamoorthi. NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion. ICML, 2023
work page 2023
-
[40]
GeNVS: Generative novel view synthesis with 3D-aware diffusion models
Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. GeNVS: Generative novel view synthesis with 3D-aware diffusion models. arXiv, 2023
work page 2023
-
[41]
ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image
Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, et al. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image. CVPR, 2024
work page 2024
-
[42]
One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization
Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization. arXiv, 2023
work page 2023
-
[43]
MVDream: Multi-view Diffusion for 3D Generation
Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. MVDream: Multi-view Diffusion for 3D Generation. arXiv, 2023
work page 2023
-
[44]
Zero123++: a single image to consistent multi-view diffusion base model, 2023
Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model, 2023
work page 2023
-
[45]
ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion
Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, and Hongdong Li. ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion. arXiv:2310.10343, 2023
-
[46]
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. arXiv, 2023
work page 2023
-
[47]
Viewdiff: 3d-consistent image generation with text-to-image models, 2024
Lukas Höllein, Aljaž Boži ˇc, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, and Matthias Nießner. Viewdiff: 3d-consistent image generation with text-to-image models, 2024
work page 2024
-
[48]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Video interpolation with diffusion models
Siddhant Jain, Daniel Watson, Eric Tabellion, Aleksander Hoły ´nski, Ben Poole, and Janne Kontkanen. Video interpolation with diffusion models. arXiv preprint arXiv:2404.01203, 2024
-
[51]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Ani- matediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023
-
[53]
ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models.arXiv:2312.01305, 2023
Jeong-gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, and Kwang Moo Yi. ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models.arXiv:2312.01305, 2023
-
[54]
Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion, 2024
work page 2024
-
[55]
3DGen: Triplane Latent Diffusion for Textured Mesh Generation
Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas O˘guz. 3DGen: Triplane Latent Diffusion for Textured Mesh Generation. arXiv:2303.05371, 2023
-
[56]
Viewset Diffusion: (0- )Image-Conditioned 3D Generative Models from 2D Data
Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset Diffusion: (0- )Image-Conditioned 3D Generative Models from 2D Data. ICCV, 2023. 13
work page 2023
-
[57]
DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model, 2023
Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, and Kai Zhang. DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model, 2023
work page 2023
-
[58]
Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model
Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model. arXiv:2311.06214, 2023
-
[59]
Splatter image: Ultra-fast single-view 3d reconstruction
Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. arXiv:2312.13150, 2023
-
[60]
GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting
Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting. arXiv:2404.19702, 2024
-
[61]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[62]
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, 2022
work page 2022
-
[63]
pixelNeRF: Neural Radiance Fields from One or Few Images
Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural Radiance Fields from One or Few Images. CVPR, 2021
work page 2021
-
[64]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. ICML, 2021
work page 2021
-
[65]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35, 2022
work page 2022
-
[66]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv:2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. ICML, 2023
work page 2023
-
[68]
Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani V ora, Mario Luˇci´c, Daniel Duckworth, Alexey Dosovitskiy, et al. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. CVPR, 2022
work page 2022
-
[69]
k-means++: the advantages of careful seeding
David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms, 2007
work page 2007
-
[70]
Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry
Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, David A Forsyth, and Anand Bhattad. Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry... for now. arXiv:2311.17138, 2023
-
[71]
Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields
Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields. ICCV, 2023
work page 2023
-
[72]
The unreason- able effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. CVPR, 2018
work page 2018
-
[73]
Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields
Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. CVPR, 2022
work page 2022
-
[74]
Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. ICCV, 2021
work page 2021
-
[75]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. CVPR, 2023. 14
work page 2023
-
[76]
Stereo magnifi- cation: Learning view synthesis using multiplane images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images. SIGGRAPH, 2018
work page 2018
-
[77]
MVImgNet: A Large-scale Dataset of Multi-view Images
Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. MVImgNet: A Large-scale Dataset of Multi-view Images. CVPR, 2023
work page 2023
-
[78]
Large scale multi-view stereopsis evaluation
Rasmus Jensen, Anders Dahl, George V ogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. CVPR, 2014
work page 2014
-
[79]
Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines
Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines. SIGGRAPH, 2019
work page 2019
-
[80]
RealmDreamer: Text- Driven 3D Scene Generation with Inpainting and Depth Diffusion, 2024
Jaidev Shriram, Alex Trevithick, Lingjie Liu, and Ravi Ramamoorthi. RealmDreamer: Text- Driven 3D Scene Generation with Inpainting and Depth Diffusion, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.