pith. sign in

arxiv: 2405.10314 · v1 · pith:XAR2GEW7new · submitted 2024-05-16 · 💻 cs.CV

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Pith reviewed 2026-05-19 21:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-view diffusion3D reconstructionnovel view synthesis3D scene generationview consistencydiffusion models
0
0 comments X p. Extension
pith:XAR2GEW7 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{XAR2GEW7}

Prints a linked pith:XAR2GEW7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A multi-view diffusion model generates consistent novel views from any inputs to drive fast, high-quality 3D scene reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method that uses a diffusion model to simulate the dense image capture process required for 3D reconstruction. Given any number of input photos and chosen target viewpoints, the model produces new views that stay geometrically and photometrically consistent with each other and the originals. These synthetic views then serve as input to standard reconstruction algorithms, which build complete 3D models that render in real time from any angle. The approach reduces the practical barrier of collecting hundreds of images, enabling scene creation in roughly one minute while surpassing prior single-image and few-view techniques.

Core claim

CAT3D employs a multi-view diffusion model that, conditioned on an arbitrary set of input images and a collection of target novel viewpoints, synthesizes a set of highly consistent novel views of the scene. These generated views are then passed directly to robust 3D reconstruction methods to obtain representations that support real-time rendering from arbitrary viewpoints. The resulting pipeline creates entire 3D scenes in as little as one minute and achieves better results than existing approaches for single-image and few-view 3D scene creation.

What carries the argument

A multi-view diffusion model that jointly synthesizes geometrically consistent images across multiple user-specified target viewpoints given any number of input images.

If this is right

  • 3D scenes can be reconstructed from just one or a few input images instead of hundreds.
  • The generated views integrate directly with existing reconstruction pipelines without added constraints.
  • Complete scenes become available for real-time rendering shortly after the diffusion step.
  • The method outperforms prior work on single-image and few-view 3D creation benchmarks.
  • Full scene creation completes in approximately one minute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The consistency property could enable reliable 3D capture using only casual smartphone snapshots.
  • The same conditioning mechanism might extend to generating views for dynamic or time-varying scenes.
  • Generated view sets could serve as synthetic training data to improve other 3D models.
  • Applying the pipeline to uncontrolled outdoor environments with changing light would test its robustness beyond controlled settings.

Load-bearing premise

The novel views produced by the model maintain enough geometric and photometric consistency that off-the-shelf 3D reconstruction algorithms succeed without extra regularization or filtering even on sparse inputs or complex scenes.

What would settle it

Feed the model's generated views from a single real-world photo of a scene with fine geometry and varying illumination into a standard reconstruction method such as 3D Gaussian splatting and measure whether the resulting model renders without visible distortions from viewpoints outside the generated set.

read the original abstract

Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model. Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques to produce 3D representations that can be rendered from any viewpoint in real-time. CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation. See our project page for results and interactive demos at https://cat3d.github.io .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CAT3D, a multi-view diffusion model that takes any number of input images plus target novel viewpoints and generates highly consistent novel views of a scene. These views are then fed directly into standard 3D reconstruction pipelines (e.g., COLMAP or NeRF) to produce renderable 3D representations in as little as one minute, with reported outperformance over prior single-image and few-view methods.

Significance. If the consistency and reconstruction claims hold under sparse or out-of-distribution inputs, the work would meaningfully reduce the data burden for high-quality 3D capture and enable rapid scene creation for real-time rendering applications. It demonstrates practical utility of conditioned diffusion models for view synthesis that integrates with existing reconstruction tools.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments): The central claim that generated views are sufficiently geometrically and photometrically consistent for direct use in off-the-shelf reconstruction without extra regularization or filtering is load-bearing, yet the reported benchmarks focus on overall outperformance rather than explicit cross-view consistency metrics (e.g., multi-view depth variance or edge alignment error) on sparse inputs or scenes with complex lighting/geometry outside the training distribution.
  2. [§3 (Method)] §3 (Method): The description relies on standard diffusion training plus view-conditioning without an explicit cross-view consistency loss or post-processing step; this makes the assumption that outputs remain globally consistent (rather than locally plausible but drifting) an empirical outcome that must be directly verified for the downstream reconstruction claim to be supported.
minor comments (2)
  1. [Abstract] Abstract: The statement that the method 'outperforms existing methods' should name the specific metrics and baselines for immediate clarity.
  2. Figure captions in the qualitative results could include more detail on camera poses and input sparsity levels to help readers assess consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and have revised the manuscript to strengthen the presentation of consistency evidence.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments): The central claim that generated views are sufficiently geometrically and photometrically consistent for direct use in off-the-shelf reconstruction without extra regularization or filtering is load-bearing, yet the reported benchmarks focus on overall outperformance rather than explicit cross-view consistency metrics (e.g., multi-view depth variance or edge alignment error) on sparse inputs or scenes with complex lighting/geometry outside the training distribution.

    Authors: We agree that explicit cross-view consistency metrics provide valuable additional support for the central claim. While end-to-end reconstruction success with COLMAP and NeRF already serves as a strong indirect indicator (inconsistent views would cause reconstruction failure), we have added direct quantitative evaluations in the revised Section 4. These include multi-view depth variance and photometric consistency measures computed across generated views for sparse-input cases. We have also included results on additional scenes with complex geometry and lighting. The new metrics and figures confirm low variance and alignment, reinforcing that the generated views are suitable for direct use in standard pipelines. revision: yes

  2. Referee: [§3 (Method)] §3 (Method): The description relies on standard diffusion training plus view-conditioning without an explicit cross-view consistency loss or post-processing step; this makes the assumption that outputs remain globally consistent (rather than locally plausible but drifting) an empirical outcome that must be directly verified for the downstream reconstruction claim to be supported.

    Authors: The referee correctly notes that our approach uses standard diffusion training with view conditioning and does not introduce an auxiliary consistency loss. Global consistency is indeed an emergent property learned from large-scale multi-view training data. To directly verify this for the reconstruction claim, the revised manuscript adds an ablation study in Section 3 and new results in Section 4 comparing reconstructions obtained from views generated with versus without multi-view conditioning. The ablations show clear degradation in both consistency and final 3D quality when conditioning is removed. We have also added qualitative consistency visualizations in the supplement. We view these empirical verifications as sufficient support while acknowledging that an explicit consistency term remains an interesting direction for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is empirical and externally validated

full rationale

The paper presents an empirical multi-view diffusion model trained to generate novel views from sparse inputs, with consistency and downstream 3D reconstruction success demonstrated via held-out test scenes and off-the-shelf pipelines (COLMAP/NeRF) rather than any closed-form reduction of outputs to training losses or self-defined quantities. No equations or claims reduce a 'prediction' to a fitted parameter by construction, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claim rests on external evaluation benchmarks, making the method self-contained against independent data and reconstruction algorithms.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of a large diffusion model whose weights are learned from data rather than derived from first principles. No new physical axioms or invented entities are introduced; the main unstated premise is that the training distribution covers the target scenes sufficiently for generalization.

free parameters (1)
  • Diffusion model weights
    Learned parameters of the multi-view conditioned diffusion network; these are fitted to large image datasets and constitute the primary learned component.
axioms (1)
  • standard math Standard diffusion model training objective and sampling procedure apply without modification to the multi-view conditioning setting.
    Invoked implicitly when describing the model architecture and generation process.

pith-pipeline@v0.9.0 · 5703 in / 1346 out tokens · 36801 ms · 2026-05-19T21:24:12.604391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.

  2. GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

    cs.CV 2026-04 unverdicted novelty 7.0

    GSCompleter completes sparse 3D Gaussian Splatting scenes via a distillation-free generate-then-register pipeline using Stereo-Anchor lifting and Ray-Constrained Registration, delivering SOTA results on three benchmarks.

  3. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  4. Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

    cs.CV 2026-04 unverdicted novelty 7.0

    A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...

  5. Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

    cs.CV 2026-04 unverdicted novelty 7.0

    A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.

  6. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  7. Novel View Synthesis as Video Completion

    cs.CV 2026-04 unverdicted novelty 7.0

    Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.

  8. HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    HAD uses multi-view reasoning from a pre-trained feedforward NVS network to estimate and mask hallucination scores in diffusion priors, reducing artifacts and achieving SOTA novel view synthesis in sparse-view 3D reco...

  9. FurnSet: Exploiting Repeats for 3D Scene Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    FurnSet improves single-view 3D scene reconstruction by using per-object CLS tokens and set-aware self-attention to group and jointly reconstruct repeated object instances, with added scene-object conditioning and lay...

  10. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 6.0

    UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

  11. Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

    cs.CV 2026-04 unverdicted novelty 6.0

    The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...

  12. NavCrafter: Exploring 3D Scenes from a Single Image

    cs.CV 2026-04 unverdicted novelty 6.0

    NavCrafter generates controllable novel-view videos from one image via video diffusion, geometry-aware expansion, and enhanced 3D Gaussian Splatting to achieve state-of-the-art synthesis under large viewpoint changes.

  13. ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    cs.CV 2024-09 unverdicted novelty 6.0

    ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.

  14. DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing

    cs.CV 2026-05 unverdicted novelty 5.0

    DreamEdit3D learns separate token embeddings for segmented object components via two-phase multi-view optimization to enable text-guided 3D editing with consistent image generation and mesh reconstruction.

  15. DecoRec: Decomposed 3D Scene Reconstruction from Single-View Images via Object-Level Diffusion

    cs.CV 2026-05 unverdicted novelty 5.0

    DecoRec decomposes single-view 3D scene reconstruction into per-object diffusion reconstructions followed by a differentiable rendering and diffusion-guided merging pipeline.

  16. GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 5.0

    GeoRect4D couples 3D Gaussian splatting with a single-step diffusion rectifier via degradation-aware feedback and progressive optimization to improve fidelity and consistency in sparse-view dynamic 3D reconstruction.

  17. Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...

  18. InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

    cs.CV 2026-03 unverdicted novelty 5.0

    InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretra...

  19. ViPE: Video Pose Engine for 3D Geometric Perception

    cs.CV 2025-08 unverdicted novelty 5.0

    ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.

  20. Learning World Models for Interactive Video Generation

    cs.CV 2025-05 unverdicted novelty 5.0

    The work introduces video retrieval augmented generation (VRAG) with explicit global state conditioning to reduce compounding errors and improve spatiotemporal consistency in interactive video world models.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 19 Pith papers · 10 internal anchors

  1. [1]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV, 2020

  2. [2]

    Instant neural graphics primitives with a multiresolution hash encoding

    Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. SIGGRAPH, 2022. 10

  3. [3]

    3D Gaussian Splatting for Real-Time Radiance Field Rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. SIGGRAPH, 2023

  4. [4]

    FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization

    Jiawei Yang, Marco Pavone, and Yue Wang. FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization. CVPR, 2023

  5. [5]

    SimpleNeRF: Regulariz- ing Sparse Input Neural Radiance Fields with Simpler Solutions

    Nagabhushan Somraj, Adithyan Karanayil, and Rajiv Soundararajan. SimpleNeRF: Regulariz- ing Sparse Input Neural Radiance Fields with Simpler Solutions. SIGGRAPH Asia, 2023

  6. [6]

    LRM: Large Reconstruction Model for Single Image to 3D

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large Reconstruction Model for Single Image to 3D. arXiv:2311.04400, 2023

  7. [7]

    Srinivasan, Dor Verbin, Jonathan T

    Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors, 2023

  8. [8]

    Barron, and Ben Mildenhall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. DreamFusion: Text-to-3D using 2D Diffusion. ICLR, 2022

  9. [9]

    Imagedream: Image-prompt multi-view diffusion for 3d generation

    Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv:2312.02201, 2023

  10. [10]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv:2311.15127, 2023

  11. [11]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. CVPR, 2023

  12. [12]

    Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

    Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. arXiv:2311.10709, 2023

  13. [13]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv, 2024

  14. [14]

    Photorealistic video generation with diffusion models, 2023

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models, 2023

  15. [15]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

  16. [16]

    State of the art on diffusion models for visual computing

    Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T Barron, Amit H Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, et al. State of the art on diffusion models for visual computing. arXiv:2310.07204, 2023

  17. [17]

    IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation, 2024

    Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokkinos. IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation, 2024

  18. [18]

    Dream- Time: An Improved Optimization Strategy for Text-to-3D Content Creation

    Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-Jun Zha, and Lei Zhang. Dream- Time: An Improved Optimization Strategy for Text-to-3D Content Creation. arXiv, 2023

  19. [19]

    SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity

    Peihao Wang, Zhiwen Fan, Dejia Xu, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, et al. SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity. arXiv, 2023

  20. [20]

    Collaborative score distillation for consistent visual editing

    Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, and Jinwoo Shin. Collaborative score distillation for consistent visual editing. NeurIPS, 36, 2024. 11

  21. [21]

    ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. NeurIPS, 2023

  22. [22]

    Instruct-nerf2nerf: Editing 3d scenes with instructions

    Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19740–19750, 2023

  23. [23]

    Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation

    Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. ICCV, 2023

  24. [24]

    Magic3D: High-Resolution Text-to-3D Content Creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-Resolution Text-to-3D Content Creation. CVPR, 2023

  25. [25]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv:2309.16653, 2023

  26. [26]

    Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors

    Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv:2310.08529, 2023

  27. [27]

    Disentan- gled 3d scene generation with layout learning

    Dave Epstein, Ben Poole, Ben Mildenhall, Alexei A Efros, and Aleksander Holynski. Disentan- gled 3d scene generation with layout learning. arXiv preprint arXiv:2402.16936, 2024

  28. [28]

    ATT3D: Amortized Text-to-3D Object Synthesis

    Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, and James Lucas. ATT3D: Amortized Text-to-3D Object Synthesis. ICCV, 2023

  29. [29]

    Realfusion: 360deg reconstruction of any object from a single image

    Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Realfusion: 360deg reconstruction of any object from a single image. CVPR, 2023

  30. [30]

    Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors.arXiv:2306.17843, 2023

    Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin- Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors.arXiv:2306.17843, 2023

  31. [31]

    Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior

    Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior. ICCV, 2023

  32. [32]

    Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

    Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models. ICCV, 2023

  33. [33]

    Monocular depth estimation using diffusion models

    Saurabh Saxena, Abhishek Kar, Mohammad Norouzi, and David J Fleet. Monocular depth estimation using diffusion models. arXiv:2302.14816, 2023

  34. [34]

    WonderJourney: Going from Anywhere to Everywhere

    Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. WonderJourney: Going from Anywhere to Everywhere. arXiv:2312.03884, 2023

  35. [35]

    Nerfiller: Completing scenes via generative 3d inpainting

    Ethan Weber, Aleksander Hoły´nski, Varun Jampani, Saurabh Saxena, Noah Snavely, Abhishek Kar, and Angjoo Kanazawa. Nerfiller: Completing scenes via generative 3d inpainting. arXiv preprint arXiv:2312.04560, 2023

  36. [36]

    Zero-1-to-3: Zero-Shot One Image to 3D Object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-Shot One Image to 3D Object. arXiv, 2023

  37. [37]

    Novel view synthesis with diffusion models

    Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. arXiv:2210.04628, 2022

  38. [38]

    DreamBooth3D: Subject-Driven Text-to-3D Generation

    Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Ben Mildenhall, Nataniel Ruiz, Shiran Zada, Kfir Aberman, Michael Rubenstein, Jonathan Barron, Yuanzhen Li, and Varun Jampani. DreamBooth3D: Subject-Driven Text-to-3D Generation. ICCV, 2023. 12

  39. [39]

    NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion

    Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M Susskind, Christian Theobalt, Lingjie Liu, and Ravi Ramamoorthi. NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion. ICML, 2023

  40. [40]

    GeNVS: Generative novel view synthesis with 3D-aware diffusion models

    Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. GeNVS: Generative novel view synthesis with 3D-aware diffusion models. arXiv, 2023

  41. [41]

    ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

    Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, et al. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image. CVPR, 2024

  42. [42]

    One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization. arXiv, 2023

  43. [43]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. MVDream: Multi-view Diffusion for 3D Generation. arXiv, 2023

  44. [44]

    Zero123++: a single image to consistent multi-view diffusion base model, 2023

    Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model, 2023

  45. [45]

    ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

    Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, and Hongdong Li. ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion. arXiv:2310.10343, 2023

  46. [46]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. arXiv, 2023

  47. [47]

    Viewdiff: 3d-consistent image generation with text-to-image models, 2024

    Lukas Höllein, Aljaž Boži ˇc, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, and Matthias Nießner. Viewdiff: 3d-consistent image generation with text-to-image models, 2024

  48. [48]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022

  49. [49]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022

  50. [50]

    Video interpolation with diffusion models

    Siddhant Jain, Daniel Watson, Eric Tabellion, Aleksander Hoły ´nski, Ben Poole, and Janne Kontkanen. Video interpolation with diffusion models. arXiv preprint arXiv:2404.01203, 2024

  51. [51]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Ani- matediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

  52. [52]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023

  53. [53]

    ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models.arXiv:2312.01305, 2023

    Jeong-gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, and Kwang Moo Yi. ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models.arXiv:2312.01305, 2023

  54. [54]

    SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion, 2024

    Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion, 2024

  55. [55]

    3dgen: Triplane latent diffusion for textured mesh generation.arXiv preprint arXiv:2303.05371, 2023

    Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas O˘guz. 3DGen: Triplane Latent Diffusion for Textured Mesh Generation. arXiv:2303.05371, 2023

  56. [56]

    Viewset Diffusion: (0- )Image-Conditioned 3D Generative Models from 2D Data

    Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset Diffusion: (0- )Image-Conditioned 3D Generative Models from 2D Data. ICCV, 2023. 13

  57. [57]

    DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model, 2023

    Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, and Kai Zhang. DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model, 2023

  58. [58]

    Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model

    Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model. arXiv:2311.06214, 2023

  59. [59]

    Splatter image: Ultra-fast single-view 3d reconstruction

    Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. arXiv:2312.13150, 2023

  60. [60]

    GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

    Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting. arXiv:2404.19702, 2024

  61. [61]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013

  62. [62]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, 2022

  63. [63]

    pixelNeRF: Neural Radiance Fields from One or Few Images

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural Radiance Fields from One or Few Images. CVPR, 2021

  64. [64]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. ICML, 2021

  65. [65]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35, 2022

  66. [66]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv:2307.08691, 2023

  67. [67]

    Simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. ICML, 2023

  68. [68]

    Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations

    Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani V ora, Mario Luˇci´c, Daniel Duckworth, Alexey Dosovitskiy, et al. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. CVPR, 2022

  69. [69]

    k-means++: the advantages of careful seeding

    David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms, 2007

  70. [70]

    Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry

    Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, David A Forsyth, and Anand Bhattad. Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry... for now. arXiv:2311.17138, 2023

  71. [71]

    Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields. ICCV, 2023

  72. [72]

    The unreason- able effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. CVPR, 2018

  73. [73]

    Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. CVPR, 2022

  74. [74]

    Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. ICCV, 2021

  75. [75]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. CVPR, 2023. 14

  76. [76]

    Stereo magnifi- cation: Learning view synthesis using multiplane images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images. SIGGRAPH, 2018

  77. [77]

    MVImgNet: A Large-scale Dataset of Multi-view Images

    Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. MVImgNet: A Large-scale Dataset of Multi-view Images. CVPR, 2023

  78. [78]

    Large scale multi-view stereopsis evaluation

    Rasmus Jensen, Anders Dahl, George V ogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. CVPR, 2014

  79. [79]

    Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines

    Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines. SIGGRAPH, 2019

  80. [80]

    RealmDreamer: Text- Driven 3D Scene Generation with Inpainting and Depth Diffusion, 2024

    Jaidev Shriram, Alex Trevithick, Lingjie Liu, and Ravi Ramamoorthi. RealmDreamer: Text- Driven 3D Scene Generation with Inpainting and Depth Diffusion, 2024

Showing first 80 references.