CAT3D: Create Anything in 3D with Multi-View Diffusion Models

arxiv: 2405.10314 · v1 · pith:XAR2GEW7new · submitted 2024-05-16 · 💻 cs.CV

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Ruiqi Gao , Aleksander Holynski , Philipp Henzler , Arthur Brussee , Ricardo Martin-Brualla , Pratul Srinivasan , Jonathan T. Barron , Ben Poole This is my paper

Pith reviewed 2026-05-19 21:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-view diffusion3D reconstructionnovel view synthesis3D scene generationview consistencydiffusion models

0 comments p. Extension

pith:XAR2GEW7 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{XAR2GEW7}

Prints a linked pith:XAR2GEW7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A multi-view diffusion model generates consistent novel views from any inputs to drive fast, high-quality 3D scene reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method that uses a diffusion model to simulate the dense image capture process required for 3D reconstruction. Given any number of input photos and chosen target viewpoints, the model produces new views that stay geometrically and photometrically consistent with each other and the originals. These synthetic views then serve as input to standard reconstruction algorithms, which build complete 3D models that render in real time from any angle. The approach reduces the practical barrier of collecting hundreds of images, enabling scene creation in roughly one minute while surpassing prior single-image and few-view techniques.

Core claim

CAT3D employs a multi-view diffusion model that, conditioned on an arbitrary set of input images and a collection of target novel viewpoints, synthesizes a set of highly consistent novel views of the scene. These generated views are then passed directly to robust 3D reconstruction methods to obtain representations that support real-time rendering from arbitrary viewpoints. The resulting pipeline creates entire 3D scenes in as little as one minute and achieves better results than existing approaches for single-image and few-view 3D scene creation.

What carries the argument

A multi-view diffusion model that jointly synthesizes geometrically consistent images across multiple user-specified target viewpoints given any number of input images.

If this is right

3D scenes can be reconstructed from just one or a few input images instead of hundreds.
The generated views integrate directly with existing reconstruction pipelines without added constraints.
Complete scenes become available for real-time rendering shortly after the diffusion step.
The method outperforms prior work on single-image and few-view 3D creation benchmarks.
Full scene creation completes in approximately one minute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The consistency property could enable reliable 3D capture using only casual smartphone snapshots.
The same conditioning mechanism might extend to generating views for dynamic or time-varying scenes.
Generated view sets could serve as synthetic training data to improve other 3D models.
Applying the pipeline to uncontrolled outdoor environments with changing light would test its robustness beyond controlled settings.

Load-bearing premise

The novel views produced by the model maintain enough geometric and photometric consistency that off-the-shelf 3D reconstruction algorithms succeed without extra regularization or filtering even on sparse inputs or complex scenes.

What would settle it

Feed the model's generated views from a single real-world photo of a scene with fine geometry and varying illumination into a standard reconstruction method such as 3D Gaussian splatting and measure whether the resulting model renders without visible distortions from viewpoints outside the generated set.

read the original abstract

Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model. Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques to produce 3D representations that can be rendered from any viewpoint in real-time. CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation. See our project page for results and interactive demos at https://cat3d.github.io .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CAT3D, a multi-view diffusion model that takes any number of input images plus target novel viewpoints and generates highly consistent novel views of a scene. These views are then fed directly into standard 3D reconstruction pipelines (e.g., COLMAP or NeRF) to produce renderable 3D representations in as little as one minute, with reported outperformance over prior single-image and few-view methods.

Significance. If the consistency and reconstruction claims hold under sparse or out-of-distribution inputs, the work would meaningfully reduce the data burden for high-quality 3D capture and enable rapid scene creation for real-time rendering applications. It demonstrates practical utility of conditioned diffusion models for view synthesis that integrates with existing reconstruction tools.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The central claim that generated views are sufficiently geometrically and photometrically consistent for direct use in off-the-shelf reconstruction without extra regularization or filtering is load-bearing, yet the reported benchmarks focus on overall outperformance rather than explicit cross-view consistency metrics (e.g., multi-view depth variance or edge alignment error) on sparse inputs or scenes with complex lighting/geometry outside the training distribution.
[§3 (Method)] §3 (Method): The description relies on standard diffusion training plus view-conditioning without an explicit cross-view consistency loss or post-processing step; this makes the assumption that outputs remain globally consistent (rather than locally plausible but drifting) an empirical outcome that must be directly verified for the downstream reconstruction claim to be supported.

minor comments (2)

[Abstract] Abstract: The statement that the method 'outperforms existing methods' should name the specific metrics and baselines for immediate clarity.
Figure captions in the qualitative results could include more detail on camera poses and input sparsity levels to help readers assess consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and have revised the manuscript to strengthen the presentation of consistency evidence.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The central claim that generated views are sufficiently geometrically and photometrically consistent for direct use in off-the-shelf reconstruction without extra regularization or filtering is load-bearing, yet the reported benchmarks focus on overall outperformance rather than explicit cross-view consistency metrics (e.g., multi-view depth variance or edge alignment error) on sparse inputs or scenes with complex lighting/geometry outside the training distribution.

Authors: We agree that explicit cross-view consistency metrics provide valuable additional support for the central claim. While end-to-end reconstruction success with COLMAP and NeRF already serves as a strong indirect indicator (inconsistent views would cause reconstruction failure), we have added direct quantitative evaluations in the revised Section 4. These include multi-view depth variance and photometric consistency measures computed across generated views for sparse-input cases. We have also included results on additional scenes with complex geometry and lighting. The new metrics and figures confirm low variance and alignment, reinforcing that the generated views are suitable for direct use in standard pipelines. revision: yes
Referee: [§3 (Method)] §3 (Method): The description relies on standard diffusion training plus view-conditioning without an explicit cross-view consistency loss or post-processing step; this makes the assumption that outputs remain globally consistent (rather than locally plausible but drifting) an empirical outcome that must be directly verified for the downstream reconstruction claim to be supported.

Authors: The referee correctly notes that our approach uses standard diffusion training with view conditioning and does not introduce an auxiliary consistency loss. Global consistency is indeed an emergent property learned from large-scale multi-view training data. To directly verify this for the reconstruction claim, the revised manuscript adds an ablation study in Section 3 and new results in Section 4 comparing reconstructions obtained from views generated with versus without multi-view conditioning. The ablations show clear degradation in both consistency and final 3D quality when conditioning is removed. We have also added qualitative consistency visualizations in the supplement. We view these empirical verifications as sufficient support while acknowledging that an explicit consistency term remains an interesting direction for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is empirical and externally validated

full rationale

The paper presents an empirical multi-view diffusion model trained to generate novel views from sparse inputs, with consistency and downstream 3D reconstruction success demonstrated via held-out test scenes and off-the-shelf pipelines (COLMAP/NeRF) rather than any closed-form reduction of outputs to training losses or self-defined quantities. No equations or claims reduce a 'prediction' to a fitted parameter by construction, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claim rests on external evaluation benchmarks, making the method self-contained against independent data and reconstruction algorithms.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of a large diffusion model whose weights are learned from data rather than derived from first principles. No new physical axioms or invented entities are introduced; the main unstated premise is that the training distribution covers the target scenes sufficiently for generalization.

free parameters (1)

Diffusion model weights
Learned parameters of the multi-view conditioned diffusion network; these are fitted to large image datasets and constitute the primary learned component.

axioms (1)

standard math Standard diffusion model training objective and sampling procedure apply without modification to the multi-view conditioning setting.
Invoked implicitly when describing the model architecture and generation process.

pith-pipeline@v0.9.0 · 5703 in / 1346 out tokens · 36801 ms · 2026-05-19T21:24:12.604391+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.
GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds
cs.CV 2026-04 unverdicted novelty 7.0

GSCompleter completes sparse 3D Gaussian Splatting scenes via a distillation-free generate-then-register pipeline using Stereo-Anchor lifting and Ray-Constrained Registration, delivering SOTA results on three benchmarks.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
cs.CV 2026-04 unverdicted novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
cs.CV 2026-04 unverdicted novelty 7.0

A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Novel View Synthesis as Video Completion
cs.CV 2026-04 unverdicted novelty 7.0

Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

HAD uses multi-view reasoning from a pre-trained feedforward NVS network to estimate and mask hallucination scores in diffusion priors, reducing artifacts and achieving SOTA novel view synthesis in sparse-view 3D reco...
FurnSet: Exploiting Repeats for 3D Scene Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

FurnSet improves single-view 3D scene reconstruction by using per-object CLS tokens and set-aware self-attention to group and jointly reconstruct repeated object instances, with added scene-object conditioning and lay...
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
NavCrafter: Exploring 3D Scenes from a Single Image
cs.CV 2026-04 unverdicted novelty 6.0

NavCrafter generates controllable novel-view videos from one image via video diffusion, geometry-aware expansion, and enhanced 3D Gaussian Splatting to achieve state-of-the-art synthesis under large viewpoint changes.
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
cs.CV 2024-09 unverdicted novelty 6.0

ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing
cs.CV 2026-05 unverdicted novelty 5.0

DreamEdit3D learns separate token embeddings for segmented object components via two-phase multi-view optimization to enable text-guided 3D editing with consistent image generation and mesh reconstruction.
DecoRec: Decomposed 3D Scene Reconstruction from Single-View Images via Object-Level Diffusion
cs.CV 2026-05 unverdicted novelty 5.0

DecoRec decomposes single-view 3D scene reconstruction into per-object diffusion reconstructions followed by a differentiable rendering and diffusion-guided merging pipeline.
GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 5.0

GeoRect4D couples 3D Gaussian splatting with a single-step diffusion rectifier via degradation-aware feedback and progressive optimization to improve fidelity and consistency in sparse-view dynamic 3D reconstruction.
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
cs.CV 2026-04 unverdicted novelty 5.0

Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...
InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model
cs.CV 2026-03 unverdicted novelty 5.0

InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretra...
ViPE: Video Pose Engine for 3D Geometric Perception
cs.CV 2025-08 unverdicted novelty 5.0

ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
Learning World Models for Interactive Video Generation
cs.CV 2025-05 unverdicted novelty 5.0

The work introduces video retrieval augmented generation (VRAG) with explicit global state conditioning to reduce compounding errors and improve spatiotemporal consistency in interactive video world models.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 19 Pith papers · 10 internal anchors

[1]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV, 2020

work page 2020
[2]

Instant neural graphics primitives with a multiresolution hash encoding

Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. SIGGRAPH, 2022. 10

work page 2022
[3]

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. SIGGRAPH, 2023

work page 2023
[4]

FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization

Jiawei Yang, Marco Pavone, and Yue Wang. FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization. CVPR, 2023

work page 2023
[5]

SimpleNeRF: Regulariz- ing Sparse Input Neural Radiance Fields with Simpler Solutions

Nagabhushan Somraj, Adithyan Karanayil, and Rajiv Soundararajan. SimpleNeRF: Regulariz- ing Sparse Input Neural Radiance Fields with Simpler Solutions. SIGGRAPH Asia, 2023

work page 2023
[6]

LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large Reconstruction Model for Single Image to 3D. arXiv:2311.04400, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Srinivasan, Dor Verbin, Jonathan T

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors, 2023

work page 2023
[8]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. DreamFusion: Text-to-3D using 2D Diffusion. ICLR, 2022

work page 2022
[9]

Imagedream: Image-prompt multi-view diffusion for 3d generation

Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv:2312.02201, 2023

work page arXiv 2023
[10]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. CVPR, 2023

work page 2023
[12]

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. arXiv:2311.10709, 2023

work page arXiv 2023
[13]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv, 2024

work page 2024
[14]

Photorealistic video generation with diffusion models, 2023

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models, 2023

work page 2023
[15]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

work page 2024
[16]

State of the art on diffusion models for visual computing

Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T Barron, Amit H Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, et al. State of the art on diffusion models for visual computing. arXiv:2310.07204, 2023

work page arXiv 2023
[17]

IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation, 2024

Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokkinos. IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation, 2024

work page 2024
[18]

Dream- Time: An Improved Optimization Strategy for Text-to-3D Content Creation

Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-Jun Zha, and Lei Zhang. Dream- Time: An Improved Optimization Strategy for Text-to-3D Content Creation. arXiv, 2023

work page 2023
[19]

SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity

Peihao Wang, Zhiwen Fan, Dejia Xu, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, et al. SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity. arXiv, 2023

work page 2023
[20]

Collaborative score distillation for consistent visual editing

Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, and Jinwoo Shin. Collaborative score distillation for consistent visual editing. NeurIPS, 36, 2024. 11

work page 2024
[21]

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. NeurIPS, 2023

work page 2023
[22]

Instruct-nerf2nerf: Editing 3d scenes with instructions

Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19740–19750, 2023

work page 2023
[23]

Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. ICCV, 2023

work page 2023
[24]

Magic3D: High-Resolution Text-to-3D Content Creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-Resolution Text-to-3D Content Creation. CVPR, 2023

work page 2023
[25]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv:2309.16653, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors

Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv:2310.08529, 2023

work page arXiv 2023
[27]

Disentan- gled 3d scene generation with layout learning

Dave Epstein, Ben Poole, Ben Mildenhall, Alexei A Efros, and Aleksander Holynski. Disentan- gled 3d scene generation with layout learning. arXiv preprint arXiv:2402.16936, 2024

work page arXiv 2024
[28]

ATT3D: Amortized Text-to-3D Object Synthesis

Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, and James Lucas. ATT3D: Amortized Text-to-3D Object Synthesis. ICCV, 2023

work page 2023
[29]

Realfusion: 360deg reconstruction of any object from a single image

Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Realfusion: 360deg reconstruction of any object from a single image. CVPR, 2023

work page 2023
[30]

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors.arXiv:2306.17843, 2023

Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin- Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors.arXiv:2306.17843, 2023

work page arXiv 2023
[31]

Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior

Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior. ICCV, 2023

work page 2023
[32]

Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models. ICCV, 2023

work page 2023
[33]

Monocular depth estimation using diffusion models

Saurabh Saxena, Abhishek Kar, Mohammad Norouzi, and David J Fleet. Monocular depth estimation using diffusion models. arXiv:2302.14816, 2023

work page arXiv 2023
[34]

WonderJourney: Going from Anywhere to Everywhere

Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. WonderJourney: Going from Anywhere to Everywhere. arXiv:2312.03884, 2023

work page arXiv 2023
[35]

Nerfiller: Completing scenes via generative 3d inpainting

Ethan Weber, Aleksander Hoły´nski, Varun Jampani, Saurabh Saxena, Noah Snavely, Abhishek Kar, and Angjoo Kanazawa. Nerfiller: Completing scenes via generative 3d inpainting. arXiv preprint arXiv:2312.04560, 2023

work page arXiv 2023
[36]

Zero-1-to-3: Zero-Shot One Image to 3D Object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-Shot One Image to 3D Object. arXiv, 2023

work page 2023
[37]

Novel view synthesis with diffusion models

Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. arXiv:2210.04628, 2022

work page arXiv 2022
[38]

DreamBooth3D: Subject-Driven Text-to-3D Generation

Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Ben Mildenhall, Nataniel Ruiz, Shiran Zada, Kfir Aberman, Michael Rubenstein, Jonathan Barron, Yuanzhen Li, and Varun Jampani. DreamBooth3D: Subject-Driven Text-to-3D Generation. ICCV, 2023. 12

work page 2023
[39]

NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion

Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M Susskind, Christian Theobalt, Lingjie Liu, and Ravi Ramamoorthi. NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion. ICML, 2023

work page 2023
[40]

GeNVS: Generative novel view synthesis with 3D-aware diffusion models

Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. GeNVS: Generative novel view synthesis with 3D-aware diffusion models. arXiv, 2023

work page 2023
[41]

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, et al. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image. CVPR, 2024

work page 2024
[42]

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization. arXiv, 2023

work page 2023
[43]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. MVDream: Multi-view Diffusion for 3D Generation. arXiv, 2023

work page 2023
[44]

Zero123++: a single image to consistent multi-view diffusion base model, 2023

Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model, 2023

work page 2023
[45]

ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, and Hongdong Li. ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion. arXiv:2310.10343, 2023

work page arXiv 2023
[46]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. arXiv, 2023

work page 2023
[47]

Viewdiff: 3d-consistent image generation with text-to-image models, 2024

Lukas Höllein, Aljaž Boži ˇc, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, and Matthias Nießner. Viewdiff: 3d-consistent image generation with text-to-image models, 2024

work page 2024
[48]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Video interpolation with diffusion models

Siddhant Jain, Daniel Watson, Eric Tabellion, Aleksander Hoły ´nski, Ben Poole, and Janne Kontkanen. Video interpolation with diffusion models. arXiv preprint arXiv:2404.01203, 2024

work page arXiv 2024
[51]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Ani- matediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023

work page arXiv 2023
[53]

ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models.arXiv:2312.01305, 2023

Jeong-gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, and Kwang Moo Yi. ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models.arXiv:2312.01305, 2023

work page arXiv 2023
[54]

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion, 2024

Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion, 2024

work page 2024
[55]

3DGen: Triplane Latent Diffusion for Textured Mesh Generation

Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas O˘guz. 3DGen: Triplane Latent Diffusion for Textured Mesh Generation. arXiv:2303.05371, 2023

work page arXiv 2023
[56]

Viewset Diffusion: (0- )Image-Conditioned 3D Generative Models from 2D Data

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset Diffusion: (0- )Image-Conditioned 3D Generative Models from 2D Data. ICCV, 2023. 13

work page 2023
[57]

DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model, 2023

Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, and Kai Zhang. DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model, 2023

work page 2023
[58]

Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model. arXiv:2311.06214, 2023

work page arXiv 2023
[59]

Splatter image: Ultra-fast single-view 3d reconstruction

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. arXiv:2312.13150, 2023

work page arXiv 2023
[60]

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting. arXiv:2404.19702, 2024

work page arXiv 2024
[61]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[62]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, 2022

work page 2022
[63]

pixelNeRF: Neural Radiance Fields from One or Few Images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural Radiance Fields from One or Few Images. CVPR, 2021

work page 2021
[64]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. ICML, 2021

work page 2021
[65]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35, 2022

work page 2022
[66]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. ICML, 2023

work page 2023
[68]

Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations

Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani V ora, Mario Luˇci´c, Daniel Duckworth, Alexey Dosovitskiy, et al. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. CVPR, 2022

work page 2022
[69]

k-means++: the advantages of careful seeding

David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms, 2007

work page 2007
[70]

Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry

Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, David A Forsyth, and Anand Bhattad. Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry... for now. arXiv:2311.17138, 2023

work page arXiv 2023
[71]

Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields. ICCV, 2023

work page 2023
[72]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. CVPR, 2018

work page 2018
[73]

Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. CVPR, 2022

work page 2022
[74]

Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. ICCV, 2021

work page 2021
[75]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. CVPR, 2023. 14

work page 2023
[76]

Stereo magnifi- cation: Learning view synthesis using multiplane images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images. SIGGRAPH, 2018

work page 2018
[77]

MVImgNet: A Large-scale Dataset of Multi-view Images

Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. MVImgNet: A Large-scale Dataset of Multi-view Images. CVPR, 2023

work page 2023
[78]

Large scale multi-view stereopsis evaluation

Rasmus Jensen, Anders Dahl, George V ogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. CVPR, 2014

work page 2014
[79]

Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines

Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines. SIGGRAPH, 2019

work page 2019
[80]

RealmDreamer: Text- Driven 3D Scene Generation with Inpainting and Depth Diffusion, 2024

Jaidev Shriram, Alex Trevithick, Lingjie Liu, and Ravi Ramamoorthi. RealmDreamer: Text- Driven 3D Scene Generation with Inpainting and Depth Diffusion, 2024

work page 2024

Showing first 80 references.

[1] [1]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV, 2020

work page 2020

[2] [2]

Instant neural graphics primitives with a multiresolution hash encoding

Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. SIGGRAPH, 2022. 10

work page 2022

[3] [3]

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. SIGGRAPH, 2023

work page 2023

[4] [4]

FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization

Jiawei Yang, Marco Pavone, and Yue Wang. FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization. CVPR, 2023

work page 2023

[5] [5]

SimpleNeRF: Regulariz- ing Sparse Input Neural Radiance Fields with Simpler Solutions

Nagabhushan Somraj, Adithyan Karanayil, and Rajiv Soundararajan. SimpleNeRF: Regulariz- ing Sparse Input Neural Radiance Fields with Simpler Solutions. SIGGRAPH Asia, 2023

work page 2023

[6] [6]

LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large Reconstruction Model for Single Image to 3D. arXiv:2311.04400, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Srinivasan, Dor Verbin, Jonathan T

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors, 2023

work page 2023

[8] [8]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. DreamFusion: Text-to-3D using 2D Diffusion. ICLR, 2022

work page 2022

[9] [9]

Imagedream: Image-prompt multi-view diffusion for 3d generation

Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv:2312.02201, 2023

work page arXiv 2023

[10] [10]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. CVPR, 2023

work page 2023

[12] [12]

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. arXiv:2311.10709, 2023

work page arXiv 2023

[13] [13]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv, 2024

work page 2024

[14] [14]

Photorealistic video generation with diffusion models, 2023

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models, 2023

work page 2023

[15] [15]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

work page 2024

[16] [16]

State of the art on diffusion models for visual computing

Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T Barron, Amit H Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, et al. State of the art on diffusion models for visual computing. arXiv:2310.07204, 2023

work page arXiv 2023

[17] [17]

IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation, 2024

Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokkinos. IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation, 2024

work page 2024

[18] [18]

Dream- Time: An Improved Optimization Strategy for Text-to-3D Content Creation

Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-Jun Zha, and Lei Zhang. Dream- Time: An Improved Optimization Strategy for Text-to-3D Content Creation. arXiv, 2023

work page 2023

[19] [19]

SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity

Peihao Wang, Zhiwen Fan, Dejia Xu, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, et al. SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity. arXiv, 2023

work page 2023

[20] [20]

Collaborative score distillation for consistent visual editing

Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, and Jinwoo Shin. Collaborative score distillation for consistent visual editing. NeurIPS, 36, 2024. 11

work page 2024

[21] [21]

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. NeurIPS, 2023

work page 2023

[22] [22]

Instruct-nerf2nerf: Editing 3d scenes with instructions

Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19740–19750, 2023

work page 2023

[23] [23]

Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. ICCV, 2023

work page 2023

[24] [24]

Magic3D: High-Resolution Text-to-3D Content Creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-Resolution Text-to-3D Content Creation. CVPR, 2023

work page 2023

[25] [25]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv:2309.16653, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors

Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv:2310.08529, 2023

work page arXiv 2023

[27] [27]

Disentan- gled 3d scene generation with layout learning

Dave Epstein, Ben Poole, Ben Mildenhall, Alexei A Efros, and Aleksander Holynski. Disentan- gled 3d scene generation with layout learning. arXiv preprint arXiv:2402.16936, 2024

work page arXiv 2024

[28] [28]

ATT3D: Amortized Text-to-3D Object Synthesis

Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, and James Lucas. ATT3D: Amortized Text-to-3D Object Synthesis. ICCV, 2023

work page 2023

[29] [29]

Realfusion: 360deg reconstruction of any object from a single image

Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Realfusion: 360deg reconstruction of any object from a single image. CVPR, 2023

work page 2023

[30] [30]

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors.arXiv:2306.17843, 2023

Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin- Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors.arXiv:2306.17843, 2023

work page arXiv 2023

[31] [31]

Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior

Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior. ICCV, 2023

work page 2023

[32] [32]

Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models. ICCV, 2023

work page 2023

[33] [33]

Monocular depth estimation using diffusion models

Saurabh Saxena, Abhishek Kar, Mohammad Norouzi, and David J Fleet. Monocular depth estimation using diffusion models. arXiv:2302.14816, 2023

work page arXiv 2023

[34] [34]

WonderJourney: Going from Anywhere to Everywhere

Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, et al. WonderJourney: Going from Anywhere to Everywhere. arXiv:2312.03884, 2023

work page arXiv 2023

[35] [35]

Nerfiller: Completing scenes via generative 3d inpainting

Ethan Weber, Aleksander Hoły´nski, Varun Jampani, Saurabh Saxena, Noah Snavely, Abhishek Kar, and Angjoo Kanazawa. Nerfiller: Completing scenes via generative 3d inpainting. arXiv preprint arXiv:2312.04560, 2023

work page arXiv 2023

[36] [36]

Zero-1-to-3: Zero-Shot One Image to 3D Object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-Shot One Image to 3D Object. arXiv, 2023

work page 2023

[37] [37]

Novel view synthesis with diffusion models

Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. arXiv:2210.04628, 2022

work page arXiv 2022

[38] [38]

DreamBooth3D: Subject-Driven Text-to-3D Generation

Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Ben Mildenhall, Nataniel Ruiz, Shiran Zada, Kfir Aberman, Michael Rubenstein, Jonathan Barron, Yuanzhen Li, and Varun Jampani. DreamBooth3D: Subject-Driven Text-to-3D Generation. ICCV, 2023. 12

work page 2023

[39] [39]

NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion

Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M Susskind, Christian Theobalt, Lingjie Liu, and Ravi Ramamoorthi. NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion. ICML, 2023

work page 2023

[40] [40]

GeNVS: Generative novel view synthesis with 3D-aware diffusion models

Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. GeNVS: Generative novel view synthesis with 3D-aware diffusion models. arXiv, 2023

work page 2023

[41] [41]

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, et al. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image. CVPR, 2024

work page 2024

[42] [42]

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization. arXiv, 2023

work page 2023

[43] [43]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. MVDream: Multi-view Diffusion for 3D Generation. arXiv, 2023

work page 2023

[44] [44]

Zero123++: a single image to consistent multi-view diffusion base model, 2023

Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model, 2023

work page 2023

[45] [45]

ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, and Hongdong Li. ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion. arXiv:2310.10343, 2023

work page arXiv 2023

[46] [46]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. arXiv, 2023

work page 2023

[47] [47]

Viewdiff: 3d-consistent image generation with text-to-image models, 2024

Lukas Höllein, Aljaž Boži ˇc, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, and Matthias Nießner. Viewdiff: 3d-consistent image generation with text-to-image models, 2024

work page 2024

[48] [48]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[50] [50]

Video interpolation with diffusion models

Siddhant Jain, Daniel Watson, Eric Tabellion, Aleksander Hoły ´nski, Ben Poole, and Janne Kontkanen. Video interpolation with diffusion models. arXiv preprint arXiv:2404.01203, 2024

work page arXiv 2024

[51] [51]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Ani- matediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023

work page arXiv 2023

[53] [53]

ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models.arXiv:2312.01305, 2023

Jeong-gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, and Kwang Moo Yi. ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models.arXiv:2312.01305, 2023

work page arXiv 2023

[54] [54]

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion, 2024

Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion, 2024

work page 2024

[55] [55]

3DGen: Triplane Latent Diffusion for Textured Mesh Generation

Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas O˘guz. 3DGen: Triplane Latent Diffusion for Textured Mesh Generation. arXiv:2303.05371, 2023

work page arXiv 2023

[56] [56]

Viewset Diffusion: (0- )Image-Conditioned 3D Generative Models from 2D Data

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset Diffusion: (0- )Image-Conditioned 3D Generative Models from 2D Data. ICCV, 2023. 13

work page 2023

[57] [57]

DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model, 2023

Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, and Kai Zhang. DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model, 2023

work page 2023

[58] [58]

Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model. arXiv:2311.06214, 2023

work page arXiv 2023

[59] [59]

Splatter image: Ultra-fast single-view 3d reconstruction

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. arXiv:2312.13150, 2023

work page arXiv 2023

[60] [60]

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting. arXiv:2404.19702, 2024

work page arXiv 2024

[61] [61]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[62] [62]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, 2022

work page 2022

[63] [63]

pixelNeRF: Neural Radiance Fields from One or Few Images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural Radiance Fields from One or Few Images. CVPR, 2021

work page 2021

[64] [64]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. ICML, 2021

work page 2021

[65] [65]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35, 2022

work page 2022

[66] [66]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. ICML, 2023

work page 2023

[68] [68]

Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations

Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani V ora, Mario Luˇci´c, Daniel Duckworth, Alexey Dosovitskiy, et al. Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. CVPR, 2022

work page 2022

[69] [69]

k-means++: the advantages of careful seeding

David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms, 2007

work page 2007

[70] [70]

Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry

Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, David A Forsyth, and Anand Bhattad. Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry... for now. arXiv:2311.17138, 2023

work page arXiv 2023

[71] [71]

Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields. ICCV, 2023

work page 2023

[72] [72]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. CVPR, 2018

work page 2018

[73] [73]

Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. CVPR, 2022

work page 2022

[74] [74]

Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. ICCV, 2021

work page 2021

[75] [75]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. CVPR, 2023. 14

work page 2023

[76] [76]

Stereo magnifi- cation: Learning view synthesis using multiplane images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images. SIGGRAPH, 2018

work page 2018

[77] [77]

MVImgNet: A Large-scale Dataset of Multi-view Images

Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. MVImgNet: A Large-scale Dataset of Multi-view Images. CVPR, 2023

work page 2023

[78] [78]

Large scale multi-view stereopsis evaluation

Rasmus Jensen, Anders Dahl, George V ogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. CVPR, 2014

work page 2014

[79] [79]

Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines

Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines. SIGGRAPH, 2019

work page 2019

[80] [80]

RealmDreamer: Text- Driven 3D Scene Generation with Inpainting and Depth Diffusion, 2024

Jaidev Shriram, Alex Trevithick, Lingjie Liu, and Ravi Ramamoorthi. RealmDreamer: Text- Driven 3D Scene Generation with Inpainting and Depth Diffusion, 2024

work page 2024