arxiv: 2308.16512 · v4 · submitted 2023-08-31 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi , Peng Wang , Jianglong Ye , Mai Long , Kejie Li , Xiao Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-view diffusion3D generationtext-to-3Dscore distillation samplingconsistent multi-view images3D priorfew-shot 3D learning

0 comments

The pith

A multi-view diffusion model trained on both 2D and 3D data acts as a generalizable 3D prior that improves consistency in text-to-3D generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MVDream, a diffusion model designed to produce consistent images across multiple viewpoints from a single text prompt. It achieves this by training jointly on 2D image-text pairs and 3D data, merging the wide coverage of standard 2D diffusion with the geometric coherence of rendered 3D views. This trained model functions as an implicit 3D prior that works independently of any particular 3D shape format. When plugged into score distillation sampling, it yields more stable and consistent 3D outputs than methods relying solely on 2D diffusion. The same model also supports personalizing new 3D concepts from a small number of 2D reference images.

Core claim

MVDream shows that a multi-view diffusion model learned from both 2D and 3D data is implicitly a generalizable 3D prior agnostic to 3D representations. Applied via Score Distillation Sampling, it markedly improves the consistency and stability of existing 2D-lifting approaches to 3D generation. It further enables few-shot concept learning from 2D examples for 3D output, analogous to DreamBooth but in the 3D setting.

What carries the argument

The multi-view diffusion model trained jointly on 2D and 3D data, which generates viewpoint-consistent images and thereby encodes an implicit 3D prior usable in score distillation sampling.

If this is right

Existing 2D-lifting pipelines for text-to-3D can be upgraded to higher consistency simply by swapping in the multi-view diffusion prior.
Few-shot personalization of 3D objects becomes feasible from ordinary 2D photographs without explicit 3D data.
The same prior can be used with any 3D representation because it does not depend on a specific geometry format.
Training cost for new 3D generators decreases because the model already supplies multi-view consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to video or dynamic scenes by adding temporal consistency as another training signal.
If the implicit prior holds across domains, similar joint 2D-3D training might improve consistency in other generative tasks such as novel-view synthesis.
Downstream applications could combine this prior with faster inference methods to make real-time 3D content creation more practical.

Load-bearing premise

Joint training on 2D and 3D data yields a prior that generalizes to new text prompts and shapes without overfitting to the specific training renderings or degrading single-view image quality.

What would settle it

A direct comparison showing that score distillation sampling with MVDream produces no measurable gain in multi-view consistency or output stability over standard 2D diffusion baselines on a fixed set of text prompts would falsify the central claim.

read the original abstract

We introduce MVDream, a diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view diffusion model is implicitly a generalizable 3D prior agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MVDream trains a multi-view diffusion model on 2D and 3D data to act as a 3D prior for SDS and claims better consistency, but the abstract gives no metrics or ablations to back the generalizability claim.

read the letter

The main thing to know is that MVDream trains a diffusion model on both 2D images and 3D data to generate consistent multi-view images from text prompts. They then use this model as an implicit 3D prior inside Score Distillation Sampling to improve the consistency and stability of text-to-3D generation. What the paper does well is show a concrete way to combine the strengths of 2D diffusion models with 3D consistency through joint training. The multi-view conditioning during diffusion training is the key step, and framing the result as a representation-agnostic 3D prior is a nice way to think about it. The few-shot concept learning extension is also a useful addition, extending DreamBooth-style personalization to 3D. The approach looks like it could be a direct upgrade for many existing 2D-lifting pipelines that struggle with view inconsistencies. The soft spots are around the strength of the evidence. The abstract states that the model significantly enhances consistency, but there are no reported metrics, no details on ablations for the 2D versus 3D data mix, and no tests showing performance on novel shapes or out-of-distribution prompts. The stress-test note correctly points out that generalizability hinges on not overfitting to the specific 3D renderings used in training, and nothing in the provided summary rules that out. This makes the central claim harder to assess without the full experiments. This work is aimed at researchers building text-to-3D systems in graphics and vision. A reader who wants to experiment with better priors for SDS would get practical value from the model construction details. It deserves a serious referee because it tackles an important bottleneck with a clear method, even if more rigorous validation is needed. I recommend sending it to peer review rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces MVDream, a multi-view diffusion model trained jointly on 2D and 3D data to generate consistent multi-view images from text prompts. It claims this model functions as an implicit generalizable 3D prior agnostic to representations, which can be applied via Score Distillation Sampling (SDS) to enhance consistency and stability in existing 2D-lifting 3D generation methods, and extended to few-shot 3D concept learning akin to DreamBooth.

Significance. If validated, the work offers a practical bridge between the generalizability of 2D diffusion models and the multi-view consistency of 3D renderings, potentially improving text-to-3D pipelines without explicit 3D representations. The empirical demonstrations of SDS-based generation and few-shot adaptation provide concrete value for 3D content creation applications, though the strength depends on rigorous quantitative support for the transfer claims.

major comments (2)

[§4.1] §4.1 (3D Generation via SDS): The central claim that MVDream 'significantly enhancing the consistency and stability of existing 2D-lifting methods' lacks load-bearing quantitative evidence; no metrics (e.g., multi-view consistency scores, CLIP similarity across views, or direct comparisons to DreamFusion baselines) or ablation isolating the multi-view prior's contribution are reported, leaving the enhancement unverified.
[§5] §5 (Few-shot 3D Concept Learning): The generalizability of the 3D prior to novel text prompts and shapes rests on the untested assumption that joint 2D+3D training avoids overfitting to the specific 3D training renderings; no ablation on 3D data contribution, no diversity statistics for the 3D corpus, and no out-of-distribution shape/prompt tests are provided to support the transfer claim.

minor comments (2)

[§3.2] §3.2 (Model Architecture): The definition and range of the 'multi-view conditioning strength' hyperparameter is introduced without explicit notation or sensitivity analysis, complicating reproducibility of the reported results.
[Figure 3] Figure 3 (Qualitative Results): The caption does not specify the exact text prompts or camera poses used for the multi-view generations, reducing clarity for readers attempting to interpret the consistency improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address the major comments point-by-point below and will revise the manuscript accordingly to strengthen the quantitative support for our claims.

read point-by-point responses

Referee: [§4.1] §4.1 (3D Generation via SDS): The central claim that MVDream 'significantly enhancing the consistency and stability of existing 2D-lifting methods' lacks load-bearing quantitative evidence; no metrics (e.g., multi-view consistency scores, CLIP similarity across views, or direct comparisons to DreamFusion baselines) or ablation isolating the multi-view prior's contribution are reported, leaving the enhancement unverified.

Authors: We agree that the current manuscript relies primarily on qualitative results for this claim. In the revision we will add quantitative metrics including multi-view consistency scores and average CLIP similarity across generated views, plus direct numerical comparisons against DreamFusion baselines. We will also include an ablation that isolates the multi-view prior's contribution by comparing against a 2D-only diffusion baseline under identical SDS settings. revision: yes
Referee: [§5] §5 (Few-shot 3D Concept Learning): The generalizability of the 3D prior to novel text prompts and shapes rests on the untested assumption that joint 2D+3D training avoids overfitting to the specific 3D training renderings; no ablation on 3D data contribution, no diversity statistics for the 3D corpus, and no out-of-distribution shape/prompt tests are provided to support the transfer claim.

Authors: We acknowledge that additional controls are needed to substantiate the transfer claim. The revised version will report (1) an ablation measuring performance with and without the 3D training data, (2) basic diversity statistics (e.g., object category coverage and viewpoint distribution) for the 3D corpus, and (3) qualitative and quantitative results on out-of-distribution shapes and prompts not seen during training. revision: yes

Circularity Check

0 steps flagged

No significant circularity; core claim is empirical training result

full rationale

The paper claims that joint training on 2D and 3D data yields a multi-view diffusion model that acts as a generalizable 3D prior, demonstrated via application to SDS for 3D generation. This rests on external datasets and standard diffusion training rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or derivations reduce the claimed prior to its inputs by construction. Minor self-citations (e.g., to DreamBooth or SDS) are not central to the derivation and do not force the result. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach assumes standard diffusion training dynamics extend to multi-view conditioning and that SDS can leverage the learned prior without additional 3D-specific losses.

free parameters (1)

multi-view conditioning strength
Weighting between 2D and 3D training signals is chosen to balance consistency and generalizability.

axioms (1)

domain assumption Joint 2D-3D training yields a prior that is agnostic to explicit 3D representations.
Invoked when claiming the model can be used directly with SDS for any 3D representation.

pith-pipeline@v0.9.0 · 5426 in / 1263 out tokens · 37067 ms · 2026-05-15T08:31:50.509985+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We demonstrate that such a multi-view diffusion model is implicitly a generalizable 3D prior agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views
cs.CV 2026-05 unverdicted novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 7.0

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
cs.CV 2026-04 unverdicted novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
cs.CV 2026-03 unverdicted novelty 7.0

SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.
GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
Velox: Learning Representations of 4D Geometry and Appearance
cs.CV 2026-05 unverdicted novelty 6.0

Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

DiLAST optimizes 3D latents via guidance from a 2D diffusion model to enable generalizable style transfer for OOD styles in 3D asset generation.
REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
cs.CV 2026-04 unverdicted novelty 6.0

REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
cs.CV 2026-04 unverdicted novelty 6.0

Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.
AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion
cs.CV 2026-04 unverdicted novelty 6.0

A two-stage method synthesizes multi-view 2D motion data from internet video keypoints and trains a camera-conditioned diffusion model to recover globally consistent 3D human motion and HOI in world space.
HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
cs.CV 2026-04 unverdicted novelty 6.0

HandDreamer is the first zero-shot text-to-3D method for hands that uses MANO initialization, skeleton-guided diffusion, and corrective shape guidance to produce view-consistent models.
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
cs.CV 2026-03 unverdicted novelty 6.0

Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models
cs.CV 2024-04 unverdicted novelty 6.0

InstantMesh produces diverse, high-quality 3D meshes from single images in seconds by combining a multi-view diffusion model with a sparse-view large reconstruction model and optimizing directly on meshes.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 5.0

R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
Pose-Aware Diffusion for 3D Generation
cs.CV 2026-05 unverdicted novelty 5.0

PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
cs.CV 2026-04 unverdicted novelty 5.0

Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation
cs.CV 2026-04 unverdicted novelty 4.0

AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantica...
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

161 extracted references · 161 canonical work pages · cited by 19 Pith papers · 7 internal anchors

[1]

https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0

stable-diffusion-xl-base-1.0. https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0. Accessed: 2023-08-29

work page 2023
[2]

https://sketchfab.com/3d-models/popular

Sketchfab. https://sketchfab.com/3d-models/popular. Accessed: 2023-08-30

work page 2023
[3]

https://huggingface.co/DeepFloyd

Deepfloyd. https://huggingface.co/DeepFloyd. Accessed: 2023-08-25

work page 2023
[4]

https://lumalabs.ai/dashboard/imagine

Luma.ai. https://lumalabs.ai/dashboard/imagine. Accessed: 2023-08-25

work page 2023
[5]

https://huggingface.co/spaces/lambdalabs/stable-diffusion-image-variations

Stable diffusion image variation. https://huggingface.co/spaces/lambdalabs/stable-diffusion-image-variations

work page
[6]

https://huggingface.co/stabilityai/stable-diffusion-2-1-base

Stable diffusion 2.1 base. https://huggingface.co/stabilityai/stable-diffusion-2-1-base. Accessed: 2023-07-14

work page 2023
[7]

https://github.com/threestudio-project/threestudio

Threestudio project. https://github.com/threestudio-project/threestudio. Accessed: 2023-08-25

work page 2023
[9]

Barron, Ben Mildenhall, Dor Verbin, Pratul P

Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. CVPR, 2022

work page 2022
[10]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023

work page 2023
[11]

Efficient geometry-aware 3d generative adversarial networks

Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. In CVPR, 2022

work page 2022
[12]

Chan, Koki Nagano, Matthew A

Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexander W. Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. GeNVS : Generative novel view synthesis with 3D -aware diffusion models. In arXiv, 2023

work page 2023
[14]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In CVPR, pp.\ 13142--13153, 2023

work page 2023
[15]

Gram: Generative radiance manifolds for 3d-aware image generation

Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong. Gram: Generative radiance manifolds for 3d-aware image generation. In CVPR, pp.\ 10673--10683, 2022

work page 2022
[16]

Get3d: A generative model of high quality 3d textured shapes learned from images

Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. NeurIPS, 2022

work page 2022
[17]

Learning single-image 3d reconstruction by generative modelling of shape, pose and shading

Paul Henderson and Vittorio Ferrari. Learning single-image 3d reconstruction by generative modelling of shape, pose and shading. International Journal of Computer Vision, 2020

work page 2020
[18]

Leveraging 2d data to learn textured 3d mesh generation

Paul Henderson, Vagia Tsiminaki, and Christoph H Lampert. Leveraging 2d data to learn textured 3d mesh generation. In CVPR, 2020

work page 2020
[19]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017

work page 2017
[23]

Holodiffusion: Training a 3d diffusion model using 2d images

Animesh Karnewar, Andrea Vedaldi, David Novotny, and Niloy J Mitra. Holodiffusion: Training a 3d diffusion model using 2d images. In CVPR, 2023

work page 2023
[24]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014

work page 2014
[25]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014

work page 2014
[26]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023 a

work page 2023
[29]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2021

work page 2021
[30]

Instant neural graphics primitives with a multiresolution hash encoding

Thomas M\"uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 2022

work page 2022
[31]

Hologan: Unsupervised learning of 3d representations from natural images

Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. Hologan: Unsupervised learning of 3d representations from natural images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019

work page 2019
[32]

Blockgan: Learning 3d object-aware scene representations from unlabelled images

Thu H Nguyen-Phuoc, Christian Richardt, Long Mai, Yongliang Yang, and Niloy Mitra. Blockgan: Learning 3d object-aware scene representations from unlabelled images. NeurIPS, 2020

work page 2020
[34]

Giraffe: Representing scenes as compositional generative neural feature fields

Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In CVPR, 2021

work page 2021
[36]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023

work page 2023
[37]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021
[39]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \" o rn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022

work page 2022
[40]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023

work page 2023
[41]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022

work page 2022
[42]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. NeurIPS, 2016

work page 2016
[43]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022

work page 2022
[44]

Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis

Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In NeurIPS, 2021

work page 2021
[45]

3d neural field generation using triplane diffusion

J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. In CVPR, 2023

work page 2023
[47]

Scene representation networks: Continuous 3d-structure-aware neural scene representations

Vincent Sitzmann, Michael Zollh \"o fer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. NeurIPS, 32, 2019

work page 2019
[49]

Viewset diffusion:(0-) image-conditioned 3d generative models from 2d data, 2023

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset diffusion:(0-) image-conditioned 3d generative models from 2d data, 2023

work page 2023
[53]

Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023 a

work page 2023
[54]

Rodin: A generative model for sculpting 3d digital avatars using diffusion

Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In CVPR, 2023 b

work page 2023
[56]

Novel view synthesis with diffusion models

Daniel Watson, William Chan, Ricardo Martin - Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. In ICLR, 2023

work page 2023
[57]

Multiview compressive coding for 3d reconstruction

Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, and Georgia Gkioxari. Multiview compressive coding for 3d reconstruction. In CVPR, 2023

work page 2023
[59]

Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction

Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023

work page 2023
[60]

MetaHuman , howpublished =

work page
[61]

Proceedings of the 6th IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS) for Security, Safety and Monitoring in Smart Environments , year=

A 3D Face Model for Pose and Illumination Invariant Face Recognition , author=. Proceedings of the 6th IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS) for Security, Safety and Monitoring in Smart Environments , year=

work page
[62]

International Journal of Computer Vision , year=

Learning single-image 3d reconstruction by generative modelling of shape, pose and shading , author=. International Journal of Computer Vision , year=

work page
[63]

CVPR , year=

Leveraging 2d data to learn textured 3d mesh generation , author=. CVPR , year=

work page
[64]

CVPR , year=

Efficient geometry-aware 3D generative adversarial networks , author=. CVPR , year=

work page
[65]

NeurIPS , year=

Get3d: A generative model of high quality 3d textured shapes learned from images , author=. NeurIPS , year=

work page
[66]

CVPR , year=

Rodin: A generative model for sculpting 3d digital avatars using diffusion , author=. CVPR , year=

work page
[67]

CVPR , year=

Holodiffusion: Training a 3D diffusion model using 2D images , author=. CVPR , year=

work page
[68]

CVPR , year=

3d neural field generation using triplane diffusion , author=. CVPR , year=

work page
[69]

and Mildenhall, Ben , title =

Poole, Ben and Jain, Ajay and Barron, Jonathan T. and Mildenhall, Ben , title =. ICLR , year=

work page
[70]

CVPR , year=

Magic3d: High-resolution text-to-3d content creation , author=. CVPR , year=

work page
[71]

ECCV , year=

Nerf: Representing scenes as neural radiance fields for view synthesis , author=. ECCV , year=

work page
[72]

ICLR , year=

Auto-encoding variational bayes , author=. ICLR , year=

work page
[73]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

Hologan: Unsupervised learning of 3d representations from natural images , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

work page
[74]

NeurIPS , year=

Blockgan: Learning 3d object-aware scene representations from unlabelled images , author=. NeurIPS , year=

work page
[75]

CVPR , year=

Giraffe: Representing scenes as compositional generative neural feature fields , author=. CVPR , year=

work page
[76]

CVPR , pages=

Lifting 2d stylegan for 3d-aware face generation , author=. CVPR , pages=

work page
[77]

CVPR , pages=

Gram: Generative radiance manifolds for 3d-aware image generation , author=. CVPR , pages=

work page
[78]

CVPR , year=

Multiview compressive coding for 3D reconstruction , author=. CVPR , year=

work page
[79]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Point-e: A system for generating 3d point clouds from complex prompts , author=. arXiv:2212.08751 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Shap-e: Generating conditional 3d implicit functions

Shap-e: Generating conditional 3d implicit functions , author=. arXiv:2305.02463 , year=

work page arXiv
[81]

CVPR , year=

Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation , author=. CVPR , year=

work page
[82]

Stable Diffusion 2.1 base , howpublished =

work page
[83]

arXiv:2306.12422 , year=

DreamTime: An Improved Optimization Strategy for Text-to-3D Content Creation , author=. arXiv:2306.12422 , year=

work page arXiv
[84]

arXiv:2305.16213 , year=

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation , author=. arXiv:2305.16213 , year=

work page arXiv
[85]

arXiv:2303.13873 , year=

Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation , author=. arXiv:2303.13873 , year=

work page arXiv
[86]

arXiv:2304.12439 , year=

TextMesh: Generation of Realistic 3D Meshes From Text Prompts , author=. arXiv:2304.12439 , year=

work page arXiv
[87]

CVPR , year=

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. CVPR , year=

work page
[88]

arXiv:2303.13508 , year=

Dreambooth3d: Subject-driven text-to-3d generation , author=. arXiv:2303.13508 , year=

work page arXiv
[89]

arXiv:2303.11328 , year=

Zero-1-to-3: Zero-shot one image to 3d object , author=. arXiv:2303.11328 , year=

work page arXiv
[90]

Mehdi S. M. Sajjadi and Henning Meyer and Etienne Pot and Urs Bergmann and Klaus Greff and Noha Radwan and Suhani Vora and Mario Lucic and Daniel Duckworth and Alexey Dosovitskiy and Jakob Uszkoreit and Thomas A. Funkhouser and Andrea Tagliasacchi , title =. CVPR , year =

work page
[91]

Denoising Diffusion Probabilistic Models , booktitle =

Jonathan Ho and Ajay Jain and Pieter Abbeel , editor =. Denoising Diffusion Probabilistic Models , booktitle =. 2020 , url =

work page 2020
[92]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , booktitle =

Jascha Sohl. Deep Unsupervised Learning using Nonequilibrium Thermodynamics , booktitle =. 2015 , url =

work page 2015
[93]

Generative Modeling by Estimating Gradients of the Data Distribution , booktitle =

Yang Song and Stefano Ermon , editor =. Generative Modeling by Estimating Gradients of the Data Distribution , booktitle =. 2019 , url =

work page 2019
[94]

Diffusion Models Beat GANs on Image Synthesis , booktitle =

Prafulla Dhariwal and Alexander Quinn Nichol , editor =. Diffusion Models Beat GANs on Image Synthesis , booktitle =. 2021 , url =

work page 2021
[95]

Fleet and Mohammad Norouzi and Tim Salimans , title =

Jonathan Ho and Chitwan Saharia and William Chan and David J. Fleet and Mohammad Norouzi and Tim Salimans , title =. J. Mach. Learn. Res. , volume =. 2022 , url =

work page 2022
[96]

RePaint: Inpainting using Denoising Diffusion Probabilistic Models , booktitle =

Andreas Lugmayr and Martin Danelljan and Andr. RePaint: Inpainting using Denoising Diffusion Probabilistic Models , booktitle =

work page
[97]

Fleet and Mohammad Norouzi , title =

Chitwan Saharia and Jonathan Ho and William Chan and Tim Salimans and David J. Fleet and Mohammad Norouzi , title =

work page

Showing first 80 references.