arxiv: 2404.07191 · v2 · submitted 2024-04-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Jiale Xu , Weihao Cheng , Yiming Gao , Xintao Wang , Shenghua Gao , Ying Shan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D mesh generationsingle imagefeed-forwardmultiview diffusionlarge reconstruction modelsiso-surface extractionsparse-view reconstruction

0 comments

The pith

InstantMesh generates high-quality 3D meshes from a single image in under 10 seconds by combining multiview diffusion with sparse-view large reconstruction models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents InstantMesh as a feed-forward framework that turns a single image into a 3D mesh very quickly. It achieves this by first creating multiple views of the object using a multiview diffusion model and then reconstructing the 3D mesh using a sparse-view model based on the LRM architecture. Adding a differentiable iso-surface extraction module allows the system to optimize the mesh directly with supervision from depths and normals, improving training efficiency. This results in better quality 3D assets generated in about 10 seconds compared to other recent methods.

Core claim

The central claim is that InstantMesh can generate diverse and high-quality 3D meshes from a single image in a feed-forward manner within 10 seconds. It does so by synergizing an off-the-shelf multiview diffusion model with a sparse-view reconstruction model based on the LRM architecture and integrating a differentiable iso-surface extraction module to optimize directly on the mesh representation using additional geometric supervisions.

What carries the argument

The InstantMesh framework that synergizes a multiview diffusion model for generating consistent views, an LRM-based sparse-view reconstructor, and a differentiable iso-surface extraction module for direct mesh optimization.

If this is right

3D assets can be created from single images much faster than with previous optimization-based methods.
Training can leverage more geometric information such as depths and normals for better results.
The generated meshes show improved consistency and quality over other image-to-3D approaches.
Open-sourcing the code, weights, and demo enables broader use in 3D generative AI applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This combination might allow extension to text-to-3D by using text-to-image models as the first step.
The speed opens possibilities for interactive 3D content creation in design tools or games.
Future work could explore end-to-end training of the entire pipeline instead of using off-the-shelf components.
The approach may generalize to other modalities like generating 3D from sketches or partial images.

Load-bearing premise

The assumption that an off-the-shelf multiview diffusion model and an LRM-based sparse-view reconstructor can be combined with a differentiable iso-surface module to produce consistently high-quality meshes without major artifacts or view inconsistencies.

What would settle it

Observing whether the generated meshes from single images exhibit visible artifacts, view inconsistencies, or take longer than 10 seconds to produce on standard hardware, when tested against ground-truth 3D data.

read the original abstract

We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability. By synergizing the strengths of an off-the-shelf multiview diffusion model and a sparse-view reconstruction model based on the LRM architecture, InstantMesh is able to create diverse 3D assets within 10 seconds. To enhance the training efficiency and exploit more geometric supervisions, e.g, depths and normals, we integrate a differentiable iso-surface extraction module into our framework and directly optimize on the mesh representation. Experimental results on public datasets demonstrate that InstantMesh significantly outperforms other latest image-to-3D baselines, both qualitatively and quantitatively. We release all the code, weights, and demo of InstantMesh, with the intention that it can make substantial contributions to the community of 3D generative AI and empower both researchers and content creators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InstantMesh wires together pretrained multiview diffusion and LRM components with differentiable iso-surface extraction to deliver usable single-image meshes in seconds, backed by code release and benchmark numbers.

read the letter

InstantMesh is a practical pipeline that generates 3D meshes from single images in under 10 seconds by combining an off-the-shelf multiview diffusion model, a sparse-view LRM reconstructor, and a differentiable iso-surface extraction module trained with direct mesh supervision on depths and normals. The end-to-end setup lets them optimize the mesh representation itself rather than stopping at implicit fields. What the paper does well is lay out the training recipe, include quantitative tables on public datasets that show gains over recent baselines, and release the full code, weights, and demo. That last part matters for anyone who wants to test or extend it without starting from scratch. The experiments support the speed and quality claims without obvious circularity, since the extra supervision comes from geometric signals. The softer spots are that the core pieces are borrowed, so the advance is mainly in how the pieces are glued together and trained rather than a new primitive. Results will still depend on how the pretrained models handle unusual inputs, and occasional mesh artifacts or view inconsistencies could appear in edge cases even if the reported numbers look good. The full manuscript fills in the loss details and training hyperparameters that the abstract left out. This work is aimed at researchers and developers who need fast, open-source single-image 3D tools for content pipelines. It deserves peer review because the claims are concrete, the implementation is available for verification, and it pushes toward more scalable feed-forward generation.

Referee Report

2 major / 3 minor

Summary. The paper presents InstantMesh, a feed-forward pipeline for single-image 3D mesh generation that combines an off-the-shelf multiview diffusion model with a sparse-view LRM-based reconstructor and a differentiable iso-surface extraction module. It claims to produce diverse, high-quality meshes in under 10 seconds while outperforming recent image-to-3D baselines both qualitatively and quantitatively on public datasets, with all code, weights, and a demo released.

Significance. If the quantitative tables and ablation results hold, the work offers a practical advance in efficient single-image 3D asset creation by leveraging pretrained components plus direct mesh optimization, achieving usable speed and quality for content-creation applications. The open release of code and weights is a clear strength that supports reproducibility and community follow-up.

major comments (2)

[§4.2, Table 2] §4.2, Table 2: The reported Chamfer distance and normal consistency gains over Zero123++ and LRM are substantial, but the table does not report variance across multiple random seeds or test splits; without this, it is difficult to assess whether the claimed outperformance is statistically reliable for the central speed-quality claim.
[§3.3] §3.3: The differentiable iso-surface extraction is presented as key to exploiting depth/normal supervision, yet no ablation isolates its contribution versus simply using the LRM output directly; this leaves open whether the module is load-bearing for the reported mesh quality or mainly an implementation detail.

minor comments (3)

[Figure 3] Figure 3 caption: the view angles used for the qualitative comparison are not stated, making it hard to reproduce the exact visual results shown.
[§2] §2: The related-work discussion of LRM variants is concise but omits the precise architectural differences (e.g., triplane resolution, attention layers) between the adopted LRM and the original LRM paper; a short table would improve clarity.
[Eq. (5)] Eq. (5): the weighting between the diffusion loss and the mesh supervision terms is given numerically without justification or sensitivity analysis; a brief sentence on how these weights were chosen would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of InstantMesh and the constructive comments. We address each major point below and will incorporate revisions to strengthen the statistical reporting and ablation analysis.

read point-by-point responses

Referee: [§4.2, Table 2] The reported Chamfer distance and normal consistency gains over Zero123++ and LRM are substantial, but the table does not report variance across multiple random seeds or test splits; without this, it is difficult to assess whether the claimed outperformance is statistically reliable for the central speed-quality claim.

Authors: We agree that reporting variance would strengthen the reliability assessment. In the revised manuscript, we will augment Table 2 with standard deviations computed over multiple random seeds (e.g., 3–5 runs) for the key metrics on the same test splits. We will also explicitly state the fixed evaluation protocol and data splits used, ensuring the outperformance claims are presented with appropriate statistical context. revision: yes
Referee: [§3.3] The differentiable iso-surface extraction is presented as key to exploiting depth/normal supervision, yet no ablation isolates its contribution versus simply using the LRM output directly; this leaves open whether the module is load-bearing for the reported mesh quality or mainly an implementation detail.

Authors: We thank the referee for this suggestion. While the module enables direct mesh optimization with depth/normal losses (as described in §3.3), we acknowledge that an explicit ablation would better isolate its contribution. In the revision, we will add a new ablation study comparing the full InstantMesh pipeline against a baseline that uses the LRM output directly (without differentiable iso-surface extraction and mesh optimization). This will quantify the module’s impact on final mesh quality metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline combines external pretrained models with independent module

full rationale

The manuscript describes an engineering pipeline that fuses an off-the-shelf multiview diffusion model with an LRM-based sparse-view reconstructor and a new differentiable iso-surface extraction module. All quantitative claims are supported by training details, loss formulations, and evaluations on public datasets that do not reduce to quantities fitted on the same test data. No equations or central claims are shown to be equivalent to their inputs by construction, and self-citations (if present) are not load-bearing for the performance assertions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework depends on the quality of two external pretrained models and on the assumption that depth and normal supervision during mesh optimization will improve geometry without introducing new artifacts; no new physical entities are postulated.

free parameters (1)

training hyperparameters for mesh optimization
Typical loss weights and learning rates for the differentiable iso-surface module are not specified in the abstract.

axioms (2)

domain assumption Off-the-shelf multiview diffusion model produces sufficiently consistent and accurate novel views for downstream reconstruction
The method treats the diffusion model as a black-box input generator whose outputs are reliable enough for the LRM stage.
domain assumption Sparse-view LRM architecture can be fine-tuned with mesh-level losses to produce watertight meshes
The paper assumes the LRM backbone can be adapted to output mesh-compatible representations via the added iso-surface module.

pith-pipeline@v0.9.0 · 5471 in / 1531 out tokens · 49920 ms · 2026-05-13T21:10:38.284278+00:00 · methodology

discussion (0)

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 7.0

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
MiXR: Harvesting and Recomposing Geometry from Real-World Objects for In-Situ 3D Design
cs.HC 2026-05 unverdicted novelty 7.0

MiXR enables in-situ 3D design by harvesting real-world geometry for user-defined compositions that generative AI then refines, outperforming text-only generative methods in control and fidelity per a 12-person study.
MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation
cs.GR 2026-05 unverdicted novelty 7.0

MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...
Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation
cs.CV 2026-05 unverdicted novelty 7.0

Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.
AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI
cs.CV 2026-04 unverdicted novelty 7.0

AmaraSpatial-10K is a new dataset of over 10,000 metric-scaled and semantically anchored 3D assets that achieves 3.4 times higher text retrieval precision than Objaverse for embodied AI and spatial computing.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
cs.CV 2026-04 unverdicted novelty 7.0

DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.
Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives tha...
THOM: Generating Physically Plausible Hand-Object Meshes From Text
cs.CV 2026-04 unverdicted novelty 7.0

THOM is a training-free two-stage framework that generates physically plausible hand-object 3D meshes directly from text by combining text-guided Gaussians with contact-aware physics optimization and VLM refinement.
Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image
cs.CV 2026-05 unverdicted novelty 6.0

Sat3DGen improves geometric RMSE from 6.76m to 5.20m and FID from ~40 to 19 for street-level 3D generation from satellite images via geometry-centric constraints and perspective training.
Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos
cs.CV 2026-05 unverdicted novelty 6.0

HA-HOI produces physically plausible 4D HOI animations from monocular videos by anchoring object reconstruction to human motion and refining the result in a physics-based humanoid-object simulator.
OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects
cs.CV 2026-05 unverdicted novelty 6.0

OneViewAll achieves 92.5% ADD-0.1 accuracy on LINEMOD for novel object 6D pose estimation using only one real reference view by integrating category, symmetry, and patch-level semantic priors in a projection-equivaria...
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
cs.CV 2026-05 unverdicted novelty 6.0

PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.
3D-ReGen: A Unified 3D Geometry Regeneration Framework
cs.CV 2026-04 unverdicted novelty 6.0

3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
cs.CV 2026-04 unverdicted novelty 6.0

REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.
Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations
cs.CV 2026-04 unverdicted novelty 6.0

RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape qua...
Repurposing 3D Generative Model for Autoregressive Layout Generation
cs.CV 2026-04 unverdicted novelty 6.0

LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
cs.CV 2026-04 unverdicted novelty 6.0

ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.
UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
cs.CV 2026-04 unverdicted novelty 6.0

UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.
SegviGen: Repurposing 3D Generative Model for Part Segmentation
cs.CV 2026-03 unverdicted novelty 6.0

SegviGen shows pretrained 3D generative models can be repurposed for part segmentation via voxel colorization, beating prior methods by 40% interactively and 15% on full segmentation using only 0.32% of labeled data.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 5.0

R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
Pose-Aware Diffusion for 3D Generation
cs.CV 2026-05 unverdicted novelty 5.0

PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
cs.GR 2026-04 unverdicted novelty 5.0

The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.
AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI
cs.CV 2026-04 unverdicted novelty 5.0

AmaraSpatial-10K supplies 10K deployment-ready 3D assets with metric scaling and metadata, delivering 3.4x higher CLIP Recall@5 than Objaverse and 99.1% physics stability in Habitat-Sim.
Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
cs.CV 2026-04 unverdicted novelty 5.0

Unposed-to-3D learns simulation-ready 3D vehicle models from unposed real images by predicting camera parameters for photometric self-supervision, then adding scale prediction and harmonization.
UniMesh: Unifying 3D Mesh Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
MeshOn: Intersection-Free Mesh-to-Mesh Composition
cs.GR 2026-04 unverdicted novelty 5.0

MeshOn composes two input meshes realistically without intersections by using VLM-based rigid initialization, attractive geometric losses, a barrier loss, and a diffusion prior for final deformation.
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation
cs.CV 2026-04 unverdicted novelty 4.0

AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantica...
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
cs.GR 2026-04 unverdicted novelty 4.0

The paper surveys 3D content generation literature using a taxonomy of asset types and production stages to evaluate progress toward engine-ready assets.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 26 Pith papers · 3 internal anchors

[1]

Polydiff: Generating 3d polygonal meshes with diffusion models

Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, and Matthias Nießner. Polydiff: Generating 3d polygonal meshes with diffusion models. arXiv preprint arXiv:2312.11417 ,

work page arXiv
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 22246–22256, 2023. 2, 4

work page 2023
[4]

Bsp-net: Generating compact meshes via binary space partitioning

Zhiqin Chen, Andrea Tagliasacchi, and Hao Zhang. Bsp-net: Generating compact meshes via binary space partitioning. In 8 Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 45–54, 2020. 2

work page 2020
[5]

V3d: Video diffusion models are effective 3d generators

Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu. V3d: Video diffusion models are effective 3d generators. arXiv preprint arXiv:2403.06738, 2024. 3

work page arXiv 2024
[6]

Sdfusion: Multimodal 3d shape completion, reconstruction, and generation

Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexan- der G Schwing, and Liang-Yan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4456–4465, 2023. 2

work page 2023
[7]

Diffusion-sdf: Conditional generative modeling of signed distance func- tions

Gene Chou, Yuval Bahat, and Felix Heide. Diffusion-sdf: Conditional generative modeling of signed distance func- tions. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 2262–2272, 2023. 2

work page 2023
[8]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 2, 3, 4

work page 2023
[9]

Objaverse-xl: A universe of 10m+ 3d objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Informa- tion Processing Systems, 36, 2024. 2, 3

work page 2024
[10]

Google scanned objects: A high- quality dataset of 3d scanned household items

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3d scanned household items. In 2022 In- ternational Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022. 5

work page 2022
[11]

arXiv:2303.05371 , year=

Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Bar- las O˘guz. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023. 2

work page arXiv 2023
[12]

Vfusion3d: Learning scalable 3d generative models from video diffusion models

Junlin Han, Filippos Kokkinos, and Philip Torr. Vfusion3d: Learning scalable 3d generative models from video diffusion models. arXiv preprint arXiv:2403.12034, 2024. 3

work page arXiv 2024
[13]

Openlrm: Open-source large reconstruction models

Zexin He and Tengfei Wang. Openlrm: Open-source large reconstruction models. https://github.com/ 3DTopia/OpenLRM, 2023. 2, 4, 5

work page 2023
[14]

LRM: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large reconstruction model for single image to 3d. In The Twelfth International Conference on Learning Representations, 2024. 1, 2, 3, 8

work page 2024
[15]

Zero-shot text-guided object genera- tion with dream fields

Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. Zero-shot text-guided object genera- tion with dream fields. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 867–876, 2022. 2

work page 2022
[16]

Shap-e: Generating conditional 3d implicit functions

Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023. 2

work page arXiv 2023
[17]

Spad: Spatially aware multiview diffusers

Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, and Aliaksandr Siarohin. Spad: Spatially aware multiview diffusers. arXiv preprint arXiv:2402.05235, 2024. 3

work page arXiv 2024
[18]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42 (4):1–14, 2023. 2, 3

work page 2023
[19]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In The Twelfth International Conference on Learning Represen- tations, 2024. 2, 3, 4, 8

work page 2024
[20]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 300–309, 2023. 2, 4

work page 2023
[21]

One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion

Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji- ayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885, 2023. 3

work page arXiv 2023
[22]

One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion. Advances in Neural Information Processing Systems , 36, 2024. 2, 4

work page 2024
[23]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 9298–9309, 2023. 2

work page 2023
[24]

Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age.arXiv preprint arXiv:2309.03453, 2023

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age. arXiv preprint arXiv:2309.03453, 2023. 2, 3

work page arXiv 2023
[25]

Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu

Zhen Liu, Yao Feng, Michael J. Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu. Meshd- iffusion: Score-based generative 3d mesh modeling. In The Eleventh International Conference on Learning Representations, 2023. 2

work page 2023
[26]

Wonder3d: Sin- gle image to 3d using cross-domain diffusion.arXiv preprint arXiv:2310.15008, 2023

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion.arXiv preprint arXiv:2310.15008, 2023. 2, 3

work page arXiv 2023
[27]

Pc2: Projection-conditioned point cloud diffu- sion for single-image 3d reconstruction

Luke Melas-Kyriazi, Christian Rupprecht, and Andrea Vedaldi. Pc2: Projection-conditioned point cloud diffu- sion for single-image 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12923–12932, 2023. 2

work page 2023
[28]

Occupancy networks: Learning 3d reconstruction in function space

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019. 2

work page 2019
[29]

Nerf: 9 Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: 9 Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 3

work page 2021
[30]

Diffrf: Rendering-guided 3d radiance field diffusion

Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4328–4338, 2023. 2

work page 2023
[31]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022. 2

work page internal anchor Pith review arXiv 2022
[32]

Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision

Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3504–3515, 2020. 2

work page 2020
[33]

Deep mesh reconstruction from single rgb images via topology modification networks

Junyi Pan, Xiaoguang Han, Weikai Chen, Jiapeng Tang, and Kui Jia. Deep mesh reconstruction from single rgb images via topology modification networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 9964–9973, 2019. 2

work page 2019
[34]

Barron, and Ben Milden- hall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representa- tions, 2023. 2

work page 2023
[35]

Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors

Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko- rokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors.arXiv preprint arXiv:2306.17843,

work page arXiv
[36]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2

work page 2021
[37]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

work page 2022
[38]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 2

work page 2022
[39]

Deep marching tetrahedra: a hybrid repre- sentation for high-resolution 3d shape synthesis

Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra: a hybrid repre- sentation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems , 34:6087–6101,

work page
[40]

Flexible isosurface extraction for gradient-based mesh optimization

Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosurface extraction for gradient-based mesh optimization. ACM Transactions on Graphics (TOG), 42(4):1–16, 2023. 2, 3, 4

work page 2023
[41]

Zero123++: a single image to consistent multi-view dif- fusion base model.arXiv preprint arXiv:2310.15110, 2023

Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view dif- fusion base model. arXiv preprint arXiv:2310.15110, 2023. 2, 3

work page arXiv 2023
[42]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration. arXiv preprint arXiv:2308.16512, 2023. 2, 3

work page internal anchor Pith review arXiv 2023
[43]

Diffusion-based signed distance fields for 3d shape gener- ation

Jaehyeok Shim, Changwoo Kang, and Kyungdon Joo. Diffusion-based signed distance fields for 3d shape gener- ation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 20887–20897,

work page
[44]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024. 3, 5

work page arXiv 2024
[45]

arXiv preprint arXiv:2403.02151 , year=

Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151, 2024. 2, 5

work page arXiv 2024
[46]

Gecco: Geometrically-conditioned point diffusion models

Michał J Tyszkiewicz, Pascal Fua, and Eduard Trulls. Gecco: Geometrically-conditioned point diffusion models. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 2128–2138, 2023. 2

work page 2023
[47]

Sv3d: Novel multi-view syn- thesis and 3d generation from a single image using latent video diffusion

Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. Sv3d: Novel multi-view syn- thesis and 3d generation from a single image using latent video diffusion. arXiv preprint arXiv:2403.12008, 2024. 2, 3, 6

work page arXiv 2024
[48]

Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023. 2

work page 2023
[49]

Pixel2mesh: Generating 3d mesh models from single rgb images

Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the Euro- pean conference on computer vision (ECCV) , pages 52–67,

work page
[50]

arXiv preprint arXiv:2312.02201 , year=

Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023. 2, 3

work page arXiv 2023
[51]

PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction

Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction. In The Twelfth International Conference on Learning Representations, 2024. 3

work page 2024
[52]

Rodin: A generative model for sculpting 3d digital avatars using diffusion

Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings 10 of the IEEE/CVF conference on computer vision and pattern recognition, pages 4563–4573, 2023. 2

work page 2023
[53]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. Advances in Neural Information Processing Systems , 36, 2024. 2

work page 2024
[54]

Crm: Single image to 3d textured mesh with convolutional reconstruction model.arXiv preprint arXiv:2403.05034, 2024

Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xi- ang, Shuo Chen, Dajiang Yu, Chongxuan Li, Hang Su, and Jun Zhu. Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034, 2024. 2, 3, 5

work page arXiv 2024
[55]

Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation

Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023. 5

work page 2023
[56]

Sketch and text guided diffusion model for colored point cloud generation

Zijie Wu, Yaonan Wang, Mingtao Feng, He Xie, and Ajmal Mian. Sketch and text guided diffusion model for colored point cloud generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8929– 8939, 2023. 2

work page 2023
[57]

Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views

Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4479–4489, 2023. 2

work page 2023
[58]

Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models

Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and Shenghua Gao. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20908–20918, 2023. 2

work page 2023
[59]

Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation

Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, and Gordon Wet- zstein. Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation. arXiv preprint arXiv:2403.14621, 2024. 3

work page arXiv 2024
[60]

DMV3d: Denoising multi- view diffusion using 3d large reconstruction model

Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Ji- ahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, and Kai Zhang. DMV3d: Denoising multi- view diffusion using 3d large reconstruction model. In The Twelfth International Conference on Learning Representa- tions, 2024. 3

work page 2024
[61]

3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023

Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023. 2

work page 2023
[62]

Locally attentional sdf diffusion for controllable 3d shape generation

Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung Shum. Locally attentional sdf diffusion for controllable 3d shape generation. ACM Trans- actions on Graphics (TOG), 42(4):1–13, 2023. 2

work page 2023
[63]

Mvd2: Efficient multiview 3d reconstruction for multiview diffusion.arXiv preprint arXiv:2402.14253, 2024

Xin-Yang Zheng, Hao Pan, Yu-Xiao Guo, Xin Tong, and Yang Liu. Mvd2: Efficient multiview 3d reconstruction for multiview diffusion.arXiv preprint arXiv:2402.14253, 2024. 2, 3

work page arXiv 2024
[64]

3d shape generation and completion through point-voxel diffusion

Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 5826–5835, 2021. 2

work page 2021
[65]

Triplane meets gaussian splatting: Fast and generalizable single- view 3d reconstruction with transformers

Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, and Song-Hai Zhang. Triplane meets gaussian splatting: Fast and generalizable single- view 3d reconstruction with transformers. arXiv preprint arXiv:2312.09147, 2023. 5

work page arXiv 2023
[66]

Videomv: Consistent multi-view gener- ation based on large video generative model

Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, et al. Videomv: Consistent multi-view gener- ation based on large video generative model. arXiv preprint arXiv:2403.12010, 2024. 3 11

work page arXiv 2024