pith. machine review for the scientific record. sign in

arxiv: 2404.07191 · v2 · submitted 2024-04-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D mesh generationsingle imagefeed-forwardmultiview diffusionlarge reconstruction modelsiso-surface extractionsparse-view reconstruction
0
0 comments X

The pith

InstantMesh generates high-quality 3D meshes from a single image in under 10 seconds by combining multiview diffusion with sparse-view large reconstruction models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents InstantMesh as a feed-forward framework that turns a single image into a 3D mesh very quickly. It achieves this by first creating multiple views of the object using a multiview diffusion model and then reconstructing the 3D mesh using a sparse-view model based on the LRM architecture. Adding a differentiable iso-surface extraction module allows the system to optimize the mesh directly with supervision from depths and normals, improving training efficiency. This results in better quality 3D assets generated in about 10 seconds compared to other recent methods.

Core claim

The central claim is that InstantMesh can generate diverse and high-quality 3D meshes from a single image in a feed-forward manner within 10 seconds. It does so by synergizing an off-the-shelf multiview diffusion model with a sparse-view reconstruction model based on the LRM architecture and integrating a differentiable iso-surface extraction module to optimize directly on the mesh representation using additional geometric supervisions.

What carries the argument

The InstantMesh framework that synergizes a multiview diffusion model for generating consistent views, an LRM-based sparse-view reconstructor, and a differentiable iso-surface extraction module for direct mesh optimization.

If this is right

  • 3D assets can be created from single images much faster than with previous optimization-based methods.
  • Training can leverage more geometric information such as depths and normals for better results.
  • The generated meshes show improved consistency and quality over other image-to-3D approaches.
  • Open-sourcing the code, weights, and demo enables broader use in 3D generative AI applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This combination might allow extension to text-to-3D by using text-to-image models as the first step.
  • The speed opens possibilities for interactive 3D content creation in design tools or games.
  • Future work could explore end-to-end training of the entire pipeline instead of using off-the-shelf components.
  • The approach may generalize to other modalities like generating 3D from sketches or partial images.

Load-bearing premise

The assumption that an off-the-shelf multiview diffusion model and an LRM-based sparse-view reconstructor can be combined with a differentiable iso-surface module to produce consistently high-quality meshes without major artifacts or view inconsistencies.

What would settle it

Observing whether the generated meshes from single images exhibit visible artifacts, view inconsistencies, or take longer than 10 seconds to produce on standard hardware, when tested against ground-truth 3D data.

read the original abstract

We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability. By synergizing the strengths of an off-the-shelf multiview diffusion model and a sparse-view reconstruction model based on the LRM architecture, InstantMesh is able to create diverse 3D assets within 10 seconds. To enhance the training efficiency and exploit more geometric supervisions, e.g, depths and normals, we integrate a differentiable iso-surface extraction module into our framework and directly optimize on the mesh representation. Experimental results on public datasets demonstrate that InstantMesh significantly outperforms other latest image-to-3D baselines, both qualitatively and quantitatively. We release all the code, weights, and demo of InstantMesh, with the intention that it can make substantial contributions to the community of 3D generative AI and empower both researchers and content creators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents InstantMesh, a feed-forward pipeline for single-image 3D mesh generation that combines an off-the-shelf multiview diffusion model with a sparse-view LRM-based reconstructor and a differentiable iso-surface extraction module. It claims to produce diverse, high-quality meshes in under 10 seconds while outperforming recent image-to-3D baselines both qualitatively and quantitatively on public datasets, with all code, weights, and a demo released.

Significance. If the quantitative tables and ablation results hold, the work offers a practical advance in efficient single-image 3D asset creation by leveraging pretrained components plus direct mesh optimization, achieving usable speed and quality for content-creation applications. The open release of code and weights is a clear strength that supports reproducibility and community follow-up.

major comments (2)
  1. [§4.2, Table 2] §4.2, Table 2: The reported Chamfer distance and normal consistency gains over Zero123++ and LRM are substantial, but the table does not report variance across multiple random seeds or test splits; without this, it is difficult to assess whether the claimed outperformance is statistically reliable for the central speed-quality claim.
  2. [§3.3] §3.3: The differentiable iso-surface extraction is presented as key to exploiting depth/normal supervision, yet no ablation isolates its contribution versus simply using the LRM output directly; this leaves open whether the module is load-bearing for the reported mesh quality or mainly an implementation detail.
minor comments (3)
  1. [Figure 3] Figure 3 caption: the view angles used for the qualitative comparison are not stated, making it hard to reproduce the exact visual results shown.
  2. [§2] §2: The related-work discussion of LRM variants is concise but omits the precise architectural differences (e.g., triplane resolution, attention layers) between the adopted LRM and the original LRM paper; a short table would improve clarity.
  3. [Eq. (5)] Eq. (5): the weighting between the diffusion loss and the mesh supervision terms is given numerically without justification or sensitivity analysis; a brief sentence on how these weights were chosen would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of InstantMesh and the constructive comments. We address each major point below and will incorporate revisions to strengthen the statistical reporting and ablation analysis.

read point-by-point responses
  1. Referee: [§4.2, Table 2] The reported Chamfer distance and normal consistency gains over Zero123++ and LRM are substantial, but the table does not report variance across multiple random seeds or test splits; without this, it is difficult to assess whether the claimed outperformance is statistically reliable for the central speed-quality claim.

    Authors: We agree that reporting variance would strengthen the reliability assessment. In the revised manuscript, we will augment Table 2 with standard deviations computed over multiple random seeds (e.g., 3–5 runs) for the key metrics on the same test splits. We will also explicitly state the fixed evaluation protocol and data splits used, ensuring the outperformance claims are presented with appropriate statistical context. revision: yes

  2. Referee: [§3.3] The differentiable iso-surface extraction is presented as key to exploiting depth/normal supervision, yet no ablation isolates its contribution versus simply using the LRM output directly; this leaves open whether the module is load-bearing for the reported mesh quality or mainly an implementation detail.

    Authors: We thank the referee for this suggestion. While the module enables direct mesh optimization with depth/normal losses (as described in §3.3), we acknowledge that an explicit ablation would better isolate its contribution. In the revision, we will add a new ablation study comparing the full InstantMesh pipeline against a baseline that uses the LRM output directly (without differentiable iso-surface extraction and mesh optimization). This will quantify the module’s impact on final mesh quality metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline combines external pretrained models with independent module

full rationale

The manuscript describes an engineering pipeline that fuses an off-the-shelf multiview diffusion model with an LRM-based sparse-view reconstructor and a new differentiable iso-surface extraction module. All quantitative claims are supported by training details, loss formulations, and evaluations on public datasets that do not reduce to quantities fitted on the same test data. No equations or central claims are shown to be equivalent to their inputs by construction, and self-citations (if present) are not load-bearing for the performance assertions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework depends on the quality of two external pretrained models and on the assumption that depth and normal supervision during mesh optimization will improve geometry without introducing new artifacts; no new physical entities are postulated.

free parameters (1)
  • training hyperparameters for mesh optimization
    Typical loss weights and learning rates for the differentiable iso-surface module are not specified in the abstract.
axioms (2)
  • domain assumption Off-the-shelf multiview diffusion model produces sufficiently consistent and accurate novel views for downstream reconstruction
    The method treats the diffusion model as a black-box input generator whose outputs are reliable enough for the LRM stage.
  • domain assumption Sparse-view LRM architecture can be fine-tuned with mesh-level losses to produce watertight meshes
    The paper assumes the LRM backbone can be adapted to output mesh-compatible representations via the added iso-surface module.

pith-pipeline@v0.9.0 · 5471 in / 1531 out tokens · 49920 ms · 2026-05-13T21:10:38.284278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 7.0

    R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.

  2. MiXR: Harvesting and Recomposing Geometry from Real-World Objects for In-Situ 3D Design

    cs.HC 2026-05 unverdicted novelty 7.0

    MiXR enables in-situ 3D design by harvesting real-world geometry for user-defined compositions that generative AI then refines, outperforming text-only generative methods in control and fidelity per a 12-person study.

  3. MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation

    cs.GR 2026-05 unverdicted novelty 7.0

    MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...

  4. Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation

    cs.CV 2026-05 unverdicted novelty 7.0

    Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.

  5. AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

    cs.CV 2026-04 unverdicted novelty 7.0

    AmaraSpatial-10K is a new dataset of over 10,000 metric-scaled and semantically anchored 3D assets that achieves 3.4 times higher text retrieval precision than Objaverse for embodied AI and spatial computing.

  6. DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

    cs.CV 2026-04 unverdicted novelty 7.0

    DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.

  7. Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives tha...

  8. THOM: Generating Physically Plausible Hand-Object Meshes From Text

    cs.CV 2026-04 unverdicted novelty 7.0

    THOM is a training-free two-stage framework that generates physically plausible hand-object 3D meshes directly from text by combining text-guided Gaussians with contact-aware physics optimization and VLM refinement.

  9. Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

    cs.CV 2026-05 unverdicted novelty 6.0

    Sat3DGen improves geometric RMSE from 6.76m to 5.20m and FID from ~40 to 19 for street-level 3D generation from satellite images via geometry-centric constraints and perspective training.

  10. Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos

    cs.CV 2026-05 unverdicted novelty 6.0

    HA-HOI produces physically plausible 4D HOI animations from monocular videos by anchoring object reconstruction to human motion and refining the result in a physics-based humanoid-object simulator.

  11. OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects

    cs.CV 2026-05 unverdicted novelty 6.0

    OneViewAll achieves 92.5% ADD-0.1 accuracy on LINEMOD for novel object 6D pose estimation using only one real reference view by integrating category, symmetry, and patch-level semantic priors in a projection-equivaria...

  12. PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

    cs.CV 2026-05 unverdicted novelty 6.0

    PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.

  13. 3D-ReGen: A Unified 3D Geometry Regeneration Framework

    cs.CV 2026-04 unverdicted novelty 6.0

    3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.

  14. REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

    cs.CV 2026-04 unverdicted novelty 6.0

    REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.

  15. Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations

    cs.CV 2026-04 unverdicted novelty 6.0

    RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape qua...

  16. Repurposing 3D Generative Model for Autoregressive Layout Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.

  17. ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.

  18. Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.

  19. UniRecGen: Unifying Multi-View 3D Reconstruction and Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.

  20. SegviGen: Repurposing 3D Generative Model for Part Segmentation

    cs.CV 2026-03 unverdicted novelty 6.0

    SegviGen shows pretrained 3D generative models can be repurposed for part segmentation via voxel colorization, beating prior methods by 40% interactively and 15% on full segmentation using only 0.32% of labeled data.

  21. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 5.0

    R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.

  22. Pose-Aware Diffusion for 3D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.

  23. From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

    cs.GR 2026-04 unverdicted novelty 5.0

    The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.

  24. AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

    cs.CV 2026-04 unverdicted novelty 5.0

    AmaraSpatial-10K supplies 10K deployment-ready 3D assets with metric scaling and metadata, delivering 3.4x higher CLIP Recall@5 than Objaverse and 99.1% physics stability in Habitat-Sim.

  25. Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images

    cs.CV 2026-04 unverdicted novelty 5.0

    Unposed-to-3D learns simulation-ready 3D vehicle models from unposed real images by predicting camera parameters for photometric self-supervision, then adding scale prediction and harmonization.

  26. UniMesh: Unifying 3D Mesh Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.

  27. MeshOn: Intersection-Free Mesh-to-Mesh Composition

    cs.GR 2026-04 unverdicted novelty 5.0

    MeshOn composes two input meshes realistically without intersections by using VLM-based rigid initialization, attractive geometric losses, a barrier loss, and a diffusion prior for final deformation.

  28. AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation

    cs.CV 2026-04 unverdicted novelty 4.0

    AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantica...

  29. From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

    cs.GR 2026-04 unverdicted novelty 4.0

    The paper surveys 3D content generation literature using a taxonomy of asset types and production stages to evaluate progress toward engine-ready assets.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 26 Pith papers · 3 internal anchors

  1. [1]

    Polydiff: Generating 3d polygonal meshes with diffusion models

    Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, and Matthias Nießner. Polydiff: Generating 3d polygonal meshes with diffusion models. arXiv preprint arXiv:2312.11417 ,

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 6

  3. [3]

    Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation

    Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 22246–22256, 2023. 2, 4

  4. [4]

    Bsp-net: Generating compact meshes via binary space partitioning

    Zhiqin Chen, Andrea Tagliasacchi, and Hao Zhang. Bsp-net: Generating compact meshes via binary space partitioning. In 8 Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 45–54, 2020. 2

  5. [5]

    V3d: Video diffusion models are effective 3d generators

    Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu. V3d: Video diffusion models are effective 3d generators. arXiv preprint arXiv:2403.06738, 2024. 3

  6. [6]

    Sdfusion: Multimodal 3d shape completion, reconstruction, and generation

    Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexan- der G Schwing, and Liang-Yan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4456–4465, 2023. 2

  7. [7]

    Diffusion-sdf: Conditional generative modeling of signed distance func- tions

    Gene Chou, Yuval Bahat, and Felix Heide. Diffusion-sdf: Conditional generative modeling of signed distance func- tions. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 2262–2272, 2023. 2

  8. [8]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 2, 3, 4

  9. [9]

    Objaverse-xl: A universe of 10m+ 3d objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Informa- tion Processing Systems, 36, 2024. 2, 3

  10. [10]

    Google scanned objects: A high- quality dataset of 3d scanned household items

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3d scanned household items. In 2022 In- ternational Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022. 5

  11. [11]

    arXiv:2303.05371 , year=

    Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Bar- las O˘guz. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023. 2

  12. [12]

    Vfusion3d: Learning scalable 3d generative models from video diffusion models

    Junlin Han, Filippos Kokkinos, and Philip Torr. Vfusion3d: Learning scalable 3d generative models from video diffusion models. arXiv preprint arXiv:2403.12034, 2024. 3

  13. [13]

    Openlrm: Open-source large reconstruction models

    Zexin He and Tengfei Wang. Openlrm: Open-source large reconstruction models. https://github.com/ 3DTopia/OpenLRM, 2023. 2, 4, 5

  14. [14]

    LRM: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large reconstruction model for single image to 3d. In The Twelfth International Conference on Learning Representations, 2024. 1, 2, 3, 8

  15. [15]

    Zero-shot text-guided object genera- tion with dream fields

    Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. Zero-shot text-guided object genera- tion with dream fields. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 867–876, 2022. 2

  16. [16]

    Shap-e: Generating conditional 3d implicit functions

    Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023. 2

  17. [17]

    Spad: Spatially aware multiview diffusers

    Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, and Aliaksandr Siarohin. Spad: Spatially aware multiview diffusers. arXiv preprint arXiv:2402.05235, 2024. 3

  18. [18]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42 (4):1–14, 2023. 2, 3

  19. [19]

    Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

    Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In The Twelfth International Conference on Learning Represen- tations, 2024. 2, 3, 4, 8

  20. [20]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 300–309, 2023. 2, 4

  21. [21]

    One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion

    Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji- ayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885, 2023. 3

  22. [22]

    One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion. Advances in Neural Information Processing Systems , 36, 2024. 2, 4

  23. [23]

    Zero-1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 9298–9309, 2023. 2

  24. [24]

    Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age.arXiv preprint arXiv:2309.03453, 2023

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age. arXiv preprint arXiv:2309.03453, 2023. 2, 3

  25. [25]

    Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu

    Zhen Liu, Yao Feng, Michael J. Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu. Meshd- iffusion: Score-based generative 3d mesh modeling. In The Eleventh International Conference on Learning Representations, 2023. 2

  26. [26]

    Wonder3d: Sin- gle image to 3d using cross-domain diffusion.arXiv preprint arXiv:2310.15008, 2023

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion.arXiv preprint arXiv:2310.15008, 2023. 2, 3

  27. [27]

    Pc2: Projection-conditioned point cloud diffu- sion for single-image 3d reconstruction

    Luke Melas-Kyriazi, Christian Rupprecht, and Andrea Vedaldi. Pc2: Projection-conditioned point cloud diffu- sion for single-image 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12923–12932, 2023. 2

  28. [28]

    Occupancy networks: Learning 3d reconstruction in function space

    Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019. 2

  29. [29]

    Nerf: 9 Representing scenes as neural radiance fields for view syn- thesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: 9 Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 3

  30. [30]

    Diffrf: Rendering-guided 3d radiance field diffusion

    Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4328–4338, 2023. 2

  31. [31]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022. 2

  32. [32]

    Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision

    Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3504–3515, 2020. 2

  33. [33]

    Deep mesh reconstruction from single rgb images via topology modification networks

    Junyi Pan, Xiaoguang Han, Weikai Chen, Jiapeng Tang, and Kui Jia. Deep mesh reconstruction from single rgb images via topology modification networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 9964–9973, 2019. 2

  34. [34]

    Barron, and Ben Milden- hall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representa- tions, 2023. 2

  35. [35]

    Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors

    Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko- rokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors.arXiv preprint arXiv:2306.17843,

  36. [36]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2

  37. [37]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  38. [38]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 2

  39. [39]

    Deep marching tetrahedra: a hybrid repre- sentation for high-resolution 3d shape synthesis

    Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra: a hybrid repre- sentation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems , 34:6087–6101,

  40. [40]

    Flexible isosurface extraction for gradient-based mesh optimization

    Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosurface extraction for gradient-based mesh optimization. ACM Transactions on Graphics (TOG), 42(4):1–16, 2023. 2, 3, 4

  41. [41]

    Zero123++: a single image to consistent multi-view dif- fusion base model.arXiv preprint arXiv:2310.15110, 2023

    Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view dif- fusion base model. arXiv preprint arXiv:2310.15110, 2023. 2, 3

  42. [42]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration. arXiv preprint arXiv:2308.16512, 2023. 2, 3

  43. [43]

    Diffusion-based signed distance fields for 3d shape gener- ation

    Jaehyeok Shim, Changwoo Kang, and Kyungdon Joo. Diffusion-based signed distance fields for 3d shape gener- ation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 20887–20897,

  44. [44]

    Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024

    Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024. 3, 5

  45. [45]

    arXiv preprint arXiv:2403.02151 , year=

    Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151, 2024. 2, 5

  46. [46]

    Gecco: Geometrically-conditioned point diffusion models

    Michał J Tyszkiewicz, Pascal Fua, and Eduard Trulls. Gecco: Geometrically-conditioned point diffusion models. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 2128–2138, 2023. 2

  47. [47]

    Sv3d: Novel multi-view syn- thesis and 3d generation from a single image using latent video diffusion

    Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. Sv3d: Novel multi-view syn- thesis and 3d generation from a single image using latent video diffusion. arXiv preprint arXiv:2403.12008, 2024. 2, 3, 6

  48. [48]

    Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation

    Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023. 2

  49. [49]

    Pixel2mesh: Generating 3d mesh models from single rgb images

    Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the Euro- pean conference on computer vision (ECCV) , pages 52–67,

  50. [50]

    arXiv preprint arXiv:2312.02201 , year=

    Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023. 2, 3

  51. [51]

    PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction

    Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction. In The Twelfth International Conference on Learning Representations, 2024. 3

  52. [52]

    Rodin: A generative model for sculpting 3d digital avatars using diffusion

    Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings 10 of the IEEE/CVF conference on computer vision and pattern recognition, pages 4563–4573, 2023. 2

  53. [53]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. Advances in Neural Information Processing Systems , 36, 2024. 2

  54. [54]

    Crm: Single image to 3d textured mesh with convolutional reconstruction model.arXiv preprint arXiv:2403.05034, 2024

    Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xi- ang, Shuo Chen, Dajiang Yu, Chongxuan Li, Hang Su, and Jun Zhu. Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034, 2024. 2, 3, 5

  55. [55]

    Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation

    Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023. 5

  56. [56]

    Sketch and text guided diffusion model for colored point cloud generation

    Zijie Wu, Yaonan Wang, Mingtao Feng, He Xie, and Ajmal Mian. Sketch and text guided diffusion model for colored point cloud generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8929– 8939, 2023. 2

  57. [57]

    Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views

    Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4479–4489, 2023. 2

  58. [58]

    Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models

    Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and Shenghua Gao. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20908–20918, 2023. 2

  59. [59]

    Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation

    Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, and Gordon Wet- zstein. Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation. arXiv preprint arXiv:2403.14621, 2024. 3

  60. [60]

    DMV3d: Denoising multi- view diffusion using 3d large reconstruction model

    Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Ji- ahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, and Kai Zhang. DMV3d: Denoising multi- view diffusion using 3d large reconstruction model. In The Twelfth International Conference on Learning Representa- tions, 2024. 3

  61. [61]

    3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023

    Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023. 2

  62. [62]

    Locally attentional sdf diffusion for controllable 3d shape generation

    Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung Shum. Locally attentional sdf diffusion for controllable 3d shape generation. ACM Trans- actions on Graphics (TOG), 42(4):1–13, 2023. 2

  63. [63]

    Mvd2: Efficient multiview 3d reconstruction for multiview diffusion.arXiv preprint arXiv:2402.14253, 2024

    Xin-Yang Zheng, Hao Pan, Yu-Xiao Guo, Xin Tong, and Yang Liu. Mvd2: Efficient multiview 3d reconstruction for multiview diffusion.arXiv preprint arXiv:2402.14253, 2024. 2, 3

  64. [64]

    3d shape generation and completion through point-voxel diffusion

    Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 5826–5835, 2021. 2

  65. [65]

    Triplane meets gaussian splatting: Fast and generalizable single- view 3d reconstruction with transformers

    Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, and Song-Hai Zhang. Triplane meets gaussian splatting: Fast and generalizable single- view 3d reconstruction with transformers. arXiv preprint arXiv:2312.09147, 2023. 5

  66. [66]

    Videomv: Consistent multi-view gener- ation based on large video generative model

    Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, et al. Videomv: Consistent multi-view gener- ation based on large video generative model. arXiv preprint arXiv:2403.12010, 2024. 3 11