pith. sign in

arxiv: 2509.07435 · v2 · submitted 2025-09-09 · 💻 cs.CV

DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation

Pith reviewed 2026-05-18 18:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D asset generationmulti-view diffusionGaussian splattingPBR materialsplug-in adapterdata-efficient finetuningrelightable meshesdiffusion priors
0
0 comments X

The pith

LGAA reuses layers from multi-view diffusion models to generate PBR-ready 3D assets from only 69k instances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Lightweight Gaussian Asset Adapter (LGAA), a modular plug-in that attaches to existing multi-view diffusion models to create 3D assets with both geometry and physically based rendering materials. It reuses and adapts network layers trained on billions of images so that fine-tuning on a small set of 69,000 multi-view examples still converges well and preserves useful 2D knowledge. The design includes a wrapper for layer reuse, a switcher to combine multiple priors, and a decoder that outputs 2D Gaussian splats with PBR channels, followed by post-processing to extract relightable meshes. A sympathetic reader would care because most 3D generation methods either ignore materials or need far larger datasets, and this method offers an end-to-end, data-efficient route to production-ready assets.

Core claim

LGAA unifies geometry and PBR material modeling by exploiting multi-view diffusion priors through a modular design: the LGAA Wrapper reuses and adapts network layers from MV diffusion models to preserve 2D priors for better convergence, the LGAA Switcher aligns multiple wrapper layers that encapsulate different knowledge, and the LGAA Decoder, a tamed variational autoencoder, predicts 2D Gaussian Splatting with PBR channels. A dedicated post-processing procedure then extracts high-quality, relightable mesh assets from the resulting 2DGS. Experiments demonstrate superior performance with both text- and image-conditioned MV diffusion models and data-efficient finetuning with merely 69k multi-v

What carries the argument

Lightweight Gaussian Asset Adapter (LGAA), a plug-in module whose Wrapper reuses MV diffusion layers, Switcher aligns multiple priors, and Decoder predicts 2DGS with PBR channels, thereby lifting pre-trained models for unified 3D asset generation.

Load-bearing premise

Reusing and adapting network layers from pre-trained MV diffusion models preserves 2D priors sufficiently to enable better convergence and superior 3D PBR performance when finetuned on only 69k multi-view instances.

What would settle it

A model trained from scratch on the same 69k multi-view instances that matches or exceeds LGAA in convergence speed and final 3D PBR asset quality would show that layer reuse is not necessary for the claimed data efficiency.

Figures

Figures reproduced from arXiv: 2509.07435 by Jian Yang, Jiaxiong Qiu, Jin Xie, Liu Liu, Wei Sui, Xinjie Wang, Ze-Xin Yin, Zhizhong Su.

Figure 1
Figure 1. Figure 1: Our pipeline possesses the capability of generating diverse, PBR-ready 3D assets from either text prompts or image [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall of the 3D asset generation pipeline. We propose the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparisons of text-conditioned 3D asset generation methods. For LGM and LaRa, we use MVDream 2.1 to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparisons of image-conditioned 3D asset generation methods. For LGM and LaRa, we use ImageDream to [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relighting results under different HDRI maps. The synthesized diffuse materials exhibit base color changes under [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We provide detailed visualizations of geometry and PBR materials from the generated 3D assets, along with the input [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

The labor- and experience-intensive creation of 3D assets with physically based rendering (PBR) materials demands an autonomous 3D asset creation pipeline. However, most existing 3D generation methods focus on geometry modeling, either baking textures into simple vertex colors or leaving texture synthesis to post-processing with image diffusion models. To achieve end-to-end PBR-ready 3D asset generation, we present Lightweight Gaussian Asset Adapter (LGAA), a novel framework that unifies the modeling of geometry and PBR materials by exploiting multi-view (MV) diffusion priors from a novel perspective. The LGAA features a modular design with three components. Specifically, the LGAA Wrapper reuses and adapts network layers from MV diffusion models, which encapsulate knowledge acquired from billions of images, enabling better convergence in a data-efficient manner. To incorporate multiple diffusion priors for geometry and PBR synthesis, the LGAA Switcher aligns multiple LGAA Wrapper layers encapsulating different knowledge. Then, a tamed variational autoencoder (VAE), termed LGAA Decoder, is designed to predict 2D Gaussian Splatting (2DGS) with PBR channels. Finally, we introduce a dedicated post-processing procedure to effectively extract high-quality, relightable mesh assets from the resulting 2DGS. Extensive quantitative and qualitative experiments demonstrate the superior performance of LGAA with both text- and image-conditioned MV diffusion models. Additionally, the modular design enables flexible incorporation of multiple diffusion priors, and the knowledge-preserving scheme effectively preseves the 2D priors learned on massive image dataset, which leads to data efficient finetuning to lift the MV diffuison models for 3D generation with merely 69k multi-view instances.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Lightweight Gaussian Asset Adapter (LGAA), a modular plug-in framework to lift pre-trained multi-view (MV) diffusion models for end-to-end 3D asset generation with geometry and PBR materials. LGAA Wrapper reuses and adapts layers from MV diffusion models to preserve 2D priors for data-efficient finetuning; LGAA Switcher aligns multiple such wrappers for geometry and PBR priors; LGAA Decoder (a tamed VAE) predicts 2D Gaussian Splatting with PBR channels; a post-processing step extracts relightable meshes. The authors claim superior performance over existing methods via extensive quantitative and qualitative experiments on both text- and image-conditioned MV models, achieved through data-efficient finetuning on only 69k multi-view instances.

Significance. If the empirical claims hold, the modular reuse of billion-image 2D priors for 3D PBR generation would represent a practical advance in data-efficient 3D asset pipelines, reducing reliance on large-scale 3D datasets while enabling flexible combination of geometry and material priors. The explicit post-processing to relightable meshes and the 2DGS output format are also potentially useful for downstream applications.

major comments (2)
  1. Abstract: the central claim that LGAA 'effectively preserves the 2D priors learned on massive image dataset' and thereby enables 'data efficient finetuning ... with merely 69k multi-view instances' is unsupported by any reported quantitative check (feature similarity, retained 2D generation quality after adaptation, or from-scratch ablation). Without such evidence the attribution of gains to prior preservation rather than to the new architecture or training protocol remains unverified.
  2. Abstract and Experiments section: no numerical metrics, ablation tables, error bars, dataset construction details, or evaluation protocol (e.g., metrics for geometry, PBR, or relighting quality) are supplied to substantiate the repeated assertion of 'superior performance.' This absence prevents assessment of whether the reported gains are statistically meaningful or merely qualitative.
minor comments (2)
  1. Abstract: typographical errors ('preseves', 'diffuison') should be corrected.
  2. Abstract: the sentence 'the modular design enables flexible incorporation of multiple diffusion priors, and the knowledge-preserving scheme effectively preseves...' is run-on and should be split for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The two major comments identify important gaps in evidentiary support for our central claims. We agree that additional quantitative checks and detailed reporting are needed to strengthen the manuscript and will revise accordingly.

read point-by-point responses
  1. Referee: Abstract: the central claim that LGAA 'effectively preserves the 2D priors learned on massive image dataset' and thereby enables 'data efficient finetuning ... with merely 69k multi-view instances' is unsupported by any reported quantitative check (feature similarity, retained 2D generation quality after adaptation, or from-scratch ablation). Without such evidence the attribution of gains to prior preservation rather than to the new architecture or training protocol remains unverified.

    Authors: We agree that the current version does not contain direct quantitative verification of prior preservation, such as feature-space similarity between adapted and original MV diffusion layers or an explicit from-scratch training ablation. The manuscript relies on indirect evidence through overall performance gains and the modular reuse design. We will add a dedicated ablation subsection that reports (i) performance when the LGAA Wrapper is trained from random initialization versus initialized from MV diffusion weights and (ii) any feasible retained 2D generation quality metrics after adaptation. This revision will allow readers to assess the contribution of the preserved priors more rigorously. revision: yes

  2. Referee: Abstract and Experiments section: no numerical metrics, ablation tables, error bars, dataset construction details, or evaluation protocol (e.g., metrics for geometry, PBR, or relighting quality) are supplied to substantiate the repeated assertion of 'superior performance.' This absence prevents assessment of whether the reported gains are statistically meaningful or merely qualitative.

    Authors: The experiments section does contain quantitative comparisons, yet we acknowledge that error bars, complete dataset construction details for the 69k instances, and explicit per-category metrics for geometry, PBR material quality, and relighting fidelity are insufficiently reported. We will expand the experiments section with (i) full ablation tables including numerical values and standard deviations, (ii) a detailed description of the multi-view dataset curation and splits, and (iii) additional evaluation protocols and metrics specifically for geometry accuracy, PBR channel fidelity, and relighting quality under novel lighting. These additions will make the superiority claims quantitatively verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: modular reuse of external pre-trained priors with independent experimental validation

full rationale

The paper describes an engineering framework (LGAA Wrapper reusing MV diffusion layers, LGAA Switcher for alignment, LGAA Decoder for 2DGS+PBR prediction) that builds on external pre-trained models trained on billions of images. The central data-efficiency claim (finetuning on 69k instances) is presented as an empirical outcome of prior preservation rather than a quantity defined by or fitted to the target 3D results themselves. No equations, self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text; the derivation chain consists of architectural choices justified by external diffusion priors and validated through quantitative/qualitative experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the effectiveness of the modular adapter design in preserving pre-trained 2D diffusion knowledge while adapting to 3D PBR output; this depends on domain assumptions about diffusion model capabilities and introduces several new invented components without independent falsifiable evidence beyond the claimed experiments.

axioms (1)
  • domain assumption Multi-view diffusion models encapsulate knowledge acquired from billions of images that can be reused for better convergence in 3D tasks.
    Invoked in the abstract to justify the Wrapper component and data-efficient finetuning.
invented entities (3)
  • LGAA Wrapper no independent evidence
    purpose: Reuses and adapts network layers from MV diffusion models
    New modular component introduced to encapsulate and adapt 2D priors.
  • LGAA Switcher no independent evidence
    purpose: Aligns multiple LGAA Wrapper layers encapsulating different knowledge for geometry and PBR
    New component to incorporate multiple diffusion priors.
  • LGAA Decoder no independent evidence
    purpose: Tamed variational autoencoder to predict 2D Gaussian Splatting with PBR channels
    New decoder design for the 3D output representation.

pith-pipeline@v0.9.0 · 5868 in / 1538 out tokens · 59479 ms · 2026-05-18T18:26:06.847959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 11 internal anchors

  1. [1]

    Hunyuan3d 1.0: A unified frame- work for text-to-3d and image-to-3d generation,

    X. Yang, H. Shi, B. Zhang, F. Yang, J. Wang, H. Zhao, X. Liu, X. Wang, Q. Lin, J. Yu et al. , “Hunyuan3d 1.0: A unified frame- work for text-to-3d and image-to-3d generation,” arXiv preprint arXiv:2411.02293, 2024. 1

  2. [2]

    Clay: A controllable large-scale generative model for creating high-quality 3d assets,

    L. Zhang, Z. Wang, Q. Zhang, Q. Qiu, A. Pang, H. Jiang, W. Yang, L. Xu, and J. Yu, “Clay: A controllable large-scale generative model for creating high-quality 3d assets,” ACM T ransactions on Graphics (TOG), vol. 43, no. 4, pp. 1–20, 2024. 1

  3. [3]

    Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

    Z. Zhao, Z. Lai, Q. Lin, Y. Zhao, H. Liu, S. Yang, Y. Feng, M. Yang, S. Zhang, X. Yang et al., “Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation,” arXiv preprint arXiv:2501.12202, 2025. 1, 3

  4. [4]

    arXiv2504.07943(2025) 8

    Y. Yang, Y.-C. Guo, Y. Huang, Z.-X. Zou, Z. Yu, Y. Li, Y.-P . Cao, and X. Liu, “Holopart: Generative 3d part amodal segmentation,” arXiv preprint arXiv:2504.07943 , 2025. 1

  5. [5]

    Sparseflex: High-resolution and arbitrary-topology 3d shape modeling,

    X. He, Z.-X. Zou, C.-H. Chen, Y.-C. Guo, D. Liang, C. Yuan, W. Ouyang, Y.-P . Cao, and Y. Li, “Sparseflex: High-resolution and arbitrary-topology 3d shape modeling,” arXiv preprint arXiv:2503.21732, 2025. 1

  6. [6]

    TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

    Y. Li, Z.-X. Zou, Z. Liu, D. Wang, Y. Liang, Z. Yu, X. Liu, Y.-C. Guo, D. Liang, W. Ouyang et al., “Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models,” arXiv preprint arXiv:2502.06608, 2025. 1

  7. [7]

    Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

    S. Wu, Y. Lin, F. Zhang, Y. Zeng, Y. Yang, Y. Bao, J. Qian, S. Zhu, P . Torr, X. Cao, and Y. Yao, “Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention,” arXiv preprint arXiv:2505.17412, 2025. 1, 3

  8. [8]

    arXiv2505.14521(2025) 6, 8, 10, 11

    Z. Li, Y. Wang, H. Zheng, Y. Luo, and B. Wen, “Sparc3d: Sparse representation and construction for high-resolution 3d shapes modeling,” arXiv preprint arXiv:2505.14521 , 2025. 1, 3

  9. [9]

    DreamFusion: Text-to-3D using 2D Diffusion

    B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text-to-3d using 2d diffusion,” arXiv preprint arXiv:2209.14988 ,

  10. [10]

    Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation,

    H. Wang, X. Du, J. Li, R. A. Yeh, and G. Shakhnarovich, “Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 12 619–12 629. 1

  11. [11]

    Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content cre- ation,

    R. Chen, Y. Chen, N. Jiao, and K. Jia, “Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content cre- ation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 22 246–22 256. 1, 3

  12. [12]

    Magic3d: High- resolution text-to-3d content creation,

    C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, 12 K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin, “Magic3d: High- resolution text-to-3d content creation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 300–309. 1, 3

  13. [13]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation,

    Z. Wang, C. Lu, Y. Wang, F. Bao, C. Li, H. Su, and J. Zhu, “Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation,” Advances in Neural Information Processing Systems, vol. 36, 2024. 1, 3

  14. [14]

    Text-to-3d using gaussian splatting,

    Z. Chen, F. Wang, Y. Wang, and H. Liu, “Text-to-3d using gaussian splatting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 21 401–21 412. 1, 3

  15. [15]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,” arXiv preprint arXiv:2309.16653 , 2023. 1, 3

  16. [16]

    Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

    R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su, “Zero123++: a single image to consistent multi-view diffusion base model,” arXiv preprint arXiv:2310.15110 , 2023. 1, 3

  17. [17]

    Wonder3d: Single image to 3d using cross-domain diffusion,

    X. Long, Y.-C. Guo, C. Lin, Y. Liu, Z. Dou, L. Liu, Y. Ma, S.- H. Zhang, M. Habermann, C. Theobalt et al. , “Wonder3d: Single image to 3d using cross-domain diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 9970–9980. 1, 3

  18. [18]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Y. Liu, C. Lin, Z. Zeng, X. Long, L. Liu, T. Komura, and W. Wang, “Syncdreamer: Generating multiview-consistent images from a single-view image,” arXiv preprint arXiv:2309.03453 , 2023. 1, 3

  19. [19]

    MVDream: Multi-view Diffusion for 3D Generation

    Y. Shi, P . Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mv- dream: Multi-view diffusion for 3d generation,” arXiv preprint arXiv:2308.16512, 2023. 1, 3, 4, 5, 8

  20. [20]

    Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model.arXiv preprint arXiv:2311.06214, 2023

    J. Li, H. Tan, K. Zhang, Z. Xu, F. Luan, Y. Xu, Y. Hong, K. Sunkavalli, G. Shakhnarovich, and S. Bi, “Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model,” arXiv preprint arXiv:2311.06214 , 2023. 1, 3

  21. [21]

    LRM: Large Reconstruction Model for Single Image to 3D

    Y. Hong, K. Zhang, J. Gu, S. Bi, Y. Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan, “Lrm: Large reconstruction model for single image to 3d,” arXiv preprint arXiv:2311.04400 ,

  22. [22]

    Dmv3d: Denoising multi-view diffu- sion using 3d large reconstruction model.arXiv preprint arXiv:2311.09217, 2023

    Y. Xu, H. Tan, F. Luan, S. Bi, P . Wang, J. Li, Z. Shi, K. Sunkavalli, G. Wetzstein, Z. Xu et al. , “Dmv3d: Denoising multi-view dif- fusion using 3d large reconstruction model,” arXiv preprint arXiv:2311.09217, 2023. 1

  23. [23]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    J. Xu, W. Cheng, Y. Gao, X. Wang, S. Gao, and Y. Shan, “In- stantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models,” arXiv preprint arXiv:2404.07191, 2024. 1, 3

  24. [24]

    arXiv2402.05054(2024) 11

    J. Tang, Z. Chen, X. Chen, T. Wang, G. Zeng, and Z. Liu, “Lgm: Large multi-view gaussian model for high-resolution 3d content creation,” arXiv preprint arXiv:2402.05054 , 2024. 1, 3, 6, 8

  25. [25]

    Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation

    Y. Xu, Z. Shi, W. Yifan, H. Chen, C. Yang, S. Peng, Y. Shen, and G. Wetzstein, “Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation,” arXiv preprint arXiv:2403.14621, 2024. 1, 3

  26. [26]

    3dtopia-xl: Scaling high-quality 3d asset gen- eration via primitive diffusion,

    Z. Chen, J. Tang, Y. Dong, Z. Cao, F. Hong, Y. Lan, T. Wang, H. Xie, T. Wu, S. Saitoet al., “3dtopia-xl: Scaling high-quality 3d asset gen- eration via primitive diffusion,” arXiv preprint arXiv:2409.12957 ,

  27. [27]

    Huang, Y

    Z. Huang, Y.-C. Guo, H. Wang, R. Yi, L. Ma, Y.-P . Cao, and L. Sheng, “Mv-adapter: Multi-view consistent image generation made easy,” arXiv preprint arXiv:2412.03632 , 2024. 1

  28. [28]

    NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction

    P . Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,” arXiv preprint arXiv:2106.10689 , 2021. 1, 3

  29. [29]

    Nerf: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P . P . Srinivasan, M. Tancik, J. T. Barron, R. Ra- mamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM , vol. 65, no. 1, pp. 99–106, 2021. 1, 3

  30. [30]

    Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d,

    L. Qiu, G. Chen, X. Gu, Q. Zuo, M. Xu, Y. Wu, W. Yuan, Z. Dong, L. Bo, and X. Han, “Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 9914–9925. 1, 2, 3, 7

  31. [31]

    arXiv preprint arXiv:2412.12083 (2024)

    Z. Li, T. Wu, J. Tan, M. Zhang, J. Wang, and D. Lin, “Idarb: Intrinsic decomposition for arbitrary number of input views and illuminations,” arXiv preprint arXiv:2412.12083 , 2024. 1, 3, 4, 5, 8

  32. [32]

    arXiv preprint arXiv:2501.18590 (2025)

    R. Liang, Z. Gojcic, H. Ling, J. Munkberg, J. Hasselgren, Z.-H. Lin, J. Gao, A. Keller, N. Vijaykumar, S. Fidler et al. , “Diffusionren- derer: Neural inverse and forward rendering with video diffusion models,” arXiv preprint arXiv:2501.18590 , 2025. 1

  33. [33]

    Depth anything at any condition.arXiv preprint arXiv:2507.01634, 2025

    B. Sun, M. Jin, B. Yin, and Q. Hou, “Depth anything at any condition,” arXiv preprint arXiv:2507.01634 , 2025. 1

  34. [34]

    3d gaussian splatting for real-time radiance field rendering,

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM T ransactions on Graphics, vol. 42, no. 4, July 2023. [Online]. Avail- able: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/ 2, 3, 4, 7

  35. [35]

    2d gaussian splatting for geometrically accurate radiance fields,

    B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao, “2d gaussian splatting for geometrically accurate radiance fields,” in ACM SIGGRAPH 2024 Conference Papers , 2024, pp. 1–11. 2, 4, 7

  36. [36]

    Splatter image: Ultra-fast single-view 3d reconstruction,

    S. Szymanowicz, C. Rupprecht, and A. Vedaldi, “Splatter image: Ultra-fast single-view 3d reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 10 208–10 217. 2, 4

  37. [37]

    Objaverse: A universe of annotated 3d objects,

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. Van- derBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 13 142–13 153. 2, 3, 7

  38. [38]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems , vol. 33, pp. 6840–6851, 2020. 3

  39. [39]

    Flexible isosurface extraction for gradient-based mesh optimization,

    T. Shen, J. Munkberg, J. Hasselgren, K. Yin, Z. Wang, W. Chen, Z. Gojcic, S. Fidler, N. Sharp, and J. Gao, “Flexible isosurface extraction for gradient-based mesh optimization,” ACM T rans. Graph. , vol. 42, no. 4, jul 2023. [Online]. Available: https://doi.org/10.1145/3592430 3

  40. [40]

    Deep march- ing tetrahedra: a hybrid representation for high-resolution 3d shape synthesis,

    T. Shen, J. Gao, K. Yin, M.-Y. Liu, and S. Fidler, “Deep march- ing tetrahedra: a hybrid representation for high-resolution 3d shape synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 6087–6101, 2021. 3

  41. [41]

    Occupancy networks: Learning 3d reconstruction in function space,

    L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger, “Occupancy networks: Learning 3d reconstruction in function space,” in Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , 2019, pp. 4460–4470. 3

  42. [42]

    arXiv preprint arXiv:2310.19415 , year=

    X. Yu, Y.-C. Guo, Y. Li, D. Liang, S.-H. Zhang, and X. Qi, “Text-to-3d with classifier score distillation,” arXiv preprint arXiv:2310.19415, 2023. 3

  43. [43]

    Lucid- dreamer: Towards high-fidelity text-to-3d generation via interval score matching,

    Y. Liang, X. Yang, J. Lin, H. Li, X. Xu, and Y. Chen, “Lucid- dreamer: Towards high-fidelity text-to-3d generation via interval score matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 6517–6526. 3

  44. [44]

    Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors

    T. Yi, J. Fang, G. Wu, L. Xie, X. Zhang, W. Liu, Q. Tian, and X. Wang, “Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors,” arXiv preprint arXiv:2310.08529, 2023. 3

  45. [45]

    arXiv2404.19702(2024) 11

    K. Zhang, S. Bi, H. Tan, Y. Xiangli, N. Zhao, K. Sunkavalli, and Z. Xu, “Gs-lrm: Large reconstruction model for 3d gaussian splatting,” arXiv preprint arXiv:2404.19702 , 2024. 3

  46. [46]

    Ctrl-room: Controllable text- to-3d room meshes generation with layout constraints,

    C. Fang, X. Hu, K. Luo, and P . Tan, “Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints,” arXiv preprint arXiv:2310.03602, 2023. 3

  47. [47]

    arXiv preprint arXiv:2312.17142 , year=

    J. Ren, L. Pan, J. Tang, C. Zhang, A. Cao, G. Zeng, and Z. Liu, “Dreamgaussian4d: Generative 4d gaussian splatting,” arXiv preprint arXiv:2312.17142 , 2023. 3

  48. [48]

    arXiv preprint arXiv:2301.11280 , year=

    U. Singer, S. Sheynin, A. Polyak, O. Ashual, I. Makarov, F. Kokki- nos, N. Goyal, A. Vedaldi, D. Parikh, J. Johnson et al., “Text-to-4d dynamic scene generation,” arXiv preprint arXiv:2301.11280 , 2023. 3

  49. [49]

    Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models,

    H. Ling, S. W. Kim, A. Torralba, S. Fidler, and K. Kreis, “Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 8576–8588. 3

  50. [50]

    Objaverse-xl: A universe of 10m+ 3d objects,

    M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . Voleti, S. Y. Gadre et al. , “Objaverse-xl: A universe of 10m+ 3d objects,” Advances in Neural Information Processing Systems, vol. 36, 2024. 3, 7

  51. [51]

    Ar-1-to-3: Single image to consistent 3d object generation via next-view prediction,

    X. Zhang, Y. Zhou, K. Wang, Y. Wang, Z. Li, S. Jiao, D. Zhou, Q. Hou, and M.-M. Cheng, “Ar-1-to-3: Single image to consistent 3d object generation via next-view prediction,” arXiv preprint arXiv:2503.12929, 2025. 3

  52. [52]

    Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d,

    W. Li, R. Chen, X. Chen, and P . Tan, “Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d,” arXiv preprint arXiv:2310.02596, 2023. 3 13

  53. [53]

    Dreamview: Injecting view-specific text guidance into text-to-3d generation,

    J. Yan, Y. Gao, Q. Yang, X. Wei, X. Xie, A. Wu, and W.-S. Zheng, “Dreamview: Injecting view-specific text guidance into text-to-3d generation,” in European Conference on Computer Vision . Springer, 2024, pp. 358–374. 3, 4, 5, 8

  54. [54]

    Crm: Single image to 3d textured mesh with convolutional reconstruction model,

    Z. Wang, Y. Wang, Y. Chen, C. Xiang, S. Chen, D. Yu, C. Li, H. Su, and J. Zhu, “Crm: Single image to 3d textured mesh with con- volutional reconstruction model,” arXiv preprint arXiv:2403.05034 ,

  55. [55]

    Lara: Efficient large-baseline radiance fields,

    A. Chen, H. Xu, S. Esposito, S. Tang, and A. Geiger, “Lara: Efficient large-baseline radiance fields,” arXiv preprint arXiv:2407.04699 ,

  56. [56]

    Turbo3d: Ultra-fast text-to-3d generation,

    H. Hu, T. Yin, F. Luan, Y. Hu, H. Tan, Z. Xu, S. Bi, S. Tulsiani, and K. Zhang, “Turbo3d: Ultra-fast text-to-3d generation,” arXiv preprint arXiv:2412.04470, 2024. 3

  57. [57]

    Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation,

    C. Lin, P . Pan, B. Yang, Z. Li, and Y. Mu, “Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation,” arXiv preprint arXiv:2501.16764 , 2025. 3, 8

  58. [58]

    Structured 3d latents for scalable and versatile 3d generation,

    J. Xiang, Z. Lv, S. Xu, Y. Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang, “Structured 3d latents for scalable and versatile 3d generation,” in Proceedings of the Computer Vision and Pattern Recognition Conference , 2025, pp. 21 469–21 480. 3, 8

  59. [59]

    3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models

    B. Zhang, J. Tang, M. Nießner, and P . Wonka, “3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models,” ACM T rans. Graph. , vol. 42, no. 4, jul 2023. [Online]. Available: https://doi.org/10.1145/3592442 3

  60. [60]

    arXiv preprint arXiv:2405.14979 (2024) 11

    W. Li, J. Liu, H. Yan, R. Chen, Y. Liang, X. Chen, P . Tan, and X. Long, “Craftsman3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner,” arXiv preprint arXiv:2405.14979, 2024. 3

  61. [61]

    Dora: Sampling and benchmarking for 3d shape variational auto-encoders,

    R. Chen, J. Zhang, Y. Liang, G. Luo, W. Li, J. Liu, X. Li, X. Long, J. Feng, and P . Tan, “Dora: Sampling and benchmarking for 3d shape variational auto-encoders,” in Proceedings of the Computer Vision and Pattern Recognition Conference , 2025, pp. 16 251–16 261. 3

  62. [62]

    Nero: Neural geometry and brdf reconstruction of reflective objects from multiview images,

    Y. Liu, P . Wang, C. Lin, X. Long, J. Wang, L. Liu, T. Komura, and W. Wang, “Nero: Neural geometry and brdf reconstruction of reflective objects from multiview images,” ACM T ransactions on Graphics (TOG), vol. 42, no. 4, pp. 1–22, 2023. 3

  63. [63]

    Gs-ror: 3d gaussian splat- ting for reflective object relighting via sdf priors,

    Z.-L. Zhu, B. Wang, and J. Yang, “Gs-ror: 3d gaussian splat- ting for reflective object relighting via sdf priors,” arXiv preprint arXiv:2406.18544, 2024. 3

  64. [64]

    Nerfactor: Neural factorization of shape and reflectance under an unknown illumination,

    X. Zhang, P . P . Srinivasan, B. Deng, P . Debevec, W. T. Freeman, and J. T. Barron, “Nerfactor: Neural factorization of shape and reflectance under an unknown illumination,” ACM T ransactions on Graphics (T oG), vol. 40, no. 6, pp. 1–18, 2021. 3

  65. [65]

    Tensosdf: Roughness- aware tensorial representation for robust geometry and material reconstruction,

    J. Li, L. Wang, L. Zhang, and B. Wang, “Tensosdf: Roughness- aware tensorial representation for robust geometry and material reconstruction,” ACM T ransactions on Graphics (TOG), vol. 43, no. 4, pp. 1–13, 2024. 3

  66. [66]

    Gaussian splatting with dis- cretized sdf for relightable assets,

    Z.-L. Zhu, J. Yang, and B. Wang, “Gaussian splatting with dis- cretized sdf for relightable assets,” in Proceedings of IEEE Interna- tional Conference on Computer Vision (ICCV) , 2025. 3

  67. [67]

    Unidream: Unifying dif- fusion priors for relightable text-to-3d generation,

    Z. Liu, Y. Li, Y. Lin, X. Yu, S. Peng, Y.-P . Cao, X. Qi, X. Huang, D. Liang, and W. Ouyang, “Unidream: Unifying dif- fusion priors for relightable text-to-3d generation,” arXiv preprint arXiv:2312.08754, 2023. 3

  68. [68]

    Matlaber: Material- aware text-to-3d via latent brdf auto-encoder,

    X. Xu, Z. Lyu, X. Pan, and B. Dai, “Matlaber: Material- aware text-to-3d via latent brdf auto-encoder,” arXiv preprint arXiv:2308.09278, 2023. 3

  69. [69]

    Meta 3d assetgen: Text-to-mesh generation with high- quality geometry, texture, and pbr materials,

    Y. Siddiqui, T. Monnier, F. Kokkinos, M. Kariya, Y. Kleiman, E. Garreau, O. Gafni, N. Neverova, A. Vedaldi, R. Shapovalov et al. , “Meta 3d assetgen: Text-to-mesh generation with high- quality geometry, texture, and pbr materials,” arXiv preprint arXiv:2407.02445, 2024. 3

  70. [70]

    Arm: Appearance reconstruction model for re- lightable 3d generation,

    X. Feng, C. Yu, Z. Bi, Y. Shang, F. Gao, H. Wu, K. Zhou, C. Jiang, and Y. Yang, “Arm: Appearance reconstruction model for re- lightable 3d generation,” arXiv preprint arXiv:2411.10825 , 2024. 3

  71. [71]

    Texgaussian: Generating high-quality pbr material via octree-based 3d gaussian splatting,

    B. Xiong, J. Liu, J. Hu, C. Wu, J. Wu, X. Liu, C. Zhao, E. Ding, and Z. Lian, “Texgaussian: Generating high-quality pbr material via octree-based 3d gaussian splatting,” in Proceedings of the Computer Vision and Pattern Recognition Conference , 2025, pp. 551–561. 3

  72. [72]

    Texgen: a generative diffusion model for mesh textures,

    X. Yu, Z. Yuan, Y.-C. Guo, Y.-T. Liu, J. Liu, Y. Li, Y.-P . Cao, D. Liang, and X. Qi, “Texgen: a generative diffusion model for mesh textures,” ACM T ransactions on Graphics (TOG), vol. 43, no. 6, pp. 1–14, 2024. 3

  73. [73]

    Objects with lighting: A real-world dataset for evaluating reconstruction and rendering for object relighting,

    B. Ummenhofer, S. Agrawal, R. Sep ´ulveda, Y. Lao, K. Zhang, T. Cheng, S. R. Richter, S. Wang, and G. Ros, “Objects with lighting: A real-world dataset for evaluating reconstruction and rendering for object relighting,” in 3DV. IEEE, 2024. 3, 6

  74. [74]

    Digital twin catalog: A large-scale photorealistic 3d object digital twin dataset,

    Z. Dong, K. Chen, Z. Lv, H.-X. Yu, Y. Zhang, C. Zhang, Y. Zhu, S. Tian, Z. Li, G. Moffatt et al., “Digital twin catalog: A large-scale photorealistic 3d object digital twin dataset,” in Proceedings of the Computer Vision and Pattern Recognition Conference , 2025, pp. 753–

  75. [75]

    Mage: Single image to material-aware 3d via the multi-view g- buffer estimation model,

    H. Wang, Z. Wang, X. Long, C. Lin, G. Hancke, and R. W. Lau, “Mage: Single image to material-aware 3d via the multi-view g- buffer estimation model,” in Proceedings of the Computer Vision and Pattern Recognition Conference , 2025, pp. 10 985–10 995. 3

  76. [76]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695. 4

  77. [77]

    arXiv preprint arXiv:2312.02201 , year=

    P . Wang and Y. Shi, “Imagedream: Image-prompt multi-view diffusion for 3d generation,” arXiv preprint arXiv:2312.02201, 2023. 4, 5, 7, 8

  78. [78]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 3836–3847. 5

  79. [79]

    Extracting triangular 3d models, mate- rials, and lighting from images,

    J. Munkberg, J. Hasselgren, T. Shen, J. Gao, W. Chen, A. Evans, T. M ¨uller, and S. Fidler, “Extracting triangular 3d models, mate- rials, and lighting from images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 8280–8290. 5

  80. [80]

    Gs-ir: 3d gaussian splatting for inverse rendering,

    Z. Liang, Q. Zhang, Y. Feng, Y. Shan, and K. Jia, “Gs-ir: 3d gaussian splatting for inverse rendering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 21 644–21 653. 5

Showing first 80 references.