arxiv: 2604.27504 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

Hankyeol Lee , Wooyeol Baek , Seongdo Kim , Jongyoo Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D generationvolume enhancementinflated priorlatent refinementsingle-image to 3Dgenerative models3D editingsurface metrics

0 comments

The pith

A two-stage pipeline builds an inflated geometric prior from flat images then refines it in latent space to produce voluminous 3D assets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative models struggle to add depth when converting single flat images into 3D because those inputs supply few geometric cues. REVIVE 3D first builds an Inflated Prior by expanding the foreground silhouette for global volume and layering part-aware details for local shape. The second stage adds noise to the prior's latent code and denoises it, letting the prior's cues steer the process toward pretrained 3D knowledge while keeping the original image's fine details. The same mechanism supports image-conditioned editing of 3D outputs. Two new metrics, Compactness and Normal Anisotropy, quantify volume and surface flatness and were shown in a user study to track human judgments of quality.

Core claim

REVIVE 3D shows that an Inflated Prior, created by inflating the input silhouette to recover global volume and superimposing part-aware details to capture local structure, can guide a 3D Latent Refinement stage in which Gaussian noise is injected into the prior's latent representation and then removed, allowing the pretrained backbone to produce more voluminous 3D assets from flat images.

What carries the argument

The Inflated Prior, formed by silhouette inflation for global volume plus part-aware detail overlays for local structure, which supplies geometric cues that steer the latent-space denoising process.

If this is right

The same two-stage process enables image-conditioned 3D editing by refining modified inputs.
Compactness and Normal Anisotropy metrics provide quantitative proxies for perceived volume and surface quality that match human evaluations.
Extensive tests demonstrate state-of-the-art results on challenging flat-image datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The prior-construction step could be inserted into other 3D generative pipelines to improve volume without retraining the backbone.
Higher-fidelity volumetric outputs may improve accuracy in downstream uses such as physics simulation or virtual-object placement.
Automating the part-aware detail stage might further reduce reliance on heuristic or manual inputs.

Load-bearing premise

The geometric cues encoded in the Inflated Prior can steer the denoising process to add volume without creating inconsistencies or erasing fine details from the original image.

What would settle it

Apply REVIVE 3D to a held-out set of flat images and observe no increase in Compactness scores or no preference in blind user ratings for volume relative to standard generative baselines.

Figures

Figures reproduced from arXiv: 2604.27504 by Hankyeol Lee, Jongyoo Kim, Seongdo Kim, Wooyeol Baek.

**Figure 1.** Figure 1: REVIVE 3D generates voluminous and detailed 3D meshes from flat images. view at source ↗

**Figure 2.** Figure 2: Previous methods fail to generate accurate global struc view at source ↗

**Figure 3.** Figure 3: Overview of our method. Stage 1 generates the Inflated Prior. We create a Base 3D from the Silhouette Mask and Detail 3D from Segmentation Masks, then combine them via superimposing. Stage 2 refines the Inflated Prior by encoding the mesh, injecting noise, denoising it with the image condition, and decoding the result into the Refined 3D mesh. iteratively denoises under image conditioning to a clean laten… view at source ↗

**Figure 4.** Figure 4: Stage 1’s additive method incorrectly forms features that view at source ↗

**Figure 5.** Figure 5: Visual comparison of our method against baseline methods, using the Hunyuan3D-2.1 [ view at source ↗

**Figure 6.** Figure 6: User study ratings on a 5-point Likert scale for Quality, view at source ↗

**Figure 8.** Figure 8: Each row shows a different example. Within each ex view at source ↗

**Figure 9.** Figure 9: Hyperparameter robustness analysis. Arrows indicate the direction from the prior to the final 3D results. view at source ↗

**Figure 10.** Figure 10: Qualitative results showing non-flat cases (top) and view at source ↗

**Figure 11.** Figure 11: Contour and Cut Visualization. The outer boundary is view at source ↗

**Figure 12.** Figure 12: Visualization of the mask filtering process. We visualize masks filtered by our four criteria: Margin (red), Overlap (blue), Size view at source ↗

**Figure 13.** Figure 13: Prior and refined 3D results when using character and view at source ↗

**Figure 14.** Figure 14: Interpolating the Inflated Prior toward a sphere degrades image-consistent geometry, as shown by priors (top) and refined 3D view at source ↗

**Figure 15.** Figure 15: Effect of varying global and local inflation strength (1.5, 3.0, 6.0) on the Inflated Prior and refined 3D results. view at source ↗

**Figure 16.** Figure 16: Category-wise Compactness (C) and Normal Anisotropy (NA) on ModelNet40 [ view at source ↗

**Figure 17.** Figure 17: Example of the web interface used in our user study. view at source ↗

**Figure 18.** Figure 18: Dataset examples. A.5. Dataset We collected 2,232 flat images with limited 3D cues and show a few representative examples in view at source ↗

**Figure 19.** Figure 19: Image-conditioned 3D editing results on the Edit3DBench dataset [ view at source ↗

**Figure 20.** Figure 20: Image-conditioned 3D editing result on the Edit3DBench dataset [ view at source ↗

**Figure 21.** Figure 21: Effect of different 3D latent backbones. view at source ↗

**Figure 22.** Figure 22: Additional qualitative comparisons of 3D meshes generated from flat images by our method and prior works. view at source ↗

**Figure 23.** Figure 23: Texture comparison of 3D generations from flat images. view at source ↗

read the original abstract

Recent generative models have shown strong performance in generating diverse 3D assets from 2D images, a fundamental research topic in computer vision and graphics. However, these models still struggle to generate voluminous 3D assets when the input is a flat image that provides limited 3D cues. We introduce REVIVE 3D, a two-stage, plug-and-play pipeline for generating voluminous 3D assets from flat images. In Stage 1, we construct an Inflated Prior by inflating the foreground silhouette to recover global volume and superimposing part-aware details to capture local structure. In Stage 2, 3D Latent Refinement injects Gaussian noise into the Inflated Prior's latent and then denoises it, using the prior's geometric cues to leverage the backbone's pretrained 3D knowledge. Furthermore, our framework supports image-conditioned 3D editing. To quantify volume and surface flatness, we propose Compactness and Normal Anisotropy. We validate Compactness and Normal Anisotropy through a user study, showing that these metrics align with human perception of volume and quality. We show that REVIVE 3D achieves state-of-the-art performance on a challenging flat image dataset, based on extensive qualitative and quantitative evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REVIVE 3D gives a straightforward two-stage pipeline to add volume to flat-image 3D outputs via an inflated prior and latent refinement, plus two new metrics, but the evidence for reliable gains stays thin.

read the letter

The main thing to know is that this paper offers a plug-and-play fix for the common flatness issue in single-image 3D generation. Stage 1 builds an Inflated Prior by expanding the foreground silhouette for overall volume and layering part-aware details on top. Stage 2 then adds noise to that prior's latent code and denoises it, using the prior's cues to steer a pretrained 3D backbone. They also introduce Compactness and Normal Anisotropy metrics and back them with a user study showing alignment with human judgments of volume and quality. The method further allows image-conditioned editing as a side benefit.

Referee Report

3 major / 2 minor

Summary. The manuscript presents REVIVE 3D, a two-stage plug-and-play pipeline for generating voluminous 3D assets from flat 2D images. Stage 1 constructs an Inflated Prior by inflating the foreground silhouette to recover global volume and superimposing part-aware details to capture local structure. Stage 2 performs 3D Latent Refinement by injecting Gaussian noise into the prior's latent representation and denoising it, guided by the prior's geometric cues to leverage pretrained 3D knowledge from backbones. The framework also supports image-conditioned 3D editing. New metrics Compactness and Normal Anisotropy are proposed to quantify volume and surface flatness; these are validated via a user study showing alignment with human perception. The authors claim state-of-the-art performance on a challenging flat-image dataset based on qualitative and quantitative evaluations.

Significance. If the central claims hold, the work could offer a practical, efficient approach to addressing limited 3D cues in single-image 3D generation by injecting a constructed prior into latent refinement. The introduction of Compactness and Normal Anisotropy metrics, if validated as correlating with human judgments, would provide useful tools for evaluating volumetric quality in 3D assets. The plug-and-play design and editing support add practical value, and reliance on pretrained backbones is computationally attractive. However, the significance is tempered by the need for clearer evidence that the Inflated Prior reliably guides refinement without artifacts.

major comments (3)

[Stage 2 description] Stage 2 (3D Latent Refinement): The description states that geometric cues from the Inflated Prior guide the denoising process after Gaussian noise injection, but provides no mechanism for cue injection (e.g., conditioning, attention, or auxiliary losses), no explicit fidelity terms to the original image, and no analysis of failure modes such as over-smoothing or hallucinated geometry. This is load-bearing for the claim that the method recovers true volume without inconsistencies or detail loss.
[Evaluation] Evaluation section: The SOTA claim rests on 'extensive qualitative and quantitative evaluations' and a user study validating Compactness and Normal Anisotropy, yet the provided text supplies no numerical results, ablation tables, baseline comparisons, participant counts, rating protocol, or statistical analysis. Without these, the performance claims and metric validation cannot be assessed.
[Stage 1 description] Stage 1 (Inflated Prior construction): The process of superimposing part-aware details onto the inflated silhouette lacks specifics on detail extraction (e.g., from segmentation models or other sources) and resolution of potential conflicts between global inflation and local details. If these cues are weak or inconsistent, they directly undermine the reliability of the guidance provided to Stage 2.

minor comments (2)

[Abstract] The abstract introduces several new terms (Inflated Prior, Compactness, Normal Anisotropy) without immediate definitions or illustrative examples; the full manuscript should provide these early to improve accessibility.
[Method] Notation for the latent space operations in Stage 2 should be formalized with equations to clarify the noise injection and denoising steps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful and constructive comments on our work. The feedback has helped us identify areas where the manuscript can be improved for clarity and completeness. We have prepared a revised version of the manuscript that addresses each of the major comments. Below, we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [Stage 2 description] Stage 2 (3D Latent Refinement): The description states that geometric cues from the Inflated Prior guide the denoising process after Gaussian noise injection, but provides no mechanism for cue injection (e.g., conditioning, attention, or auxiliary losses), no explicit fidelity terms to the original image, and no analysis of failure modes such as over-smoothing or hallucinated geometry. This is load-bearing for the claim that the method recovers true volume without inconsistencies or detail loss.

Authors: We appreciate the referee highlighting the need for more precise technical details in Stage 2. The original manuscript described the process at a conceptual level, but we agree that the specific implementation of cue injection was not sufficiently elaborated. In the revised manuscript, we have added a detailed description of the mechanism: the geometric cues are injected by encoding the prior's depth and normal information and using them as conditioning inputs to the denoising U-Net through cross-attention layers. We have also incorporated an auxiliary loss term to maintain fidelity to the geometric prior during refinement. Regarding fidelity to the original image, the Inflated Prior is built directly from the input, and the latent refinement is constrained to stay close to the prior's latent code. Furthermore, we have included an analysis of failure modes in a new subsection, discussing scenarios of over-smoothing (addressed by lower noise injection levels) and potential hallucination (prevented by the strong guidance from the prior). These revisions ensure the claims are better supported. revision: yes
Referee: [Evaluation] Evaluation section: The SOTA claim rests on 'extensive qualitative and quantitative evaluations' and a user study validating Compactness and Normal Anisotropy, yet the provided text supplies no numerical results, ablation tables, baseline comparisons, participant counts, rating protocol, or statistical analysis. Without these, the performance claims and metric validation cannot be assessed.

Authors: We acknowledge that the main body of the submitted manuscript did not include the full set of numerical results and detailed evaluation protocols, which may have made it difficult to assess the claims. The quantitative evaluations and user study details were partially described but not with the full tables and statistics. In the revised version, we have expanded the Evaluation section to include comprehensive numerical results comparing against state-of-the-art methods, ablation studies demonstrating the contribution of each component, and full details of the user study including the number of participants, the rating protocol (5-point Likert scale on volume and flatness), and statistical analysis (e.g., correlation coefficients and significance tests showing alignment with human perception). These additions will allow for a complete assessment of our performance claims and the validity of the proposed metrics. revision: yes
Referee: [Stage 1 description] Stage 1 (Inflated Prior construction): The process of superimposing part-aware details onto the inflated silhouette lacks specifics on detail extraction (e.g., from segmentation models or other sources) and resolution of potential conflicts between global inflation and local details. If these cues are weak or inconsistent, they directly undermine the reliability of the guidance provided to Stage 2.

Authors: Thank you for this comment on Stage 1. We agree that more specifics are needed to ensure reproducibility and to address potential issues with cue consistency. In the revised manuscript, we have elaborated on the detail extraction process: part-aware details are extracted using a pretrained segmentation model to identify semantic parts, followed by detail enhancement from the input image's texture and edges. For resolving conflicts between global inflation and local details, we employ a priority-based blending where local details override the global volume in regions of high detail density, using a weighted combination based on edge strength. We have added pseudocode and additional figures to illustrate this process. This should strengthen the reliability of the prior for guiding Stage 2. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural pipeline and externally validated metrics remain self-contained.

full rationale

The paper presents a two-stage engineering pipeline (silhouette inflation plus part-aware detail overlay to form an Inflated Prior, followed by latent-space denoising that injects geometric cues from that prior into a pretrained 3D backbone) without any equations, fitted parameters, or claimed predictions that reduce to the inputs by construction. New metrics (Compactness and Normal Anisotropy) are defined procedurally and validated against an independent user study rather than by reference to the method's own outputs. No self-citations are invoked as load-bearing uniqueness theorems, and the SOTA claim is grounded in comparative evaluations on an external flat-image dataset. The derivation chain is therefore absent; the work is a plug-and-play heuristic that does not collapse to self-definition or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of the Inflated Prior construction and the ability of its geometric cues to steer latent denoising. No explicit free parameters, mathematical axioms, or new physical entities are stated in the abstract.

invented entities (1)

Inflated Prior no independent evidence
purpose: Recover global volume from silhouette and local structure from part-aware details
Constructed in Stage 1 as the starting point for refinement; no independent evidence outside the method is provided.

pith-pipeline@v0.9.0 · 5532 in / 1418 out tokens · 51400 ms · 2026-05-07T08:01:25.501018+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 22 canonical work pages · 8 internal anchors

[1]

Gwangbin Bae and Andrew J. Davison. Rethinking inductive biases for surface normal estimation. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),
[2]

Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22246–22256, 2023. 2

2023
[3]

Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders

Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan. Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 16251– 16261, 2025. 3

2025
[4]

Panic-3d: Stylized single- view 3d reconstruction from portraits of anime characters

Shuhong Chen, Kevin Zhang, Yichun Shi, Heng Wang, Yi- heng Zhu, Guoxian Song, Sizhe An, Janus Kristjansson, Xiao Yang, and Matthias Zwicker. Panic-3d: Stylized single- view 3d reconstruction from portraits of anime characters. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 21068–21077, 2023. 3

2023
[5]

Text-to-3d using gaussian splatting

Zilong Chen, Feng Wang, Yikai Wang, and Huaping Liu. Text-to-3d using gaussian splatting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21401–21412, 2024. 2

2024
[6]

Art3D: Training-Free 3D Generation from Flat-Colored Illustration

Xiaoyan Cong, Jiayi Shen, Zekun Li, Rao Fu, Tao Lu, and Srinath Sridhar. Art3d: Training-free 3d generation from flat-colored illustration.arXiv preprint arXiv:2504.10466,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Monster mash: a single-view approach to casual 3d modeling and an- imation.ACM Transactions on Graphics (ToG), 39(6):1–12,

Marek Dvoro ˇzˇn´ak, Daniel S `ykora, Cassidy Curtis, Brian Curless, Olga Sorkine-Hornung, and David Salesin. Monster mash: a single-view approach to casual 3d modeling and an- imation.ACM Transactions on Graphics (ToG), 39(6):1–12,
[8]

Magictoon: A 2d-to-3d creative cartoon modeling system with mobile ar

Lele Feng, Xubo Yang, and Shuangjiu Xiao. Magictoon: A 2d-to-3d creative cartoon modeling system with mobile ar. In2017 IEEE Virtual Reality (VR), pages 195–204. IEEE,
[9]

arXiv:2303.05371 , year=

Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Bar- las O˘guz. 3dgen: Triplane latent diffusion for textured mesh generation.arXiv preprint arXiv:2303.05371, 2023. 3

work page arXiv 2023
[10]

Deeps- ketch2face: a deep learning based sketching system for 3d face and caricature modeling.ACM Transactions on graph- ics (TOG), 36(4):1–12, 2017

Xiaoguang Han, Chang Gao, and Yizhou Yu. Deeps- ketch2face: a deep learning based sketching system for 3d face and caricature modeling.ACM Transactions on graph- ics (TOG), 36(4):1–12, 2017. 3

2017
[11]

Stdgen: Semantic-decomposed 3d character generation from single images

Yuze He, Yanning Zhou, Wang Zhao, Zhongkai Wu, Kai- wen Xiao, Wei Yang, Yong-Jin Liu, and Xiao Han. Stdgen: Semantic-decomposed 3d character generation from single images. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26345–26355, 2025. 3

2025
[12]

3dtopia: Large text-to-3d genera- tion model with hybrid diffusion priors.arXiv preprint arXiv:2403.02234, 2024

Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, et al. 3dtopia: Large text-to-3d genera- tion model with hybrid diffusion priors.arXiv preprint arXiv:2403.02234, 2024. 3

work page arXiv 2024
[13]

Teddy: a sketching interface for 3d freeform design

Takeo Igarashi, Satoshi Matsuoka, and Hidehiko Tanaka. Teddy: a sketching interface for 3d freeform design. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, page 409–416, USA,
[14]

ACM Press/Addison-Wesley Publishing Co. 3
[15]

Shap-e: Generating conditional 3d implicit functions

Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 3

work page arXiv 2023
[16]

Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022. 5

2022
[17]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 2

2023
[18]

Modeling from contour drawings

Vladislav Kraevoy, Alla Sheffer, and Michiel Van De Panne. Modeling from contour drawings. InProceedings of the 6th Eurographics Symposium on Sketch-Based interfaces and Modeling, pages 37–44, 2009. 3

2009
[19]

Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation

Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. InECCV, 2024. 3

2024
[20]

Instant3d: Fast text-to-3d with sparse-view gen- eration and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 2

work page arXiv 2023
[21]

V oxhammer: Training-free precise and coherent 3d editing in native 3d space.arXiv preprint arXiv:2508.19247, 2025

Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, and Lu Sheng. V oxhammer: Training-free precise and coherent 3d editing in native 3d space.arXiv preprint arXiv:2508.19247, 2025. 16, 17

work page arXiv 2025
[22]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023. 2

2023
[23]

One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36:22226–22246, 2023

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36:22226–22246, 2023. 2

2023
[24]

One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion

Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji- ayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10072–10083,
[25]

Zero-1-to- 3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 1, 2

2023
[26]

Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age.arXiv preprint arXiv:2309.03453, 2023

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age.arXiv preprint arXiv:2309.03453, 2023. 2

work page arXiv 2023
[27]

Wonder3d: Sin- gle image to 3d using cross-domain diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 2

2024
[28]

Simpmodeling: Sketching implicit field to guide mesh modeling for 3d animalmorphic head de- sign

Zhongjin Luo, Jie Zhou, Heming Zhu, Dong Du, Xiaoguang Han, and Hongbo Fu. Simpmodeling: Sketching implicit field to guide mesh modeling for 3d animalmorphic head de- sign. InThe 34th annual ACM symposium on user interface software and technology, pages 854–863, 2021. 3

2021
[29]

Ra- bit: Parametric modeling of 3d biped cartoon characters with a topological-consistent dataset

Zhongjin Luo, Shengcai Cai, Jinguo Dong, Ruibo Ming, Liangdong Qiu, Xiaohang Zhan, and Xiaoguang Han. Ra- bit: Parametric modeling of 3d biped cartoon characters with a topological-consistent dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12825–12835, 2023. 2, 3

2023
[30]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions.arXiv preprint arXiv:2108.01073, 2021. 5

work page internal anchor Pith review arXiv 2021
[31]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

2021
[32]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 5

work page internal anchor Pith review arXiv 2023
[33]

Charactergen: Efficient 3d character generation from single images with multi-view pose canon- icalization.ACM Transactions on Graphics (TOG), 43(4): 1–13, 2024

Hao-Yang Peng, Jia-Peng Zhang, Meng-Hao Guo, Yan-Pei Cao, and Shi-Min Hu. Charactergen: Efficient 3d character generation from single images with multi-view pose canon- icalization.ACM Transactions on Graphics (TOG), 43(4): 1–13, 2024. 2, 3

2024
[34]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 2

work page internal anchor Pith review arXiv 2022
[35]

Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors

Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko- rokhodov, Peter Wonka, Sergey Tulyakov, and Bernard Ghanem. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. InThe Twelfth International Conference on Learning Representa- tions (ICLR), 2024. 2

2024
[36]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5

2021
[37]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review arXiv 2024
[38]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3

2024
[39]

Zeronvs: Zero-shot 360- degree view synthesis from a single image

Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La- gun, Li Fei-Fei, Deqing Sun, et al. Zeronvs: Zero-shot 360- degree view synthesis from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9420–9429, 2024. 2

2024
[40]

Zero123++: a single image to consistent multi-view dif- fusion base model, 2023

Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view dif- fusion base model, 2023. 1

2023
[41]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 2

work page internal anchor Pith review arXiv 2023
[42]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 5

work page internal anchor Pith review arXiv 2011
[43]

Magiccartoon: 3d pose and shape esti- mation for bipedal cartoon characters

Yu-Pei Song, Yuan-Tong Liu, Xiao Wu, Qi He, Zhaoquan Yuan, and Ao Luo. Magiccartoon: 3d pose and shape esti- mation for bipedal cartoon characters. InProceedings of the 32nd ACM International Conference on Multimedia, pages 8219–8227, 2024. 2, 3

2024
[44]

arXiv preprint arXiv:2310.16818 , year=

Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. Dreamcraft3d: Hierarchi- cal 3d generation with bootstrapped diffusion prior.arXiv preprint arXiv:2310.16818, 2023. 2

work page arXiv 2023
[45]

arXiv preprint arXiv:2309.16653 , year=

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation.arXiv preprint arXiv:2309.16653,

work page arXiv
[46]

Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material,

Tencent Hunyuan3D Team. Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material,
[47]

1, 2, 3, 5, 6, 16, 18
[48]

Hunyuan3d-omni: A unified framework for controllable generation of 3d assets, 2025

Tencent Hunyuan3D Team. Hunyuan3d-omni: A unified framework for controllable generation of 3d assets, 2025. 6, 18

2025
[49]

Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation,

Tencent Hunyuan3D Team. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation,
[50]

Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12619–12629, 2023. 2

2023
[51]

Nova-3d: Non-overlapped views for 3d anime character reconstruc- tion

Hongsheng Wang, Xinrui Zhou, and Feng Lin. Nova-3d: Non-overlapped views for 3d anime character reconstruc- tion. InProceedings of the 6th ACM International Confer- ence on Multimedia in Asia Workshops, pages 1–7, 2024. 3

2024
[52]

arXiv preprint arXiv:2312.02201 , year=

Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation.arXiv preprint arXiv:2312.02201, 2023. 2

work page arXiv 2023
[53]

arXiv:2305.16213 , year=

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.arXiv preprint arXiv:2305.16213, 2023. 2

work page arXiv 2023
[54]

Direct3d: Scal- able image-to-3d generation via 3d latent diffusion trans- former.Advances in Neural Information Processing Systems, 37:121859–121881, 2024

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scal- able image-to-3d generation via 3d latent diffusion trans- former.Advances in Neural Information Processing Systems, 37:121859–121881, 2024. 1, 2, 3, 5, 6, 16

2024
[55]

3d shapenets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015. 6, 15

1912
[56]

arXiv preprint arXiv:2412.01506 (2024) 4

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration.arXiv preprint arXiv:2412.01506, 2024. 1, 3, 6, 18

work page arXiv 2024
[57]

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models.arXiv preprint arXiv:2404.07191,

work page internal anchor Pith review arXiv
[58]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1179–1189, 2023. 2, 7

2023
[59]

Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 2

2024
[60]

Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models

Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6796–6807, 2024. 2

2024
[61]

Gaussiandreamerpro: Text to ma- nipulable 3d gaussians with highly enhanced quality.arXiv preprint arXiv:2406.18462, 2024

Taoran Yi, Jiemin Fang, Zanwei Zhou, Junjie Wang, Guan- jun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Xing- gang Wang, and Qi Tian. Gaussiandreamerpro: Text to ma- nipulable 3d gaussians with highly enhanced quality.arXiv preprint arXiv:2406.18462, 2024. 2

work page arXiv 2024
[62]

Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation, 2024

Xu Yinghao, Shi Zifan, Yifan Wang, Chen Hansheng, Yang Ceyuan, Peng Sida, Shen Yujun, and Wetzstein Gordon. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation, 2024. 2

2024
[63]

Lion: Latent point diffusion models for 3d shape generation

Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 3

2022
[64]

3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023. 3

2023
[65]

Creatureshop: Interactive 3d character modeling and texturing from a single color drawing.IEEE Transactions on Visualization and Computer Graphics, 29(12):4874–4890,

Congyi Zhang, Lei Yang, Nenglun Chen, Nicholas Vining, Alla Sheffer, Francis CM Lau, Guoping Wang, and Wenping Wang. Creatureshop: Interactive 3d character modeling and texturing from a single color drawing.IEEE Transactions on Visualization and Computer Graphics, 29(12):4874–4890,
[66]

Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 3

2024
[67]

Ar-1-to-3: Single image to consistent 3d object generation via next-view prediction.arXiv preprint arXiv:2503.12929, 2025

Xuying Zhang, Yupeng Zhou, Kai Wang, Yikai Wang, Zhen Li, Shaohui Jiao, Daquan Zhou, Qibin Hou, and Ming- Ming Cheng. Ar-1-to-3: Single image to consistent 3d object generation via next-view prediction.arXiv preprint arXiv:2503.12929, 2025. 2

work page arXiv 2025
[68]

Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation.Advances in neural information processing systems, 36:73969–73982,

Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation.Advances in neural information processing systems, 36:73969–73982,
[69]

Free3d: Consistent novel view synthesis without 3d representation

Chuanxia Zheng and Andrea Vedaldi. Free3d: Consistent novel view synthesis without 3d representation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9720–9731, 2024. 2

2024
[70]

Uni3d: Ex- ploring unified 3d representation at scale,

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

work page arXiv
[71]

Drawingspinup: 3d animation from single character draw- ings

Jie Zhou, Chufeng Xiao, Miu-Ling Lam, and Hongbo Fu. Drawingspinup: 3d animation from single character draw- ings. InSIGGRAPH Asia 2024 Conference Papers, pages 1–10, 2024. 2, 3, 6, 18 REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement (Supplementary Material) Figure 11. Contour and Cut Visualization. The outer boundary is ...

2024