pith. machine review for the scientific record. sign in

arxiv: 2604.09231 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D texture generationmulti-view synthesisnative 3D modelingtexture consistencygeometric alignmentimage editingsurface texturing3D asset creation
0
0 comments X

The pith

Hitem3D 2.0 generates more complete and consistent 3D textures by guiding native 3D modeling with multi-view image priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Hitem3D 2.0 as a framework that tackles incomplete coverage, view-to-view mismatches, and geometry-texture misalignment in 3D texture generation. It first builds a multi-view synthesis stage on top of a pre-trained image editing model by adding modules that enforce alignment, consistency, and uniform lighting across generated images. These consistent views then condition a native 3D texture model that projects details onto surfaces and fills unseen areas plausibly. The result is textures that cover objects more fully, match when the object rotates, and sit correctly on the underlying shape. If the improvements hold, generated 3D assets would require less manual cleanup for use in graphics and simulation.

Core claim

Hitem3D 2.0 consists of a multi-view synthesis framework built on a pre-trained image editing backbone with added plug-and-play modules that promote geometric alignment, cross-view consistency, and illumination uniformity, plus a native 3D texture generation model that projects the resulting multi-view textures onto 3D surfaces while completing regions without direct observation. Integration of multi-view consistency constraints with native 3D texture modeling produces textures with greater completeness, cross-view coherence, and geometric alignment than prior approaches.

What carries the argument

Multi-view synthesis framework with plug-and-play modules for alignment and consistency, feeding into a native 3D texture projection and completion model conditioned on generated views and input geometry.

If this is right

  • Textures cover a larger fraction of each 3D surface without gaps.
  • Appearance remains coherent when the object is viewed from new angles.
  • Texture details align more closely with the input mesh geometry.
  • Generated 3D models show higher visual fidelity and require less post-processing.
  • Quantitative scores on detail, consistency, and alignment metrics exceed those of existing methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency modules could be tested on video input to enforce temporal stability in animated textures.
  • This pipeline might integrate with existing 3D reconstruction tools to texture scanned meshes automatically.
  • Applications in virtual reality could benefit if the method scales to real-time multi-view capture streams.
  • Limitations may appear on highly detailed or transparent surfaces where projection alone cannot recover hidden details.

Load-bearing premise

The added plug-and-play modules on the pre-trained image editing backbone will enforce geometric alignment, cross-view consistency, and uniform illumination without creating new artifacts or needing heavy retraining.

What would settle it

Running the method on a held-out set of complex 3D objects and finding visible texture seams or lighting shifts when rotating the model to unobserved angles would indicate the consistency claims do not hold.

Figures

Figures reproduced from arXiv: 2604.09231 by Heliang Zheng, Huiang He, Hu Zhang, Jianwen Huang, Jiaqi Wu, Jie Li, Pei Tang, Rongfei Jia, Shengchu Zhao, Yukun Li.

Figure 1
Figure 1. Figure 1: High-fidelity 3D textured assets are generated by our framework, Hitem3D 2.0, which leverages multi-view priors and native 3D [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of multi-view texturing, native texturing [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the multiview guided 3D native texture [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The framework of multiview generation. In the training stage 1, we fine-tune the image editing model to adapt the distribution [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The framework of 3D native texture generation. We design a dual-branch VAE to reconstruct geometry and texture features [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A gallery of 3D texture assets generated by Hitem3D 2.0. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study on our proposed components in multi-view generation model. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: 3D texture reconstruction results. age VAE and projected into the unified 3D coordinate space alongside the geometric features, again leveraging Cross￾Attention with 3D RoPE. Concretely, the CCMs corre￾sponding to each view are downsampled and scaled to the 3D feature resolution to ensure the alignment between geo￾metric and multi-view images. where P[x′ ,y′ ,z′ ] represents the transformed multi-view 3D c… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison results of 3D texture generation with other commercial models. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view generation priors and native 3D texture representations. Hitem3D 2.0 comprises two key components: a multi-view synthesis framework and a native 3D texture generation model. The multi-view generation is built upon a pre-trained image editing backbone and incorporates plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity, thereby enabling the synthesis of high-fidelity multi-view images. Conditioned on the generated views and 3D geometry, the native 3D texture generation model projects multi-view textures onto 3D surfaces while plausibly completing textures in unseen regions. Through the integration of multi-view consistency constraints with native 3D texture modeling, Hitem3D 2.0 significantly improves texture completeness, cross-view coherence, and geometric alignment. Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods in terms of texture detail, fidelity, consistency, coherence, and alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Hitem3D 2.0, a multi-view guided native 3D texture generation framework comprising a multi-view synthesis component built on a pre-trained image editing backbone augmented with plug-and-play modules to enforce geometric alignment, cross-view consistency, and illumination uniformity, plus a native 3D texture model that projects multi-view textures onto 3D geometry while completing unseen regions. It claims this integration yields significant gains in texture completeness, cross-view coherence, and geometric alignment, with experiments showing outperformance over prior methods on texture detail, fidelity, consistency, coherence, and alignment.

Significance. If the central claims are substantiated, the work would be significant for 3D content generation by demonstrating a practical, modular way to inject 2D multi-view priors into native 3D texture modeling without full backbone retraining. This could improve efficiency and quality in graphics pipelines for VR, gaming, and digital content creation. The explicit separation of 2D consistency enforcement from 3D projection is a potentially reusable design pattern.

major comments (2)
  1. [Abstract / multi-view synthesis framework description] The assertion that the plug-and-play modules 'explicitly promote geometric alignment, cross-view consistency, and illumination uniformity' (Abstract) is load-bearing for the claim of reliable enforcement on a frozen pre-trained backbone, yet no architecture diagrams, loss terms, or conditioning mechanisms are supplied to show how these properties are achieved or isolated from the backbone.
  2. [Experimental results] The statement that 'Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods' (Abstract) in multiple dimensions lacks any quantitative metrics, ablation studies isolating the plug-and-play modules, error bars, or dataset specifications; without these, the central outperformance claim cannot be verified and the propagation of 2D-stage shortfalls to the 3D texture model remains untested.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by a single sentence naming the evaluation datasets or benchmarks used to support the outperformance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript accordingly to improve clarity and substantiation of our claims.

read point-by-point responses
  1. Referee: [Abstract / multi-view synthesis framework description] The assertion that the plug-and-play modules 'explicitly promote geometric alignment, cross-view consistency, and illumination uniformity' (Abstract) is load-bearing for the claim of reliable enforcement on a frozen pre-trained backbone, yet no architecture diagrams, loss terms, or conditioning mechanisms are supplied to show how these properties are achieved or isolated from the backbone.

    Authors: We agree that the abstract alone does not convey the implementation details. The methods section of the manuscript describes the multi-view synthesis framework built on the pre-trained backbone, but to make the enforcement mechanisms explicit, we will add an architecture diagram and the specific loss terms (including geometric consistency and illumination uniformity regularizers) plus conditioning details in the revised version. These plug-and-play modules operate by injecting adapters into the frozen backbone without retraining it. revision: yes

  2. Referee: [Experimental results] The statement that 'Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods' (Abstract) in multiple dimensions lacks any quantitative metrics, ablation studies isolating the plug-and-play modules, error bars, or dataset specifications; without these, the central outperformance claim cannot be verified and the propagation of 2D-stage shortfalls to the 3D texture model remains untested.

    Authors: We acknowledge that the abstract claim requires supporting evidence. In the revised manuscript we will expand the experiments section to report quantitative metrics (e.g., FID, PSNR, consistency scores), ablation studies isolating each plug-and-play module, error bars across multiple runs, and explicit dataset details. We will also add analysis of how improvements in the 2D multi-view stage propagate to the final 3D texture quality. revision: yes

Circularity Check

0 steps flagged

No circularity: compositional framework with no self-referential reductions or fitted predictions

full rationale

The paper presents Hitem3D 2.0 as a composition of a pre-trained image editing backbone augmented with plug-and-play modules for multi-view synthesis, followed by a native 3D texture model conditioned on the outputs. The abstract and description assert improvements in completeness, consistency, and alignment as empirical outcomes of this integration, without any equations, parameter fits, uniqueness theorems, or self-citations that reduce the claimed results to the inputs by construction. No load-bearing step equates a 'prediction' or 'first-principles result' to a fitted quantity or renamed input. The derivation chain remains self-contained as a high-level architectural proposal validated by experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The method relies on the existence of a capable pre-trained 2D image-editing model and on the assumption that small plug-in modules can enforce the desired geometric and consistency properties.

axioms (1)
  • domain assumption A pre-trained 2D image editing backbone can be extended with plug-and-play modules to enforce geometric alignment, cross-view consistency, and illumination uniformity.
    Invoked in the description of the multi-view synthesis framework.

pith-pipeline@v0.9.0 · 5566 in / 1393 out tokens · 42494 ms · 2026-05-10T18:17:56.123862+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 27 canonical work pages · 8 internal anchors

  1. [1]

    Reconviagen: Towards accurate multi- view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025

    Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi- view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025. 5

  2. [2]

    Lafite: A generative latent field for 3d native texturing.arXiv preprint arXiv:2512.04786, 2025

    Chia-Hao Chen, Zi-Xin Zou, Yan-Pei Cao, Ze Yuan, Guan Luo, Xiaojuan Qi, Ding Liang, Song-Hai Zhang, and Yuan- Chen Guo. Lafite: A generative latent field for 3d native texturing.arXiv preprint arXiv:2512.04786, 2025. 2, 3

  3. [3]

    Text2tex: Text-driven texture synthesis via diffusion models

    Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Nießner. Text2tex: Text-driven texture synthesis via diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 18558–18568, 2023. 3

  4. [4]

    Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation

    Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. InProceedings of the IEEE/CVF international conference on computer vision, pages 22246–22256, 2023. 2

  5. [5]

    Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders

    Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan. Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 16251–16261, 2025. 5

  6. [6]

    Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025

    Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, and Guosheng Lin. Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025

  7. [7]

    3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion

    Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26576–26586, 2025. 5

  8. [8]

    Romantex: Decoupling 3d-aware rotary positional embedded multi-attention network for texture synthesis

    Yifei Feng, Mingxin Yang, Shuhui Yang, Sheng Zhang, Ji- aao Yu, Zibo Zhao, Yuhong Liu, Jie Jiang, and Chunchao Guo. Romantex: Decoupling 3d-aware rotary positional embedded multi-attention network for texture synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17203–17213, 2025. 2, 3

  9. [9]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 2, 5

  10. [10]

    Sparseflex: High-resolution and arbitrary-topology 3d shape modeling

    Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, and Yangguang Li. Sparseflex: High-resolution and arbitrary-topology 3d shape modeling. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 14822–14833, 2025. 5

  11. [11]

    Materialmvp: Illumination- invariant material generation via multi-view pbr diffusion

    Zebin He, Mingxin Yang, Shuhui Yang, Yixuan Tang, Tao Wang, Kaihao Zhang, Guanying Chen, Yuhong Liu, Jie Jiang, Chunchao Guo, et al. Materialmvp: Illumination- invariant material generation via multi-view pbr diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26294–26305, 2025. 2, 3

  12. [12]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022. 2, 5

  13. [13]

    Mv-adapter: Multi-view consistent image generation made easy

    Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, and Lu Sheng. Mv-adapter: Multi-view consistent image generation made easy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16377–16387, 2025. 2, 3

  14. [14]

    arXiv2506.15442(2025) 10

    Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high- fidelity 3d assets with production-ready pbr material.arXiv preprint arXiv:2506.15442, 2025. 2, 3, 5

  15. [15]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  16. [16]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  17. [17]

    arXiv preprint arXiv:2506.16504 , year=

    Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxi- ang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, et al. Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details.arXiv preprint arXiv:2506.16504, 2025. 2, 3, 5

  18. [18]

    Lattice: Democratize high-fidelity 3d generation at scale.arXiv preprint arXiv:2512.03052, 2025

    Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Qingxi- ang Lin, Jingwei Huang, Chunchao Guo, and Xiangyu Yue. Lattice: Democratize high-fidelity 3d generation at scale. arXiv preprint arXiv:2512.03052, 2025. 2, 5

  19. [19]

    CoRR , volume =

    Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Xin Yang, Xin Huang, Jingwei Huang, Xiangyu Yue, and Chunchao Guo. Natex: Seamless texture generation as latent color diffusion.arXiv preprint arXiv:2511.16317, 2025. 2, 3

  20. [20]

    Iggt: Instance- grounded geometry transformer for semantic 3d reconstruc- tion.arXiv preprint arXiv:2510.22706, 2024

    Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, et al. Iggt: Instance-grounded geometry transformer for semantic 3d reconstruction.arXiv preprint arXiv:2510.22706, 2025. 5

  21. [21]

    Step1X-3D: Towards high-fidelity and controllable generation of textured 3D assets, 2025

    Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and con- trollable generation of textured 3d assets.arXiv preprint arXiv:2505.07747, 2025. 2, 3

  22. [22]

    Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE Transactions on Pattern Analysis and Machine Intelligence,

    Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE Transactions on Pattern Analysis and Machine Intelligence,

  23. [23]

    arXiv preprint arXiv:2505.14521 , year=

    Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, and Bihan Wen. Sparc3d: Sparse representation and construc- 11 tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025. 5

  24. [24]

    Unitex: Univer- sal high fidelity generative texturing for 3d shapes.arXiv preprint arXiv:2505.23253, 2025

    Yixun Liang, Kunming Luo, Xiao Chen, Rui Chen, Hongyu Yan, Weiyu Li, Jiarui Liu, and Ping Tan. Unitex: Univer- sal high fidelity generative texturing for 3d shapes.arXiv preprint arXiv:2505.23253, 2025. 2, 3

  25. [25]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023. 2

  26. [26]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647, 2025. 5

  27. [27]

    Calitex: Geometry-calibrated atten- tion for view-coherent 3d texture generation.arXiv preprint arXiv:2511.21309, 2025

    Chenyu Liu, Hongze Chen, Jingzhi Bao, Lingting Zhu, Runze Zhang, Weikai Chen, Zeyu Hu, Yingda Yin, Keyang Luo, and Xin Wang. Calitex: Geometry-calibrated atten- tion for view-coherent 3d texture generation.arXiv preprint arXiv:2511.21309, 2025. 2, 3

  28. [28]

    Texoct: Generating textures of 3d models with octree-based diffusion

    Jialun Liu, Chenming Wu, Xinqi Liu, Xing Liu, Jinbo Wu, Haotian Peng, Chen Zhao, Haocheng Feng, Jingtuo Liu, and Errui Ding. Texoct: Generating textures of 3d models with octree-based diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4284–4293, 2024. 3

  29. [29]

    Text-guided texturing by synchronized multi-view diffusion

    Yuxin Liu, Minshan Xie, Hanyuan Liu, and Tien-Tsin Wong. Text-guided texturing by synchronized multi-view diffusion. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11,

  30. [30]

    Unidream: Unifying diffusion priors for re- lightable text-to-3d generation

    Zexiang Liu, Yangguang Li, Youtian Lin, Xin Yu, Sida Peng, Yan-Pei Cao, Xiaojuan Qi, Xiaoshui Huang, Ding Liang, and Wanli Ouyang. Unidream: Unifying diffusion priors for re- lightable text-to-3d generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 2

  31. [31]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  32. [32]

    Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

    Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2

  33. [33]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 5

  34. [34]

    Deepsdf: Learning con- tinuous signed distance functions for shape representation

    Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 165–174, 2019. 2

  35. [35]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  36. [36]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 2

  37. [37]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 2

  38. [38]

    arXiv preprint arXiv:2306.17843 , year=

    Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko- rokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors.arXiv preprint arXiv:2306.17843,

  39. [39]

    Texture: Text-guided texturing of 3d shapes

    Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided texturing of 3d shapes. InACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023. 3

  40. [40]

    Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets

    ByteDance Seed. Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets. 2025. 5

  41. [41]

    Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior

    Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 22819–22829, 2023. 2

  42. [42]

    Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation

    Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12619–12629, 2023. 2

  43. [43]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 5

  44. [44]

    arXiv preprint arXiv:2106.10689 , year=

    Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021. 2

  45. [45]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 2

  46. [46]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 2, 4

  47. [47]

    Direct3d: Scal- able image-to-3d generation via 3d latent diffusion trans- former.Advances in Neural Information Processing Systems, 37:121859–121881, 2024

    Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scal- able image-to-3d generation via 3d latent diffusion trans- former.Advances in Neural Information Processing Systems, 37:121859–121881, 2024. 5

  48. [48]

    Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

    Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Xun Cao, 12 Philip Torr, et al. Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025. 5

  49. [49]

    Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

    Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, et al. Native and com- pact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025. 2, 3, 4

  50. [50]

    Structured 3d latents for scalable and versatile 3d gen- eration

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 21469–21480,

  51. [51]

    Texgaussian: Generating high-quality pbr material via octree-based 3d gaussian splatting

    Bojun Xiong, Jialun Liu, Jiakui Hu, Chenming Wu, Jinbo Wu, Xing Liu, Chen Zhao, Errui Ding, and Zhouhui Lian. Texgaussian: Generating high-quality pbr material via octree-based 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 551–561, 2025. 3

  52. [52]

    Flexpainter: Flexible and multi-view consistent texture generation.arXiv preprint arXiv:2506.02620, 2025

    Dongyu Yan, Leyi Wu, Jiantao Lin, Luozhou Wang, Tianshuo Xu, Zhifei Chen, Zhen Yang, Lie Xu, Shunsi Zhang, and Yingcong Chen. Flexpainter: Flexible and multi-view consistent texture generation.arXiv preprint arXiv:2506.02620, 2025. 2, 3

  53. [53]

    Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation

    Jiayu Yang, Taizhang Shang, Weixuan Sun, Xibin Song, Ziang Cheng, Senbo Wang, Shenzhou Chen, Weizhe Liu, Hongdong Li, and Pan Ji. Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation. arXiv preprint arXiv:2502.14247, 2025. 2, 3

  54. [54]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2

  55. [55]

    Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging

    Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xi- aoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 25050– 25061, 2025. 5

  56. [56]

    Seqtex: Generate mesh textures in video sequence

    Ze Yuan, Xin Yu, Yangtian Sun, Yuan-Chen Guo, Yan-Pei Cao, Ding Liang, and Xiaojuan Qi. Seqtex: Generate mesh textures in video sequence. InProceedings of the SIG- GRAPH Asia 2025 Conference Papers, pages 1–12, 2025. 2, 3

  57. [57]

    Textrix: Latent attribute grid for native texture generation and beyond.arXiv preprint arXiv:2512.02993, 2025

    Yifei Zeng, Yajie Bao, Jiachen Qian, Shuang Wu, Youtian Lin, Hao Zhu, Buyu Li, Feihu Zhang, Xun Cao, and Yao Yao. Textrix: Latent attribute grid for native texture generation and beyond.arXiv preprint arXiv:2512.02993, 2025. 2, 3

  58. [58]

    3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

    Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023. 2, 5

  59. [59]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 2, 5

  60. [60]

    Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

    Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 2, 3 13