Recognition: 2 theorem links
· Lean TheoremHitem3D 2.0: Multi-View Guided Native 3D Texture Generation
Pith reviewed 2026-05-10 18:17 UTC · model grok-4.3
The pith
Hitem3D 2.0 generates more complete and consistent 3D textures by guiding native 3D modeling with multi-view image priors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hitem3D 2.0 consists of a multi-view synthesis framework built on a pre-trained image editing backbone with added plug-and-play modules that promote geometric alignment, cross-view consistency, and illumination uniformity, plus a native 3D texture generation model that projects the resulting multi-view textures onto 3D surfaces while completing regions without direct observation. Integration of multi-view consistency constraints with native 3D texture modeling produces textures with greater completeness, cross-view coherence, and geometric alignment than prior approaches.
What carries the argument
Multi-view synthesis framework with plug-and-play modules for alignment and consistency, feeding into a native 3D texture projection and completion model conditioned on generated views and input geometry.
If this is right
- Textures cover a larger fraction of each 3D surface without gaps.
- Appearance remains coherent when the object is viewed from new angles.
- Texture details align more closely with the input mesh geometry.
- Generated 3D models show higher visual fidelity and require less post-processing.
- Quantitative scores on detail, consistency, and alignment metrics exceed those of existing methods.
Where Pith is reading between the lines
- The same consistency modules could be tested on video input to enforce temporal stability in animated textures.
- This pipeline might integrate with existing 3D reconstruction tools to texture scanned meshes automatically.
- Applications in virtual reality could benefit if the method scales to real-time multi-view capture streams.
- Limitations may appear on highly detailed or transparent surfaces where projection alone cannot recover hidden details.
Load-bearing premise
The added plug-and-play modules on the pre-trained image editing backbone will enforce geometric alignment, cross-view consistency, and uniform illumination without creating new artifacts or needing heavy retraining.
What would settle it
Running the method on a held-out set of complex 3D objects and finding visible texture seams or lighting shifts when rotating the model to unobserved angles would indicate the consistency claims do not hold.
Figures
read the original abstract
Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view generation priors and native 3D texture representations. Hitem3D 2.0 comprises two key components: a multi-view synthesis framework and a native 3D texture generation model. The multi-view generation is built upon a pre-trained image editing backbone and incorporates plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity, thereby enabling the synthesis of high-fidelity multi-view images. Conditioned on the generated views and 3D geometry, the native 3D texture generation model projects multi-view textures onto 3D surfaces while plausibly completing textures in unseen regions. Through the integration of multi-view consistency constraints with native 3D texture modeling, Hitem3D 2.0 significantly improves texture completeness, cross-view coherence, and geometric alignment. Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods in terms of texture detail, fidelity, consistency, coherence, and alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Hitem3D 2.0, a multi-view guided native 3D texture generation framework comprising a multi-view synthesis component built on a pre-trained image editing backbone augmented with plug-and-play modules to enforce geometric alignment, cross-view consistency, and illumination uniformity, plus a native 3D texture model that projects multi-view textures onto 3D geometry while completing unseen regions. It claims this integration yields significant gains in texture completeness, cross-view coherence, and geometric alignment, with experiments showing outperformance over prior methods on texture detail, fidelity, consistency, coherence, and alignment.
Significance. If the central claims are substantiated, the work would be significant for 3D content generation by demonstrating a practical, modular way to inject 2D multi-view priors into native 3D texture modeling without full backbone retraining. This could improve efficiency and quality in graphics pipelines for VR, gaming, and digital content creation. The explicit separation of 2D consistency enforcement from 3D projection is a potentially reusable design pattern.
major comments (2)
- [Abstract / multi-view synthesis framework description] The assertion that the plug-and-play modules 'explicitly promote geometric alignment, cross-view consistency, and illumination uniformity' (Abstract) is load-bearing for the claim of reliable enforcement on a frozen pre-trained backbone, yet no architecture diagrams, loss terms, or conditioning mechanisms are supplied to show how these properties are achieved or isolated from the backbone.
- [Experimental results] The statement that 'Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods' (Abstract) in multiple dimensions lacks any quantitative metrics, ablation studies isolating the plug-and-play modules, error bars, or dataset specifications; without these, the central outperformance claim cannot be verified and the propagation of 2D-stage shortfalls to the 3D texture model remains untested.
minor comments (1)
- [Abstract] The abstract would be strengthened by a single sentence naming the evaluation datasets or benchmarks used to support the outperformance claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript accordingly to improve clarity and substantiation of our claims.
read point-by-point responses
-
Referee: [Abstract / multi-view synthesis framework description] The assertion that the plug-and-play modules 'explicitly promote geometric alignment, cross-view consistency, and illumination uniformity' (Abstract) is load-bearing for the claim of reliable enforcement on a frozen pre-trained backbone, yet no architecture diagrams, loss terms, or conditioning mechanisms are supplied to show how these properties are achieved or isolated from the backbone.
Authors: We agree that the abstract alone does not convey the implementation details. The methods section of the manuscript describes the multi-view synthesis framework built on the pre-trained backbone, but to make the enforcement mechanisms explicit, we will add an architecture diagram and the specific loss terms (including geometric consistency and illumination uniformity regularizers) plus conditioning details in the revised version. These plug-and-play modules operate by injecting adapters into the frozen backbone without retraining it. revision: yes
-
Referee: [Experimental results] The statement that 'Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods' (Abstract) in multiple dimensions lacks any quantitative metrics, ablation studies isolating the plug-and-play modules, error bars, or dataset specifications; without these, the central outperformance claim cannot be verified and the propagation of 2D-stage shortfalls to the 3D texture model remains untested.
Authors: We acknowledge that the abstract claim requires supporting evidence. In the revised manuscript we will expand the experiments section to report quantitative metrics (e.g., FID, PSNR, consistency scores), ablation studies isolating each plug-and-play module, error bars across multiple runs, and explicit dataset details. We will also add analysis of how improvements in the 2D multi-view stage propagate to the final 3D texture quality. revision: yes
Circularity Check
No circularity: compositional framework with no self-referential reductions or fitted predictions
full rationale
The paper presents Hitem3D 2.0 as a composition of a pre-trained image editing backbone augmented with plug-and-play modules for multi-view synthesis, followed by a native 3D texture model conditioned on the outputs. The abstract and description assert improvements in completeness, consistency, and alignment as empirical outcomes of this integration, without any equations, parameter fits, uniqueness theorems, or self-citations that reduce the claimed results to the inputs by construction. No load-bearing step equates a 'prediction' or 'first-principles result' to a fitted quantity or renamed input. The derivation chain remains self-contained as a high-level architectural proposal validated by experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A pre-trained 2D image editing backbone can be extended with plug-and-play modules to enforce geometric alignment, cross-view consistency, and illumination uniformity.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity... 3D RoPE... dual-branch VAE... DiT conditioned on geometric and multi-view features
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
native 3D texture generation model... sparse voxel representations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi- view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025. 5
-
[2]
Lafite: A generative latent field for 3d native texturing.arXiv preprint arXiv:2512.04786, 2025
Chia-Hao Chen, Zi-Xin Zou, Yan-Pei Cao, Ze Yuan, Guan Luo, Xiaojuan Qi, Ding Liang, Song-Hai Zhang, and Yuan- Chen Guo. Lafite: A generative latent field for 3d native texturing.arXiv preprint arXiv:2512.04786, 2025. 2, 3
-
[3]
Text2tex: Text-driven texture synthesis via diffusion models
Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Nießner. Text2tex: Text-driven texture synthesis via diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 18558–18568, 2023. 3
2023
-
[4]
Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation
Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. InProceedings of the IEEE/CVF international conference on computer vision, pages 22246–22256, 2023. 2
2023
-
[5]
Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders
Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan. Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 16251–16261, 2025. 5
2025
-
[6]
Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, and Guosheng Lin. Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025
-
[7]
3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion
Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26576–26586, 2025. 5
2025
-
[8]
Romantex: Decoupling 3d-aware rotary positional embedded multi-attention network for texture synthesis
Yifei Feng, Mingxin Yang, Shuhui Yang, Sheng Zhang, Ji- aao Yu, Zibo Zhao, Yuhong Liu, Jie Jiang, and Chunchao Guo. Romantex: Decoupling 3d-aware rotary positional embedded multi-attention network for texture synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17203–17213, 2025. 2, 3
2025
-
[9]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 2, 5
work page internal anchor Pith review arXiv 2023
-
[10]
Sparseflex: High-resolution and arbitrary-topology 3d shape modeling
Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, and Yangguang Li. Sparseflex: High-resolution and arbitrary-topology 3d shape modeling. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 14822–14833, 2025. 5
2025
-
[11]
Materialmvp: Illumination- invariant material generation via multi-view pbr diffusion
Zebin He, Mingxin Yang, Shuhui Yang, Yixuan Tang, Tao Wang, Kaihao Zhang, Guanying Chen, Yuhong Liu, Jie Jiang, Chunchao Guo, et al. Materialmvp: Illumination- invariant material generation via multi-view pbr diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26294–26305, 2025. 2, 3
2025
-
[12]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022. 2, 5
2022
-
[13]
Mv-adapter: Multi-view consistent image generation made easy
Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, and Lu Sheng. Mv-adapter: Multi-view consistent image generation made easy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16377–16387, 2025. 2, 3
2025
-
[14]
Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high- fidelity 3d assets with production-ready pbr material.arXiv preprint arXiv:2506.15442, 2025. 2, 3, 5
-
[15]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
-
[16]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
arXiv preprint arXiv:2506.16504 , year=
Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxi- ang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, et al. Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details.arXiv preprint arXiv:2506.16504, 2025. 2, 3, 5
-
[18]
Lattice: Democratize high-fidelity 3d generation at scale.arXiv preprint arXiv:2512.03052, 2025
Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Qingxi- ang Lin, Jingwei Huang, Chunchao Guo, and Xiangyu Yue. Lattice: Democratize high-fidelity 3d generation at scale. arXiv preprint arXiv:2512.03052, 2025. 2, 5
-
[19]
Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Xin Yang, Xin Huang, Jingwei Huang, Xiangyu Yue, and Chunchao Guo. Natex: Seamless texture generation as latent color diffusion.arXiv preprint arXiv:2511.16317, 2025. 2, 3
-
[20]
Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, et al. Iggt: Instance-grounded geometry transformer for semantic 3d reconstruction.arXiv preprint arXiv:2510.22706, 2025. 5
-
[21]
Step1X-3D: Towards high-fidelity and controllable generation of textured 3D assets, 2025
Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and con- trollable generation of textured 3d assets.arXiv preprint arXiv:2505.07747, 2025. 2, 3
-
[22]
Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE Transactions on Pattern Analysis and Machine Intelligence,
Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE Transactions on Pattern Analysis and Machine Intelligence,
-
[23]
arXiv preprint arXiv:2505.14521 , year=
Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, and Bihan Wen. Sparc3d: Sparse representation and construc- 11 tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025. 5
-
[24]
Yixun Liang, Kunming Luo, Xiao Chen, Rui Chen, Hongyu Yan, Weiyu Li, Jiarui Liu, and Ping Tan. Unitex: Univer- sal high fidelity generative texturing for 3d shapes.arXiv preprint arXiv:2505.23253, 2025. 2, 3
-
[25]
Magic3d: High-resolution text-to-3d content creation
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023. 2
2023
-
[26]
Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647, 2025. 5
work page internal anchor Pith review arXiv 2025
-
[27]
Chenyu Liu, Hongze Chen, Jingzhi Bao, Lingting Zhu, Runze Zhang, Weikai Chen, Zeyu Hu, Yingda Yin, Keyang Luo, and Xin Wang. Calitex: Geometry-calibrated atten- tion for view-coherent 3d texture generation.arXiv preprint arXiv:2511.21309, 2025. 2, 3
-
[28]
Texoct: Generating textures of 3d models with octree-based diffusion
Jialun Liu, Chenming Wu, Xinqi Liu, Xing Liu, Jinbo Wu, Haotian Peng, Chen Zhao, Haocheng Feng, Jingtuo Liu, and Errui Ding. Texoct: Generating textures of 3d models with octree-based diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4284–4293, 2024. 3
2024
-
[29]
Text-guided texturing by synchronized multi-view diffusion
Yuxin Liu, Minshan Xie, Hanyuan Liu, and Tien-Tsin Wong. Text-guided texturing by synchronized multi-view diffusion. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11,
2024
-
[30]
Unidream: Unifying diffusion priors for re- lightable text-to-3d generation
Zexiang Liu, Yangguang Li, Youtian Lin, Xin Yu, Sida Peng, Yan-Pei Cao, Xiaojuan Qi, Xiaoshui Huang, Ding Liang, and Wanli Ouyang. Unidream: Unifying diffusion priors for re- lightable text-to-3d generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 2
2024
-
[31]
Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2
2021
-
[32]
Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022
Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2
2022
-
[33]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Deepsdf: Learning con- tinuous signed distance functions for shape representation
Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 165–174, 2019. 2
2019
-
[35]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[36]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[38]
arXiv preprint arXiv:2306.17843 , year=
Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko- rokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors.arXiv preprint arXiv:2306.17843,
-
[39]
Texture: Text-guided texturing of 3d shapes
Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided texturing of 3d shapes. InACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023. 3
2023
-
[40]
Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets
ByteDance Seed. Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets. 2025. 5
2025
-
[41]
Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior
Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 22819–22829, 2023. 2
2023
-
[42]
Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation
Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12619–12629, 2023. 2
2023
-
[43]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 5
2025
-
[44]
arXiv preprint arXiv:2106.10689 , year=
Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021. 2
-
[45]
Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 2
2023
-
[46]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Direct3d: Scal- able image-to-3d generation via 3d latent diffusion trans- former.Advances in Neural Information Processing Systems, 37:121859–121881, 2024
Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scal- able image-to-3d generation via 3d latent diffusion trans- former.Advances in Neural Information Processing Systems, 37:121859–121881, 2024. 5
2024
-
[48]
Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Xun Cao, 12 Philip Torr, et al. Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025. 5
-
[49]
Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025
Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, et al. Native and com- pact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025. 2, 3, 4
-
[50]
Structured 3d latents for scalable and versatile 3d gen- eration
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 21469–21480,
-
[51]
Texgaussian: Generating high-quality pbr material via octree-based 3d gaussian splatting
Bojun Xiong, Jialun Liu, Jiakui Hu, Chenming Wu, Jinbo Wu, Xing Liu, Chen Zhao, Errui Ding, and Zhouhui Lian. Texgaussian: Generating high-quality pbr material via octree-based 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 551–561, 2025. 3
2025
-
[52]
Dongyu Yan, Leyi Wu, Jiantao Lin, Luozhou Wang, Tianshuo Xu, Zhifei Chen, Zhen Yang, Lie Xu, Shunsi Zhang, and Yingcong Chen. Flexpainter: Flexible and multi-view consistent texture generation.arXiv preprint arXiv:2506.02620, 2025. 2, 3
-
[53]
Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation
Jiayu Yang, Taizhang Shang, Weixuan Sun, Xibin Song, Ziang Cheng, Senbo Wang, Shenzhou Chen, Weizhe Liu, Hongdong Li, and Pan Ji. Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation. arXiv preprint arXiv:2502.14247, 2025. 2, 3
-
[54]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging
Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xi- aoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 25050– 25061, 2025. 5
2025
-
[56]
Seqtex: Generate mesh textures in video sequence
Ze Yuan, Xin Yu, Yangtian Sun, Yuan-Chen Guo, Yan-Pei Cao, Ding Liang, and Xiaojuan Qi. Seqtex: Generate mesh textures in video sequence. InProceedings of the SIG- GRAPH Asia 2025 Conference Papers, pages 1–12, 2025. 2, 3
2025
-
[57]
Yifei Zeng, Yajie Bao, Jiachen Qian, Shuang Wu, Youtian Lin, Hao Zhu, Buyu Li, Feihu Zhang, Xun Cao, and Yao Yao. Textrix: Latent attribute grid for native texture generation and beyond.arXiv preprint arXiv:2512.02993, 2025. 2, 3
-
[58]
3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023
Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023. 2, 5
2023
-
[59]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 2, 5
2023
-
[60]
Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024
Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 2, 3 13
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.