arxiv: 2604.09231 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation

Huiang He , Shengchu Zhao , Jianwen Huang , Jie Li , Jiaqi Wu , Hu Zhang , Pei Tang , Heliang Zheng

show 2 more authors

Yukun Li Rongfei Jia

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D texture generationmulti-view synthesisnative 3D modelingtexture consistencygeometric alignmentimage editingsurface texturing3D asset creation

0 comments

The pith

Hitem3D 2.0 generates more complete and consistent 3D textures by guiding native 3D modeling with multi-view image priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Hitem3D 2.0 as a framework that tackles incomplete coverage, view-to-view mismatches, and geometry-texture misalignment in 3D texture generation. It first builds a multi-view synthesis stage on top of a pre-trained image editing model by adding modules that enforce alignment, consistency, and uniform lighting across generated images. These consistent views then condition a native 3D texture model that projects details onto surfaces and fills unseen areas plausibly. The result is textures that cover objects more fully, match when the object rotates, and sit correctly on the underlying shape. If the improvements hold, generated 3D assets would require less manual cleanup for use in graphics and simulation.

Core claim

Hitem3D 2.0 consists of a multi-view synthesis framework built on a pre-trained image editing backbone with added plug-and-play modules that promote geometric alignment, cross-view consistency, and illumination uniformity, plus a native 3D texture generation model that projects the resulting multi-view textures onto 3D surfaces while completing regions without direct observation. Integration of multi-view consistency constraints with native 3D texture modeling produces textures with greater completeness, cross-view coherence, and geometric alignment than prior approaches.

What carries the argument

Multi-view synthesis framework with plug-and-play modules for alignment and consistency, feeding into a native 3D texture projection and completion model conditioned on generated views and input geometry.

If this is right

Textures cover a larger fraction of each 3D surface without gaps.
Appearance remains coherent when the object is viewed from new angles.
Texture details align more closely with the input mesh geometry.
Generated 3D models show higher visual fidelity and require less post-processing.
Quantitative scores on detail, consistency, and alignment metrics exceed those of existing methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency modules could be tested on video input to enforce temporal stability in animated textures.
This pipeline might integrate with existing 3D reconstruction tools to texture scanned meshes automatically.
Applications in virtual reality could benefit if the method scales to real-time multi-view capture streams.
Limitations may appear on highly detailed or transparent surfaces where projection alone cannot recover hidden details.

Load-bearing premise

The added plug-and-play modules on the pre-trained image editing backbone will enforce geometric alignment, cross-view consistency, and uniform illumination without creating new artifacts or needing heavy retraining.

What would settle it

Running the method on a held-out set of complex 3D objects and finding visible texture seams or lighting shifts when rotating the model to unobserved angles would indicate the consistency claims do not hold.

Figures

Figures reproduced from arXiv: 2604.09231 by Heliang Zheng, Huiang He, Hu Zhang, Jianwen Huang, Jiaqi Wu, Jie Li, Pei Tang, Rongfei Jia, Shengchu Zhao, Yukun Li.

**Figure 1.** Figure 1: High-fidelity 3D textured assets are generated by our framework, Hitem3D 2.0, which leverages multi-view priors and native 3D [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of multi-view texturing, native texturing [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the multiview guided 3D native texture [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The framework of multiview generation. In the training stage 1, we fine-tune the image editing model to adapt the distribution [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: The framework of 3D native texture generation. We design a dual-branch VAE to reconstruct geometry and texture features [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: A gallery of 3D texture assets generated by Hitem3D 2.0. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation study on our proposed components in multi-view generation model. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: 3D texture reconstruction results. age VAE and projected into the unified 3D coordinate space alongside the geometric features, again leveraging CrossAttention with 3D RoPE. Concretely, the CCMs corresponding to each view are downsampled and scaled to the 3D feature resolution to ensure the alignment between geometric and multi-view images. where P[x′ ,y′ ,z′ ] represents the transformed multi-view 3D c… view at source ↗

**Figure 9.** Figure 9: Comparison results of 3D texture generation with other commercial models. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view generation priors and native 3D texture representations. Hitem3D 2.0 comprises two key components: a multi-view synthesis framework and a native 3D texture generation model. The multi-view generation is built upon a pre-trained image editing backbone and incorporates plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity, thereby enabling the synthesis of high-fidelity multi-view images. Conditioned on the generated views and 3D geometry, the native 3D texture generation model projects multi-view textures onto 3D surfaces while plausibly completing textures in unseen regions. Through the integration of multi-view consistency constraints with native 3D texture modeling, Hitem3D 2.0 significantly improves texture completeness, cross-view coherence, and geometric alignment. Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods in terms of texture detail, fidelity, consistency, coherence, and alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hitem3D 2.0 gives a workable two-stage pipeline that adds plug-and-play modules to a 2D backbone for multi-view consistency then projects into native 3D texturing, but the modules get no isolated testing and the outperformance claims rest on thin evidence.

read the letter

The main point is a two-stage setup: take a pre-trained 2D image editing model, bolt on modules that push for geometric alignment, cross-view consistency, and even illumination, generate the views, then feed them plus the 3D geometry into a native texture projector that fills in the unseen parts. This directly targets the three problems the abstract names—incomplete coverage, view mismatches, and geometry-texture drift—and the integration itself is the piece that feels new rather than any single new equation or architecture. It is a sensible engineering move that reuses strong existing 2D priors instead of starting from scratch in 3D, which keeps the method practical for people who already have geometry and want better textures on top. If the full experiments hold up with side-by-side results, it could cut down on the manual cleanup that still dominates 3D asset pipelines. The approach is honest about building on prior work and does not overclaim a revolution. The soft spots are exactly where the stress-test note points. The plug-and-play modules are described at a high level with no architecture diagram, loss terms, or ablation that isolates their contribution. Because the 3D stage conditions directly on the 2D output, any shortfall in enforcement or any new artifacts introduced by the modules will show up downstream, yet there is no separate check on that step. The abstract also gives no numbers, no dataset sizes, no error bars, and no comparison tables, so the statement that it “outperforms existing methods” in detail, fidelity, and alignment stays unverified. Those gaps are not fatal for an applied paper, but they are load-bearing for the central claim. This is for graphics and vision engineers who build or use automatic 3D texturing tools. A reader already working on multi-view consistency or texture completion would pick up a concrete pipeline to try, even if they have to re-implement the modules. It is not foundational enough for a broad audience. The paper deserves peer review because the problem is real, the two-stage framing is coherent, and the integration is worth testing in public. Referees should be asked to focus on quantitative metrics, ablations of the added modules, and whether the improvements survive when the 2D backbone changes.

Referee Report

2 major / 1 minor

Summary. The paper proposes Hitem3D 2.0, a multi-view guided native 3D texture generation framework comprising a multi-view synthesis component built on a pre-trained image editing backbone augmented with plug-and-play modules to enforce geometric alignment, cross-view consistency, and illumination uniformity, plus a native 3D texture model that projects multi-view textures onto 3D geometry while completing unseen regions. It claims this integration yields significant gains in texture completeness, cross-view coherence, and geometric alignment, with experiments showing outperformance over prior methods on texture detail, fidelity, consistency, coherence, and alignment.

Significance. If the central claims are substantiated, the work would be significant for 3D content generation by demonstrating a practical, modular way to inject 2D multi-view priors into native 3D texture modeling without full backbone retraining. This could improve efficiency and quality in graphics pipelines for VR, gaming, and digital content creation. The explicit separation of 2D consistency enforcement from 3D projection is a potentially reusable design pattern.

major comments (2)

[Abstract / multi-view synthesis framework description] The assertion that the plug-and-play modules 'explicitly promote geometric alignment, cross-view consistency, and illumination uniformity' (Abstract) is load-bearing for the claim of reliable enforcement on a frozen pre-trained backbone, yet no architecture diagrams, loss terms, or conditioning mechanisms are supplied to show how these properties are achieved or isolated from the backbone.
[Experimental results] The statement that 'Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods' (Abstract) in multiple dimensions lacks any quantitative metrics, ablation studies isolating the plug-and-play modules, error bars, or dataset specifications; without these, the central outperformance claim cannot be verified and the propagation of 2D-stage shortfalls to the 3D texture model remains untested.

minor comments (1)

[Abstract] The abstract would be strengthened by a single sentence naming the evaluation datasets or benchmarks used to support the outperformance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript accordingly to improve clarity and substantiation of our claims.

read point-by-point responses

Referee: [Abstract / multi-view synthesis framework description] The assertion that the plug-and-play modules 'explicitly promote geometric alignment, cross-view consistency, and illumination uniformity' (Abstract) is load-bearing for the claim of reliable enforcement on a frozen pre-trained backbone, yet no architecture diagrams, loss terms, or conditioning mechanisms are supplied to show how these properties are achieved or isolated from the backbone.

Authors: We agree that the abstract alone does not convey the implementation details. The methods section of the manuscript describes the multi-view synthesis framework built on the pre-trained backbone, but to make the enforcement mechanisms explicit, we will add an architecture diagram and the specific loss terms (including geometric consistency and illumination uniformity regularizers) plus conditioning details in the revised version. These plug-and-play modules operate by injecting adapters into the frozen backbone without retraining it. revision: yes
Referee: [Experimental results] The statement that 'Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods' (Abstract) in multiple dimensions lacks any quantitative metrics, ablation studies isolating the plug-and-play modules, error bars, or dataset specifications; without these, the central outperformance claim cannot be verified and the propagation of 2D-stage shortfalls to the 3D texture model remains untested.

Authors: We acknowledge that the abstract claim requires supporting evidence. In the revised manuscript we will expand the experiments section to report quantitative metrics (e.g., FID, PSNR, consistency scores), ablation studies isolating each plug-and-play module, error bars across multiple runs, and explicit dataset details. We will also add analysis of how improvements in the 2D multi-view stage propagate to the final 3D texture quality. revision: yes

Circularity Check

0 steps flagged

No circularity: compositional framework with no self-referential reductions or fitted predictions

full rationale

The paper presents Hitem3D 2.0 as a composition of a pre-trained image editing backbone augmented with plug-and-play modules for multi-view synthesis, followed by a native 3D texture model conditioned on the outputs. The abstract and description assert improvements in completeness, consistency, and alignment as empirical outcomes of this integration, without any equations, parameter fits, uniqueness theorems, or self-citations that reduce the claimed results to the inputs by construction. No load-bearing step equates a 'prediction' or 'first-principles result' to a fitted quantity or renamed input. The derivation chain remains self-contained as a high-level architectural proposal validated by experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The method relies on the existence of a capable pre-trained 2D image-editing model and on the assumption that small plug-in modules can enforce the desired geometric and consistency properties.

axioms (1)

domain assumption A pre-trained 2D image editing backbone can be extended with plug-and-play modules to enforce geometric alignment, cross-view consistency, and illumination uniformity.
Invoked in the description of the multi-view synthesis framework.

pith-pipeline@v0.9.0 · 5566 in / 1393 out tokens · 42494 ms · 2026-05-10T18:17:56.123862+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity... 3D RoPE... dual-branch VAE... DiT conditioned on geometric and multi-view features
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

native 3D texture generation model... sparse voxel representations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 27 canonical work pages · 8 internal anchors

[1]

Reconviagen: Towards accurate multi- view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025

Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi- view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025. 5

work page arXiv 2025
[2]

Lafite: A generative latent field for 3d native texturing.arXiv preprint arXiv:2512.04786, 2025

Chia-Hao Chen, Zi-Xin Zou, Yan-Pei Cao, Ze Yuan, Guan Luo, Xiaojuan Qi, Ding Liang, Song-Hai Zhang, and Yuan- Chen Guo. Lafite: A generative latent field for 3d native texturing.arXiv preprint arXiv:2512.04786, 2025. 2, 3

work page arXiv 2025
[3]

Text2tex: Text-driven texture synthesis via diffusion models

Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Nießner. Text2tex: Text-driven texture synthesis via diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 18558–18568, 2023. 3

2023
[4]

Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. InProceedings of the IEEE/CVF international conference on computer vision, pages 22246–22256, 2023. 2

2023
[5]

Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders

Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan. Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 16251–16261, 2025. 5

2025
[6]

Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025

Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, and Guosheng Lin. Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025

work page arXiv 2025
[7]

3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26576–26586, 2025. 5

2025
[8]

Romantex: Decoupling 3d-aware rotary positional embedded multi-attention network for texture synthesis

Yifei Feng, Mingxin Yang, Shuhui Yang, Sheng Zhang, Ji- aao Yu, Zibo Zhao, Yuhong Liu, Jie Jiang, and Chunchao Guo. Romantex: Decoupling 3d-aware rotary positional embedded multi-attention network for texture synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17203–17213, 2025. 2, 3

2025
[9]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 2, 5

work page internal anchor Pith review arXiv 2023
[10]

Sparseflex: High-resolution and arbitrary-topology 3d shape modeling

Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, and Yangguang Li. Sparseflex: High-resolution and arbitrary-topology 3d shape modeling. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 14822–14833, 2025. 5

2025
[11]

Materialmvp: Illumination- invariant material generation via multi-view pbr diffusion

Zebin He, Mingxin Yang, Shuhui Yang, Yixuan Tang, Tao Wang, Kaihao Zhang, Guanying Chen, Yuhong Liu, Jie Jiang, Chunchao Guo, et al. Materialmvp: Illumination- invariant material generation via multi-view pbr diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26294–26305, 2025. 2, 3

2025
[12]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022. 2, 5

2022
[13]

Mv-adapter: Multi-view consistent image generation made easy

Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, and Lu Sheng. Mv-adapter: Multi-view consistent image generation made easy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16377–16387, 2025. 2, 3

2025
[14]

arXiv2506.15442(2025) 10

Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high- fidelity 3d assets with production-ready pbr material.arXiv preprint arXiv:2506.15442, 2025. 2, 3, 5

work page arXiv 2025
[15]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
[16]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2506.16504 , year=

Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxi- ang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, et al. Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details.arXiv preprint arXiv:2506.16504, 2025. 2, 3, 5

work page arXiv 2025
[18]

Lattice: Democratize high-fidelity 3d generation at scale.arXiv preprint arXiv:2512.03052, 2025

Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Qingxi- ang Lin, Jingwei Huang, Chunchao Guo, and Xiangyu Yue. Lattice: Democratize high-fidelity 3d generation at scale. arXiv preprint arXiv:2512.03052, 2025. 2, 5

work page arXiv 2025
[19]

CoRR , volume =

Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Xin Yang, Xin Huang, Jingwei Huang, Xiangyu Yue, and Chunchao Guo. Natex: Seamless texture generation as latent color diffusion.arXiv preprint arXiv:2511.16317, 2025. 2, 3

work page arXiv 2025
[20]

Iggt: Instance- grounded geometry transformer for semantic 3d reconstruc- tion.arXiv preprint arXiv:2510.22706, 2024

Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, et al. Iggt: Instance-grounded geometry transformer for semantic 3d reconstruction.arXiv preprint arXiv:2510.22706, 2025. 5

work page arXiv 2025
[21]

Step1X-3D: Towards high-fidelity and controllable generation of textured 3D assets, 2025

Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and con- trollable generation of textured 3d assets.arXiv preprint arXiv:2505.07747, 2025. 2, 3

work page arXiv 2025
[22]

Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE Transactions on Pattern Analysis and Machine Intelligence,

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE Transactions on Pattern Analysis and Machine Intelligence,
[23]

arXiv preprint arXiv:2505.14521 , year=

Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, and Bihan Wen. Sparc3d: Sparse representation and construc- 11 tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025. 5

work page arXiv 2025
[24]

Unitex: Univer- sal high fidelity generative texturing for 3d shapes.arXiv preprint arXiv:2505.23253, 2025

Yixun Liang, Kunming Luo, Xiao Chen, Rui Chen, Hongyu Yan, Weiyu Li, Jiarui Liu, and Ping Tan. Unitex: Univer- sal high fidelity generative texturing for 3d shapes.arXiv preprint arXiv:2505.23253, 2025. 2, 3

work page arXiv 2025
[25]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023. 2

2023
[26]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647, 2025. 5

work page internal anchor Pith review arXiv 2025
[27]

Calitex: Geometry-calibrated atten- tion for view-coherent 3d texture generation.arXiv preprint arXiv:2511.21309, 2025

Chenyu Liu, Hongze Chen, Jingzhi Bao, Lingting Zhu, Runze Zhang, Weikai Chen, Zeyu Hu, Yingda Yin, Keyang Luo, and Xin Wang. Calitex: Geometry-calibrated atten- tion for view-coherent 3d texture generation.arXiv preprint arXiv:2511.21309, 2025. 2, 3

work page arXiv 2025
[28]

Texoct: Generating textures of 3d models with octree-based diffusion

Jialun Liu, Chenming Wu, Xinqi Liu, Xing Liu, Jinbo Wu, Haotian Peng, Chen Zhao, Haocheng Feng, Jingtuo Liu, and Errui Ding. Texoct: Generating textures of 3d models with octree-based diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4284–4293, 2024. 3

2024
[29]

Text-guided texturing by synchronized multi-view diffusion

Yuxin Liu, Minshan Xie, Hanyuan Liu, and Tien-Tsin Wong. Text-guided texturing by synchronized multi-view diffusion. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11,

2024
[30]

Unidream: Unifying diffusion priors for re- lightable text-to-3d generation

Zexiang Liu, Yangguang Li, Youtian Lin, Xin Yu, Sida Peng, Yan-Pei Cao, Xiaojuan Qi, Xiaoshui Huang, Ding Liang, and Wanli Ouyang. Unidream: Unifying diffusion priors for re- lightable text-to-3d generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 2

2024
[31]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

2021
[32]

Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2

2022
[33]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Deepsdf: Learning con- tinuous signed distance functions for shape representation

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 165–174, 2019. 2

2019
[35]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
[36]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 2

work page internal anchor Pith review arXiv 2022
[38]

arXiv preprint arXiv:2306.17843 , year=

Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko- rokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors.arXiv preprint arXiv:2306.17843,

work page arXiv
[39]

Texture: Text-guided texturing of 3d shapes

Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided texturing of 3d shapes. InACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023. 3

2023
[40]

Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets

ByteDance Seed. Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets. 2025. 5

2025
[41]

Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior

Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 22819–22829, 2023. 2

2023
[42]

Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12619–12629, 2023. 2

2023
[43]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 5

2025
[44]

arXiv preprint arXiv:2106.10689 , year=

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021. 2

work page arXiv 2021
[45]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 2

2023
[46]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Direct3d: Scal- able image-to-3d generation via 3d latent diffusion trans- former.Advances in Neural Information Processing Systems, 37:121859–121881, 2024

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scal- able image-to-3d generation via 3d latent diffusion trans- former.Advances in Neural Information Processing Systems, 37:121859–121881, 2024. 5

2024
[48]

Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Xun Cao, 12 Philip Torr, et al. Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025. 5

work page arXiv 2025
[49]

Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, et al. Native and com- pact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025. 2, 3, 4

work page arXiv 2025
[50]

Structured 3d latents for scalable and versatile 3d gen- eration

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 21469–21480,
[51]

Texgaussian: Generating high-quality pbr material via octree-based 3d gaussian splatting

Bojun Xiong, Jialun Liu, Jiakui Hu, Chenming Wu, Jinbo Wu, Xing Liu, Chen Zhao, Errui Ding, and Zhouhui Lian. Texgaussian: Generating high-quality pbr material via octree-based 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 551–561, 2025. 3

2025
[52]

Flexpainter: Flexible and multi-view consistent texture generation.arXiv preprint arXiv:2506.02620, 2025

Dongyu Yan, Leyi Wu, Jiantao Lin, Luozhou Wang, Tianshuo Xu, Zhifei Chen, Zhen Yang, Lie Xu, Shunsi Zhang, and Yingcong Chen. Flexpainter: Flexible and multi-view consistent texture generation.arXiv preprint arXiv:2506.02620, 2025. 2, 3

work page arXiv 2025
[53]

Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation

Jiayu Yang, Taizhang Shang, Weixuan Sun, Xibin Song, Ziang Cheng, Senbo Wang, Shenzhou Chen, Weizhe Liu, Hongdong Li, and Pan Ji. Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation. arXiv preprint arXiv:2502.14247, 2025. 2, 3

work page arXiv 2025
[54]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging

Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xi- aoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 25050– 25061, 2025. 5

2025
[56]

Seqtex: Generate mesh textures in video sequence

Ze Yuan, Xin Yu, Yangtian Sun, Yuan-Chen Guo, Yan-Pei Cao, Ding Liang, and Xiaojuan Qi. Seqtex: Generate mesh textures in video sequence. InProceedings of the SIG- GRAPH Asia 2025 Conference Papers, pages 1–12, 2025. 2, 3

2025
[57]

Textrix: Latent attribute grid for native texture generation and beyond.arXiv preprint arXiv:2512.02993, 2025

Yifei Zeng, Yajie Bao, Jiachen Qian, Shuang Wu, Youtian Lin, Hao Zhu, Buyu Li, Feihu Zhang, Xun Cao, and Yao Yao. Textrix: Latent attribute grid for native texture generation and beyond.arXiv preprint arXiv:2512.02993, 2025. 2, 3

work page arXiv 2025
[58]

3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023. 2, 5

2023
[59]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 2, 5

2023
[60]

Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 2, 3 13

2024