pith. sign in

arxiv: 2511.18801 · v3 · pith:KN2B6XG2new · submitted 2025-11-24 · 💻 cs.CV

PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion

Pith reviewed 2026-05-21 18:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D mesh generationdiscrete diffusionpart-wise generationsemi-autoregressivepoint cloud to meshDiT architecturesemantic segmentationcross-attention
0
0 comments X

The pith

PartDiffuser generates 3D meshes from point clouds by autoregressing across semantic parts for global structure while diffusing in parallel inside each part for local details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PartDiffuser as a semi-autoregressive diffusion method to fix the balance problem in existing autoregressive 3D mesh generators, where global consistency often comes at the cost of lost fine details and accumulated errors. It first segments the input into semantic parts, then sequences the parts autoregressively to lock in overall topology and runs discrete diffusion steps simultaneously within each part to recover high-frequency geometry. A DiT backbone with part-aware cross-attention takes hierarchical point-cloud conditioning to guide both levels. If the separation works, the resulting meshes should show richer surface detail than prior models while keeping coherent large-scale shape. This matters for applications that need production-ready 3D assets from raw scans or sketches.

Core claim

PartDiffuser performs semantic segmentation on the input mesh or point cloud, then uses autoregression between parts to maintain global topology while running a parallel discrete diffusion process inside each semantic part to reconstruct high-frequency geometric features, all inside a DiT architecture equipped with a part-aware cross-attention layer that conditions on hierarchical point-cloud geometry to decouple the global and local tasks.

What carries the argument

The part-aware cross-attention mechanism inside the DiT backbone that uses hierarchical point-cloud conditioning to dynamically steer generation and separate global topology control from local detail reconstruction.

If this is right

  • Global structural consistency is achieved through autoregressive ordering of semantic parts rather than full-sequence autoregression.
  • High-frequency local details are recovered by parallel discrete diffusion performed independently inside each part.
  • Error accumulation across the entire object is limited because diffusion steps remain local to each semantic region.
  • Meshes exhibit richer surface detail than current state-of-the-art point-cloud-to-mesh generators while preserving overall topology.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The part-wise split could be tested on other conditional generation tasks such as texture synthesis or scene layout where global coherence and local fidelity must both be maintained.
  • If segmentation can be made lightweight and online, the framework might support interactive 3D modeling tools that accept partial point clouds.
  • The same conditioning hierarchy might improve consistency when extending the model to generate textured meshes or animated sequences.

Load-bearing premise

The method assumes that accurate semantic segmentation of the input can be obtained in advance and that the part-aware cross-attention will successfully prevent boundary artifacts when merging the autoregressive inter-part sequence with the parallel intra-part diffusion.

What would settle it

Quantitative results on a held-out test set of complex objects where the method shows no improvement over prior models on detail-sensitive metrics such as normal consistency or edge sharpness, or visual inspection revealing visible seams or loss of geometry at part boundaries in the generated meshes.

Figures

Figures reproduced from arXiv: 2511.18801 by Baochang Zhang, Guojun Lei, Haodong Zhu, Hong Li, Linin Yang, Sheng Xu, Yichen Yang.

Figure 1
Figure 1. Figure 1: Gallery of our mesh generation results. Abstract Existing autoregressive (AR) methods for generating artist￾designed meshes struggle to balance global structural con￾sistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartD￾iffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs *Equ… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our PartDiffuser framework. The process begins with semantic segmentation of the input point-cloud using [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the composite attention mask during the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison of PartDiffuser with Baselines. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An example of the ablation study. text, shows a degradation in performance. This indicates that while the global feature is crucial for capturing the holistic shape, it lacks the fine-grained guidance necessary for high-fidelity local detail. Without the specific {Cparti } features, the model struggles to precisely reconstruct the ge￾ometry of individual parts, resulting in lower overall accu￾racy. Convers… view at source ↗
Figure 6
Figure 6. Figure 6: Mesh with different faces in the dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the part-wise sampling process. The process evolves from left to right: (1) The first part being denoised, (2) [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual comparison of varying k. 2 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Existing autoregressive (AR) methods for generating artist-designed meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a "part-wise" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local generation tasks. Experiments demonstrate that this method significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail, exhibiting exceptional detail representation suitable for real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes PartDiffuser, a semi-autoregressive discrete diffusion framework for point-cloud-to-mesh generation. It first performs semantic segmentation on the input, then applies autoregressive generation across parts to maintain global topology while running parallel discrete diffusion within each part to recover high-frequency details. The DiT architecture is extended with a part-aware cross-attention mechanism that conditions generation on hierarchical point clouds, thereby decoupling global and local tasks. The abstract states that experiments show significant outperformance over SOTA models in detail fidelity for real-world applications.

Significance. If the empirical claims are substantiated, the part-wise decomposition could offer a practical way to reconcile global consistency with local geometric fidelity in conditional mesh generation. The combination of autoregressive inter-part modeling and intra-part discrete diffusion, together with cross-attention conditioning, represents an architectural pattern that may influence subsequent work on structured 3D synthesis.

major comments (2)
  1. [Abstract] Abstract: the central claim that the method 'significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail' is unsupported by any quantitative metrics, baseline comparisons, error measures, or experimental protocol; without these the outperformance assertion cannot be evaluated and is load-bearing for the paper's contribution.
  2. [Method] Method description (abstract and implied §3): the framework presupposes accurate upfront semantic segmentation and artifact-free boundary handling via part-aware cross-attention, yet no robustness analysis, ablation on segmentation noise, or boundary-specific metrics (e.g., normal consistency or edge error across part interfaces) are reported; these assumptions directly determine whether the global-local decoupling succeeds.
minor comments (2)
  1. [Abstract] Abstract: the term 'semi-autoregressive' is introduced without a concise definition of how the autoregressive inter-part schedule interacts with the parallel intra-part diffusion steps.
  2. [Abstract] Abstract: 'hierarchical point clouds' are mentioned as conditioning input but the construction of the hierarchy (number of levels, sampling strategy) is not specified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review of our manuscript. We address each of the major comments below and have made revisions to the manuscript where appropriate to strengthen the presentation of our results and analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the method 'significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail' is unsupported by any quantitative metrics, baseline comparisons, error measures, or experimental protocol; without these the outperformance assertion cannot be evaluated and is load-bearing for the paper's contribution.

    Authors: The abstract is intended as a concise overview, and the supporting quantitative evidence—including specific metrics, baseline comparisons, and the experimental setup—is provided in detail in Section 4 of the full manuscript. Nevertheless, we agree that incorporating key quantitative results into the abstract would make the claim more immediately verifiable. We have revised the abstract to include brief references to the performance gains observed in our experiments. revision: yes

  2. Referee: [Method] Method description (abstract and implied §3): the framework presupposes accurate upfront semantic segmentation and artifact-free boundary handling via part-aware cross-attention, yet no robustness analysis, ablation on segmentation noise, or boundary-specific metrics (e.g., normal consistency or edge error across part interfaces) are reported; these assumptions directly determine whether the global-local decoupling succeeds.

    Authors: We recognize the importance of validating the robustness of the part-wise approach to segmentation inaccuracies. The current work assumes high-quality semantic segmentation as input, consistent with many part-based 3D generation methods. To directly address this concern, we have conducted additional experiments and included an ablation study on segmentation noise levels along with boundary-specific metrics in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural combination with independent experimental claims

full rationale

The paper introduces PartDiffuser as a new semi-autoregressive framework that combines upfront semantic segmentation, autoregressive inter-part generation for global topology, parallel discrete diffusion within parts for local details, and part-aware cross-attention on hierarchical point clouds. No equations, fitted parameters, or derivation steps appear that reduce any claimed prediction or result to the inputs by construction. The abstract and description frame the approach as an original architectural synthesis rather than a self-referential fit or renamed prior result. Central performance claims rest on experimental outperformance rather than load-bearing self-citations or uniqueness theorems imported from the authors' prior work. This is the common case of a self-contained engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the approach builds on standard DiT and discrete diffusion components from prior literature without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5709 in / 1174 out tokens · 45793 ms · 2026-05-21T18:32:37.904977+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 7 internal anchors

  1. [1]

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

    Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating be- tween autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025. 2, 3, 5, 6

  2. [2]

    Structured denoising dif- fusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tar- low, and Rianne Van Den Berg. Structured denoising dif- fusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021. 2, 3

  3. [3]

    Partgen: Part-level 3d generation and reconstruction with multi-view diffusion models

    Minghao Chen, Roman Shapovalov, Iro Laina, Tom Mon- nier, Jianyuan Wang, David Novotny, and Andrea Vedaldi. Partgen: Part-level 3d generation and reconstruction with multi-view diffusion models. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 5881–5892, 2025. 3

  4. [4]

    Autopartgen: Autogres- sive 3d part generation and discovery.arXiv preprint arXiv:2507.13346, 2025

    Minghao Chen, Jianyuan Wang, Roman Shapovalov, Tom Monnier, Hyunyoung Jung, Dilin Wang, Rakesh Ranjan, Iro Laina, and Andrea Vedaldi. Autopartgen: Autogres- sive 3d part generation and discovery.arXiv preprint arXiv:2507.13346, 2025. 3

  5. [5]

    Meshxl: Neural coordinate field for generative 3d foundation models.Advances in Neural Information Pro- cessing Systems, 37:97141–97166, 2024

    Sijin Chen, Xin Chen, Anqi Pang, Xianfang Zeng, Wei Cheng, Yijun Fu, Fukun Yin, Billzb Wang, Jingyi Yu, Gang Yu, et al. Meshxl: Neural coordinate field for generative 3d foundation models.Advances in Neural Information Pro- cessing Systems, 37:97141–97166, 2024. 2

  6. [6]

    Meshanything: Artist-created mesh generation with autoregressive transformers.arXiv preprint arXiv:2406.10163, 2024

    Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Ji- axiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh generation with au- toregressive transformers.arXiv preprint arXiv:2406.10163,

  7. [7]

    Meshany- thing v2: Artist-created mesh generation with adjacent mesh tokenization

    Yiwen Chen, Yikai Wang, Yihao Luo, Zhengyi Wang, Zilong Chen, Jun Zhu, Chi Zhang, and Guosheng Lin. Meshany- thing v2: Artist-created mesh generation with adjacent mesh tokenization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13922–13931, 2025. 6

  8. [8]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023. 6, 8, 1

  9. [9]

    Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 3

  10. [10]

    3d-front: 3d furnished rooms with layouts and semantics

    Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Bin- qiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 10933–10942,

  11. [11]

    Memdlm: De novo membrane protein design with masked discrete diffusion protein language models

    Shrey Goel, Vishrut Thoutam, Edgar Mariano Marro- quin, Aaron Gokaslan, Arash Firouzbakht, Sophia Vincoff, V olodymyr Kuleshov, Huong T Kratochvil, and Pranam Chatterjee. Memdlm: De novo membrane protein design with masked discrete diffusion protein language models. arXiv preprint arXiv:2410.16735, 2024. 3

  12. [12]

    Scaling Diffusion Language Models via Adaptation from Autoregressive Models

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024. 3

  13. [13]

    Diffu- coder: Understanding and improving masked diffusion mod- els for code generation.arXiv preprint arXiv:2506.20639,

    Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffu- coder: Understanding and improving masked diffusion mod- els for code generation.arXiv preprint arXiv:2506.20639,

  14. [14]

    Meshtron: High-fidelity, artist-like 3d mesh generation at scale.arXiv preprint arXiv:2412.09548, 2024

    Zekun Hao, David W Romero, Tsung-Yi Lin, and Ming-Yu Liu. Meshtron: High-fidelity, artist-like 3d mesh generation at scale.arXiv preprint arXiv:2412.09548, 2024. 2

  15. [15]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 3

  16. [16]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

  17. [17]

    Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion

    Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern...

  18. [18]

    Mercury: Ultra-Fast Language Models Based on Diffusion

    Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Ya- nis Miraoui, Akash Palrecha, Stefano Ermon, et al. Mer- cury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025. 3

  19. [19]

    Diffusion-lm improves control- lable text generation.Advances in neural information pro- cessing systems, 35:4328–4343, 2022

    Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves control- lable text generation.Advances in neural information pro- cessing systems, 35:4328–4343, 2022. 3

  20. [20]

    Partcrafter: Structured 3d mesh generation via compositional latent diffusion trans- formers.arXiv preprint arXiv:2506.05573, 2025

    Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compo- sitional latent diffusion transformers.arXiv preprint arXiv:2506.05573, 2025. 3

  21. [21]

    Treemeshgpt: Artistic mesh generation with autoregressive tree sequenc- ing

    Stefan Lionar, Jiabin Liang, and Gim Hee Lee. Treemeshgpt: Artistic mesh generation with autoregressive tree sequenc- ing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26608–26617, 2025. 2, 3, 6

  22. [22]

    Part123: part-aware 3d reconstruction from a single-view image

    Anran Liu, Cheng Lin, Yuan Liu, Xiaoxiao Long, Zhiyang Dou, Hao-Xiang Guo, Ping Luo, and Wenping Wang. Part123: part-aware 3d reconstruction from a single-view image. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024. 3

  23. [23]

    Partfield: Learn- ing 3d feature fields for part segmentation and beyond

    Minghua Liu, Mikaela Angelina Uy, Donglai Xiang, Hao Su, Sanja Fidler, Nicholas Sharp, and Jun Gao. Partfield: Learn- ing 3d feature fields for part segmentation and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9704–9715, 2025. 4, 6, 1 9

  24. [24]

    Wonder3d: Sin- gle image to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 2

  25. [25]

    Marching cubes: A high resolution 3d surface construction algorithm

    William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. InSem- inal graphics: pioneering efforts that shaped the field, pages 347–353. 1998. 2

  26. [26]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  27. [27]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. 2, 3

  28. [28]

    Deepsdf: Learning con- tinuous signed distance functions for shape representation

    Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 165–174, 2019. 2

  29. [29]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  30. [30]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

  31. [31]

    Meshgpt: Generating triangle meshes with decoder-only transformers

    Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Ta- tiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder-only transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19615–19625, 2024. 2

  32. [32]

    Topology sculptor, shape refiner: Discrete diffusion model for high-fidelity 3d meshes generation.arXiv preprint arXiv:2510.21264, 2025

    Kaiyu Song, Hanjiang Lai, Yaqing Zhang, Chuangjian Cai, Yan Pan Kun Yue, and Jian Yin. Topology sculptor, shape refiner: Discrete diffusion model for high-fidelity 3d meshes generation.arXiv preprint arXiv:2510.21264, 2025. 3

  33. [33]

    Efficient part-level 3d object generation via dual volume packing.arXiv preprint arXiv:2506.09980,

    Jiaxiang Tang, Ruijie Lu, Zhaoshuo Li, Zekun Hao, Xuan Li, Fangyin Wei, Shuran Song, Gang Zeng, Ming-Yu Liu, and Tsung-Yi Lin. Efficient part-level 3d object generation via dual volume packing.arXiv preprint arXiv:2506.09980,

  34. [34]

    arXiv preprint arXiv:2410.13782 , year=

    Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. Dplm-2: A multi- modal diffusion protein language model.arXiv preprint arXiv:2410.13782, 2024. 3

  35. [35]

    LLaMA-Mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024

    Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, and Xiaohui Zeng. Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024. 2

  36. [36]

    Scaling mesh generation via compressive tokenization

    Haohan Weng, Zibo Zhao, Biwen Lei, Xianghui Yang, Jian Liu, Zeqiang Lai, Zhuo Chen, Yuhong Liu, Jie Jiang, Chun- chao Guo, et al. Scaling mesh generation via compressive tokenization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11093–11103, 2025. 2, 3, 4, 6, 1

  37. [37]

    Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

    Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025. 3

  38. [38]

    Structured 3d latents for scalable and versatile 3d gen- eration

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 2

  39. [39]

    Frankenstein: Generating semantic- compositional 3d scenes in one tri-plane

    Han Yan, Yang Li, Zhennan Wu, Shenzhou Chen, Weix- uan Sun, Taizhang Shang, Weizhe Liu, Tian Chen, Xiaqiang Dai, Chao Ma, et al. Frankenstein: Generating semantic- compositional 3d scenes in one tri-plane. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 3

  40. [40]

    Phycage: Physically plausible compositional 3d asset gener- ation from a single image.arXiv preprint arXiv:2411.18548,

    Han Yan, Mingrui Zhang, Yang Li, Chao Ma, and Pan Ji. Phycage: Physically plausible compositional 3d asset gener- ation from a single image.arXiv preprint arXiv:2411.18548,

  41. [41]

    Holopart: Generative 3d part amodal segmentation.arXiv preprint arXiv:2504.07943, 2025

    Yunhan Yang, Yuan-Chen Guo, Yukun Huang, Zi-Xin Zou, Zhipeng Yu, Yangguang Li, Yan-Pei Cao, and Xihui Liu. Holopart: Generative 3d part amodal segmentation.arXiv preprint arXiv:2504.07943, 2025. 3

  42. [42]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. 2, 3

  43. [43]

    Deepmesh: Auto- regressive artist-mesh creation with reinforcement learning

    Ruowen Zhao, Junliang Ye, Zhengyi Wang, Guangce Liu, Yiwen Chen, Yikai Wang, and Jun Zhu. Deepmesh: Auto- regressive artist-mesh creation with reinforcement learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10612–10623, 2025. 2, 3

  44. [44]

    Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation.Advances in neural information processing systems, 36:73969–73982,

    Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation.Advances in neural information processing systems, 36:73969–73982,

  45. [45]

    Dataset Construction As a supplement to the dataset introduction in the main text, we provide a detailed description of the dataset construction process

    4 10 PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion Supplementary Material A. Dataset Construction As a supplement to the dataset introduction in the main text, we provide a detailed description of the dataset construction process. We utilize Objaverse [8] and 3D-Front [10] as our primary data sources. The data preprocessing pipeline co...