pith. sign in

arxiv: 2606.23514 · v1 · pith:W3CL47ACnew · submitted 2026-06-22 · 💻 cs.CV · cs.GR

Arbor: Explicit Geometric Conditioning for Controllable 3D Asset Generation

Pith reviewed 2026-06-26 08:46 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords 3D asset generationgeometric conditioningconstraint mesheslatent diffusioncontrollable generationtext-to-3Dhull avoidance touch
0
0 comments X

The pith

Arbor adds explicit geometric control to text-to-3D models by routing constraint mesh tokens into a frozen denoiser.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Arbor to give text-conditioned 3D generators a direct spatial control interface using constraint meshes. These meshes define hull regions where geometry must exist, avoidance regions that must stay empty, and touch regions that must make contact. The meshes are turned into tokens and attached through routing inside the frozen denoiser so each part of the latent space receives only the relevant local constraint. This produces higher obedience to the constraints while keeping object quality and output variation, all without adding compliance losses or retraining the base model.

Core claim

Arbor is a trainable attachment for text conditioned latent 3D generation that treats constraint meshes as a native 3D control interface. The interface supplies hull regions where geometry should exist, avoidance regions that should remain empty, and touch regions the object should contact. Constraint meshes are converted into tokens and integrated via a routed attachment inside a frozen denoiser so each latent region receives only the constraint portion that applies to its spatial location. Even without dedicated compliance losses, this improves constraint obedience while preserving object quality and variation under fixed constraints.

What carries the argument

The routed attachment that converts constraint meshes into tokens and injects them locally into the frozen denoiser.

If this is right

  • Text-to-3D models can respect explicit spatial requirements such as fitting inside envelopes or leaving clearance for motion.
  • Output variation remains high even when the same constraint meshes are applied.
  • Constraint obedience rises on both automatic and artist-curated benchmarks for hull, avoidance, and touch constraints.
  • Metric gains align with human preference ratings for the controlled outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-routing attachment could be tested on other latent generative backbones such as video or image models.
  • Artists could iterate on 3D assets by successively adding or editing constraint meshes without restarting generation.
  • The approach might reduce reliance on post-processing or manual cleanup in production asset pipelines.

Load-bearing premise

Constraint meshes can be converted into tokens and integrated via a routed attachment inside a frozen denoiser to locally influence generation without degrading base model performance or requiring additional compliance signals.

What would settle it

On the automatic and artist-curated control benchmarks, constraint obedience metrics show no improvement when the routed attachment is added compared with the unmodified base denoiser.

Figures

Figures reproduced from arXiv: 2606.23514 by Andreas Engelhardt, Hendrik P.A. Lensch, Jan-Niklas Dihlmann, Mark Boss, Simon Donne.

Figure 1
Figure 1. Figure 1: Arbor overview. Arbor turns simple 3D control objects into an explicit constraint signal for text￾conditioned 3D generation. Hull regions mark where generated geometry should exist, touch regions mark contact patches, and avoidance regions mark free space that should remain empty. This enables artist to co-author the generation process, making asset generation more reliable and therefore more likely to be … view at source ↗
Figure 2
Figure 2. Figure 2: Constraint conditioning pipeline. Arbor converts a typed constraint object into TRELLIS.2 OVoxels, encodes geometry and signal attributes with frozen encoders (Sec. 3.2), aligns the resulting latents into geometry tokens, and routes those tokens into the TRELLIS sparse structure denoiser (Sec. 3.3). Local routing gives each query group the nearby constraint evidence it needs, while learned global summaries… view at source ↗
Figure 3
Figure 3. Figure 3: Controlled generation comparison. Each column shows one prompt and constraint object. The constraint is rendered as normal shaded geometry with signal regions colored hull, touch, and avoidance. Rows compare predictions and their constraint following. Here, green indicates a hull match and blue indicates missing hull. Arbor keeps readable objects while following local roles. Models. Arbor is the model from… view at source ↗
Figure 4
Figure 4. Figure 4: Constraint sweeps. The prompt is fixed and the constraint region is continuously moved, scaled, or rotated. Arbor follows the deformation without snapping to a small set of canonical layouts [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Variation under a fixed constraint. Each block keeps the hull fixed and varies the seed; image conditioned baselines also fix the input image. Arbor changes details and proportions across seeds while still satisfying the constraint, where the image anchored methods stay close to their input. Method Var.↑ Ctrl. Scr.↑ Hull Hit↑ Avoid Viol.↓ Vol. Match↑ MV-CLIP↑ Arbor 0.740 0.361 ± 0.010 0.707 ± 0.044 0.016 ±… view at source ↗
Figure 6
Figure 6. Figure 6: Automatic constraint families used by Arbor. The figure shows the concrete generators that make up Arbor’s typed constraint program. Green columns are positive hull families, yellow columns are touch/contact families, and red columns are avoidance families. Each column lists the family intent at the top and example outputs on several objects below. These families are sampled online during training, while b… view at source ↗
Figure 7
Figure 7. Figure 7: Additional Arbor results on selected Toys4K constraints. Showing manual and automatic benchmark cases. In practice, this variant was not the best solution. It does improve direct constraint pressure, but it also moves the model toward a failure mode where following the constraint becomes easier than generating a plausible object from the prompt. This is close to the behavior seen in SpaceControl, where the… view at source ↗
Figure 8
Figure 8. Figure 8: Extended constraint sweeps. Additional sweep selections using the same rendering language as [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
read the original abstract

Text and image conditioned 3D models now generate convincing assets, but they still offer little direct control over the space an object should occupy or avoid. In authoring, this spatial intent is often known before generation starts. A chair should fit a seating envelope, a prop should leave clearance for motion, or a part should expose a contact surface. Prompts and image views are poor carriers for such constraints, requiring the need for an explicit control interface. We present Arbor, a trainable attachment for text conditioned latent 3D generation. Arbor introduces constraint meshes as a native 3D control interface. The interface uses hull regions where geometry should exist, avoidance regions that should remain empty, and touch regions the object should contact. Unlike completion or whole object scaffold control, these meshes are not target evidence. They are local typed requirements and can include regions where no surface should appear. Arbor keeps this signal as geometry by converting constraint meshes into tokens and learning a routed attachment inside a frozen denoiser. Each latent region can therefore receive the part of the constraint that matters for its spatial location. We evaluate Arbor on automatic and artist curated control benchmarks with hull, avoidance, and touch constraints, and compare the metric trends to a user preference study. Even without dedicated compliance losses, Arbor improves constraint obedience while preserving object quality and variation under fixed constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents Arbor, a trainable attachment for text-conditioned latent 3D generation. Arbor converts constraint meshes (hull regions where geometry should exist, avoidance regions that should remain empty, and touch regions the object should contact) into tokens and integrates them via a routed attachment inside a frozen denoiser, allowing each latent region to receive the relevant part of the constraint. The central claim is that, even without dedicated compliance losses, this yields improved constraint obedience on hull/avoidance/touch benchmarks while preserving object quality and variation, with supporting evidence from automatic/artist-curated benchmarks and a user preference study.

Significance. If the experimental claims hold, the work would supply a practical, geometry-native control interface that addresses a clear limitation in current 3D generative models, where prompts and images are poor carriers of spatial intent. The additive, frozen-denoiser design is a notable strength if it demonstrably avoids the need for compliance losses while maintaining base-model performance.

major comments (1)
  1. [Abstract] Abstract and provided manuscript text: the central claim of improved constraint obedience rests on metric trends from automatic and artist-curated benchmarks plus a user study, yet the text supplies no method equations, tokenization procedure, routing architecture, benchmark definitions, quantitative tables, error bars, or ablation results. Without these, the support for the claim that the routed attachment produces measurable gains cannot be verified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. The concern raised is that the abstract and provided text lack supporting technical details and results. The full manuscript contains dedicated sections addressing these elements; we address the point below and note that no changes to the core claims or experiments are required.

read point-by-point responses
  1. Referee: [Abstract] Abstract and provided manuscript text: the central claim of improved constraint obedience rests on metric trends from automatic and artist-curated benchmarks plus a user study, yet the text supplies no method equations, tokenization procedure, routing architecture, benchmark definitions, quantitative tables, error bars, or ablation results. Without these, the support for the claim that the routed attachment produces measurable gains cannot be verified.

    Authors: The full manuscript (beyond the abstract) includes: (1) Section 3 with the tokenization procedure for constraint meshes, the routed attachment architecture, and all relevant equations for the conditioning mechanism inside the frozen denoiser; (2) Section 4 with explicit definitions of the hull, avoidance, and touch benchmarks (both automatic and artist-curated); (3) Section 5 with quantitative tables reporting metric trends, error bars from multiple runs, and the user preference study results; and (4) Section 5.3 with ablation studies isolating the contribution of the routed attachment. These sections directly support the central claim. The abstract is intentionally concise and summarizes rather than reproduces the full technical content. We can add explicit section references to the abstract in a revision if the referee finds that helpful for navigation. revision: partial

Circularity Check

0 steps flagged

No circularity: method is an additive trainable attachment evaluated on external benchmarks

full rationale

The paper presents Arbor as a new trainable routed attachment that converts constraint meshes to tokens and integrates them locally inside a frozen denoiser. The central claim of improved constraint obedience is supported by evaluation on automatic and artist-curated benchmarks with hull/avoidance/touch constraints, plus a user preference study. No equations, predictions, or first-principles results are shown that reduce by construction to fitted parameters or self-citations; the approach is additive and externally benchmarked rather than self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the assumption that a frozen denoiser can incorporate the attachment and that constraint meshes function as local typed requirements rather than targets. No free parameters are explicitly named. One invented entity is the constraint mesh interface.

axioms (1)
  • domain assumption A text-conditioned latent 3D denoiser can remain frozen while a trainable routed attachment processes constraint tokens.
    Stated in the description of Arbor as a trainable attachment inside a frozen denoiser.
invented entities (1)
  • constraint meshes (hull, avoidance, touch regions) no independent evidence
    purpose: Provide explicit local 3D control signals that are not target geometry.
    Introduced as the core native interface; no independent evidence outside the paper is mentioned.

pith-pipeline@v0.9.1-grok · 5786 in / 1426 out tokens · 34929 ms · 2026-06-26T08:46:39.973590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 8 canonical work pages

  1. [1]

    Kim, Noam Aigerman, Amit H

    Amir Barda, Matheus Gadelha, Vladimir G. Kim, Noam Aigerman, Amit H. Bermano, and Thibault Groueix. Instant3dit: Multiview inpainting for fast editing of 3D objects. InConference on Computer Vision and Pattern Recognition (CVPR), pages 16273–16282, 2025. URL https: //arxiv.org/abs/2412.00518

  2. [2]

    SF3D: Stable fast 3D mesh reconstruction with UV-unwrapping and illumination disentanglement

    Mark Boss, Zixuan Huang, Aaryaman Vasishta, and Varun Jampani. SF3D: Stable fast 3D mesh reconstruction with UV-unwrapping and illumination disentanglement. InConference on Computer Vision and Pattern Recognition (CVPR), pages 16240–16250, 2025. URL https: //arxiv.org/abs/2408.00653

  3. [3]

    DiffComplete: Diffusion-based generative 3D shape completion

    Ruihang Chu, Enze Xie, Shentong Mo, Zhenguo Li, Matthias Nießner, Chi-Wing Fu, and Jiaya Jia. DiffComplete: Diffusion-based generative 3D shape completion. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/ 2306.16329

  4. [4]

    Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, and Jitendra Malik

    Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F. Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, and Jitendra Malik. ABO: Dataset and benchmarks for real-world 3D object understanding. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. URL https: //arxiv.org/...

  5. [5]

    Objaverse-XL: A universe of 10M+ 3D objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-XL: A universe of 10M+ 3D objects. InAdvances in Neural Information Processi...

  6. [6]

    V*: Guided visual search as a core mechanism in multimodal llms

    Jan-Niklas Dihlmann, Andreas Engelhardt, and Hendrik Lensch. SIGNeRF: Scene integrated generation for neural radiance fields. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. doi: 10.1109/CVPR52733.2024.00638. URL https://doi.org/10.1109/ CVPR52733.2024.00638

  7. [7]

    Jan-Niklas Dihlmann, Mark Boss, Simon Donne, Andreas Engelhardt, Hendrik P. A. Lensch, and Varun Jampani. ReLi3D: Relightable multi-view 3D reconstruction with disentangled illumination. InInternational Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/forum?id=BlSKgQb3Vd

  8. [8]

    2d gaussian splatting for geometrically accurate radiance fields,

    Wenqi Dong, Bangbang Yang, Lin Ma, Xiao Liu, Liyuan Cui, Hujun Bao, Yuewen Ma, and Zhaopeng Cui. Coin3D: Controllable and interactive 3D assets generation with proxy- guided conditioning. InACM SIGGRAPH, 2024. doi: 10.1145/3641519.3657425. URL https://doi.org/10.1145/3641519.3657425

  9. [9]

    SpaceControl: Introducing test-time spatial control to 3D generative modeling

    Elisabetta Fedele, Francis Engelmann, Ian Huang, Or Litany, Marc Pollefeys, and Leonidas Guibas. SpaceControl: Introducing test-time spatial control to 3D generative modeling. InInternational Conference on Learning Representations (ICLR), 2026. URL https: //openreview.net/forum?id=mEqsCVI5sN

  10. [10]

    ObjFiller-3D: Consistent multi-view 3D inpainting via video diffusion models.arXiv preprint, 2025

    Haitang Feng, Jie Liu, Jie Tang, Gangshan Wu, Beiqi Chen, Jianhuang Lai, and Guangcong Wang. ObjFiller-3D: Consistent multi-view 3D inpainting via video diffusion models.arXiv preprint, 2025. URLhttps://arxiv.org/abs/2508.18271

  11. [11]

    OpenLRM: Open-source large reconstruction models

    Zexin He and Tengfei Wang. OpenLRM: Open-source large reconstruction models. https:// github.com/3DTopia/OpenLRM, 2023. URL https://github.com/3DTopia/OpenLRM. GitHub repository; open-source implementation of LRM, not a primary paper. 10

  12. [12]

    SPAGHETTI

    Amir Hertz, Or Perel, Raja Giryes, Olga Sorkine-Hornung, and Daniel Cohen-Or. SPAGHETTI. ACM Transactions on Graphics (TOG), 2022. doi: 10.1145/3528223.3530084. URL https: //doi.org/10.1145/3528223.3530084

  13. [13]

    LRM: Large reconstruction model for single image to 3D

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large reconstruction model for single image to 3D. InInternational Conference on Learning Representations (ICLR), 2024. URL https: //openreview.net/forum?id=sllU8vvsFF

  14. [14]

    Easy3E: Feed-forward 3D asset editing via rectified voxel flow.arXiv preprint, 2026

    Shimin Hu, Yuanyi Wei, Fei Zha, Yudong Guo, and Juyong Zhang. Easy3E: Feed-forward 3D asset editing via rectified voxel flow.arXiv preprint, 2026. URL https://arxiv.org/abs/ 2602.21499. CVPR 2026

  15. [15]

    SPAR3D: Stable point-aware reconstruction of 3D objects from single images

    Zixuan Huang, Mark Boss, Aaryaman Vasishta, James Matthew Rehg, and Varun Jampani. SPAR3D: Stable point-aware reconstruction of 3D objects from single images. InConference on Computer Vision and Pattern Recognition (CVPR), pages 16860–16870, 2025. URL https: //arxiv.org/abs/2501.04689

  16. [16]

    Otaduy, and Dan Casas

    Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, and Ben Poole. Zero-shot text-guided object generation with dream fields. InConference on Computer Vision and Pattern Recognition (CVPR), pages 857–866, 2022. doi: 10.1109/CVPR52688.2022.00094. URL https://doi.org/10.1109/CVPR52688.2022.00094

  17. [17]

    Chang, and Manolis Savva

    Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Schacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. Habitat synthetic scenes dataset (HSSD-200): An analysis of 3D scene scale and realism tradeoffs for ObjectGoal navigation. InIEEE International Conference on Robotics and Automation (ICRA), 2024. ...

  18. [18]

    SALAD: Part-level latent diffusion for 3D shape generation and manipulation

    Juil Koo, Seungwoo Yoo, Minh Hieu Nguyen, and Minhyuk Sung. SALAD: Part-level latent diffusion for 3D shape generation and manipulation. InInternational Conference on Computer Vision (ICCV), 2023. URLhttps://arxiv.org/abs/2303.12236

  19. [19]

    BoxSplitGen: A generative model for 3D part bounding boxes in varying granularity

    Juil Koo, Wei-Tung Lin, Chanho Park, Chanhyeok Park, and Minhyuk Sung. BoxSplitGen: A generative model for 3D part bounding boxes in varying granularity. InWinter Conference on Applications of Computer Vision (WACV), pages 1777–1787, 2026. URL https://arxiv. org/abs/2602.20666

  20. [20]

    Lightlab: Controlling light sources in images with diffusion models

    Mingi Lee, Dongsu Zhang, Clément Jambon, and Young Min Kim. BrepDiff: Single-stage b-rep diffusion model. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, SIGGRAPH Conference Papers ’25, New York, NY , USA, 2025. Association for Computing Machinery. ISBN 9798400715402. doi: 10.1145/3...

  21. [21]

    Black, and Otmar Hilliges

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-resolution text-to-3D content creation. InConference on Computer Vision and Pattern Recognition (CVPR), pages 300–309, 2023. doi: 10.1109/CVPR52729.2023.00037. URL https://doi.org/10.1109/ CVPR52729.2023.00037

  22. [22]

    T2I-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2I-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InAAAI Conference on Artificial Intelligence (AAAI), pages 4296–4304, 2024. doi: 10.1609/AAAI.V38I5.28226. URLhttps://doi.org/10.1609/AAAI.V38I5.28226

  23. [23]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patric...

  24. [24]

    Barron, and Ben Mildenhall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. DreamFusion: Text-to-3D using 2D diffusion. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=FjNys5c7VyY. 11

  25. [25]

    Spice·e: Structural priors in 3D diffusion using cross-entity attention

    Etai Sella, Gal Fiebelman, Noam Atia, and Hadar Averbuch-Elor. Spice·e: Structural priors in 3D diffusion using cross-entity attention. InACM SIGGRAPH, pages 1–11, 2024. URL https://arxiv.org/abs/2311.17834

  26. [26]

    Stefan Stojanov, Anh Thai, and James M. Rehg. Using shape to categorize: Low-shot learning with an iterative categorization-discrimination loop. InConference on Computer Vision and Pattern Recognition (CVPR), 2021. URLhttps://arxiv.org/abs/2104.07371

  27. [27]

    DreamCraft3D: Hierarchical 3D generation with bootstrapped diffusion prior

    Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. DreamCraft3D: Hierarchical 3D generation with bootstrapped diffusion prior. InInternational Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/ forum?id=DDX1u29Gqr

  28. [28]

    DreamGaussian: Generative gaussian splatting for efficient 3D content creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. DreamGaussian: Generative gaussian splatting for efficient 3D content creation. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=UyNXMqnN3c

  29. [29]

    Hunyuan3D-omni: A unified framework for controllable generation of 3D assets.arXiv preprint, 2025

    Team Hunyuan3D. Hunyuan3D-omni: A unified framework for controllable generation of 3D assets.arXiv preprint, 2025. URLhttps://arxiv.org/abs/2509.21245

  30. [30]

    TripoSR: Fast 3D object reconstruction from a single image.arXiv preprint, 2024

    Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. TripoSR: Fast 3D object reconstruction from a single image.arXiv preprint, 2024. URL https://arxiv.org/abs/ 2403.02151

  31. [31]

    SK-adapter: Skeleton-based structural control for native 3D generation.arXiv preprint, 2026

    Anbang Wang, Yuzhuo Ao, Shangzhe Wu, and Chi-Keung Tang. SK-adapter: Skeleton-based structural control for native 3D generation.arXiv preprint, 2026. URL https://arxiv.org/ abs/2603.14152

  32. [32]

    Zhenwei Wang, Tengfei Wang, Zexin He, Gerhard Hancke, Ziwei Liu, and Rynson W. H. Lau. Phidias: A generative model for creating 3D content from text, image, and 3D conditions with reference-augmented diffusion. InInternational Conference on Learning Representations (ICLR), 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 50ca96a1a9ebe0b5...

  33. [33]

    Direct3D: Scalable image-to-3D generation via 3D latent diffusion transformer

    Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3D: Scalable image-to-3D generation via 3D latent diffusion transformer. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 121859–121881, 2024. doi: 10.52202/079017-3873. URL https://proceedings.neurips.cc/paper_files/paper/20...

  34. [34]

    Points-to-3D: Structure- aware 3D generation with point cloud priors

    Jiatong Xia, Zicheng Duan, Anton van den Hengel, and Lingqiao Liu. Points-to-3D: Structure- aware 3D generation with point cloud priors. InConference on Computer Vision and Pattern Recognition (CVPR), 2026. URLhttps://jiatongxia.github.io/points2-3D/

  35. [35]

    Native and compact structured latents for 3D generation.arXiv preprint, 2025

    Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, and Jiaolong Yang. Native and compact structured latents for 3D generation.arXiv preprint, 2025. URL https://arxiv.org/abs/ 2512.14692

  36. [36]

    Structured 3D latents for scalable and versatile 3D generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3D latents for scalable and versatile 3D generation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 21469–21480, 2025. URLhttps://arxiv.org/abs/2412.01506

  37. [37]

    In- stantMesh: Efficient 3D mesh generation from a single image with sparse-view large recon- struction models.arXiv preprint, 2024

    Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. In- stantMesh: Efficient 3D mesh generation from a single image with sparse-view large recon- struction models.arXiv preprint, 2024. URLhttps://arxiv.org/abs/2404.07191

  38. [38]

    Lambourne, Pradeep Kumar Jayaraman, Zhengqing Wang, Karl D.D

    Xiang Xu, Joseph G. Lambourne, Pradeep Kumar Jayaraman, Zhengqing Wang, Karl D.D. Willis, and Yasutaka Furukawa. BrepGen: A b-rep generative diffusion model with structured latent geometry.ACM Transactions on Graphics (TOG), 43(4):1–14, 2024. doi: 10.1145/ 3658129. URLhttps://brepgen.github.io/

  39. [39]

    OmniPart: Part-aware 3D generation with semantic decoupling and structural cohesion.arXiv preprint, 2025

    Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, and Xihui Liu. OmniPart: Part-aware 3D generation with semantic decoupling and structural cohesion.arXiv preprint, 2025. URL https://arxiv.org/abs/ 2507.06165. SIGGRAPH Asia 2025. 12

  40. [40]

    IP-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint, 2023

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint, 2023. URL https:// arxiv.org/abs/2308.06721

  41. [41]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InInternational Conference on Computer Vision (ICCV), 2023. URL https://arxiv.org/abs/2302.05543

  42. [42]

    Assembler: Scalable 3D part assembly via anchor point diffusion.arXiv preprint, 2025

    Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, and Ying Shan. Assembler: Scalable 3D part assembly via anchor point diffusion.arXiv preprint, 2025. URL https://arxiv.org/ abs/2506.17074. SIGGRAPH Asia 2025

  43. [43]

    Hunyuan3D 2.0: Scaling diffusion models for high resolution textured 3D assets generation.arXiv preprint, 2025

    Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Shaoxiong Yang, So...

  44. [44]

    PartSAM: A scalable promptable part segmentation model trained on native 3D data.arXiv preprint, 2026

    Zhe Zhu, Le Wan, Rui Xu, Yiheng Zhang, Honghua Chen, Zhiyang Dou, Cheng Lin, Yuan Liu, and Mingqiang Wei. PartSAM: A scalable promptable part segmentation model trained on native 3D data.arXiv preprint, 2026. URL https://arxiv.org/abs/2509.21965. ICLR 2026. 13 A Supplementary Material This supplement adds the details that support the experimental claims b...