pith. sign in

arxiv: 2605.21572 · v1 · pith:43UC2BMQnew · submitted 2026-05-20 · 💻 cs.CV · cs.RO

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

Pith reviewed 2026-05-22 09:35 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords 3D generationsimulation-ready assetsvision-language modelsrigid objectsdeformable objectsarticulated objectsphysical simulationembodied AI
0
0 comments X

The pith

PhysX-Omni generates simulation-ready 3D models for rigid, deformable, and articulated objects with one unified framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhysX-Omni to create 3D assets that function directly in physical simulations across rigid, soft, and jointed object types. It develops a geometry representation for vision-language models that feeds full high-resolution 3D data without any compression step, which raises output quality for all categories. The authors assembled PhysXVerse as the first broad dataset of simulation-ready objects drawn from indoor and outdoor environments. They also built PhysX-Bench to measure six real-world attributes including geometry, absolute scale, material, affordance, kinematics, and function description. Experiments show the system succeeds at both asset creation and asset understanding, opening uses in full scene simulation and robot behavior training.

Core claim

PhysX-Omni is a unified framework for simulation-ready physical 3D generation across rigid, deformable, and articulated objects. It rests on a novel geometry representation tailored for vision-language models that directly encodes high-resolution 3D structures without compression, thereby improving generation performance. The framework is trained on the new PhysXVerse dataset and evaluated with PhysX-Bench on six attributes that test both generative and understanding capabilities.

What carries the argument

The novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression and improves generation performance across asset categories.

If this is right

  • Generation quality rises across rigid, deformable, and articulated asset categories.
  • The framework supports creation of complete simulation-ready indoor and outdoor scenes.
  • Trained policies for robotic tasks benefit from the availability of accurate physical 3D assets.
  • PhysX-Bench supplies a consistent way to compare both generative and understanding performance on physical properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same encoding approach could be tested on larger outdoor scenes with multiple interacting objects to check scalability.
  • Combining the generated assets with existing physics engines may speed up training loops for embodied agents.
  • The six-attribute benchmark could become a standard test set for other 3D generation methods that target simulation use.
  • Future extensions might add real-time material response or contact-rich interaction data to the dataset.

Load-bearing premise

The novel geometry representation tailored for Vision-Language Models directly encodes high-resolution 3D structures without compression and thereby significantly improves generation performance across asset categories.

What would settle it

Retraining the vision-language model on PhysXVerse but replacing the direct high-resolution encoding with a compressed alternative and measuring whether scores on PhysX-Bench geometry and kinematics attributes fall sharply.

Figures

Figures reproduced from arXiv: 2605.21572 by Fangzhou Hong, Haitian Li, Liang Pan, Runmao Yao, Yinghao Liu, Zhaoxi Chen, Ziang Cao, Ziwei Liu.

Figure 1
Figure 1. Figure 1: By exploiting the high diversity of PhysXVerse, PhysX-Omni is capable of generating detailed and general 3D assets covering rigid, deformable, and articulated objects, producing simulation-ready physical assets suitable for downstream applications. Abstract Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existin… view at source ↗
Figure 2
Figure 2. Figure 2: Given a single complete or partially occluded image, PhysX-Omni first infers high-level overall in￾formation. It then employs a multi-turn generation process to produce detailed part-level geometry. Owing to the inherent alignment between global and local representations, these outputs can be directly integrated into simulation-ready physical 3D assets. sive and robust evaluation framework for assessing ge… view at source ↗
Figure 3
Figure 3. Figure 3: (a). Comparison of different geometry representations for 3D modeling. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Statistics and distribution of PhysXVerse. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of PhysX-Bench. It consists of six key dimensions for comprehensively evaluating 3D structure, appearance, fundamental physical attributes, and understanding. over 2.9K categories, covering a wide range of object types, such as indoor furniture, unmanned aerial vehicles, robots, vehicles, and large-scale scene components. Compared with existing simulation-ready datasets, PhysXVerse exhibits substa… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results. Compared with existing generative methods, our PhysX-Omni demonstrates im￾pressive performance in generating complex geometries and rich physical attributes. simulation-ready physical 3D assets spanning diverse indoor and outdoor categories. The dataset cov￾ers rigid, articulated, and deformable objects with rich geometric structures and physical attributes. To improve view consistency… view at source ↗
Figure 7
Figure 7. Figure 7: Left: Comparison of our PhysX-Omni with other methods. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: More qualitative results of PhysX-Omni. Additional results further demonstrate the robust generative performance of our method in complex scenarios. ically, PhysX-Omni achieves a kinematic score of 80.72, significantly outperforming PhysX-Anything (65.99), PhysXGen (69.17), MonoArt (68.32), and Articulate-Anything (71.25). Similar improvements can also be observed for affordance understanding and descripti… view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of the generated deformable objects. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of model using different geometry representations. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Robot Manipulation on our Generated sim-ready 3D assets. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Applications of our PhysX-Omni. We explore the potential applications of PhysX-Omni in sim￾ready scene generation. 4.6 Validating human alignment of PhysX-Bench To validate that PhysX-Bench can effectively reflect human perception and evaluation preferences, we further study the correlation between the automatic evaluation results produced by PhysX-Bench and hu￾man annotations. Specifically, following pri… view at source ↗
read the original abstract

Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces PhysX-Omni, a unified framework for generating simulation-ready physical 3D assets across rigid, deformable, and articulated object categories. It proposes a novel geometry representation tailored for Vision-Language Models that directly encodes high-resolution 3D structures without compression to improve generation performance, constructs the PhysXVerse dataset covering diverse indoor and outdoor scenes, and introduces PhysX-Bench to evaluate six attributes (geometry, absolute scale, material, affordance, kinematics, and function description). Experiments using conventional metrics and PhysX-Bench are reported to show strong results in both generation and understanding, with additional validation for simulation-ready scene generation and robotic policy learning.

Significance. If the geometry representation can be shown to achieve the claimed high-resolution encoding without implicit compression or downsampling and if the associated datasets and benchmarks are released, the work would offer a meaningful step toward unified physical 3D generation. This could support broader use in embodied AI and physics simulation by addressing the current fragmentation across object categories.

major comments (1)
  1. Abstract: The performance improvements across asset categories are attributed to the 'novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression'. This is load-bearing for the central claim of a unified framework. Standard VLMs operate under token or patch limits (typically 4k–32k tokens or fixed-resolution inputs). An uncompressed high-resolution 3D structure (dense voxels, full meshes, or dense point clouds) would exceed these limits unless an implicit downsampling, sparse coding, learned latent projection, or 2D rendering step is used. The methods section must explicitly describe the input encoding, sequence length, and any projection mechanism; absent this detail, the contribution of the representation cannot be isolated from dataset or training effects.
minor comments (2)
  1. The abstract states that 'extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly' but supplies no numerical values, baseline comparisons, or ablation results. Adding these in the main text or a results table would allow readers to assess the magnitude of the reported gains.
  2. It is unclear whether PhysXVerse and PhysX-Bench will be released publicly. Stating the release plan and any licensing details would strengthen the resource contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the single major comment below and will incorporate the requested clarifications into the revised manuscript.

read point-by-point responses
  1. Referee: Abstract: The performance improvements across asset categories are attributed to the 'novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression'. This is load-bearing for the central claim of a unified framework. Standard VLMs operate under token or patch limits (typically 4k–32k tokens or fixed-resolution inputs). An uncompressed high-resolution 3D structure (dense voxels, full meshes, or dense point clouds) would exceed these limits unless an implicit downsampling, sparse coding, learned latent projection, or 2D rendering step is used. The methods section must explicitly describe the input encoding, sequence length, and any projection mechanism; absent this detail, the contribution of the representation cannot be isolated from dataset or training effects.

    Authors: We agree that the abstract claim requires supporting technical detail in the Methods section to substantiate how high-resolution 3D structures are encoded for VLM compatibility. The current manuscript describes the representation at a high level but does not provide the explicit encoding mechanics, sequence lengths, or projection steps. In the revised manuscript we will add a dedicated subsection (with pseudocode and an accompanying figure) that specifies the input encoding pipeline, the exact token/sequence budget used, and the mechanism that preserves resolution without conventional lossy compression. This revision will allow readers to isolate the representation's contribution from dataset and training effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a new framework, dataset (PhysXVerse), and benchmark (PhysX-Bench) for unified 3D asset generation. Its core claims rest on empirical results from a claimed novel geometry representation for VLMs and downstream validation studies, without any equations, fitted parameters, or self-referential reductions that equate outputs to inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in a way that collapses the derivation; the work is self-contained against external benchmarks and standard VLM practices.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5801 in / 1105 out tokens · 62052 ms · 2026-05-22T09:35:22.442980+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

  1. [1]

    Structured 3D Latents for Scalable and Versatile 3D Generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024. 18

  2. [2]

    3dtopia: Large text-to-3d generation model with hybrid diffu- sion priors.arXiv preprint arXiv:2403.02234, 2024

    Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, et al. 3dtopia: Large text-to-3d generation model with hybrid diffu- sion priors.arXiv preprint arXiv:2403.02234, 2024

  3. [3]

    3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024

    Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024

  4. [4]

    Meshllm: Empowering large language models to pro- gressively understand and generate 3d mesh

    Shuangkang Fang, I Shen, Yufeng Wang, Yi-Hsuan Tsai, Yi Yang, Shuchang Zhou, Wenrui Ding, Takeo Igarashi, Ming-Hsuan Yang, et al. Meshllm: Empowering large language models to pro- gressively understand and generate 3d mesh. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 14061–14072, 2025

  5. [5]

    Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025

    Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, and Jun Zhu. Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025

  6. [6]

    Native and Compact Structured Latents for 3D Generation

    Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, et al. Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

  7. [7]

    Bang: Dividing 3d assets via generative exploded dynamics.ACM Transactions on Graphics (TOG), 44(4):1–21, 2025

    Longwen Zhang, Qixuan Zhang, Haoran Jiang, Yinuo Bai, Wei Yang, Lan Xu, and Jingyi Yu. Bang: Dividing 3d assets via generative exploded dynamics.ACM Transactions on Graphics (TOG), 44(4):1–21, 2025

  8. [8]

    Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

    Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, and Xihui Liu. Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

  9. [9]

    Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

    Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

  10. [10]

    arXiv preprint arXiv:2410.16499 (2024)

    Jiayi Liu, Denys Iliash, Angel X Chang, Manolis Savva, and Ali Mahdavi-Amiri. Singapo: Single image controlled generation of articulated parts in objects.arXiv preprint arXiv:2410.16499, 2024

  11. [11]

    Articulate-anything: Automatic modeling of articulated objects via a vision-language foundation model.arXiv preprint arXiv:2410.13882, 2024

    Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Dinesh Jayaraman, and Eric Eaton. Articulate-anything: Automatic modeling of articulated objects via a vision-language foundation model.arXiv preprint arXiv:2410.13882, 2024

  12. [12]

    Dreamart: Generating interactable articulated objects from a single image.arXiv preprint arXiv:2507.05763, 2025

    Ruijie Lu, Yu Liu, Jiaxiang Tang, Junfeng Ni, Yuxiang Wang, Diwen Wan, Gang Zeng, Yixin Chen, and Siyuan Huang. Dreamart: Generating interactable articulated objects from a single image.arXiv preprint arXiv:2507.05763, 2025

  13. [13]

    Monoart: Progressive structural reasoning for monocular articulated 3d reconstruction.arXiv preprint arXiv:2603.19231, 2026

    Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, and Ziwei Liu. Monoart: Progressive structural reasoning for monocular articulated 3d reconstruction.arXiv preprint arXiv:2603.19231, 2026

  14. [14]

    Physdreamer: Physics-based interaction with 3d objects via video generation

    Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InEuropean Conference on Computer Vision, pages 388–406, 2024

  15. [15]

    Physically compatible 3d object modeling from a single image.Advances in Neural Information Processing Systems, 37:119260–119282, 2024

    Minghao Guo, Bohan Wang, Pingchuan Ma, Tianyuan Zhang, Crystal Owens, Chuang Gan, Josh Tenenbaum, Kaiming He, and Wojciech Matusik. Physically compatible 3d object modeling from a single image.Advances in Neural Information Processing Systems, 37:119260–119282, 2024. 19

  16. [16]

    Physgen3d: Crafting a miniature interactive world from a single image

    Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. Physgen3d: Crafting a miniature interactive world from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6178–6189, 2025

  17. [17]

    Pixie: Fast and generalizable supervised learning of 3d physics from pixels.arXiv preprint arXiv:2508.17437, 2025

    Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, and Lingjie Liu. Pixie: Fast and generalizable supervised learning of 3d physics from pixels.arXiv preprint arXiv:2508.17437, 2025

  18. [18]

    Vid2sim: Generalizable, video-based reconstruction of appearance, geometry and physics for mesh-free simulation

    Chuhao Chen, Zhiyang Dou, Chen Wang, Yiming Huang, Anjun Chen, Qiao Feng, Jiatao Gu, and Lingjie Liu. Vid2sim: Generalizable, video-based reconstruction of appearance, geometry and physics for mesh-free simulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26545–26555, 2025

  19. [19]

    Phys- twin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

    Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, and Yunzhu Li. Phys- twin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

  20. [20]

    Physx-3d: Physical-grounded 3d asset gener- ation.arXiv preprint arXiv:2507.12465, 2025

    Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-3d: Physical-grounded 3d asset gener- ation.arXiv preprint arXiv:2507.12465, 2025

  21. [21]

    Physx-anything: Simulation- ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

    Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-anything: Simulation- ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

  22. [22]

    Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

    Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4209–4219, 2024

  23. [23]

    From one to more: Contextual part latents for 3d generation

    Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin Wang, Yucheng Wang, Qi Wang, Jae- hyeok Kim, Chenjian Gao, Zhanpeng Huang, et al. From one to more: Contextual part latents for 3d generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8230–8240, 2025

  24. [24]

    Efficient geometry-aware 3d generative adversarial networks

    Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123–16133, 2022

  25. [25]

    Get3d: A generative model of high quality 3d textured shapes learned from images.Advances in neural information processing systems, 35:31841–31854, 2022

    Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images.Advances in neural information processing systems, 35:31841–31854, 2022

  26. [26]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022

  27. [27]

    Lgm: Large multi-view gaussian model for high-resolution 3d content creation

    Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

  28. [28]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024

  29. [29]

    Large-vocabulary 3d diffusion model with transformer.arXiv preprint arXiv:2309.07920, 2023

    Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Large-vocabulary 3d diffusion model with transformer.arXiv preprint arXiv:2309.07920, 2023. 20

  30. [30]

    Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025

    Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025

  31. [31]

    Collaborative multi-modal coding for high- quality 3d generation.arXiv preprint arXiv:2508.15228, 2025

    Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Collaborative multi-modal coding for high- quality 3d generation.arXiv preprint arXiv:2508.15228, 2025

  32. [32]

    Holopart: Generative 3d part amodal segmentation.arXiv preprint arXiv:2504.07943, 2025

    Yunhan Yang, Yuan-Chen Guo, Yukun Huang, Zi-Xin Zou, Zhipeng Yu, Yangguang Li, Yan- Pei Cao, and Xihui Liu. Holopart: Generative 3d part amodal segmentation.arXiv preprint arXiv:2504.07943, 2025

  33. [33]

    Anchoreddream: Zero-shot 360 {\deg}indoor scene generation from a single view via geometric grounding.arXiv preprint arXiv:2601.16532, 2026

    Runmao Yao, Junsheng Zhou, Zhen Dong, and Yu-Shen Liu. Anchoreddream: Zero-shot 360 {\deg}indoor scene generation from a single view via geometric grounding.arXiv preprint arXiv:2601.16532, 2026

  34. [34]

    Partcrafter: Structured 3d mesh generation via compositional latent diffusion trans- formers.arXiv preprint arXiv:2506.05573, 2025

    Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compositional latent diffusion trans- formers.arXiv preprint arXiv:2506.05573, 2025

  35. [35]

    Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets

    ByteDance Seed. Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets. 2025

  36. [36]

    Meshanything: Artist-created mesh generation with autoregressive transformers.arXiv preprint arXiv:2406.10163, 2024

    Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh generation with autoregressive transformers.arXiv preprint arXiv:2406.10163, 2024

  37. [37]

    Meshgpt: Generating triangle meshes with decoder- only transformers

    Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder- only transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19615–19625, 2024

  38. [38]

    Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024

    Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, and Xiao- hui Zeng. Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024

  39. [39]

    arXiv preprint arXiv:2502.02590 (2025)

    Xiaowen Qiu, Jincheng Yang, Yian Wang, Zhehuan Chen, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Articulate anymesh: Open-vocabulary 3d articulated objects modeling. arXiv preprint arXiv:2502.02590, 2025

  40. [40]

    Magicarticulate: Make your 3d models articulation- ready

    Chaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, et al. Magicarticulate: Make your 3d models articulation- ready. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15998– 16007, 2025

  41. [41]

    Artformer: Controllable generation of diverse 3d articulated objects

    Jiayi Su, Youhe Feng, Zheng Li, Jinhua Song, Yangfan He, Botao Ren, and Botian Xu. Artformer: Controllable generation of diverse 3d articulated objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1894–1904, 2025

  42. [42]

    Artilatent: Realistic articulated 3d object generation via structured latents

    Honghua Chen, Yushi Lan, Yongwei Chen, and Xingang Pan. Artilatent: Realistic articulated 3d object generation via structured latents. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

  43. [43]

    Freeart3d: Training-free articulated object generation using 3d diffusion

    Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, and Minghua Liu. Freeart3d: Training-free articulated object generation using 3d diffusion. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–13, 2025. 21

  44. [44]

    Procedural genera- tion of articulated simulation-ready assets, 2025

    Abhishek Joshi, Beining Han, Jack Nugent, Max Gonzalez Saez-Diez, Yiming Zuo, Jonathan Liu, Hongyu Wen, Stamatis Alexandropoulos, Karhan Kayan, Anna Calveri, et al. Procedural genera- tion of articulated simulation-ready assets, 2025. 6.URL https://arxiv. org/abs/2505.10755, 7

  45. [45]

    Nap: Neural 3d articulation prior.arXiv preprint arXiv:2305.16315, 2023

    Jiahui Lei, Congyue Deng, Bokui Shen, Leonidas Guibas, and Kostas Daniilidis. Nap: Neural 3d articulation prior.arXiv preprint arXiv:2305.16315, 2023

  46. [46]

    Urdf-anything: Constructing articulated objects with 3d multimodal language model.arXiv preprint arXiv:2511.00940, 2025

    Zhe Li, Xiang Bai, Jieyu Zhang, Zhuangzhe Wu, Che Xu, Ying Li, Chengkai Hou, and Shanghang Zhang. Urdf-anything: Constructing articulated objects with 3d multimodal language model.arXiv preprint arXiv:2511.00940, 2025

  47. [47]

    Urdf-anything+: Autoregressive articulated 3d models generation for physical simulation

    Zhuangzhe Wu, Yue Xin, Chengkai Hou, Minghao Chen, Yaoxu Lyu, Jieyu Zhang, and Shanghang Zhang. Urdf-anything+: Autoregressive articulated 3d models generation for physical simulation. arXiv preprint arXiv:2603.14010, 2026

  48. [48]

    Sophy: Generating simulation-ready objects with physical materials.arXiv preprint arXiv:2504.12684, 2025

    Junyi Cao and Evangelos Kalogerakis. Sophy: Generating simulation-ready objects with physical materials.arXiv preprint arXiv:2504.12684, 2025

  49. [49]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  50. [50]

    Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

  51. [51]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 22