PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

Fangzhou Hong; Haitian Li; Liang Pan; Runmao Yao; Yinghao Liu; Zhaoxi Chen; Ziang Cao; Ziwei Liu

arxiv: 2605.21572 · v1 · pith:43UC2BMQnew · submitted 2026-05-20 · 💻 cs.CV · cs.RO

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

Ziang Cao , Yinghao Liu , Haitian Li , Runmao Yao , Fangzhou Hong , Zhaoxi Chen , Liang Pan , Ziwei Liu This is my paper

Pith reviewed 2026-05-22 09:35 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords 3D generationsimulation-ready assetsvision-language modelsrigid objectsdeformable objectsarticulated objectsphysical simulationembodied AI

0 comments

The pith

PhysX-Omni generates simulation-ready 3D models for rigid, deformable, and articulated objects with one unified framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhysX-Omni to create 3D assets that function directly in physical simulations across rigid, soft, and jointed object types. It develops a geometry representation for vision-language models that feeds full high-resolution 3D data without any compression step, which raises output quality for all categories. The authors assembled PhysXVerse as the first broad dataset of simulation-ready objects drawn from indoor and outdoor environments. They also built PhysX-Bench to measure six real-world attributes including geometry, absolute scale, material, affordance, kinematics, and function description. Experiments show the system succeeds at both asset creation and asset understanding, opening uses in full scene simulation and robot behavior training.

Core claim

PhysX-Omni is a unified framework for simulation-ready physical 3D generation across rigid, deformable, and articulated objects. It rests on a novel geometry representation tailored for vision-language models that directly encodes high-resolution 3D structures without compression, thereby improving generation performance. The framework is trained on the new PhysXVerse dataset and evaluated with PhysX-Bench on six attributes that test both generative and understanding capabilities.

What carries the argument

The novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression and improves generation performance across asset categories.

If this is right

Generation quality rises across rigid, deformable, and articulated asset categories.
The framework supports creation of complete simulation-ready indoor and outdoor scenes.
Trained policies for robotic tasks benefit from the availability of accurate physical 3D assets.
PhysX-Bench supplies a consistent way to compare both generative and understanding performance on physical properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same encoding approach could be tested on larger outdoor scenes with multiple interacting objects to check scalability.
Combining the generated assets with existing physics engines may speed up training loops for embodied agents.
The six-attribute benchmark could become a standard test set for other 3D generation methods that target simulation use.
Future extensions might add real-time material response or contact-rich interaction data to the dataset.

Load-bearing premise

The novel geometry representation tailored for Vision-Language Models directly encodes high-resolution 3D structures without compression and thereby significantly improves generation performance across asset categories.

What would settle it

Retraining the vision-language model on PhysXVerse but replacing the direct high-resolution encoding with a compressed alternative and measuring whether scores on PhysX-Bench geometry and kinematics attributes fall sharply.

Figures

Figures reproduced from arXiv: 2605.21572 by Fangzhou Hong, Haitian Li, Liang Pan, Runmao Yao, Yinghao Liu, Zhaoxi Chen, Ziang Cao, Ziwei Liu.

**Figure 1.** Figure 1: By exploiting the high diversity of PhysXVerse, PhysX-Omni is capable of generating detailed and general 3D assets covering rigid, deformable, and articulated objects, producing simulation-ready physical assets suitable for downstream applications. Abstract Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existin… view at source ↗

**Figure 2.** Figure 2: Given a single complete or partially occluded image, PhysX-Omni first infers high-level overall information. It then employs a multi-turn generation process to produce detailed part-level geometry. Owing to the inherent alignment between global and local representations, these outputs can be directly integrated into simulation-ready physical 3D assets. sive and robust evaluation framework for assessing ge… view at source ↗

**Figure 3.** Figure 3: (a). Comparison of different geometry representations for 3D modeling. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Statistics and distribution of PhysXVerse. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of PhysX-Bench. It consists of six key dimensions for comprehensively evaluating 3D structure, appearance, fundamental physical attributes, and understanding. over 2.9K categories, covering a wide range of object types, such as indoor furniture, unmanned aerial vehicles, robots, vehicles, and large-scale scene components. Compared with existing simulation-ready datasets, PhysXVerse exhibits substa… view at source ↗

**Figure 6.** Figure 6: Qualitative results. Compared with existing generative methods, our PhysX-Omni demonstrates impressive performance in generating complex geometries and rich physical attributes. simulation-ready physical 3D assets spanning diverse indoor and outdoor categories. The dataset covers rigid, articulated, and deformable objects with rich geometric structures and physical attributes. To improve view consistency… view at source ↗

**Figure 7.** Figure 7: Left: Comparison of our PhysX-Omni with other methods. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: More qualitative results of PhysX-Omni. Additional results further demonstrate the robust generative performance of our method in complex scenarios. ically, PhysX-Omni achieves a kinematic score of 80.72, significantly outperforming PhysX-Anything (65.99), PhysXGen (69.17), MonoArt (68.32), and Articulate-Anything (71.25). Similar improvements can also be observed for affordance understanding and descripti… view at source ↗

**Figure 9.** Figure 9: Visualization of the generated deformable objects. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of model using different geometry representations. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Robot Manipulation on our Generated sim-ready 3D assets. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Applications of our PhysX-Omni. We explore the potential applications of PhysX-Omni in simready scene generation. 4.6 Validating human alignment of PhysX-Bench To validate that PhysX-Bench can effectively reflect human perception and evaluation preferences, we further study the correlation between the automatic evaluation results produced by PhysX-Bench and human annotations. Specifically, following pri… view at source ↗

read the original abstract

Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhysX-Omni unifies generation across rigid, deformable, and articulated objects with a new dataset and benchmark, but the compression-free high-res geometry claim for VLMs rests on an assumption that still needs direct evidence from the methods.

read the letter

PhysX-Omni tries to build one generator that produces simulation-ready 3D assets for rigid, deformable, and articulated objects instead of treating each category separately. It also releases PhysXVerse as a broad dataset and PhysX-Bench as an evaluation set that checks geometry, absolute scale, material, affordance, kinematics, and function description. Those pieces address a clear practical gap for people who need consistent physical assets for robotics and embodied AI training environments. The dataset and benchmark look like the most immediately usable parts of the work. A reader who needs assets that survive simulation without extra cleanup would find them worth looking at even if the generator itself is not adopted. The central technical move is a geometry representation described as tailored for vision-language models that encodes high-resolution 3D structures directly without compression. The paper ties performance gains across the three object types to this choice. The stress-test note is on target here: current VLMs run under tight token or patch budgets, so an uncompressed high-resolution 3D input would normally force either extreme sequence lengths or some form of projection, rendering, or latent encoding. If the methods section turns out to include any of those steps, the attributed improvement becomes harder to isolate from dataset size or training details. The abstract states that experiments with standard metrics and the new benchmark show strong results, plus follow-up tests on scene generation and robotic policy learning. Without the actual numbers, ablations, or comparison tables it is difficult to judge how large the lift is over existing single-category methods. This paper is aimed at researchers who build large-scale physical simulation environments or who need assets that carry material and kinematic properties out of the box. It deserves a serious referee because the problem is relevant and the dataset plus benchmark are concrete contributions, even though the key encoding claim will need tighter documentation and controls to hold up under review.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces PhysX-Omni, a unified framework for generating simulation-ready physical 3D assets across rigid, deformable, and articulated object categories. It proposes a novel geometry representation tailored for Vision-Language Models that directly encodes high-resolution 3D structures without compression to improve generation performance, constructs the PhysXVerse dataset covering diverse indoor and outdoor scenes, and introduces PhysX-Bench to evaluate six attributes (geometry, absolute scale, material, affordance, kinematics, and function description). Experiments using conventional metrics and PhysX-Bench are reported to show strong results in both generation and understanding, with additional validation for simulation-ready scene generation and robotic policy learning.

Significance. If the geometry representation can be shown to achieve the claimed high-resolution encoding without implicit compression or downsampling and if the associated datasets and benchmarks are released, the work would offer a meaningful step toward unified physical 3D generation. This could support broader use in embodied AI and physics simulation by addressing the current fragmentation across object categories.

major comments (1)

Abstract: The performance improvements across asset categories are attributed to the 'novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression'. This is load-bearing for the central claim of a unified framework. Standard VLMs operate under token or patch limits (typically 4k–32k tokens or fixed-resolution inputs). An uncompressed high-resolution 3D structure (dense voxels, full meshes, or dense point clouds) would exceed these limits unless an implicit downsampling, sparse coding, learned latent projection, or 2D rendering step is used. The methods section must explicitly describe the input encoding, sequence length, and any projection mechanism; absent this detail, the contribution of the representation cannot be isolated from dataset or training effects.

minor comments (2)

The abstract states that 'extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly' but supplies no numerical values, baseline comparisons, or ablation results. Adding these in the main text or a results table would allow readers to assess the magnitude of the reported gains.
It is unclear whether PhysXVerse and PhysX-Bench will be released publicly. Stating the release plan and any licensing details would strengthen the resource contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the single major comment below and will incorporate the requested clarifications into the revised manuscript.

read point-by-point responses

Referee: Abstract: The performance improvements across asset categories are attributed to the 'novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression'. This is load-bearing for the central claim of a unified framework. Standard VLMs operate under token or patch limits (typically 4k–32k tokens or fixed-resolution inputs). An uncompressed high-resolution 3D structure (dense voxels, full meshes, or dense point clouds) would exceed these limits unless an implicit downsampling, sparse coding, learned latent projection, or 2D rendering step is used. The methods section must explicitly describe the input encoding, sequence length, and any projection mechanism; absent this detail, the contribution of the representation cannot be isolated from dataset or training effects.

Authors: We agree that the abstract claim requires supporting technical detail in the Methods section to substantiate how high-resolution 3D structures are encoded for VLM compatibility. The current manuscript describes the representation at a high level but does not provide the explicit encoding mechanics, sequence lengths, or projection steps. In the revised manuscript we will add a dedicated subsection (with pseudocode and an accompanying figure) that specifies the input encoding pipeline, the exact token/sequence budget used, and the mechanism that preserves resolution without conventional lossy compression. This revision will allow readers to isolate the representation's contribution from dataset and training effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a new framework, dataset (PhysXVerse), and benchmark (PhysX-Bench) for unified 3D asset generation. Its core claims rest on empirical results from a claimed novel geometry representation for VLMs and downstream validation studies, without any equations, fitted parameters, or self-referential reductions that equate outputs to inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in a way that collapses the derivation; the work is self-contained against external benchmarks and standard VLM practices.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5801 in / 1105 out tokens · 62052 ms · 2026-05-22T09:35:22.442980+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression... template-based RLE representation to explicitly and directly model high-resolution 3D geometry... sliced along the z-axis into a sequence of 2D binary masks... template layers
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PhysX-Bench... six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

[1]

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024. 18

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

3dtopia: Large text-to-3d generation model with hybrid diffu- sion priors.arXiv preprint arXiv:2403.02234, 2024

Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, et al. 3dtopia: Large text-to-3d generation model with hybrid diffu- sion priors.arXiv preprint arXiv:2403.02234, 2024

work page arXiv 2024
[3]

3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024

work page arXiv 2024
[4]

Meshllm: Empowering large language models to pro- gressively understand and generate 3d mesh

Shuangkang Fang, I Shen, Yufeng Wang, Yi-Hsuan Tsai, Yi Yang, Shuchang Zhou, Wenrui Ding, Takeo Igarashi, Ming-Hsuan Yang, et al. Meshllm: Empowering large language models to pro- gressively understand and generate 3d mesh. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 14061–14072, 2025

work page 2025
[5]

Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025

Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, and Jun Zhu. Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025

work page arXiv 2025
[6]

Native and Compact Structured Latents for 3D Generation

Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, et al. Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Bang: Dividing 3d assets via generative exploded dynamics.ACM Transactions on Graphics (TOG), 44(4):1–21, 2025

Longwen Zhang, Qixuan Zhang, Haoran Jiang, Yinuo Bai, Wei Yang, Lan Xu, and Jingyi Yu. Bang: Dividing 3d assets via generative exploded dynamics.ACM Transactions on Graphics (TOG), 44(4):1–21, 2025

work page 2025
[8]

Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, and Xihui Liu. Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

work page arXiv 2025
[9]

Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

work page arXiv 2024
[10]

arXiv preprint arXiv:2410.16499 (2024)

Jiayi Liu, Denys Iliash, Angel X Chang, Manolis Savva, and Ali Mahdavi-Amiri. Singapo: Single image controlled generation of articulated parts in objects.arXiv preprint arXiv:2410.16499, 2024

work page arXiv 2024
[11]

Articulate-anything: Automatic modeling of articulated objects via a vision-language foundation model.arXiv preprint arXiv:2410.13882, 2024

Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Dinesh Jayaraman, and Eric Eaton. Articulate-anything: Automatic modeling of articulated objects via a vision-language foundation model.arXiv preprint arXiv:2410.13882, 2024

work page arXiv 2024
[12]

Dreamart: Generating interactable articulated objects from a single image.arXiv preprint arXiv:2507.05763, 2025

Ruijie Lu, Yu Liu, Jiaxiang Tang, Junfeng Ni, Yuxiang Wang, Diwen Wan, Gang Zeng, Yixin Chen, and Siyuan Huang. Dreamart: Generating interactable articulated objects from a single image.arXiv preprint arXiv:2507.05763, 2025

work page arXiv 2025
[13]

Monoart: Progressive structural reasoning for monocular articulated 3d reconstruction.arXiv preprint arXiv:2603.19231, 2026

Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, and Ziwei Liu. Monoart: Progressive structural reasoning for monocular articulated 3d reconstruction.arXiv preprint arXiv:2603.19231, 2026

work page arXiv 2026
[14]

Physdreamer: Physics-based interaction with 3d objects via video generation

Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InEuropean Conference on Computer Vision, pages 388–406, 2024

work page 2024
[15]

Physically compatible 3d object modeling from a single image.Advances in Neural Information Processing Systems, 37:119260–119282, 2024

Minghao Guo, Bohan Wang, Pingchuan Ma, Tianyuan Zhang, Crystal Owens, Chuang Gan, Josh Tenenbaum, Kaiming He, and Wojciech Matusik. Physically compatible 3d object modeling from a single image.Advances in Neural Information Processing Systems, 37:119260–119282, 2024. 19

work page 2024
[16]

Physgen3d: Crafting a miniature interactive world from a single image

Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. Physgen3d: Crafting a miniature interactive world from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6178–6189, 2025

work page 2025
[17]

Pixie: Fast and generalizable supervised learning of 3d physics from pixels.arXiv preprint arXiv:2508.17437, 2025

Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, and Lingjie Liu. Pixie: Fast and generalizable supervised learning of 3d physics from pixels.arXiv preprint arXiv:2508.17437, 2025

work page arXiv 2025
[18]

Vid2sim: Generalizable, video-based reconstruction of appearance, geometry and physics for mesh-free simulation

Chuhao Chen, Zhiyang Dou, Chen Wang, Yiming Huang, Anjun Chen, Qiao Feng, Jiatao Gu, and Lingjie Liu. Vid2sim: Generalizable, video-based reconstruction of appearance, geometry and physics for mesh-free simulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26545–26555, 2025

work page 2025
[19]

Phys- twin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, and Yunzhu Li. Phys- twin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

work page arXiv 2025
[20]

Physx-3d: Physical-grounded 3d asset gener- ation.arXiv preprint arXiv:2507.12465, 2025

Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-3d: Physical-grounded 3d asset gener- ation.arXiv preprint arXiv:2507.12465, 2025

work page arXiv 2025
[21]

Physx-anything: Simulation- ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-anything: Simulation- ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

work page arXiv 2025
[22]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4209–4219, 2024

work page 2024
[23]

From one to more: Contextual part latents for 3d generation

Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin Wang, Yucheng Wang, Qi Wang, Jae- hyeok Kim, Chenjian Gao, Zhanpeng Huang, et al. From one to more: Contextual part latents for 3d generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8230–8240, 2025

work page 2025
[24]

Efficient geometry-aware 3d generative adversarial networks

Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123–16133, 2022

work page 2022
[25]

Get3d: A generative model of high quality 3d textured shapes learned from images.Advances in neural information processing systems, 35:31841–31854, 2022

Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images.Advances in neural information processing systems, 35:31841–31854, 2022

work page 2022
[26]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

work page 2024
[28]

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Large-vocabulary 3d diffusion model with transformer.arXiv preprint arXiv:2309.07920, 2023

Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Large-vocabulary 3d diffusion model with transformer.arXiv preprint arXiv:2309.07920, 2023. 20

work page arXiv 2023
[30]

Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025

Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025

work page 2025
[31]

Collaborative multi-modal coding for high- quality 3d generation.arXiv preprint arXiv:2508.15228, 2025

Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Collaborative multi-modal coding for high- quality 3d generation.arXiv preprint arXiv:2508.15228, 2025

work page arXiv 2025
[32]

Holopart: Generative 3d part amodal segmentation.arXiv preprint arXiv:2504.07943, 2025

Yunhan Yang, Yuan-Chen Guo, Yukun Huang, Zi-Xin Zou, Zhipeng Yu, Yangguang Li, Yan- Pei Cao, and Xihui Liu. Holopart: Generative 3d part amodal segmentation.arXiv preprint arXiv:2504.07943, 2025

work page arXiv 2025
[33]

Anchoreddream: Zero-shot 360 {\deg}indoor scene generation from a single view via geometric grounding.arXiv preprint arXiv:2601.16532, 2026

Runmao Yao, Junsheng Zhou, Zhen Dong, and Yu-Shen Liu. Anchoreddream: Zero-shot 360 {\deg}indoor scene generation from a single view via geometric grounding.arXiv preprint arXiv:2601.16532, 2026

work page arXiv 2026
[34]

Partcrafter: Structured 3d mesh generation via compositional latent diffusion trans- formers.arXiv preprint arXiv:2506.05573, 2025

Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compositional latent diffusion trans- formers.arXiv preprint arXiv:2506.05573, 2025

work page arXiv 2025
[35]

Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets

ByteDance Seed. Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets. 2025

work page 2025
[36]

Meshanything: Artist-created mesh generation with autoregressive transformers.arXiv preprint arXiv:2406.10163, 2024

Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh generation with autoregressive transformers.arXiv preprint arXiv:2406.10163, 2024

work page arXiv 2024
[37]

Meshgpt: Generating triangle meshes with decoder- only transformers

Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder- only transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19615–19625, 2024

work page 2024
[38]

Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024

Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, and Xiao- hui Zeng. Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024

work page arXiv 2024
[39]

arXiv preprint arXiv:2502.02590 (2025)

Xiaowen Qiu, Jincheng Yang, Yian Wang, Zhehuan Chen, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Articulate anymesh: Open-vocabulary 3d articulated objects modeling. arXiv preprint arXiv:2502.02590, 2025

work page arXiv 2025
[40]

Magicarticulate: Make your 3d models articulation- ready

Chaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, et al. Magicarticulate: Make your 3d models articulation- ready. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15998– 16007, 2025

work page 2025
[41]

Artformer: Controllable generation of diverse 3d articulated objects

Jiayi Su, Youhe Feng, Zheng Li, Jinhua Song, Yangfan He, Botao Ren, and Botian Xu. Artformer: Controllable generation of diverse 3d articulated objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1894–1904, 2025

work page 1904
[42]

Artilatent: Realistic articulated 3d object generation via structured latents

Honghua Chen, Yushi Lan, Yongwei Chen, and Xingang Pan. Artilatent: Realistic articulated 3d object generation via structured latents. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

work page 2025
[43]

Freeart3d: Training-free articulated object generation using 3d diffusion

Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, and Minghua Liu. Freeart3d: Training-free articulated object generation using 3d diffusion. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–13, 2025. 21

work page 2025
[44]

Procedural genera- tion of articulated simulation-ready assets, 2025

Abhishek Joshi, Beining Han, Jack Nugent, Max Gonzalez Saez-Diez, Yiming Zuo, Jonathan Liu, Hongyu Wen, Stamatis Alexandropoulos, Karhan Kayan, Anna Calveri, et al. Procedural genera- tion of articulated simulation-ready assets, 2025. 6.URL https://arxiv. org/abs/2505.10755, 7

work page arXiv 2025
[45]

Nap: Neural 3d articulation prior.arXiv preprint arXiv:2305.16315, 2023

Jiahui Lei, Congyue Deng, Bokui Shen, Leonidas Guibas, and Kostas Daniilidis. Nap: Neural 3d articulation prior.arXiv preprint arXiv:2305.16315, 2023

work page arXiv 2023
[46]

Urdf-anything: Constructing articulated objects with 3d multimodal language model.arXiv preprint arXiv:2511.00940, 2025

Zhe Li, Xiang Bai, Jieyu Zhang, Zhuangzhe Wu, Che Xu, Ying Li, Chengkai Hou, and Shanghang Zhang. Urdf-anything: Constructing articulated objects with 3d multimodal language model.arXiv preprint arXiv:2511.00940, 2025

work page arXiv 2025
[47]

Urdf-anything+: Autoregressive articulated 3d models generation for physical simulation

Zhuangzhe Wu, Yue Xin, Chengkai Hou, Minghao Chen, Yaoxu Lyu, Jieyu Zhang, and Shanghang Zhang. Urdf-anything+: Autoregressive articulated 3d models generation for physical simulation. arXiv preprint arXiv:2603.14010, 2026

work page arXiv 2026
[48]

Sophy: Generating simulation-ready objects with physical materials.arXiv preprint arXiv:2504.12684, 2025

Junyi Cao and Evangelos Kalogerakis. Sophy: Generating simulation-ready objects with physical materials.arXiv preprint arXiv:2504.12684, 2025

work page arXiv 2025
[49]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

work page 2024
[51]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 22

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024. 18

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

3dtopia: Large text-to-3d generation model with hybrid diffu- sion priors.arXiv preprint arXiv:2403.02234, 2024

Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, et al. 3dtopia: Large text-to-3d generation model with hybrid diffu- sion priors.arXiv preprint arXiv:2403.02234, 2024

work page arXiv 2024

[3] [3]

3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024

work page arXiv 2024

[4] [4]

Meshllm: Empowering large language models to pro- gressively understand and generate 3d mesh

Shuangkang Fang, I Shen, Yufeng Wang, Yi-Hsuan Tsai, Yi Yang, Shuchang Zhou, Wenrui Ding, Takeo Igarashi, Ming-Hsuan Yang, et al. Meshllm: Empowering large language models to pro- gressively understand and generate 3d mesh. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 14061–14072, 2025

work page 2025

[5] [5]

Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025

Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, and Jun Zhu. Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025

work page arXiv 2025

[6] [6]

Native and Compact Structured Latents for 3D Generation

Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, et al. Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Bang: Dividing 3d assets via generative exploded dynamics.ACM Transactions on Graphics (TOG), 44(4):1–21, 2025

Longwen Zhang, Qixuan Zhang, Haoran Jiang, Yinuo Bai, Wei Yang, Lan Xu, and Jingyi Yu. Bang: Dividing 3d assets via generative exploded dynamics.ACM Transactions on Graphics (TOG), 44(4):1–21, 2025

work page 2025

[8] [8]

Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, and Xihui Liu. Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025

work page arXiv 2025

[9] [9]

Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

work page arXiv 2024

[10] [10]

arXiv preprint arXiv:2410.16499 (2024)

Jiayi Liu, Denys Iliash, Angel X Chang, Manolis Savva, and Ali Mahdavi-Amiri. Singapo: Single image controlled generation of articulated parts in objects.arXiv preprint arXiv:2410.16499, 2024

work page arXiv 2024

[11] [11]

Articulate-anything: Automatic modeling of articulated objects via a vision-language foundation model.arXiv preprint arXiv:2410.13882, 2024

Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Dinesh Jayaraman, and Eric Eaton. Articulate-anything: Automatic modeling of articulated objects via a vision-language foundation model.arXiv preprint arXiv:2410.13882, 2024

work page arXiv 2024

[12] [12]

Dreamart: Generating interactable articulated objects from a single image.arXiv preprint arXiv:2507.05763, 2025

Ruijie Lu, Yu Liu, Jiaxiang Tang, Junfeng Ni, Yuxiang Wang, Diwen Wan, Gang Zeng, Yixin Chen, and Siyuan Huang. Dreamart: Generating interactable articulated objects from a single image.arXiv preprint arXiv:2507.05763, 2025

work page arXiv 2025

[13] [13]

Monoart: Progressive structural reasoning for monocular articulated 3d reconstruction.arXiv preprint arXiv:2603.19231, 2026

Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, and Ziwei Liu. Monoart: Progressive structural reasoning for monocular articulated 3d reconstruction.arXiv preprint arXiv:2603.19231, 2026

work page arXiv 2026

[14] [14]

Physdreamer: Physics-based interaction with 3d objects via video generation

Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InEuropean Conference on Computer Vision, pages 388–406, 2024

work page 2024

[15] [15]

Physically compatible 3d object modeling from a single image.Advances in Neural Information Processing Systems, 37:119260–119282, 2024

Minghao Guo, Bohan Wang, Pingchuan Ma, Tianyuan Zhang, Crystal Owens, Chuang Gan, Josh Tenenbaum, Kaiming He, and Wojciech Matusik. Physically compatible 3d object modeling from a single image.Advances in Neural Information Processing Systems, 37:119260–119282, 2024. 19

work page 2024

[16] [16]

Physgen3d: Crafting a miniature interactive world from a single image

Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. Physgen3d: Crafting a miniature interactive world from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6178–6189, 2025

work page 2025

[17] [17]

Pixie: Fast and generalizable supervised learning of 3d physics from pixels.arXiv preprint arXiv:2508.17437, 2025

Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, and Lingjie Liu. Pixie: Fast and generalizable supervised learning of 3d physics from pixels.arXiv preprint arXiv:2508.17437, 2025

work page arXiv 2025

[18] [18]

Vid2sim: Generalizable, video-based reconstruction of appearance, geometry and physics for mesh-free simulation

Chuhao Chen, Zhiyang Dou, Chen Wang, Yiming Huang, Anjun Chen, Qiao Feng, Jiatao Gu, and Lingjie Liu. Vid2sim: Generalizable, video-based reconstruction of appearance, geometry and physics for mesh-free simulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26545–26555, 2025

work page 2025

[19] [19]

Phys- twin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, and Yunzhu Li. Phys- twin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

work page arXiv 2025

[20] [20]

Physx-3d: Physical-grounded 3d asset gener- ation.arXiv preprint arXiv:2507.12465, 2025

Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-3d: Physical-grounded 3d asset gener- ation.arXiv preprint arXiv:2507.12465, 2025

work page arXiv 2025

[21] [21]

Physx-anything: Simulation- ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-anything: Simulation- ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025

work page arXiv 2025

[22] [22]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4209–4219, 2024

work page 2024

[23] [23]

From one to more: Contextual part latents for 3d generation

Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin Wang, Yucheng Wang, Qi Wang, Jae- hyeok Kim, Chenjian Gao, Zhanpeng Huang, et al. From one to more: Contextual part latents for 3d generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8230–8240, 2025

work page 2025

[24] [24]

Efficient geometry-aware 3d generative adversarial networks

Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123–16133, 2022

work page 2022

[25] [25]

Get3d: A generative model of high quality 3d textured shapes learned from images.Advances in neural information processing systems, 35:31841–31854, 2022

Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images.Advances in neural information processing systems, 35:31841–31854, 2022

work page 2022

[26] [26]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

work page 2024

[28] [28]

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Large-vocabulary 3d diffusion model with transformer.arXiv preprint arXiv:2309.07920, 2023

Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Large-vocabulary 3d diffusion model with transformer.arXiv preprint arXiv:2309.07920, 2023. 20

work page arXiv 2023

[30] [30]

Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025

Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025

work page 2025

[31] [31]

Collaborative multi-modal coding for high- quality 3d generation.arXiv preprint arXiv:2508.15228, 2025

Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Collaborative multi-modal coding for high- quality 3d generation.arXiv preprint arXiv:2508.15228, 2025

work page arXiv 2025

[32] [32]

Holopart: Generative 3d part amodal segmentation.arXiv preprint arXiv:2504.07943, 2025

Yunhan Yang, Yuan-Chen Guo, Yukun Huang, Zi-Xin Zou, Zhipeng Yu, Yangguang Li, Yan- Pei Cao, and Xihui Liu. Holopart: Generative 3d part amodal segmentation.arXiv preprint arXiv:2504.07943, 2025

work page arXiv 2025

[33] [33]

Anchoreddream: Zero-shot 360 {\deg}indoor scene generation from a single view via geometric grounding.arXiv preprint arXiv:2601.16532, 2026

Runmao Yao, Junsheng Zhou, Zhen Dong, and Yu-Shen Liu. Anchoreddream: Zero-shot 360 {\deg}indoor scene generation from a single view via geometric grounding.arXiv preprint arXiv:2601.16532, 2026

work page arXiv 2026

[34] [34]

Partcrafter: Structured 3d mesh generation via compositional latent diffusion trans- formers.arXiv preprint arXiv:2506.05573, 2025

Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compositional latent diffusion trans- formers.arXiv preprint arXiv:2506.05573, 2025

work page arXiv 2025

[35] [35]

Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets

ByteDance Seed. Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets. 2025

work page 2025

[36] [36]

Meshanything: Artist-created mesh generation with autoregressive transformers.arXiv preprint arXiv:2406.10163, 2024

Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh generation with autoregressive transformers.arXiv preprint arXiv:2406.10163, 2024

work page arXiv 2024

[37] [37]

Meshgpt: Generating triangle meshes with decoder- only transformers

Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder- only transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19615–19625, 2024

work page 2024

[38] [38]

Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024

Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, and Xiao- hui Zeng. Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024

work page arXiv 2024

[39] [39]

arXiv preprint arXiv:2502.02590 (2025)

Xiaowen Qiu, Jincheng Yang, Yian Wang, Zhehuan Chen, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Articulate anymesh: Open-vocabulary 3d articulated objects modeling. arXiv preprint arXiv:2502.02590, 2025

work page arXiv 2025

[40] [40]

Magicarticulate: Make your 3d models articulation- ready

Chaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, et al. Magicarticulate: Make your 3d models articulation- ready. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15998– 16007, 2025

work page 2025

[41] [41]

Artformer: Controllable generation of diverse 3d articulated objects

Jiayi Su, Youhe Feng, Zheng Li, Jinhua Song, Yangfan He, Botao Ren, and Botian Xu. Artformer: Controllable generation of diverse 3d articulated objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1894–1904, 2025

work page 1904

[42] [42]

Artilatent: Realistic articulated 3d object generation via structured latents

Honghua Chen, Yushi Lan, Yongwei Chen, and Xingang Pan. Artilatent: Realistic articulated 3d object generation via structured latents. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

work page 2025

[43] [43]

Freeart3d: Training-free articulated object generation using 3d diffusion

Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, and Minghua Liu. Freeart3d: Training-free articulated object generation using 3d diffusion. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–13, 2025. 21

work page 2025

[44] [44]

Procedural genera- tion of articulated simulation-ready assets, 2025

Abhishek Joshi, Beining Han, Jack Nugent, Max Gonzalez Saez-Diez, Yiming Zuo, Jonathan Liu, Hongyu Wen, Stamatis Alexandropoulos, Karhan Kayan, Anna Calveri, et al. Procedural genera- tion of articulated simulation-ready assets, 2025. 6.URL https://arxiv. org/abs/2505.10755, 7

work page arXiv 2025

[45] [45]

Nap: Neural 3d articulation prior.arXiv preprint arXiv:2305.16315, 2023

Jiahui Lei, Congyue Deng, Bokui Shen, Leonidas Guibas, and Kostas Daniilidis. Nap: Neural 3d articulation prior.arXiv preprint arXiv:2305.16315, 2023

work page arXiv 2023

[46] [46]

Urdf-anything: Constructing articulated objects with 3d multimodal language model.arXiv preprint arXiv:2511.00940, 2025

Zhe Li, Xiang Bai, Jieyu Zhang, Zhuangzhe Wu, Che Xu, Ying Li, Chengkai Hou, and Shanghang Zhang. Urdf-anything: Constructing articulated objects with 3d multimodal language model.arXiv preprint arXiv:2511.00940, 2025

work page arXiv 2025

[47] [47]

Urdf-anything+: Autoregressive articulated 3d models generation for physical simulation

Zhuangzhe Wu, Yue Xin, Chengkai Hou, Minghao Chen, Yaoxu Lyu, Jieyu Zhang, and Shanghang Zhang. Urdf-anything+: Autoregressive articulated 3d models generation for physical simulation. arXiv preprint arXiv:2603.14010, 2026

work page arXiv 2026

[48] [48]

Sophy: Generating simulation-ready objects with physical materials.arXiv preprint arXiv:2504.12684, 2025

Junyi Cao and Evangelos Kalogerakis. Sophy: Generating simulation-ready objects with physical materials.arXiv preprint arXiv:2504.12684, 2025

work page arXiv 2025

[49] [49]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

work page 2024

[51] [51]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 22

work page internal anchor Pith review Pith/arXiv arXiv 2024