PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects
Pith reviewed 2026-05-22 09:35 UTC · model grok-4.3
The pith
PhysX-Omni generates simulation-ready 3D models for rigid, deformable, and articulated objects with one unified framework.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PhysX-Omni is a unified framework for simulation-ready physical 3D generation across rigid, deformable, and articulated objects. It rests on a novel geometry representation tailored for vision-language models that directly encodes high-resolution 3D structures without compression, thereby improving generation performance. The framework is trained on the new PhysXVerse dataset and evaluated with PhysX-Bench on six attributes that test both generative and understanding capabilities.
What carries the argument
The novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression and improves generation performance across asset categories.
If this is right
- Generation quality rises across rigid, deformable, and articulated asset categories.
- The framework supports creation of complete simulation-ready indoor and outdoor scenes.
- Trained policies for robotic tasks benefit from the availability of accurate physical 3D assets.
- PhysX-Bench supplies a consistent way to compare both generative and understanding performance on physical properties.
Where Pith is reading between the lines
- The same encoding approach could be tested on larger outdoor scenes with multiple interacting objects to check scalability.
- Combining the generated assets with existing physics engines may speed up training loops for embodied agents.
- The six-attribute benchmark could become a standard test set for other 3D generation methods that target simulation use.
- Future extensions might add real-time material response or contact-rich interaction data to the dataset.
Load-bearing premise
The novel geometry representation tailored for Vision-Language Models directly encodes high-resolution 3D structures without compression and thereby significantly improves generation performance across asset categories.
What would settle it
Retraining the vision-language model on PhysXVerse but replacing the direct high-resolution encoding with a compressed alternative and measuring whether scores on PhysX-Bench geometry and kinematics attributes fall sharply.
Figures
read the original abstract
Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PhysX-Omni, a unified framework for generating simulation-ready physical 3D assets across rigid, deformable, and articulated object categories. It proposes a novel geometry representation tailored for Vision-Language Models that directly encodes high-resolution 3D structures without compression to improve generation performance, constructs the PhysXVerse dataset covering diverse indoor and outdoor scenes, and introduces PhysX-Bench to evaluate six attributes (geometry, absolute scale, material, affordance, kinematics, and function description). Experiments using conventional metrics and PhysX-Bench are reported to show strong results in both generation and understanding, with additional validation for simulation-ready scene generation and robotic policy learning.
Significance. If the geometry representation can be shown to achieve the claimed high-resolution encoding without implicit compression or downsampling and if the associated datasets and benchmarks are released, the work would offer a meaningful step toward unified physical 3D generation. This could support broader use in embodied AI and physics simulation by addressing the current fragmentation across object categories.
major comments (1)
- Abstract: The performance improvements across asset categories are attributed to the 'novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression'. This is load-bearing for the central claim of a unified framework. Standard VLMs operate under token or patch limits (typically 4k–32k tokens or fixed-resolution inputs). An uncompressed high-resolution 3D structure (dense voxels, full meshes, or dense point clouds) would exceed these limits unless an implicit downsampling, sparse coding, learned latent projection, or 2D rendering step is used. The methods section must explicitly describe the input encoding, sequence length, and any projection mechanism; absent this detail, the contribution of the representation cannot be isolated from dataset or training effects.
minor comments (2)
- The abstract states that 'extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly' but supplies no numerical values, baseline comparisons, or ablation results. Adding these in the main text or a results table would allow readers to assess the magnitude of the reported gains.
- It is unclear whether PhysXVerse and PhysX-Bench will be released publicly. Stating the release plan and any licensing details would strengthen the resource contribution.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the single major comment below and will incorporate the requested clarifications into the revised manuscript.
read point-by-point responses
-
Referee: Abstract: The performance improvements across asset categories are attributed to the 'novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression'. This is load-bearing for the central claim of a unified framework. Standard VLMs operate under token or patch limits (typically 4k–32k tokens or fixed-resolution inputs). An uncompressed high-resolution 3D structure (dense voxels, full meshes, or dense point clouds) would exceed these limits unless an implicit downsampling, sparse coding, learned latent projection, or 2D rendering step is used. The methods section must explicitly describe the input encoding, sequence length, and any projection mechanism; absent this detail, the contribution of the representation cannot be isolated from dataset or training effects.
Authors: We agree that the abstract claim requires supporting technical detail in the Methods section to substantiate how high-resolution 3D structures are encoded for VLM compatibility. The current manuscript describes the representation at a high level but does not provide the explicit encoding mechanics, sequence lengths, or projection steps. In the revised manuscript we will add a dedicated subsection (with pseudocode and an accompanying figure) that specifies the input encoding pipeline, the exact token/sequence budget used, and the mechanism that preserves resolution without conventional lossy compression. This revision will allow readers to isolate the representation's contribution from dataset and training effects. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents a new framework, dataset (PhysXVerse), and benchmark (PhysX-Bench) for unified 3D asset generation. Its core claims rest on empirical results from a claimed novel geometry representation for VLMs and downstream validation studies, without any equations, fitted parameters, or self-referential reductions that equate outputs to inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in a way that collapses the derivation; the work is self-contained against external benchmarks and standard VLM practices.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression... template-based RLE representation to explicitly and directly model high-resolution 3D geometry... sliced along the z-axis into a sequence of 2D binary masks... template layers
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PhysX-Bench... six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Structured 3D Latents for Scalable and Versatile 3D Generation
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024. 18
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, et al. 3dtopia: Large text-to-3d generation model with hybrid diffu- sion priors.arXiv preprint arXiv:2403.02234, 2024
-
[3]
Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024
-
[4]
Meshllm: Empowering large language models to pro- gressively understand and generate 3d mesh
Shuangkang Fang, I Shen, Yufeng Wang, Yi-Hsuan Tsai, Yi Yang, Shuchang Zhou, Wenrui Ding, Takeo Igarashi, Ming-Hsuan Yang, et al. Meshllm: Empowering large language models to pro- gressively understand and generate 3d mesh. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 14061–14072, 2025
work page 2025
-
[5]
Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, and Jun Zhu. Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025
-
[6]
Native and Compact Structured Latents for 3D Generation
Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, et al. Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Longwen Zhang, Qixuan Zhang, Haoran Jiang, Yinuo Bai, Wei Yang, Lan Xu, and Jingyi Yu. Bang: Dividing 3d assets via generative exploded dynamics.ACM Transactions on Graphics (TOG), 44(4):1–21, 2025
work page 2025
-
[8]
Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, and Xihui Liu. Omnipart: Part-aware 3d generation with semantic decoupling and structural cohesion.arXiv preprint arXiv:2507.06165, 2025
-
[9]
Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024
-
[10]
arXiv preprint arXiv:2410.16499 (2024)
Jiayi Liu, Denys Iliash, Angel X Chang, Manolis Savva, and Ali Mahdavi-Amiri. Singapo: Single image controlled generation of articulated parts in objects.arXiv preprint arXiv:2410.16499, 2024
-
[11]
Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Dinesh Jayaraman, and Eric Eaton. Articulate-anything: Automatic modeling of articulated objects via a vision-language foundation model.arXiv preprint arXiv:2410.13882, 2024
-
[12]
Ruijie Lu, Yu Liu, Jiaxiang Tang, Junfeng Ni, Yuxiang Wang, Diwen Wan, Gang Zeng, Yixin Chen, and Siyuan Huang. Dreamart: Generating interactable articulated objects from a single image.arXiv preprint arXiv:2507.05763, 2025
-
[13]
Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, and Ziwei Liu. Monoart: Progressive structural reasoning for monocular articulated 3d reconstruction.arXiv preprint arXiv:2603.19231, 2026
-
[14]
Physdreamer: Physics-based interaction with 3d objects via video generation
Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InEuropean Conference on Computer Vision, pages 388–406, 2024
work page 2024
-
[15]
Minghao Guo, Bohan Wang, Pingchuan Ma, Tianyuan Zhang, Crystal Owens, Chuang Gan, Josh Tenenbaum, Kaiming He, and Wojciech Matusik. Physically compatible 3d object modeling from a single image.Advances in Neural Information Processing Systems, 37:119260–119282, 2024. 19
work page 2024
-
[16]
Physgen3d: Crafting a miniature interactive world from a single image
Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. Physgen3d: Crafting a miniature interactive world from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6178–6189, 2025
work page 2025
-
[17]
Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, and Lingjie Liu. Pixie: Fast and generalizable supervised learning of 3d physics from pixels.arXiv preprint arXiv:2508.17437, 2025
-
[18]
Chuhao Chen, Zhiyang Dou, Chen Wang, Yiming Huang, Anjun Chen, Qiao Feng, Jiatao Gu, and Lingjie Liu. Vid2sim: Generalizable, video-based reconstruction of appearance, geometry and physics for mesh-free simulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26545–26555, 2025
work page 2025
-
[19]
Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, and Yunzhu Li. Phys- twin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025
-
[20]
Physx-3d: Physical-grounded 3d asset gener- ation.arXiv preprint arXiv:2507.12465, 2025
Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-3d: Physical-grounded 3d asset gener- ation.arXiv preprint arXiv:2507.12465, 2025
-
[21]
Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Physx-anything: Simulation- ready physical 3d assets from single image.arXiv preprint arXiv:2511.13648, 2025
-
[22]
Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies
Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4209–4219, 2024
work page 2024
-
[23]
From one to more: Contextual part latents for 3d generation
Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin Wang, Yucheng Wang, Qi Wang, Jae- hyeok Kim, Chenjian Gao, Zhanpeng Huang, et al. From one to more: Contextual part latents for 3d generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8230–8240, 2025
work page 2025
-
[24]
Efficient geometry-aware 3d generative adversarial networks
Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123–16133, 2022
work page 2022
-
[25]
Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images.Advances in neural information processing systems, 35:31841–31854, 2022
work page 2022
-
[26]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Lgm: Large multi-view gaussian model for high-resolution 3d content creation
Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024
work page 2024
-
[28]
Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Large-vocabulary 3d diffusion model with transformer.arXiv preprint arXiv:2309.07920, 2023
Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Large-vocabulary 3d diffusion model with transformer.arXiv preprint arXiv:2309.07920, 2023. 20
-
[30]
Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025
work page 2025
-
[31]
Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. Collaborative multi-modal coding for high- quality 3d generation.arXiv preprint arXiv:2508.15228, 2025
-
[32]
Holopart: Generative 3d part amodal segmentation.arXiv preprint arXiv:2504.07943, 2025
Yunhan Yang, Yuan-Chen Guo, Yukun Huang, Zi-Xin Zou, Zhipeng Yu, Yangguang Li, Yan- Pei Cao, and Xihui Liu. Holopart: Generative 3d part amodal segmentation.arXiv preprint arXiv:2504.07943, 2025
-
[33]
Runmao Yao, Junsheng Zhou, Zhen Dong, and Yu-Shen Liu. Anchoreddream: Zero-shot 360 {\deg}indoor scene generation from a single view via geometric grounding.arXiv preprint arXiv:2601.16532, 2026
-
[34]
Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compositional latent diffusion trans- formers.arXiv preprint arXiv:2506.05573, 2025
-
[35]
Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets
ByteDance Seed. Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets. 2025
work page 2025
-
[36]
Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh generation with autoregressive transformers.arXiv preprint arXiv:2406.10163, 2024
-
[37]
Meshgpt: Generating triangle meshes with decoder- only transformers
Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder- only transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19615–19625, 2024
work page 2024
-
[38]
Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024
Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, and Xiao- hui Zeng. Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024
-
[39]
arXiv preprint arXiv:2502.02590 (2025)
Xiaowen Qiu, Jincheng Yang, Yian Wang, Zhehuan Chen, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Articulate anymesh: Open-vocabulary 3d articulated objects modeling. arXiv preprint arXiv:2502.02590, 2025
-
[40]
Magicarticulate: Make your 3d models articulation- ready
Chaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, et al. Magicarticulate: Make your 3d models articulation- ready. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15998– 16007, 2025
work page 2025
-
[41]
Artformer: Controllable generation of diverse 3d articulated objects
Jiayi Su, Youhe Feng, Zheng Li, Jinhua Song, Yangfan He, Botao Ren, and Botian Xu. Artformer: Controllable generation of diverse 3d articulated objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1894–1904, 2025
work page 1904
-
[42]
Artilatent: Realistic articulated 3d object generation via structured latents
Honghua Chen, Yushi Lan, Yongwei Chen, and Xingang Pan. Artilatent: Realistic articulated 3d object generation via structured latents. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025
work page 2025
-
[43]
Freeart3d: Training-free articulated object generation using 3d diffusion
Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, and Minghua Liu. Freeart3d: Training-free articulated object generation using 3d diffusion. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–13, 2025. 21
work page 2025
-
[44]
Procedural genera- tion of articulated simulation-ready assets, 2025
Abhishek Joshi, Beining Han, Jack Nugent, Max Gonzalez Saez-Diez, Yiming Zuo, Jonathan Liu, Hongyu Wen, Stamatis Alexandropoulos, Karhan Kayan, Anna Calveri, et al. Procedural genera- tion of articulated simulation-ready assets, 2025. 6.URL https://arxiv. org/abs/2505.10755, 7
-
[45]
Nap: Neural 3d articulation prior.arXiv preprint arXiv:2305.16315, 2023
Jiahui Lei, Congyue Deng, Bokui Shen, Leonidas Guibas, and Kostas Daniilidis. Nap: Neural 3d articulation prior.arXiv preprint arXiv:2305.16315, 2023
-
[46]
Zhe Li, Xiang Bai, Jieyu Zhang, Zhuangzhe Wu, Che Xu, Ying Li, Chengkai Hou, and Shanghang Zhang. Urdf-anything: Constructing articulated objects with 3d multimodal language model.arXiv preprint arXiv:2511.00940, 2025
-
[47]
Urdf-anything+: Autoregressive articulated 3d models generation for physical simulation
Zhuangzhe Wu, Yue Xin, Chengkai Hou, Minghao Chen, Yaoxu Lyu, Jieyu Zhang, and Shanghang Zhang. Urdf-anything+: Autoregressive articulated 3d models generation for physical simulation. arXiv preprint arXiv:2603.14010, 2026
-
[48]
Junyi Cao and Evangelos Kalogerakis. Sophy: Generating simulation-ready objects with physical materials.arXiv preprint arXiv:2504.12684, 2025
-
[49]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024
work page 2024
-
[51]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 22
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.