pith. sign in

arxiv: 2512.14671 · v3 · pith:THH2Z66Anew · submitted 2025-12-16 · 💻 cs.CV

ART: Articulated Reconstruction Transformer

Pith reviewed 2026-05-21 17:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords articulated object reconstructiontransformer architecturepart-based prediction3D geometryarticulation parameterssparse image inputsfeed-forward modelcategory-agnostic reconstruction
0
0 comments X

The pith

ART reconstructs complete 3D articulated objects from sparse multi-state RGB images by mapping inputs to learnable part slots that decode per-part geometry, texture, and articulation parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a feed-forward transformer called ART that reconstructs full 3D models of articulated objects without being limited to one object category. It breaks objects into rigid parts and uses a set of learnable slots to jointly predict each part's shape, appearance, and how the parts move relative to one another. This matters because earlier approaches either needed slow per-object optimization with fragile correspondences or worked only on narrow classes of objects. If the method holds, it would turn a handful of images into simulation-ready 3D models in one pass.

Core claim

ART is a category-agnostic feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images by treating objects as assemblies of rigid parts. The transformer maps the image inputs to a set of learnable part slots and jointly decodes unified representations for each part that include its 3D geometry, texture, and explicit articulation parameters. The resulting models are physically interpretable and directly exportable for simulation.

What carries the argument

Learnable part slots inside the transformer that receive sparse image inputs and decode unified per-part representations for 3D geometry, texture, and articulation parameters.

If this is right

  • Reconstructions are physically interpretable and can be exported directly for use in simulation.
  • The model produces complete 3D outputs from sparse multi-state image inputs without per-object optimization.
  • Performance improves over prior baselines and reaches state-of-the-art results across diverse benchmarks.
  • Joint decoding supplies geometry, texture, and explicit articulation parameters for every part in a single forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The part-based formulation could support downstream tasks such as physics-based planning once the 3D models are obtained.
  • Because the method is feed-forward, it might enable reconstruction pipelines that run at interactive speeds once trained.
  • Scaling the training data further could extend the approach to objects with more complex part hierarchies.

Load-bearing premise

A large-scale diverse dataset with per-part supervision exists and is sufficient to train a category-agnostic model that generalizes to unseen articulated objects.

What would settle it

A controlled test in which the trained model produces inaccurate geometry or articulation on a collection of articulated objects drawn from categories absent from the training set would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2512.14671 by Chen Geng, Cheng Zhang, Henry Howard-Jenkins, Jakob Engel, Jiajun Wu, Richard Newcombe, Zhao Dong, Zhaoyang Lv, Zhengqin Li, Zizhang Li.

Figure 1
Figure 1. Figure 1: We propose ART, a feed-forward model that reconstructs articulated objects from images by decomposing them into rigid parts. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Model Architecture of ART. Multi-view, multi-state image inputs with known camera poses are tokenized and processed by a transformer alongside learnable part slot tokens. Two separate decoders then predict each part’s geometry/texture and the articulation structure, and SDF volume rendering composes these components to render and reconstruct the articulated object at different states. The alternative is th… view at source ↗
Figure 3
Figure 3. Figure 3: Articulated object data samples from two of our data collections: Procedural (left two columns) and StorageFur￾niture (right two columns). For each object, we show two states from one dynamic sequence under the same viewpoint. Viewpoint information. For each input image we compute the Plucker ray representation [ ¨ 19, 55] using known camera intrinsics and extrinsics, denoted as (v, v × o), where v is the … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results. “Input Images” shows the multi-state inputs provided to the model (visualized by one image). “Predicted Parts” visualizes the predicted per-part bounding boxes in the start and end states; “Reconstructed Textured Articulated Meshes” displays the textured 3D reconstructions for the same two articulated states. Input image Predicted parts GT Ours SINGAPO [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison with optimization baseline. With the same sparse inputs, our method reconstructs complete, high-fidelity textured meshes in both states. ArtGS [42] lacks re￾liable correspondences under sparsity, leading to fragmented and inaccurate results. sion; nonetheless, its lower PSNR and higher LPIPS scores indicate poor appearance recovery. This disparity is rooted in sparsity: optimization … view at source ↗
Figure 7
Figure 7. Figure 7: Real-world images results. view as input. This change produces a clear drop in qual￾ity across all metrics, as expected, confirming that multiple views are crucial for resolving ambiguities and achieving high-fidelity, geometrically accurate articulated reconstruc￾tion. Rest-state formulation. Removing the rest-state formula￾tion and predicting parts relative to the first observed frame leads to a severe q… view at source ↗
read the original abstract

We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces ART, a category-agnostic feed-forward transformer for reconstructing complete 3D articulated objects from sparse multi-state RGB images. It models objects as rigid part assemblies, using a transformer with learnable part slots to jointly predict per-part 3D geometry, texture, and explicit articulation parameters. The model is trained with per-part supervision on a large-scale diverse dataset and claims new state-of-the-art results on diverse benchmarks, with outputs that are physically interpretable and exportable for simulation.

Significance. If the benchmark gains and generalization claims hold after verification of the training data and ablations, this would be a meaningful contribution to articulated object reconstruction. It shifts from slow optimization or category-specific feed-forward models to a general, fast, part-based transformer approach with explicit articulation outputs suitable for downstream simulation and robotics tasks.

major comments (1)
  1. [Abstract] The central claim of category-agnostic generalization rests on training with a 'large-scale, diverse dataset with per-part supervision' (Abstract). No statistics are supplied on category count, object instances, part-count distribution, image density per object, or label provenance (synthetic vs. real). This information is load-bearing for assessing whether the training distribution covers the variability needed for true unseen-object generalization.
minor comments (1)
  1. [Abstract] The abstract uses the phrase 'multi-state RGB images' without an early definition of what constitutes a state (different articulation poses, camera views, or both); a brief parenthetical clarification would improve accessibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. The concern about dataset transparency is well-taken and directly relevant to the strength of our category-agnostic generalization claims. We address the point below and will revise the manuscript to include the requested details.

read point-by-point responses
  1. Referee: [Abstract] The central claim of category-agnostic generalization rests on training with a 'large-scale, diverse dataset with per-part supervision' (Abstract). No statistics are supplied on category count, object instances, part-count distribution, image density per object, or label provenance (synthetic vs. real). This information is load-bearing for assessing whether the training distribution covers the variability needed for true unseen-object generalization.

    Authors: We agree that quantitative dataset statistics are necessary to evaluate the scope of generalization and that their absence from the abstract weakens the central claim. While the full manuscript describes the data collection pipeline and per-part supervision in Section 4, we acknowledge that a concise summary of key statistics is missing from the abstract and early sections. In the revised manuscript we will add a short 'Training Data' paragraph (or table) immediately after the abstract or in Section 3 that reports: total categories covered, number of distinct object instances, part-count distribution (mean, range, and histogram summary), average number of multi-state views per object, and explicit confirmation that all labels are synthetic with ground-truth per-part annotations obtained from a physics-based simulator. This addition will allow readers to assess coverage of variability for unseen objects without altering any experimental results or claims. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a feed-forward transformer architecture that maps image inputs to part slots and decodes geometry, texture, and articulation parameters. No equations, derivations, or self-referential definitions appear that reduce any claimed output to fitted inputs or prior self-citations by construction. The model is trained on an external dataset and evaluated on benchmarks, rendering the central claims empirically grounded rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented physical entities; the model introduces learnable part slots as an architectural choice whose count and initialization are unspecified.

pith-pipeline@v0.9.0 · 5706 in / 1226 out tokens · 85844 ms · 2026-05-21T17:14:55.655889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 7 internal anchors

  1. [1]

    Learning to generalize kinematic models to novel objects

    Ben Abbatematteo, Stefanie Tellex, and George Konidaris. Learning to generalize kinematic models to novel objects. In Proceedings of the 3rd Conference on Robot Learning, 2019. 2

  2. [2]

    Hexplane: A fast representa- tion for dynamic scenes

    Ang Cao and Justin Johnson. Hexplane: A fast representa- tion for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 4

  3. [3]

    Autopartgen: Autogres- sive 3d part generation and discovery.arXiv preprint arXiv:2507.13346, 2025

    Minghao Chen, Jianyuan Wang, Roman Shapovalov, Tom Monnier, Hyunyoung Jung, Dilin Wang, Rakesh Ranjan, Iro Laina, and Andrea Vedaldi. Autopartgen: Autogres- sive 3d part generation and discovery.arXiv preprint arXiv:2507.13346, 2025. 3

  4. [4]

    Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

    Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simulation environments from real-world images. arXiv preprint arXiv:2405.11656, 2024. 2, 3, 6, S2

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

  6. [6]

    Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

    Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Auto- mated creation of digital cousins for robust policy learning. arXiv preprint arXiv:2410.07408, 2024. 3

  7. [7]

    Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023. 3, 5

  8. [8]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023. 3, 5

  9. [9]

    Articulate your nerf: Unsupervised articulated object modeling via con- ditional view synthesis.Advances in Neural Information Processing Systems, 37:119717–119741, 2024

    Jianning Deng, Kartic Subr, and Hakan Bilen. Articulate your nerf: Unsupervised articulated object modeling via con- ditional view synthesis.Advances in Neural Information Processing Systems, 37:119717–119741, 2024. 1, 2

  10. [10]

    Anymate: A dataset and baselines for learning 3d object rigging

    Yufan Deng, Yuhao Zhang, Chen Geng, Shangzhe Wu, and Jiajun Wu. Anymate: A dataset and baselines for learning 3d object rigging. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Confer- ence Conference Papers, pages 1–10, 2025. 1

  11. [11]

    Prob- ing the 3d awareness of visual foundation models

    Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21795–21806, 2024. 4

  12. [12]

    Capt: Category-level articulation estimation from a single point cloud using transformer

    Lian Fu, Ryoichi Ishikawa, Yoshihiro Sato, and Takeshi Oishi. Capt: Category-level articulation estimation from a single point cloud using transformer. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 751–757. IEEE, 2024. 2

  13. [13]

    Me- shart: Generating articulated meshes with structure-guided transformers

    Daoyi Gao, Yawar Siddiqui, Lei Li, and Angela Dai. Me- shart: Generating articulated meshes with structure-guided transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 618–627, 2025. 2

  14. [14]

    Partrm: Modeling part-level dynamics with large cross-state reconstruction model

    Mingju Gao, Yike Pan, Huan-ang Gao, Zongzheng Zhang, Wenyi Li, Hao Dong, Hao Tang, Li Yi, and Hao Zhao. Partrm: Modeling part-level dynamics with large cross-state reconstruction model. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 7004–7014,

  15. [15]

    Carto: Category and joint agnostic reconstruction of articulated objects

    Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Ab- hinav Valada, and Thomas Kollar. Carto: Category and joint agnostic reconstruction of articulated objects. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21201–21210, 2023. 2, 3

  16. [16]

    LRM: Large Reconstruction Model for Single Image to 3D

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 1, 2, 3, 4, 5

  17. [17]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 4

  18. [18]

    Distributional depth-based estimation of object ar- ticulation models

    Ajinkya Jain, Stephen Giguere, Rudolf Lioutikov, and Scott Niekum. Distributional depth-based estimation of object ar- ticulation models. InConference on Robot Learning, pages 1611–1621. PMLR, 2022. 2

  19. [19]

    Pl ¨ucker coordinates for lines in the space.Prob- lem Solver Techniques for Applied Computer Science, Com- S-477/577 Course Handout, 3, 2020

    Yan-Bin Jia. Pl ¨ucker coordinates for lines in the space.Prob- lem Solver Techniques for Applied Computer Science, Com- S-477/577 Course Handout, 3, 2020. 4

  20. [20]

    Opd: Single-view 3d openable part detection

    Hanxiao Jiang, Yongsen Mao, Manolis Savva, and Angel X Chang. Opd: Single-view 3d openable part detection. In European Conference on Computer Vision, pages 410–426. Springer, 2022. 2, 6, S2

  21. [21]

    Ditto: Building digital twins of articulated objects from interaction

    Zhenyu Jiang, Cheng-Chun Hsu, and Yuke Zhu. Ditto: Building digital twins of articulated objects from interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5616–5626, 2022. 2 9

  22. [22]

    Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024. 3

  23. [23]

    Procedural generation of articulated simulation-ready assets,

    Abhishek Joshi, Beining Han, Jack Nugent, Max Gonzalez Saez-Diez, Yiming Zuo, Jonathan Liu, Hongyu Wen, Stama- tis Alexandropoulos, Karhan Kayan, Anna Calveri, Tao Sun, Gaowen Liu, Yi Shao, Alexander Raistrick, and Jia Deng. Procedural generation of articulated simulation-ready assets,

  24. [24]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  25. [25]

    Parahome: Parameterizing everyday home activities to- wards 3d generative modeling of human-object interactions

    Jeonghwan Kim, Jisoo Kim, Jeonghyeon Na, and Hanbyul Joo. Parahome: Parameterizing everyday home activities to- wards 3d generative modeling of human-object interactions. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 1816–1828, 2025. 1

  26. [26]

    Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023

    Jiahui Lei, Congyue Deng, William B Shen, Leonidas J Guibas, and Kostas Daniilidis. Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023. 2

  27. [27]

    igibson 2.0: Object-centric simulation for robot learning of everyday household tasks.arXiv preprint arXiv:2108.03272, 2021

    Chengshu Li, Fei Xia, Roberto Mart ´ın-Mart´ın, Michael Lin- gelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, et al. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks.arXiv preprint arXiv:2108.03272, 2021. 1

  28. [28]

    BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gok- men, Sanjana Srivastava, Roberto Mart ´ın-Mart´ın, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, et al. Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation.arXiv preprint arXiv:2403.09227, 2024. 1

  29. [29]

    Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

    Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 3

  30. [30]

    Ner- facc: A general nerf acceleration toolbox.arXiv preprint arXiv:2210.04847, 2022

    Ruilong Li, Matthew Tancik, and Angjoo Kanazawa. Ner- facc: A general nerf acceleration toolbox.arXiv preprint arXiv:2210.04847, 2022. S1

  31. [31]

    Category-level articulated ob- ject pose estimation

    Xiaolong Li, He Wang, Li Yi, Leonidas J Guibas, A Lynn Abbott, and Shuran Song. Category-level articulated ob- ject pose estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3706–3715, 2020. 2

  32. [32]

    Neuralangelo: High-fidelity neural surface reconstruction

    Zhaoshuo Li, Thomas M ¨uller, Alex Evans, Russell H Tay- lor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 8456–8465, 2023. S1

  33. [33]

    Learning the 3d fauna of the web

    Zizhang Li, Dor Litvak, Ruining Li, Yunzhi Zhang, Tomas Jakab, Christian Rupprecht, Shangzhe Wu, Andrea Vedaldi, and Jiajun Wu. Learning the 3d fauna of the web. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9752–9762, 2024. 4

  34. [34]

    Lirm: Large inverse render- ing model for progressive reconstruction of shape, materials and view-dependent radiance fields

    Zhengqin Li, Dilin Wang, Ka Chen, Zhaoyang Lv, Thu Nguyen-Phuoc, Milim Lee, Jia-Bin Huang, Lei Xiao, Yufeng Zhu, Carl S Marshall, et al. Lirm: Large inverse render- ing model for progressive reconstruction of shape, materials and view-dependent radiance fields. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 505–517, 2025. 1...

  35. [35]

    Partcrafter: Structured 3d mesh generation via compositional latent diffusion trans- formers.arXiv preprint arXiv:2506.05573, 2025

    Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compo- sitional latent diffusion transformers.arXiv preprint arXiv:2506.05573, 2025. 3

  36. [36]

    Paris: Part-level reconstruction and motion analysis for articulated objects

    Jiayi Liu, Ali Mahdavi-Amiri, and Manolis Savva. Paris: Part-level reconstruction and motion analysis for articulated objects. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 352–363, 2023. 2, 4, 6

  37. [37]

    arXiv preprint arXiv:2410.16499 (2024)

    Jiayi Liu, Denys Iliash, Angel X Chang, Manolis Savva, and Ali Mahdavi-Amiri. Singapo: Single image controlled generation of articulated parts in objects.arXiv preprint arXiv:2410.16499, 2024. 2, 3, 6, S2

  38. [38]

    Survey on modeling of human-made articulated objects

    Jiayi Liu, Manolis Savva, and Ali Mahdavi-Amiri. Survey on modeling of human-made articulated objects. InComputer Graphics Forum, page e70092. Wiley Online Library, 2025. 1

  39. [39]

    Toward real-world category-level articulation pose esti- mation.IEEE Transactions on Image Processing, 31:1072– 1083, 2022

    Liu Liu, Han Xue, Wenqiang Xu, Haoyuan Fu, and Cewu Lu. Toward real-world category-level articulation pose esti- mation.IEEE Transactions on Image Processing, 31:1072– 1083, 2022. 2

  40. [40]

    Category-level articulated object 9d pose estimation via reinforcement learning

    Liu Liu, Jianming Du, Hao Wu, Xun Yang, Zhenguang Liu, Richang Hong, and Meng Wang. Category-level articulated object 9d pose estimation via reinforcement learning. InPro- ceedings of the 31st ACM International Conference on Mul- timedia, pages 728–736, 2023. 2

  41. [41]

    Nothing but geometric constraints: A model-free method for articulated object pose estimation

    Qihao Liu, Weichao Qiu, Weiyao Wang, Gregory D Hager, and Alan L Yuille. Nothing but geometric constraints: A model-free method for articulated object pose estimation. arXiv preprint arXiv:2012.00088, 2020. 2

  42. [42]

    Building interactable replicas of complex articulated objects via gaussian splatting

    Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Building interactable replicas of complex articulated objects via gaussian splatting. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2, 4, 6, 7

  43. [43]

    Threestudio: A modular framework for diffusion-guided 3d generation.cg

    Ying-Tian Liu, Yuan-Chen Guo, Vikram V oleti, Ruizhi Shao, Chia-Hao Chen, Guan Luo, Zixin Zou, Chen Wang, Chris- tian Laforte, Yan-Pei Cao, et al. Threestudio: A modular framework for diffusion-guided 3d generation.cg. cs. ts- inghua. edu. cn, 2023. 1

  44. [44]

    Real2code: Reconstruct articulated objects via code genera- tion.arXiv preprint arXiv:2406.08474, 2024

    Zhao Mandi, Yijia Weng, Dominik Bauer, and Shuran Song. Real2code: Reconstruct articulated objects via code genera- tion.arXiv preprint arXiv:2406.08474, 2024. 3

  45. [45]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2, S1

  46. [46]

    A-sdf: Learning 10 disentangled signed distance functions for articulated shape representation

    Jiteng Mu, Weichao Qiu, Adam Kortylewski, Alan Yuille, Nuno Vasconcelos, and Xiaolong Wang. A-sdf: Learning 10 disentangled signed distance functions for articulated shape representation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13001–13011,

  47. [47]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4

  48. [48]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 1

  49. [49]

    Multilayer percep- tron and neural networks.WSEAS Transactions on Circuits and Systems, 8(7):579–588, 2009

    Marius-Constantin Popescu, Valentina E Balas, Liliana Perescu-Popescu, and Nikos Mastorakis. Multilayer percep- tron and neural networks.WSEAS Transactions on Circuits and Systems, 8(7):579–588, 2009. 4

  50. [50]

    Understanding 3d object articulation in in- ternet videos

    Shengyi Qian, Linyi Jin, Chris Rockwell, Siyi Chen, and David F Fouhey. Understanding 3d object articulation in in- ternet videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1599– 1609, 2022. 2

  51. [51]

    arXiv preprint arXiv:2502.02590 (2025)

    Xiaowen Qiu, Jincheng Yang, Yian Wang, Zhehuan Chen, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Articulate anymesh: Open-vocabulary 3d articulated objects modeling.arXiv preprint arXiv:2502.02590, 2025. 2

  52. [52]

    Generalized in- tersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666,

  53. [53]

    igibson 1.0: A simulation environment for interactive tasks in large realistic scenes

    Bokui Shen, Fei Xia, Chengshu Li, Roberto Mart ´ın-Mart´ın, Linxi Fan, Guanzhi Wang, Claudia P ´erez-D’Arpino, Shya- mal Buch, Sanjana Srivastava, Lyne Tchapmi, et al. igibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7520–7527. IE...

  54. [54]

    Meta 3d assetgen: Text-to-mesh generation with high- quality geometry, texture, and pbr materials.Advances in Neural Information Processing Systems, 37:9532–9564,

    Yawar Siddiqui, Tom Monnier, Filippos Kokkinos, Mahen- dra Kariya, Yanir Kleiman, Emilien Garreau, Oran Gafni, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, et al. Meta 3d assetgen: Text-to-mesh generation with high- quality geometry, texture, and pbr materials.Advances in Neural Information Processing Systems, 37:9532–9564,

  55. [55]

    Light field networks: Neu- ral scene representations with single-evaluation rendering

    Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neu- ral scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34: 19313–19325, 2021. 4

  56. [56]

    Opdmulti: Openable part detection for multiple ob- jects

    Xiaohao Sun, Hanxiao Jiang, Manolis Savva, and Angel Chang. Opdmulti: Openable part detection for multiple ob- jects. In2024 International Conference on 3D Vision (3DV), pages 169–178. IEEE, 2024. 2

  57. [57]

    Leia: Latent view-invariant embeddings for implicit 3d artic- ulation

    Archana Swaminathan, Anubhav Gupta, Kamal Gupta, Shishira R Maiya, Vatsal Agarwal, and Abhinav Shrivastava. Leia: Latent view-invariant embeddings for implicit 3d artic- ulation. InEuropean Conference on Computer Vision, pages 210–227. Springer, 2024. 2

  58. [58]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4

  59. [59]

    Active coarse-to-fine segmentation of moveable parts from real images

    Ruiqi Wang, Akshay Gadi Patil, Fenggen Yu, and Hao Zhang. Active coarse-to-fine segmentation of moveable parts from real images. InEuropean Conference on Computer Vi- sion, pages 111–127. Springer, 2024. 2

  60. [60]

    Shape2motion: Joint analysis of motion parts and attributes from 3d shapes

    Xiaogang Wang, Bin Zhou, Yahao Shi, Xiaowu Chen, Qin- ping Zhao, and Kai Xu. Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 8876–8884, 2019. 2

  61. [61]

    Meshlrm: Large reconstruction model for high- quality mesh

    Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zex- iang Xu. Meshlrm: Large reconstruction model for high- quality meshes.arXiv preprint arXiv:2404.12385, 2024. 1, 4, 5, S1

  62. [62]

    Neural implicit representation for building digital twins of unknown articulated objects

    Yijia Weng, Bowen Wen, Jonathan Tremblay, Valts Blukis, Dieter Fox, Leonidas Guibas, and Stan Birchfield. Neural implicit representation for building digital twins of unknown articulated objects. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3141–3150, 2024. 2, 4, 6

  63. [63]

    Sapien: A simulated part-based interactive environment

    Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097– 11107, 2020. 2, 6, 8, S2

  64. [64]

    Unsupervised kinematic motion detection for part- segmented 3d shape collections

    Xianghao Xu, Yifan Ruan, Srinath Sridhar, and Daniel Ritchie. Unsupervised kinematic motion detection for part- segmented 3d shape collections. InACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022. 2

  65. [65]

    Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation

    Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, and Gordon Wet- zstein. Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation. InEuropean Con- ference on Computer Vision, pages 1–20. Springer, 2024. 1, 3, 5

  66. [66]

    4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos, June 2025

    Zhen Xu, Zhengqin Li, Zhao Dong, Xiaowei Zhou, Richard Newcombe, and Zhaoyang Lv. 4dgt: Learning a 4d gaus- sian transformer using real-world monocular videos.arXiv preprint arXiv:2506.08015, 2025. 4

  67. [67]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 4

  68. [68]

    Reconstructing animatable categories from videos

    Gengshan Yang, Chaoyang Wang, N Dinesh Reddy, and Deva Ramanan. Reconstructing animatable categories from videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16995– 17005, 2023. 1

  69. [69]

    V ol- ume rendering of neural implicit surfaces.Advances in neu- 11 ral information processing systems, 34:4805–4815, 2021

    Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. V ol- ume rendering of neural implicit surfaces.Advances in neu- 11 ral information processing systems, 34:4805–4815, 2021. 5, S1

  70. [70]

    A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547,

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola- nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547,

  71. [71]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5, 6 12 ART: Articulated Reconstruction Transformer Supplementary Material A. More results We provide...