ART: Articulated Reconstruction Transformer

Chen Geng; Cheng Zhang; Henry Howard-Jenkins; Jakob Engel; Jiajun Wu; Richard Newcombe; Zhao Dong; Zhaoyang Lv; Zhengqin Li; Zizhang Li

arxiv: 2512.14671 · v3 · pith:THH2Z66Anew · submitted 2025-12-16 · 💻 cs.CV

ART: Articulated Reconstruction Transformer

Zizhang Li , Cheng Zhang , Zhengqin Li , Henry Howard-Jenkins , Zhaoyang Lv , Chen Geng , Jiajun Wu , Richard Newcombe

show 2 more authors

Jakob Engel Zhao Dong

This is my paper

Pith reviewed 2026-05-21 17:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords articulated object reconstructiontransformer architecturepart-based prediction3D geometryarticulation parameterssparse image inputsfeed-forward modelcategory-agnostic reconstruction

0 comments

The pith

ART reconstructs complete 3D articulated objects from sparse multi-state RGB images by mapping inputs to learnable part slots that decode per-part geometry, texture, and articulation parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a feed-forward transformer called ART that reconstructs full 3D models of articulated objects without being limited to one object category. It breaks objects into rigid parts and uses a set of learnable slots to jointly predict each part's shape, appearance, and how the parts move relative to one another. This matters because earlier approaches either needed slow per-object optimization with fragile correspondences or worked only on narrow classes of objects. If the method holds, it would turn a handful of images into simulation-ready 3D models in one pass.

Core claim

ART is a category-agnostic feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images by treating objects as assemblies of rigid parts. The transformer maps the image inputs to a set of learnable part slots and jointly decodes unified representations for each part that include its 3D geometry, texture, and explicit articulation parameters. The resulting models are physically interpretable and directly exportable for simulation.

What carries the argument

Learnable part slots inside the transformer that receive sparse image inputs and decode unified per-part representations for 3D geometry, texture, and articulation parameters.

If this is right

Reconstructions are physically interpretable and can be exported directly for use in simulation.
The model produces complete 3D outputs from sparse multi-state image inputs without per-object optimization.
Performance improves over prior baselines and reaches state-of-the-art results across diverse benchmarks.
Joint decoding supplies geometry, texture, and explicit articulation parameters for every part in a single forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The part-based formulation could support downstream tasks such as physics-based planning once the 3D models are obtained.
Because the method is feed-forward, it might enable reconstruction pipelines that run at interactive speeds once trained.
Scaling the training data further could extend the approach to objects with more complex part hierarchies.

Load-bearing premise

A large-scale diverse dataset with per-part supervision exists and is sufficient to train a category-agnostic model that generalizes to unseen articulated objects.

What would settle it

A controlled test in which the trained model produces inaccurate geometry or articulation on a collection of articulated objects drawn from categories absent from the training set would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2512.14671 by Chen Geng, Cheng Zhang, Henry Howard-Jenkins, Jakob Engel, Jiajun Wu, Richard Newcombe, Zhao Dong, Zhaoyang Lv, Zhengqin Li, Zizhang Li.

**Figure 1.** Figure 1: We propose ART, a feed-forward model that reconstructs articulated objects from images by decomposing them into rigid parts. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Model Architecture of ART. Multi-view, multi-state image inputs with known camera poses are tokenized and processed by a transformer alongside learnable part slot tokens. Two separate decoders then predict each part’s geometry/texture and the articulation structure, and SDF volume rendering composes these components to render and reconstruct the articulated object at different states. The alternative is th… view at source ↗

**Figure 3.** Figure 3: Articulated object data samples from two of our data collections: Procedural (left two columns) and StorageFurniture (right two columns). For each object, we show two states from one dynamic sequence under the same viewpoint. Viewpoint information. For each input image we compute the Plucker ray representation [ ¨ 19, 55] using known camera intrinsics and extrinsics, denoted as (v, v × o), where v is the … view at source ↗

**Figure 4.** Figure 4: Qualitative results. “Input Images” shows the multi-state inputs provided to the model (visualized by one image). “Predicted Parts” visualizes the predicted per-part bounding boxes in the start and end states; “Reconstructed Textured Articulated Meshes” displays the textured 3D reconstructions for the same two articulated states. Input image Predicted parts GT Ours SINGAPO [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison with optimization baseline. With the same sparse inputs, our method reconstructs complete, high-fidelity textured meshes in both states. ArtGS [42] lacks reliable correspondences under sparsity, leading to fragmented and inaccurate results. sion; nonetheless, its lower PSNR and higher LPIPS scores indicate poor appearance recovery. This disparity is rooted in sparsity: optimization … view at source ↗

**Figure 7.** Figure 7: Real-world images results. view as input. This change produces a clear drop in quality across all metrics, as expected, confirming that multiple views are crucial for resolving ambiguities and achieving high-fidelity, geometrically accurate articulated reconstruction. Rest-state formulation. Removing the rest-state formulation and predicting parts relative to the first observed frame leads to a severe q… view at source ↗

read the original abstract

We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces ART, a category-agnostic feed-forward transformer for reconstructing complete 3D articulated objects from sparse multi-state RGB images. It models objects as rigid part assemblies, using a transformer with learnable part slots to jointly predict per-part 3D geometry, texture, and explicit articulation parameters. The model is trained with per-part supervision on a large-scale diverse dataset and claims new state-of-the-art results on diverse benchmarks, with outputs that are physically interpretable and exportable for simulation.

Significance. If the benchmark gains and generalization claims hold after verification of the training data and ablations, this would be a meaningful contribution to articulated object reconstruction. It shifts from slow optimization or category-specific feed-forward models to a general, fast, part-based transformer approach with explicit articulation outputs suitable for downstream simulation and robotics tasks.

major comments (1)

[Abstract] The central claim of category-agnostic generalization rests on training with a 'large-scale, diverse dataset with per-part supervision' (Abstract). No statistics are supplied on category count, object instances, part-count distribution, image density per object, or label provenance (synthetic vs. real). This information is load-bearing for assessing whether the training distribution covers the variability needed for true unseen-object generalization.

minor comments (1)

[Abstract] The abstract uses the phrase 'multi-state RGB images' without an early definition of what constitutes a state (different articulation poses, camera views, or both); a brief parenthetical clarification would improve accessibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. The concern about dataset transparency is well-taken and directly relevant to the strength of our category-agnostic generalization claims. We address the point below and will revise the manuscript to include the requested details.

read point-by-point responses

Referee: [Abstract] The central claim of category-agnostic generalization rests on training with a 'large-scale, diverse dataset with per-part supervision' (Abstract). No statistics are supplied on category count, object instances, part-count distribution, image density per object, or label provenance (synthetic vs. real). This information is load-bearing for assessing whether the training distribution covers the variability needed for true unseen-object generalization.

Authors: We agree that quantitative dataset statistics are necessary to evaluate the scope of generalization and that their absence from the abstract weakens the central claim. While the full manuscript describes the data collection pipeline and per-part supervision in Section 4, we acknowledge that a concise summary of key statistics is missing from the abstract and early sections. In the revised manuscript we will add a short 'Training Data' paragraph (or table) immediately after the abstract or in Section 3 that reports: total categories covered, number of distinct object instances, part-count distribution (mean, range, and histogram summary), average number of multi-state views per object, and explicit confirmation that all labels are synthetic with ground-truth per-part annotations obtained from a physics-based simulator. This addition will allow readers to assess coverage of variability for unseen objects without altering any experimental results or claims. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a feed-forward transformer architecture that maps image inputs to part slots and decodes geometry, texture, and articulation parameters. No equations, derivations, or self-referential definitions appear that reduce any claimed output to fitted inputs or prior self-citations by construction. The model is trained on an external dataset and evaluated on benchmarks, rendering the central claims empirically grounded rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented physical entities; the model introduces learnable part slots as an architectural choice whose count and initialization are unspecified.

pith-pipeline@v0.9.0 · 5706 in / 1226 out tokens · 85844 ms · 2026-05-21T17:14:55.655889+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We allocate P0 learnable part slots... predict P ≤ P0 active parts... motion type C ∈ {static, prismatic, revolute}, joint axis D, pivot O, dynamics S.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 7 internal anchors

[1]

Learning to generalize kinematic models to novel objects

Ben Abbatematteo, Stefanie Tellex, and George Konidaris. Learning to generalize kinematic models to novel objects. In Proceedings of the 3rd Conference on Robot Learning, 2019. 2

work page 2019
[2]

Hexplane: A fast representa- tion for dynamic scenes

Ang Cao and Justin Johnson. Hexplane: A fast representa- tion for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 4

work page 2023
[3]

Autopartgen: Autogres- sive 3d part generation and discovery.arXiv preprint arXiv:2507.13346, 2025

Minghao Chen, Jianyuan Wang, Roman Shapovalov, Tom Monnier, Hyunyoung Jung, Dilin Wang, Rakesh Ranjan, Iro Laina, and Andrea Vedaldi. Autopartgen: Autogres- sive 3d part generation and discovery.arXiv preprint arXiv:2507.13346, 2025. 3

work page arXiv 2025
[4]

Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simulation environments from real-world images. arXiv preprint arXiv:2405.11656, 2024. 2, 3, 6, S2

work page arXiv 2024
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Auto- mated creation of digital cousins for robust policy learning. arXiv preprint arXiv:2410.07408, 2024. 3

work page arXiv 2024
[7]

Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023. 3, 5

work page 2023
[8]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023. 3, 5

work page 2023
[9]

Articulate your nerf: Unsupervised articulated object modeling via con- ditional view synthesis.Advances in Neural Information Processing Systems, 37:119717–119741, 2024

Jianning Deng, Kartic Subr, and Hakan Bilen. Articulate your nerf: Unsupervised articulated object modeling via con- ditional view synthesis.Advances in Neural Information Processing Systems, 37:119717–119741, 2024. 1, 2

work page 2024
[10]

Anymate: A dataset and baselines for learning 3d object rigging

Yufan Deng, Yuhao Zhang, Chen Geng, Shangzhe Wu, and Jiajun Wu. Anymate: A dataset and baselines for learning 3d object rigging. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Confer- ence Conference Papers, pages 1–10, 2025. 1

work page 2025
[11]

Prob- ing the 3d awareness of visual foundation models

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21795–21806, 2024. 4

work page 2024
[12]

Capt: Category-level articulation estimation from a single point cloud using transformer

Lian Fu, Ryoichi Ishikawa, Yoshihiro Sato, and Takeshi Oishi. Capt: Category-level articulation estimation from a single point cloud using transformer. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 751–757. IEEE, 2024. 2

work page 2024
[13]

Me- shart: Generating articulated meshes with structure-guided transformers

Daoyi Gao, Yawar Siddiqui, Lei Li, and Angela Dai. Me- shart: Generating articulated meshes with structure-guided transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 618–627, 2025. 2

work page 2025
[14]

Partrm: Modeling part-level dynamics with large cross-state reconstruction model

Mingju Gao, Yike Pan, Huan-ang Gao, Zongzheng Zhang, Wenyi Li, Hao Dong, Hao Tang, Li Yi, and Hao Zhao. Partrm: Modeling part-level dynamics with large cross-state reconstruction model. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 7004–7014,

work page
[15]

Carto: Category and joint agnostic reconstruction of articulated objects

Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Ab- hinav Valada, and Thomas Kollar. Carto: Category and joint agnostic reconstruction of articulated objects. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21201–21210, 2023. 2, 3

work page 2023
[16]

LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 1, 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Distributional depth-based estimation of object ar- ticulation models

Ajinkya Jain, Stephen Giguere, Rudolf Lioutikov, and Scott Niekum. Distributional depth-based estimation of object ar- ticulation models. InConference on Robot Learning, pages 1611–1621. PMLR, 2022. 2

work page 2022
[19]

Pl ¨ucker coordinates for lines in the space.Prob- lem Solver Techniques for Applied Computer Science, Com- S-477/577 Course Handout, 3, 2020

Yan-Bin Jia. Pl ¨ucker coordinates for lines in the space.Prob- lem Solver Techniques for Applied Computer Science, Com- S-477/577 Course Handout, 3, 2020. 4

work page 2020
[20]

Opd: Single-view 3d openable part detection

Hanxiao Jiang, Yongsen Mao, Manolis Savva, and Angel X Chang. Opd: Single-view 3d openable part detection. In European Conference on Computer Vision, pages 410–426. Springer, 2022. 2, 6, S2

work page 2022
[21]

Ditto: Building digital twins of articulated objects from interaction

Zhenyu Jiang, Cheng-Chun Hsu, and Yuke Zhu. Ditto: Building digital twins of articulated objects from interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5616–5626, 2022. 2 9

work page 2022
[22]

Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024. 3

work page arXiv 2024
[23]

Procedural generation of articulated simulation-ready assets,

Abhishek Joshi, Beining Han, Jack Nugent, Max Gonzalez Saez-Diez, Yiming Zuo, Jonathan Liu, Hongyu Wen, Stama- tis Alexandropoulos, Karhan Kayan, Anna Calveri, Tao Sun, Gaowen Liu, Yi Shao, Alexander Raistrick, and Jia Deng. Procedural generation of articulated simulation-ready assets,

work page
[24]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[25]

Parahome: Parameterizing everyday home activities to- wards 3d generative modeling of human-object interactions

Jeonghwan Kim, Jisoo Kim, Jeonghyeon Na, and Hanbyul Joo. Parahome: Parameterizing everyday home activities to- wards 3d generative modeling of human-object interactions. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 1816–1828, 2025. 1

work page 2025
[26]

Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023

Jiahui Lei, Congyue Deng, William B Shen, Leonidas J Guibas, and Kostas Daniilidis. Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023. 2

work page 2023
[27]

igibson 2.0: Object-centric simulation for robot learning of everyday household tasks.arXiv preprint arXiv:2108.03272, 2021

Chengshu Li, Fei Xia, Roberto Mart ´ın-Mart´ın, Michael Lin- gelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, et al. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks.arXiv preprint arXiv:2108.03272, 2021. 1

work page arXiv 2021
[28]

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gok- men, Sanjana Srivastava, Roberto Mart ´ın-Mart´ın, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, et al. Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation.arXiv preprint arXiv:2403.09227, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 3

work page arXiv 2023
[30]

Ner- facc: A general nerf acceleration toolbox.arXiv preprint arXiv:2210.04847, 2022

Ruilong Li, Matthew Tancik, and Angjoo Kanazawa. Ner- facc: A general nerf acceleration toolbox.arXiv preprint arXiv:2210.04847, 2022. S1

work page arXiv 2022
[31]

Category-level articulated ob- ject pose estimation

Xiaolong Li, He Wang, Li Yi, Leonidas J Guibas, A Lynn Abbott, and Shuran Song. Category-level articulated ob- ject pose estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3706–3715, 2020. 2

work page 2020
[32]

Neuralangelo: High-fidelity neural surface reconstruction

Zhaoshuo Li, Thomas M ¨uller, Alex Evans, Russell H Tay- lor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 8456–8465, 2023. S1

work page 2023
[33]

Learning the 3d fauna of the web

Zizhang Li, Dor Litvak, Ruining Li, Yunzhi Zhang, Tomas Jakab, Christian Rupprecht, Shangzhe Wu, Andrea Vedaldi, and Jiajun Wu. Learning the 3d fauna of the web. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9752–9762, 2024. 4

work page 2024
[34]

Lirm: Large inverse render- ing model for progressive reconstruction of shape, materials and view-dependent radiance fields

Zhengqin Li, Dilin Wang, Ka Chen, Zhaoyang Lv, Thu Nguyen-Phuoc, Milim Lee, Jia-Bin Huang, Lei Xiao, Yufeng Zhu, Carl S Marshall, et al. Lirm: Large inverse render- ing model for progressive reconstruction of shape, materials and view-dependent radiance fields. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 505–517, 2025. 1...

work page 2025
[35]

Partcrafter: Structured 3d mesh generation via compositional latent diffusion trans- formers.arXiv preprint arXiv:2506.05573, 2025

Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compo- sitional latent diffusion transformers.arXiv preprint arXiv:2506.05573, 2025. 3

work page arXiv 2025
[36]

Paris: Part-level reconstruction and motion analysis for articulated objects

Jiayi Liu, Ali Mahdavi-Amiri, and Manolis Savva. Paris: Part-level reconstruction and motion analysis for articulated objects. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 352–363, 2023. 2, 4, 6

work page 2023
[37]

arXiv preprint arXiv:2410.16499 (2024)

Jiayi Liu, Denys Iliash, Angel X Chang, Manolis Savva, and Ali Mahdavi-Amiri. Singapo: Single image controlled generation of articulated parts in objects.arXiv preprint arXiv:2410.16499, 2024. 2, 3, 6, S2

work page arXiv 2024
[38]

Survey on modeling of human-made articulated objects

Jiayi Liu, Manolis Savva, and Ali Mahdavi-Amiri. Survey on modeling of human-made articulated objects. InComputer Graphics Forum, page e70092. Wiley Online Library, 2025. 1

work page 2025
[39]

Toward real-world category-level articulation pose esti- mation.IEEE Transactions on Image Processing, 31:1072– 1083, 2022

Liu Liu, Han Xue, Wenqiang Xu, Haoyuan Fu, and Cewu Lu. Toward real-world category-level articulation pose esti- mation.IEEE Transactions on Image Processing, 31:1072– 1083, 2022. 2

work page 2022
[40]

Category-level articulated object 9d pose estimation via reinforcement learning

Liu Liu, Jianming Du, Hao Wu, Xun Yang, Zhenguang Liu, Richang Hong, and Meng Wang. Category-level articulated object 9d pose estimation via reinforcement learning. InPro- ceedings of the 31st ACM International Conference on Mul- timedia, pages 728–736, 2023. 2

work page 2023
[41]

Nothing but geometric constraints: A model-free method for articulated object pose estimation

Qihao Liu, Weichao Qiu, Weiyao Wang, Gregory D Hager, and Alan L Yuille. Nothing but geometric constraints: A model-free method for articulated object pose estimation. arXiv preprint arXiv:2012.00088, 2020. 2

work page arXiv 2012
[42]

Building interactable replicas of complex articulated objects via gaussian splatting

Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Building interactable replicas of complex articulated objects via gaussian splatting. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2, 4, 6, 7

work page 2025
[43]

Threestudio: A modular framework for diffusion-guided 3d generation.cg

Ying-Tian Liu, Yuan-Chen Guo, Vikram V oleti, Ruizhi Shao, Chia-Hao Chen, Guan Luo, Zixin Zou, Chen Wang, Chris- tian Laforte, Yan-Pei Cao, et al. Threestudio: A modular framework for diffusion-guided 3d generation.cg. cs. ts- inghua. edu. cn, 2023. 1

work page 2023
[44]

Real2code: Reconstruct articulated objects via code genera- tion.arXiv preprint arXiv:2406.08474, 2024

Zhao Mandi, Yijia Weng, Dominik Bauer, and Shuran Song. Real2code: Reconstruct articulated objects via code genera- tion.arXiv preprint arXiv:2406.08474, 2024. 3

work page arXiv 2024
[45]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2, S1

work page 2021
[46]

A-sdf: Learning 10 disentangled signed distance functions for articulated shape representation

Jiteng Mu, Weichao Qiu, Adam Kortylewski, Alan Yuille, Nuno Vasconcelos, and Xiaolong Wang. A-sdf: Learning 10 disentangled signed distance functions for articulated shape representation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13001–13011,

work page
[47]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Multilayer percep- tron and neural networks.WSEAS Transactions on Circuits and Systems, 8(7):579–588, 2009

Marius-Constantin Popescu, Valentina E Balas, Liliana Perescu-Popescu, and Nikos Mastorakis. Multilayer percep- tron and neural networks.WSEAS Transactions on Circuits and Systems, 8(7):579–588, 2009. 4

work page 2009
[50]

Understanding 3d object articulation in in- ternet videos

Shengyi Qian, Linyi Jin, Chris Rockwell, Siyi Chen, and David F Fouhey. Understanding 3d object articulation in in- ternet videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1599– 1609, 2022. 2

work page 2022
[51]

arXiv preprint arXiv:2502.02590 (2025)

Xiaowen Qiu, Jincheng Yang, Yian Wang, Zhehuan Chen, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Articulate anymesh: Open-vocabulary 3d articulated objects modeling.arXiv preprint arXiv:2502.02590, 2025. 2

work page arXiv 2025
[52]

Generalized in- tersection over union: A metric and a loss for bounding box regression

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666,

work page
[53]

igibson 1.0: A simulation environment for interactive tasks in large realistic scenes

Bokui Shen, Fei Xia, Chengshu Li, Roberto Mart ´ın-Mart´ın, Linxi Fan, Guanzhi Wang, Claudia P ´erez-D’Arpino, Shya- mal Buch, Sanjana Srivastava, Lyne Tchapmi, et al. igibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7520–7527. IE...

work page 2021
[54]

Meta 3d assetgen: Text-to-mesh generation with high- quality geometry, texture, and pbr materials.Advances in Neural Information Processing Systems, 37:9532–9564,

Yawar Siddiqui, Tom Monnier, Filippos Kokkinos, Mahen- dra Kariya, Yanir Kleiman, Emilien Garreau, Oran Gafni, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, et al. Meta 3d assetgen: Text-to-mesh generation with high- quality geometry, texture, and pbr materials.Advances in Neural Information Processing Systems, 37:9532–9564,

work page
[55]

Light field networks: Neu- ral scene representations with single-evaluation rendering

Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neu- ral scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34: 19313–19325, 2021. 4

work page 2021
[56]

Opdmulti: Openable part detection for multiple ob- jects

Xiaohao Sun, Hanxiao Jiang, Manolis Savva, and Angel Chang. Opdmulti: Openable part detection for multiple ob- jects. In2024 International Conference on 3D Vision (3DV), pages 169–178. IEEE, 2024. 2

work page 2024
[57]

Leia: Latent view-invariant embeddings for implicit 3d artic- ulation

Archana Swaminathan, Anubhav Gupta, Kamal Gupta, Shishira R Maiya, Vatsal Agarwal, and Abhinav Shrivastava. Leia: Latent view-invariant embeddings for implicit 3d artic- ulation. InEuropean Conference on Computer Vision, pages 210–227. Springer, 2024. 2

work page 2024
[58]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4

work page 2017
[59]

Active coarse-to-fine segmentation of moveable parts from real images

Ruiqi Wang, Akshay Gadi Patil, Fenggen Yu, and Hao Zhang. Active coarse-to-fine segmentation of moveable parts from real images. InEuropean Conference on Computer Vi- sion, pages 111–127. Springer, 2024. 2

work page 2024
[60]

Shape2motion: Joint analysis of motion parts and attributes from 3d shapes

Xiaogang Wang, Bin Zhou, Yahao Shi, Xiaowu Chen, Qin- ping Zhao, and Kai Xu. Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 8876–8884, 2019. 2

work page 2019
[61]

Meshlrm: Large reconstruction model for high- quality mesh

Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zex- iang Xu. Meshlrm: Large reconstruction model for high- quality meshes.arXiv preprint arXiv:2404.12385, 2024. 1, 4, 5, S1

work page arXiv 2024
[62]

Neural implicit representation for building digital twins of unknown articulated objects

Yijia Weng, Bowen Wen, Jonathan Tremblay, Valts Blukis, Dieter Fox, Leonidas Guibas, and Stan Birchfield. Neural implicit representation for building digital twins of unknown articulated objects. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3141–3150, 2024. 2, 4, 6

work page 2024
[63]

Sapien: A simulated part-based interactive environment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097– 11107, 2020. 2, 6, 8, S2

work page 2020
[64]

Unsupervised kinematic motion detection for part- segmented 3d shape collections

Xianghao Xu, Yifan Ruan, Srinath Sridhar, and Daniel Ritchie. Unsupervised kinematic motion detection for part- segmented 3d shape collections. InACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022. 2

work page 2022
[65]

Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation

Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, and Gordon Wet- zstein. Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation. InEuropean Con- ference on Computer Vision, pages 1–20. Springer, 2024. 1, 3, 5

work page 2024
[66]

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos, June 2025

Zhen Xu, Zhengqin Li, Zhao Dong, Xiaowei Zhou, Richard Newcombe, and Zhaoyang Lv. 4dgt: Learning a 4d gaus- sian transformer using real-world monocular videos.arXiv preprint arXiv:2506.08015, 2025. 4

work page arXiv 2025
[67]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Reconstructing animatable categories from videos

Gengshan Yang, Chaoyang Wang, N Dinesh Reddy, and Deva Ramanan. Reconstructing animatable categories from videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16995– 17005, 2023. 1

work page 2023
[69]

V ol- ume rendering of neural implicit surfaces.Advances in neu- 11 ral information processing systems, 34:4805–4815, 2021

Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. V ol- ume rendering of neural implicit surfaces.Advances in neu- 11 ral information processing systems, 34:4805–4815, 2021. 5, S1

work page 2021
[70]

A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547,

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola- nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547,

work page
[71]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5, 6 12 ART: Articulated Reconstruction Transformer Supplementary Material A. More results We provide...

work page 2018

[1] [1]

Learning to generalize kinematic models to novel objects

Ben Abbatematteo, Stefanie Tellex, and George Konidaris. Learning to generalize kinematic models to novel objects. In Proceedings of the 3rd Conference on Robot Learning, 2019. 2

work page 2019

[2] [2]

Hexplane: A fast representa- tion for dynamic scenes

Ang Cao and Justin Johnson. Hexplane: A fast representa- tion for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 4

work page 2023

[3] [3]

Autopartgen: Autogres- sive 3d part generation and discovery.arXiv preprint arXiv:2507.13346, 2025

Minghao Chen, Jianyuan Wang, Roman Shapovalov, Tom Monnier, Hyunyoung Jung, Dilin Wang, Rakesh Ranjan, Iro Laina, and Andrea Vedaldi. Autopartgen: Autogres- sive 3d part generation and discovery.arXiv preprint arXiv:2507.13346, 2025. 3

work page arXiv 2025

[4] [4]

Urdformer: A pipeline for constructing articulated simula- tion environments from real-world images.arXiv preprint arXiv:2405.11656, 2024

Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simulation environments from real-world images. arXiv preprint arXiv:2405.11656, 2024. 2, 3, 6, S2

work page arXiv 2024

[5] [5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Automated creation of digital cousins for robust policy learning.arXiv preprint arXiv:2410.07408, 2024

Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Auto- mated creation of digital cousins for robust policy learning. arXiv preprint arXiv:2410.07408, 2024. 3

work page arXiv 2024

[7] [7]

Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023. 3, 5

work page 2023

[8] [8]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023. 3, 5

work page 2023

[9] [9]

Articulate your nerf: Unsupervised articulated object modeling via con- ditional view synthesis.Advances in Neural Information Processing Systems, 37:119717–119741, 2024

Jianning Deng, Kartic Subr, and Hakan Bilen. Articulate your nerf: Unsupervised articulated object modeling via con- ditional view synthesis.Advances in Neural Information Processing Systems, 37:119717–119741, 2024. 1, 2

work page 2024

[10] [10]

Anymate: A dataset and baselines for learning 3d object rigging

Yufan Deng, Yuhao Zhang, Chen Geng, Shangzhe Wu, and Jiajun Wu. Anymate: A dataset and baselines for learning 3d object rigging. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Confer- ence Conference Papers, pages 1–10, 2025. 1

work page 2025

[11] [11]

Prob- ing the 3d awareness of visual foundation models

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21795–21806, 2024. 4

work page 2024

[12] [12]

Capt: Category-level articulation estimation from a single point cloud using transformer

Lian Fu, Ryoichi Ishikawa, Yoshihiro Sato, and Takeshi Oishi. Capt: Category-level articulation estimation from a single point cloud using transformer. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 751–757. IEEE, 2024. 2

work page 2024

[13] [13]

Me- shart: Generating articulated meshes with structure-guided transformers

Daoyi Gao, Yawar Siddiqui, Lei Li, and Angela Dai. Me- shart: Generating articulated meshes with structure-guided transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 618–627, 2025. 2

work page 2025

[14] [14]

Partrm: Modeling part-level dynamics with large cross-state reconstruction model

Mingju Gao, Yike Pan, Huan-ang Gao, Zongzheng Zhang, Wenyi Li, Hao Dong, Hao Tang, Li Yi, and Hao Zhao. Partrm: Modeling part-level dynamics with large cross-state reconstruction model. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 7004–7014,

work page

[15] [15]

Carto: Category and joint agnostic reconstruction of articulated objects

Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Ab- hinav Valada, and Thomas Kollar. Carto: Category and joint agnostic reconstruction of articulated objects. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21201–21210, 2023. 2, 3

work page 2023

[16] [16]

LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 1, 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Distributional depth-based estimation of object ar- ticulation models

Ajinkya Jain, Stephen Giguere, Rudolf Lioutikov, and Scott Niekum. Distributional depth-based estimation of object ar- ticulation models. InConference on Robot Learning, pages 1611–1621. PMLR, 2022. 2

work page 2022

[19] [19]

Pl ¨ucker coordinates for lines in the space.Prob- lem Solver Techniques for Applied Computer Science, Com- S-477/577 Course Handout, 3, 2020

Yan-Bin Jia. Pl ¨ucker coordinates for lines in the space.Prob- lem Solver Techniques for Applied Computer Science, Com- S-477/577 Course Handout, 3, 2020. 4

work page 2020

[20] [20]

Opd: Single-view 3d openable part detection

Hanxiao Jiang, Yongsen Mao, Manolis Savva, and Angel X Chang. Opd: Single-view 3d openable part detection. In European Conference on Computer Vision, pages 410–426. Springer, 2022. 2, 6, S2

work page 2022

[21] [21]

Ditto: Building digital twins of articulated objects from interaction

Zhenyu Jiang, Cheng-Chun Hsu, and Yuke Zhu. Ditto: Building digital twins of articulated objects from interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5616–5626, 2022. 2 9

work page 2022

[22] [22]

Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024. 3

work page arXiv 2024

[23] [23]

Procedural generation of articulated simulation-ready assets,

Abhishek Joshi, Beining Han, Jack Nugent, Max Gonzalez Saez-Diez, Yiming Zuo, Jonathan Liu, Hongyu Wen, Stama- tis Alexandropoulos, Karhan Kayan, Anna Calveri, Tao Sun, Gaowen Liu, Yi Shao, Alexander Raistrick, and Jia Deng. Procedural generation of articulated simulation-ready assets,

work page

[24] [24]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page

[25] [25]

Parahome: Parameterizing everyday home activities to- wards 3d generative modeling of human-object interactions

Jeonghwan Kim, Jisoo Kim, Jeonghyeon Na, and Hanbyul Joo. Parahome: Parameterizing everyday home activities to- wards 3d generative modeling of human-object interactions. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 1816–1828, 2025. 1

work page 2025

[26] [26]

Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023

Jiahui Lei, Congyue Deng, William B Shen, Leonidas J Guibas, and Kostas Daniilidis. Nap: Neural 3d articulated object prior.Advances in Neural Information Processing Systems, 36:31878–31894, 2023. 2

work page 2023

[27] [27]

igibson 2.0: Object-centric simulation for robot learning of everyday household tasks.arXiv preprint arXiv:2108.03272, 2021

Chengshu Li, Fei Xia, Roberto Mart ´ın-Mart´ın, Michael Lin- gelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, et al. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks.arXiv preprint arXiv:2108.03272, 2021. 1

work page arXiv 2021

[28] [28]

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gok- men, Sanjana Srivastava, Roberto Mart ´ın-Mart´ın, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, et al. Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation.arXiv preprint arXiv:2403.09227, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 3

work page arXiv 2023

[30] [30]

Ner- facc: A general nerf acceleration toolbox.arXiv preprint arXiv:2210.04847, 2022

Ruilong Li, Matthew Tancik, and Angjoo Kanazawa. Ner- facc: A general nerf acceleration toolbox.arXiv preprint arXiv:2210.04847, 2022. S1

work page arXiv 2022

[31] [31]

Category-level articulated ob- ject pose estimation

Xiaolong Li, He Wang, Li Yi, Leonidas J Guibas, A Lynn Abbott, and Shuran Song. Category-level articulated ob- ject pose estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3706–3715, 2020. 2

work page 2020

[32] [32]

Neuralangelo: High-fidelity neural surface reconstruction

Zhaoshuo Li, Thomas M ¨uller, Alex Evans, Russell H Tay- lor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 8456–8465, 2023. S1

work page 2023

[33] [33]

Learning the 3d fauna of the web

Zizhang Li, Dor Litvak, Ruining Li, Yunzhi Zhang, Tomas Jakab, Christian Rupprecht, Shangzhe Wu, Andrea Vedaldi, and Jiajun Wu. Learning the 3d fauna of the web. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9752–9762, 2024. 4

work page 2024

[34] [34]

Lirm: Large inverse render- ing model for progressive reconstruction of shape, materials and view-dependent radiance fields

Zhengqin Li, Dilin Wang, Ka Chen, Zhaoyang Lv, Thu Nguyen-Phuoc, Milim Lee, Jia-Bin Huang, Lei Xiao, Yufeng Zhu, Carl S Marshall, et al. Lirm: Large inverse render- ing model for progressive reconstruction of shape, materials and view-dependent radiance fields. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 505–517, 2025. 1...

work page 2025

[35] [35]

Partcrafter: Structured 3d mesh generation via compositional latent diffusion trans- formers.arXiv preprint arXiv:2506.05573, 2025

Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compo- sitional latent diffusion transformers.arXiv preprint arXiv:2506.05573, 2025. 3

work page arXiv 2025

[36] [36]

Paris: Part-level reconstruction and motion analysis for articulated objects

Jiayi Liu, Ali Mahdavi-Amiri, and Manolis Savva. Paris: Part-level reconstruction and motion analysis for articulated objects. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 352–363, 2023. 2, 4, 6

work page 2023

[37] [37]

arXiv preprint arXiv:2410.16499 (2024)

Jiayi Liu, Denys Iliash, Angel X Chang, Manolis Savva, and Ali Mahdavi-Amiri. Singapo: Single image controlled generation of articulated parts in objects.arXiv preprint arXiv:2410.16499, 2024. 2, 3, 6, S2

work page arXiv 2024

[38] [38]

Survey on modeling of human-made articulated objects

Jiayi Liu, Manolis Savva, and Ali Mahdavi-Amiri. Survey on modeling of human-made articulated objects. InComputer Graphics Forum, page e70092. Wiley Online Library, 2025. 1

work page 2025

[39] [39]

Toward real-world category-level articulation pose esti- mation.IEEE Transactions on Image Processing, 31:1072– 1083, 2022

Liu Liu, Han Xue, Wenqiang Xu, Haoyuan Fu, and Cewu Lu. Toward real-world category-level articulation pose esti- mation.IEEE Transactions on Image Processing, 31:1072– 1083, 2022. 2

work page 2022

[40] [40]

Category-level articulated object 9d pose estimation via reinforcement learning

Liu Liu, Jianming Du, Hao Wu, Xun Yang, Zhenguang Liu, Richang Hong, and Meng Wang. Category-level articulated object 9d pose estimation via reinforcement learning. InPro- ceedings of the 31st ACM International Conference on Mul- timedia, pages 728–736, 2023. 2

work page 2023

[41] [41]

Nothing but geometric constraints: A model-free method for articulated object pose estimation

Qihao Liu, Weichao Qiu, Weiyao Wang, Gregory D Hager, and Alan L Yuille. Nothing but geometric constraints: A model-free method for articulated object pose estimation. arXiv preprint arXiv:2012.00088, 2020. 2

work page arXiv 2012

[42] [42]

Building interactable replicas of complex articulated objects via gaussian splatting

Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Building interactable replicas of complex articulated objects via gaussian splatting. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2, 4, 6, 7

work page 2025

[43] [43]

Threestudio: A modular framework for diffusion-guided 3d generation.cg

Ying-Tian Liu, Yuan-Chen Guo, Vikram V oleti, Ruizhi Shao, Chia-Hao Chen, Guan Luo, Zixin Zou, Chen Wang, Chris- tian Laforte, Yan-Pei Cao, et al. Threestudio: A modular framework for diffusion-guided 3d generation.cg. cs. ts- inghua. edu. cn, 2023. 1

work page 2023

[44] [44]

Real2code: Reconstruct articulated objects via code genera- tion.arXiv preprint arXiv:2406.08474, 2024

Zhao Mandi, Yijia Weng, Dominik Bauer, and Shuran Song. Real2code: Reconstruct articulated objects via code genera- tion.arXiv preprint arXiv:2406.08474, 2024. 3

work page arXiv 2024

[45] [45]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2, S1

work page 2021

[46] [46]

A-sdf: Learning 10 disentangled signed distance functions for articulated shape representation

Jiteng Mu, Weichao Qiu, Adam Kortylewski, Alan Yuille, Nuno Vasconcelos, and Xiaolong Wang. A-sdf: Learning 10 disentangled signed distance functions for articulated shape representation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13001–13011,

work page

[47] [47]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

Multilayer percep- tron and neural networks.WSEAS Transactions on Circuits and Systems, 8(7):579–588, 2009

Marius-Constantin Popescu, Valentina E Balas, Liliana Perescu-Popescu, and Nikos Mastorakis. Multilayer percep- tron and neural networks.WSEAS Transactions on Circuits and Systems, 8(7):579–588, 2009. 4

work page 2009

[50] [50]

Understanding 3d object articulation in in- ternet videos

Shengyi Qian, Linyi Jin, Chris Rockwell, Siyi Chen, and David F Fouhey. Understanding 3d object articulation in in- ternet videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1599– 1609, 2022. 2

work page 2022

[51] [51]

arXiv preprint arXiv:2502.02590 (2025)

Xiaowen Qiu, Jincheng Yang, Yian Wang, Zhehuan Chen, Yufei Wang, Tsun-Hsuan Wang, Zhou Xian, and Chuang Gan. Articulate anymesh: Open-vocabulary 3d articulated objects modeling.arXiv preprint arXiv:2502.02590, 2025. 2

work page arXiv 2025

[52] [52]

Generalized in- tersection over union: A metric and a loss for bounding box regression

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666,

work page

[53] [53]

igibson 1.0: A simulation environment for interactive tasks in large realistic scenes

Bokui Shen, Fei Xia, Chengshu Li, Roberto Mart ´ın-Mart´ın, Linxi Fan, Guanzhi Wang, Claudia P ´erez-D’Arpino, Shya- mal Buch, Sanjana Srivastava, Lyne Tchapmi, et al. igibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7520–7527. IE...

work page 2021

[54] [54]

Meta 3d assetgen: Text-to-mesh generation with high- quality geometry, texture, and pbr materials.Advances in Neural Information Processing Systems, 37:9532–9564,

Yawar Siddiqui, Tom Monnier, Filippos Kokkinos, Mahen- dra Kariya, Yanir Kleiman, Emilien Garreau, Oran Gafni, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, et al. Meta 3d assetgen: Text-to-mesh generation with high- quality geometry, texture, and pbr materials.Advances in Neural Information Processing Systems, 37:9532–9564,

work page

[55] [55]

Light field networks: Neu- ral scene representations with single-evaluation rendering

Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neu- ral scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34: 19313–19325, 2021. 4

work page 2021

[56] [56]

Opdmulti: Openable part detection for multiple ob- jects

Xiaohao Sun, Hanxiao Jiang, Manolis Savva, and Angel Chang. Opdmulti: Openable part detection for multiple ob- jects. In2024 International Conference on 3D Vision (3DV), pages 169–178. IEEE, 2024. 2

work page 2024

[57] [57]

Leia: Latent view-invariant embeddings for implicit 3d artic- ulation

Archana Swaminathan, Anubhav Gupta, Kamal Gupta, Shishira R Maiya, Vatsal Agarwal, and Abhinav Shrivastava. Leia: Latent view-invariant embeddings for implicit 3d artic- ulation. InEuropean Conference on Computer Vision, pages 210–227. Springer, 2024. 2

work page 2024

[58] [58]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4

work page 2017

[59] [59]

Active coarse-to-fine segmentation of moveable parts from real images

Ruiqi Wang, Akshay Gadi Patil, Fenggen Yu, and Hao Zhang. Active coarse-to-fine segmentation of moveable parts from real images. InEuropean Conference on Computer Vi- sion, pages 111–127. Springer, 2024. 2

work page 2024

[60] [60]

Shape2motion: Joint analysis of motion parts and attributes from 3d shapes

Xiaogang Wang, Bin Zhou, Yahao Shi, Xiaowu Chen, Qin- ping Zhao, and Kai Xu. Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 8876–8884, 2019. 2

work page 2019

[61] [61]

Meshlrm: Large reconstruction model for high- quality mesh

Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zex- iang Xu. Meshlrm: Large reconstruction model for high- quality meshes.arXiv preprint arXiv:2404.12385, 2024. 1, 4, 5, S1

work page arXiv 2024

[62] [62]

Neural implicit representation for building digital twins of unknown articulated objects

Yijia Weng, Bowen Wen, Jonathan Tremblay, Valts Blukis, Dieter Fox, Leonidas Guibas, and Stan Birchfield. Neural implicit representation for building digital twins of unknown articulated objects. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3141–3150, 2024. 2, 4, 6

work page 2024

[63] [63]

Sapien: A simulated part-based interactive environment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097– 11107, 2020. 2, 6, 8, S2

work page 2020

[64] [64]

Unsupervised kinematic motion detection for part- segmented 3d shape collections

Xianghao Xu, Yifan Ruan, Srinath Sridhar, and Daniel Ritchie. Unsupervised kinematic motion detection for part- segmented 3d shape collections. InACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022. 2

work page 2022

[65] [65]

Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation

Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, and Gordon Wet- zstein. Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation. InEuropean Con- ference on Computer Vision, pages 1–20. Springer, 2024. 1, 3, 5

work page 2024

[66] [66]

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos, June 2025

Zhen Xu, Zhengqin Li, Zhao Dong, Xiaowei Zhou, Richard Newcombe, and Zhaoyang Lv. 4dgt: Learning a 4d gaus- sian transformer using real-world monocular videos.arXiv preprint arXiv:2506.08015, 2025. 4

work page arXiv 2025

[67] [67]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

Reconstructing animatable categories from videos

Gengshan Yang, Chaoyang Wang, N Dinesh Reddy, and Deva Ramanan. Reconstructing animatable categories from videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16995– 17005, 2023. 1

work page 2023

[69] [69]

V ol- ume rendering of neural implicit surfaces.Advances in neu- 11 ral information processing systems, 34:4805–4815, 2021

Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. V ol- ume rendering of neural implicit surfaces.Advances in neu- 11 ral information processing systems, 34:4805–4815, 2021. 5, S1

work page 2021

[70] [70]

A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547,

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola- nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547,

work page

[71] [71]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5, 6 12 ART: Articulated Reconstruction Transformer Supplementary Material A. More results We provide...

work page 2018