pith. the verified trust layer for science. sign in

arxiv: 2509.04276 · v2 · submitted 2025-09-04 · 💻 cs.CV

PAOLI: Pose-free Articulated Object Learning from Sparse-view Images

Pith reviewed 2026-05-18 18:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords articulated objectssparse-view reconstructionpose-free learningdeformation field3D object modelingself-supervised consistencypart disentanglement
0
0 comments X p. Extension

The pith

A method learns accurate 3D models of articulated objects from just four sparse unposed images per articulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that articulated objects can be modeled in detail without the dense multi-view images or known camera poses required by prior work. It first reconstructs each articulation separately using sparse-view techniques, then learns a deformation field to align these reconstructions and establish dense correspondences across poses. A progressive disentanglement step separates static parts from moving ones, after which geometry, appearance, and kinematics are optimized together through self-supervised consistency losses. This matters because it removes the need for controlled capture setups or pose annotations, making 3D modeling of real moving objects more practical with casual image collections.

Core claim

We present a methodology to model articulated objects using a sparse set of images with unknown poses. Our central insight is to first solve a robust correspondence and alignment problem between unaligned reconstructions, before part motions can be analyzed. We first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we optimize geometry, appearance, and kinematics jointly with a self-supervised loss that enforces跨跨

What carries the argument

A learned deformation field that aligns independent sparse-view reconstructions across articulations and supports progressive disentanglement of static and moving parts.

If this is right

  • Articulated objects can be represented accurately with as few as four views per articulation and no camera supervision.
  • Independent per-pose reconstructions can be aligned without external pose information to separate static and moving components.
  • Joint optimization of geometry, appearance, and kinematics succeeds when driven only by cross-view and cross-pose consistency losses.
  • The resulting models remain detailed on both standard benchmarks and real-world captured objects under the weaker input conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Casual multi-view photography of moving objects could replace controlled studio capture for many 3D modeling tasks.
  • The alignment-first strategy may extend to other problems involving unposed image sets, such as scene reconstruction with moving elements.
  • If the deformation field proves stable on even sparser inputs, the approach could scale to video sequences with unknown camera motion.

Load-bearing premise

Independent sparse-view reconstructions of each articulation can be robustly aligned and disentangled into static and moving parts via a learned deformation field without any ground-truth poses or dense observations.

What would settle it

Apply the method to a benchmark with known ground-truth poses and part motions; if the output 3D models show reconstruction or motion errors comparable to or higher than pose-supervised baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2509.04276 by Hakan Bilen, Jianning Deng, Kartic Subr.

Figure 1
Figure 1. Figure 1: Given few unposed views of an articulated object over [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our method pipeline. Given two sets of K-view images (K=4) of an articulated object in different states, our approach [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the extension to multiple parts. We pro [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative evaluation for novel view synthesis in tar￾get state. We can see from the results that both AGS-GT and AGS-VGGT fail to reconstruct the object in the setting of 4-view images. In the meanwhile, our method demonstrates similar ren￾dering quality compared to AGS-Full, which is trained with 100 images per articulation state with ground truth camera poses. baseline methods heavily rely on precise 3… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of 3D Gaussian points with part-level segmentation and articulation axes. Ground truth axises are shown in green, predictions in red. Better view in color and zoom in. Number of views PSNR 0 10 20 30 40 4 8 16 32 64 100 AGS+GT AGS+VGGT AYN+GT Ours [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PSNR w.r.t. number of views per articulation. 4.3. Ablation Studies Effect of the number of input images. With just 4-view images, baseline methods struggle to reconstruct objects even when in static state. We investigate how the number of input views affects performance by varying the views per articulation state [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Real-world objects. We show the input images (left), novel articulation synthesis results (middle), and part-level segmentation with articulation axes (right). The first three objects are single-part articulated objects, while the last one is a multi-part articulated object. For the multi-part object, we collect three sets of input images per articulation state. The segmentation results are shown in differ… view at source ↗
read the original abstract

We present a methodology to model articulated objects using a sparse set of images with unknown poses. Current methods require dense multi-view observations and ground-truth camera poses. Our approach operates with as few as four views per articulation and no camera supervision. Our central insight is to first solve a robust correspondence and alignment problem between unaligned reconstructions, before part motions can be analyzed. We first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we optimize geometry, appearance, and kinematics jointly with a self-supervised loss that enforces cross-view and cross-pose consistency. Experiments on the standard benchmark and real-world examples demonstrate that our method produces accurate and detailed articulated object representations under significantly weaker input assumptions than existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents PAOLI, a method to model articulated objects from sparse-view images with unknown camera poses. It reconstructs each articulation independently via recent sparse-view 3D techniques, learns a deformation field to establish dense correspondences across poses, applies a progressive disentanglement strategy to separate static and moving parts, and jointly optimizes geometry, appearance, and kinematics under a self-supervised cross-view/cross-pose consistency loss. The central claim is that the pipeline produces accurate articulated representations using as few as four views per articulation and no camera supervision, outperforming prior work that requires denser observations and ground-truth poses.

Significance. If the disentanglement and alignment steps prove robust, the work would be a meaningful advance in articulated object reconstruction by relaxing the strong input assumptions of dense multi-view capture and known poses. Building directly on recent sparse-view reconstruction advances and self-supervised losses is a practical strength; successful validation would enable more accessible data collection for robotics and AR applications. The explicit separation of alignment from motion analysis is a clear conceptual contribution.

major comments (2)
  1. [§3] §3 (Method), progressive disentanglement paragraph and associated loss formulation: the manuscript does not specify an explicit mechanism (e.g., regularization term, initialization strategy, or architectural bias) that prevents the learned deformation field from absorbing camera motion into object motion when each 4-view reconstruction carries large depth/pose ambiguities. The self-supervised consistency loss can be satisfied by kinematically incorrect solutions that conflate the two, directly undermining the central claim that camera and object motion can be robustly separated without ground-truth poses or dense observations.
  2. [§5] §5 (Experiments), quantitative tables and ablation studies: no ablation is reported that isolates the contribution of the progressive disentanglement module versus a baseline that simply aligns independent reconstructions; without this, it is impossible to verify that the method overcomes the geometric degeneracy highlighted in the central insight rather than relying on favorable initialization or dataset biases.
minor comments (2)
  1. [§3.2] Notation for the deformation field and the static/moving partition mask should be introduced with explicit equations rather than prose descriptions to improve reproducibility.
  2. [Figure 3] Figure 3 (qualitative results) would benefit from side-by-side comparison with a naive alignment baseline to visually demonstrate the effect of the disentanglement step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments have helped us clarify key aspects of the method and strengthen the experimental validation. We address each major comment below and have made revisions to the manuscript as indicated.

read point-by-point responses
  1. Referee: [§3] §3 (Method), progressive disentanglement paragraph and associated loss formulation: the manuscript does not specify an explicit mechanism (e.g., regularization term, initialization strategy, or architectural bias) that prevents the learned deformation field from absorbing camera motion into object motion when each 4-view reconstruction carries large depth/pose ambiguities. The self-supervised consistency loss can be satisfied by kinematically incorrect solutions that conflate the two, directly undermining the central claim that camera and object motion can be robustly separated without ground-truth poses or dense observations.

    Authors: We appreciate the referee for highlighting this potential issue in the method description. The progressive disentanglement is intended to address the separation of camera and object motion by starting with static part alignment and progressively identifying moving parts. The deformation field is constrained by the initial independent reconstructions which provide a starting point less prone to absorbing global motion. Nevertheless, we acknowledge that the manuscript would benefit from a more explicit description of the mechanism. In the revised version, we have expanded the progressive disentanglement paragraph to include details on the initialization strategy and an added regularization term that limits the deformation field's ability to model large global transformations in initial stages. This should make the robustness clearer. revision: yes

  2. Referee: [§5] §5 (Experiments), quantitative tables and ablation studies: no ablation is reported that isolates the contribution of the progressive disentanglement module versus a baseline that simply aligns independent reconstructions; without this, it is impossible to verify that the method overcomes the geometric degeneracy highlighted in the central insight rather than relying on favorable initialization or dataset biases.

    Authors: We agree with the referee that an ablation isolating the progressive disentanglement is necessary to fully validate the contribution. We have added this ablation to the experiments section in the revised manuscript. The new results demonstrate that removing the progressive disentanglement leads to degraded performance in motion estimation, confirming that it plays a key role in overcoming the geometric ambiguities rather than depending on dataset biases or initialization alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external sparse-view methods and self-supervised losses without self-referential reduction

full rationale

The paper's pipeline begins with independent per-articulation reconstructions drawn from cited external advances in sparse-view 3D reconstruction, followed by a learned deformation field for correspondences and a progressive disentanglement step optimized via cross-view/cross-pose consistency losses. None of these steps reduce a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the alignment and static/moving separation are presented as emergent from the joint optimization rather than presupposed in the inputs. The central claims therefore remain independent of the paper's own fitted values or prior self-references, qualifying as a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract provides insufficient detail for exhaustive ledger; key unstated premises include reliability of off-the-shelf sparse-view recon and feasibility of unsupervised disentanglement.

axioms (2)
  • domain assumption Recent sparse-view 3D reconstruction advances can produce usable independent per-articulation models even with unknown poses.
    Method explicitly starts by reconstructing each articulation independently using those advances.
  • domain assumption A learned deformation field can establish reliable dense correspondences across different articulations without supervision.
    Central step after independent reconstruction.

pith-pipeline@v0.9.0 · 5684 in / 1253 out tokens · 64212 ms · 2026-05-18T18:41:48.903701+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 2 internal anchors

  1. [1]

    Building rome in a day

    Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day. Communications of the ACM , 54 (10):105–112, 2011. 2

  2. [2]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19457–19467, 2024. 2

  3. [3]

    Op-align: Object-level and part-level alignment for self-supervised category-level articulated object pose estimation

    Yuchen Che, Ryo Furukawa, and Asako Kanezaki. Op-align: Object-level and part-level alignment for self-supervised category-level articulated object pose estimation. In Eu- ropean Conference on Computer Vision , pages 72–88. Springer, 2024. 2

  4. [4]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In European Conference on Computer Vision, pages 370–386. Springer, 2024. 2

  5. [5]

    Gaussianpro: 3d gaussian splatting with progressive propagation

    Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Yuexin Ma, Wenping Wang, and Xuejin Chen. Gaussianpro: 3d gaussian splatting with progressive propagation. InForty- first International Conference on Machine Learning, 2024. 2

  6. [6]

    Depth-regularized optimization for 3d gaussian splatting in few-shot images

    Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee. Depth-regularized optimization for 3d gaussian splatting in few-shot images. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 811–820, 2024. 2

  7. [7]

    Articulate your nerf: Unsupervised articulated object modeling via con- ditional view synthesis

    Jianning Deng, Kartic Subr, and Hakan Bilen. Articulate your nerf: Unsupervised articulated object modeling via con- ditional view synthesis. Advances in Neural Information Processing Systems, 37:119717–119741, 2024. 1, 2, 4, 5, 6, 12

  8. [8]

    Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ fps

    Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang, et al. Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ fps. Advances in neural information processing systems , 37: 140138–140158, 2024. 2

  9. [9]

    Capt: Category-level articulation estimation from a single point cloud using transformer

    Lian Fu, Ryoichi Ishikawa, Yoshihiro Sato, and Takeshi Oishi. Capt: Category-level articulation estimation from a single point cloud using transformer. In 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 751–757. IEEE, 2024. 2

  10. [10]

    Articulatedgs: Self-supervised digital twin modeling of articulated objects using 3d gaussian splatting

    Junfu Guo, Yu Xin, Gaoyi Liu, Kai Xu, Ligang Liu, and Ruizhen Hu. Articulatedgs: Self-supervised digital twin modeling of articulated objects using 3d gaussian splatting. In Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 27144–27153, 2025. 1, 2, 5, 6, 12

  11. [11]

    Carto: Category and joint agnostic reconstruction of articulated objects

    Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Ab- hinav Valada, and Thomas Kollar. Carto: Category and joint agnostic reconstruction of articulated objects. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21201–21210, 2023. 2

  12. [12]

    2d gaussian splatting for geometrically ac- curate radiance fields

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. In ACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 2

  13. [13]

    Opd: Single-view 3d openable part detection

    Hanxiao Jiang, Yongsen Mao, Manolis Savva, and Angel X Chang. Opd: Single-view 3d openable part detection. In European Conference on Computer Vision, pages 410–426. Springer, 2022. 1

  14. [14]

    Detection based part- level articulated object reconstruction from single rgbd im- age

    Yuki Kawana and Tatsuya Harada. Detection based part- level articulated object reconstruction from single rgbd im- age. Advances in Neural Information Processing Systems , 36:18444–18473, 2023. 2

  15. [15]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

  16. [16]

    Gener- ative sparse-view gaussian splatting

    Hanyang Kong, Xingyi Yang, and Xinchao Wang. Gener- ative sparse-view gaussian splatting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26745–26755, 2025. 2

  17. [17]

    Articulate-anything: Auto- matic modeling of articulated objects via a vision-language foundation model

    Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Di- nesh Jayaraman, and Eric Eaton. Articulate-anything: Auto- matic modeling of articulated objects via a vision-language foundation model. arXiv preprint arXiv:2410.13882, 2024. 2

  18. [18]

    Compact 3d gaussian representation for radiance field

    Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. Compact 3d gaussian representation for radiance field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21719– 21728, 2024. 2

  19. [19]

    Nap: Neural 3d articulation prior

    Jiahui Lei, Congyue Deng, Bokui Shen, Leonidas Guibas, and Kostas Daniilidis. Nap: Neural 3d articulation prior. arXiv preprint arXiv:2305.16315, 2023. 2

  20. [20]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground- ing image matching in 3d with mast3r. In European Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2

  21. [21]

    Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normaliza- tion

    Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normaliza- tion. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 20775–20785,

  22. [22]

    Paris: Part-level reconstruction and motion analysis for articulated objects

    Jiayi Liu, Ali Mahdavi-Amiri, and Manolis Savva. Paris: Part-level reconstruction and motion analysis for articulated objects. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 352–363, 2023. 1, 2, 5, 12

  23. [23]

    Cage: Controllable articulation generation

    Jiayi Liu, Hou In Ivan Tam, Ali Mahdavi-Amiri, and Manolis Savva. Cage: Controllable articulation generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17880–17889, 2024. 2

  24. [24]

    Zero-1-to- 3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 2

  25. [25]

    Build- ing rearticulable models for arbitrary 3d objects from 4d 9 point clouds

    Shaowei Liu, Saurabh Gupta, and Shenlong Wang. Build- ing rearticulable models for arbitrary 3d objects from 4d 9 point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21138– 21147, 2023. 2

  26. [26]

    Robust incremental structure-from-motion with hybrid fea- tures

    Shaohui Liu, Yidan Gao, Tianyi Zhang, Rémi Pautrat, Jo- hannes L Schönberger, Viktor Larsson, and Marc Pollefeys. Robust incremental structure-from-motion with hybrid fea- tures. In European Conference on Computer Vision , pages 249–269. Springer, 2024. 2

  27. [27]

    Building interactable replicas of complex articulated objects via gaussian splatting

    Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Building interactable replicas of complex articulated objects via gaussian splatting. In The Thirteenth International Conference on Learning Represen- tations, 2025. 2

  28. [28]

    Wonder3d: Sin- gle image to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 2

  29. [29]

    Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

    Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. 2024 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20654–20664, 2023. 2

  30. [30]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM, 65(1):99–106, 2021. 1, 2

  31. [31]

    Chang, Li Yi, Subarna Tripathi, Leonidas J

    Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, and Hao Su. PartNet: A large- scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In The IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2019. 5

  32. [32]

    A-sdf: Learning disentangled signed distance functions for articulated shape representation

    Jiteng Mu, Weichao Qiu, Adam Kortylewski, Alan Yuille, Nuno Vasconcelos, and Xiaolong Wang. A-sdf: Learning disentangled signed distance functions for articulated shape representation. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13001–13011,

  33. [33]

    Instant neural graphics primitives with a mul- tiresolution hash encoding

    Thomas Müller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2

  34. [34]

    Structure from action: Learning interactions for articulated object 3d structure discovery

    Neil Nie, Samir Yitzhak Gadre, Kiana Ehsani, and Shu- ran Song. Structure from action: Learning interactions for articulated object 3d structure discovery. arXiv preprint arXiv:2207.08997, 2022. 8

  35. [35]

    Understanding 3d object articulation in in- ternet videos

    Shengyi Qian, Linyi Jin, Chris Rockwell, Siyi Chen, and David F Fouhey. Understanding 3d object articulation in in- ternet videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1599– 1609, 2022. 1

  36. [36]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:24...

  37. [37]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2

  38. [38]

    Pixelwise view selection for un- structured multi-view stereo

    Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016. 2

  39. [39]

    Opdmulti: Openable part detection for multiple objects

    Xiaohao Sun, Hanxiao Jiang, Manolis Savva, and An- gel Xuan Chang. Opdmulti: Openable part detection for multiple objects. arXiv preprint arXiv:2303.14087 , 2023. 1

  40. [40]

    Splatter image: Ultra-fast single-view 3d recon- struction

    Stanislaw Szymanowicz, Chrisitian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d recon- struction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 10208– 10217, 2024. 2

  41. [41]

    Lgm: Large multi-view gaussian model for high-resolution 3d content creation

    Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. In European Conference on Computer Vision, pages 1–18. Springer, 2024. 2

  42. [42]

    Cla-nerf: Category-level articulated neural radiance field

    Wei-Cheng Tseng, Hung-Ju Liao, Lin Yen-Chen, and Min Sun. Cla-nerf: Category-level articulated neural radiance field. In 2022 International Conference on Robotics and Au- tomation (ICRA), pages 8454–8460. IEEE, 2022. 2

  43. [43]

    Least-squares estimation of transforma- tion parameters between two point patterns

    Shinji Umeyama. Least-squares estimation of transforma- tion parameters between two point patterns. IEEE Transac- tions on Pattern Analysis & Machine Intelligence , 13(04): 376–380, 1991. 13

  44. [44]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 5

  45. [45]

    NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction

    Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021. 2

  46. [46]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2

  47. [47]

    Self-supervised neural articulated shape and appearance models

    Fangyin Wei, Rohan Chabra, Lingni Ma, Christoph Lassner, Michael Zollhöfer, Szymon Rusinkiewicz, Chris Sweeney, Richard Newcombe, and Mira Slavcheva. Self-supervised neural articulated shape and appearance models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15816–15826, 2022. 2

  48. [48]

    Neural implicit representation for building digital twins of unknown articulated objects

    Yijia Weng, Bowen Wen, Jonathan Tremblay, Valts Blukis, Dieter Fox, Leonidas Guibas, and Stan Birchfield. Neural implicit representation for building digital twins of unknown articulated objects. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3141–3150, 2024. 2 10

  49. [49]

    Multi-scale 3d gaussian splatting for anti-aliased rendering

    Zhiwen Yan, Weng Fei Low, Yu Chen, and Gim Hee Lee. Multi-scale 3d gaussian splatting for anti-aliased rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20923–20931, 2024. 2

  50. [50]

    Teaser: Fast and certifiable point cloud registration

    Heng Yang, Jingnan Shi, and Luca Carlone. Teaser: Fast and certifiable point cloud registration. IEEE Transactions on Robotics, 37(2):314–333, 2020. 4

  51. [51]

    Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 21924–21935,

  52. [52]

    No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

    Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images. arXiv preprint arXiv:2410.24207, 2024. 2

  53. [53]

    Freesplat- ter: Free-viewpoint 3d gaussian splatting from a single im- age

    Zehao Zhang, Anand Goel, Zhan Wang, Vladlen Koltun, Ji- tendra Malik, Chenxu Ma, and Leonidas Guibas. Freesplat- ter: Free-viewpoint 3d gaussian splatting from a single im- age. arXiv preprint arXiv:2401.04644, 2024. 2, 3

  54. [54]

    Fsgs: Real-time few-shot view synthesis using gaussian splatting

    Zehao Zhu, Zhiwen Fan, Yifan Jiang, and Zhangyang Wang. Fsgs: Real-time few-shot view synthesis using gaussian splatting. In European conference on computer vision, pages 145–163. Springer, 2024. 2 11 PAOLI: Pose-free Articulated Object Learning from Sparse-view Images Supplementary Material

  55. [55]

    Detailed Discussion of Related Work Here we provide more detailed discussion of the most related articulated object learning work including, PARIS [22], AYN [7] and AGS [10]

    Supplementary Material 6.1. Detailed Discussion of Related Work Here we provide more detailed discussion of the most related articulated object learning work including, PARIS [22], AYN [7] and AGS [10]. As explained in the submission, these techniques assume dense views of the ob- ject across two articulation state along with the camera in- formation, unl...