arxiv: 2509.04276 · v2 · submitted 2025-09-04 · 💻 cs.CV

PAOLI: Pose-free Articulated Object Learning from Sparse-view Images

Jianning Deng , Kartic Subr , Hakan Bilen This is my paper

Pith reviewed 2026-05-18 18:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords articulated objectssparse-view reconstructionpose-free learningdeformation field3D object modelingself-supervised consistencypart disentanglement

0 comments p. Extension

The pith

A method learns accurate 3D models of articulated objects from just four sparse unposed images per articulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that articulated objects can be modeled in detail without the dense multi-view images or known camera poses required by prior work. It first reconstructs each articulation separately using sparse-view techniques, then learns a deformation field to align these reconstructions and establish dense correspondences across poses. A progressive disentanglement step separates static parts from moving ones, after which geometry, appearance, and kinematics are optimized together through self-supervised consistency losses. This matters because it removes the need for controlled capture setups or pose annotations, making 3D modeling of real moving objects more practical with casual image collections.

Core claim

We present a methodology to model articulated objects using a sparse set of images with unknown poses. Our central insight is to first solve a robust correspondence and alignment problem between unaligned reconstructions, before part motions can be analyzed. We first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we optimize geometry, appearance, and kinematics jointly with a self-supervised loss that enforces跨跨

What carries the argument

A learned deformation field that aligns independent sparse-view reconstructions across articulations and supports progressive disentanglement of static and moving parts.

If this is right

Articulated objects can be represented accurately with as few as four views per articulation and no camera supervision.
Independent per-pose reconstructions can be aligned without external pose information to separate static and moving components.
Joint optimization of geometry, appearance, and kinematics succeeds when driven only by cross-view and cross-pose consistency losses.
The resulting models remain detailed on both standard benchmarks and real-world captured objects under the weaker input conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Casual multi-view photography of moving objects could replace controlled studio capture for many 3D modeling tasks.
The alignment-first strategy may extend to other problems involving unposed image sets, such as scene reconstruction with moving elements.
If the deformation field proves stable on even sparser inputs, the approach could scale to video sequences with unknown camera motion.

Load-bearing premise

Independent sparse-view reconstructions of each articulation can be robustly aligned and disentangled into static and moving parts via a learned deformation field without any ground-truth poses or dense observations.

What would settle it

Apply the method to a benchmark with known ground-truth poses and part motions; if the output 3D models show reconstruction or motion errors comparable to or higher than pose-supervised baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2509.04276 by Hakan Bilen, Jianning Deng, Kartic Subr.

**Figure 2.** Figure 2: Overview of our method pipeline. Given two sets of K-view images (K=4) of an articulated object in different states, our approach [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the extension to multiple parts. We pro [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative evaluation for novel view synthesis in target state. We can see from the results that both AGS-GT and AGS-VGGT fail to reconstruct the object in the setting of 4-view images. In the meanwhile, our method demonstrates similar rendering quality compared to AGS-Full, which is trained with 100 images per articulation state with ground truth camera poses. baseline methods heavily rely on precise 3… view at source ↗

**Figure 5.** Figure 5: Visualization of 3D Gaussian points with part-level segmentation and articulation axes. Ground truth axises are shown in green, predictions in red. Better view in color and zoom in. Number of views PSNR 0 10 20 30 40 4 8 16 32 64 100 AGS+GT AGS+VGGT AYN+GT Ours [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: PSNR w.r.t. number of views per articulation. 4.3. Ablation Studies Effect of the number of input images. With just 4-view images, baseline methods struggle to reconstruct objects even when in static state. We investigate how the number of input views affects performance by varying the views per articulation state [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 9.** Figure 9: Real-world objects. We show the input images (left), novel articulation synthesis results (middle), and part-level segmentation with articulation axes (right). The first three objects are single-part articulated objects, while the last one is a multi-part articulated object. For the multi-part object, we collect three sets of input images per articulation state. The segmentation results are shown in differ… view at source ↗

read the original abstract

We present a methodology to model articulated objects using a sparse set of images with unknown poses. Current methods require dense multi-view observations and ground-truth camera poses. Our approach operates with as few as four views per articulation and no camera supervision. Our central insight is to first solve a robust correspondence and alignment problem between unaligned reconstructions, before part motions can be analyzed. We first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we optimize geometry, appearance, and kinematics jointly with a self-supervised loss that enforces cross-view and cross-pose consistency. Experiments on the standard benchmark and real-world examples demonstrate that our method produces accurate and detailed articulated object representations under significantly weaker input assumptions than existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PAOLI tries to cut supervision for articulated objects down to four unposed views per part via independent sparse reconstructions plus a deformation field for alignment and disentanglement, but the separation of camera and object motion looks underconstrained.

read the letter

The main point is a pipeline that reconstructs each articulation independently from four views using recent sparse-view methods, then learns a deformation field to establish correspondences and progressively disentangles static from moving parts before joint optimization with a self-supervised consistency loss. No camera poses or dense observations are needed at any stage.

Referee Report

2 major / 2 minor

Summary. The manuscript presents PAOLI, a method to model articulated objects from sparse-view images with unknown camera poses. It reconstructs each articulation independently via recent sparse-view 3D techniques, learns a deformation field to establish dense correspondences across poses, applies a progressive disentanglement strategy to separate static and moving parts, and jointly optimizes geometry, appearance, and kinematics under a self-supervised cross-view/cross-pose consistency loss. The central claim is that the pipeline produces accurate articulated representations using as few as four views per articulation and no camera supervision, outperforming prior work that requires denser observations and ground-truth poses.

Significance. If the disentanglement and alignment steps prove robust, the work would be a meaningful advance in articulated object reconstruction by relaxing the strong input assumptions of dense multi-view capture and known poses. Building directly on recent sparse-view reconstruction advances and self-supervised losses is a practical strength; successful validation would enable more accessible data collection for robotics and AR applications. The explicit separation of alignment from motion analysis is a clear conceptual contribution.

major comments (2)

[§3] §3 (Method), progressive disentanglement paragraph and associated loss formulation: the manuscript does not specify an explicit mechanism (e.g., regularization term, initialization strategy, or architectural bias) that prevents the learned deformation field from absorbing camera motion into object motion when each 4-view reconstruction carries large depth/pose ambiguities. The self-supervised consistency loss can be satisfied by kinematically incorrect solutions that conflate the two, directly undermining the central claim that camera and object motion can be robustly separated without ground-truth poses or dense observations.
[§5] §5 (Experiments), quantitative tables and ablation studies: no ablation is reported that isolates the contribution of the progressive disentanglement module versus a baseline that simply aligns independent reconstructions; without this, it is impossible to verify that the method overcomes the geometric degeneracy highlighted in the central insight rather than relying on favorable initialization or dataset biases.

minor comments (2)

[§3.2] Notation for the deformation field and the static/moving partition mask should be introduced with explicit equations rather than prose descriptions to improve reproducibility.
[Figure 3] Figure 3 (qualitative results) would benefit from side-by-side comparison with a naive alignment baseline to visually demonstrate the effect of the disentanglement step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments have helped us clarify key aspects of the method and strengthen the experimental validation. We address each major comment below and have made revisions to the manuscript as indicated.

read point-by-point responses

Referee: [§3] §3 (Method), progressive disentanglement paragraph and associated loss formulation: the manuscript does not specify an explicit mechanism (e.g., regularization term, initialization strategy, or architectural bias) that prevents the learned deformation field from absorbing camera motion into object motion when each 4-view reconstruction carries large depth/pose ambiguities. The self-supervised consistency loss can be satisfied by kinematically incorrect solutions that conflate the two, directly undermining the central claim that camera and object motion can be robustly separated without ground-truth poses or dense observations.

Authors: We appreciate the referee for highlighting this potential issue in the method description. The progressive disentanglement is intended to address the separation of camera and object motion by starting with static part alignment and progressively identifying moving parts. The deformation field is constrained by the initial independent reconstructions which provide a starting point less prone to absorbing global motion. Nevertheless, we acknowledge that the manuscript would benefit from a more explicit description of the mechanism. In the revised version, we have expanded the progressive disentanglement paragraph to include details on the initialization strategy and an added regularization term that limits the deformation field's ability to model large global transformations in initial stages. This should make the robustness clearer. revision: yes
Referee: [§5] §5 (Experiments), quantitative tables and ablation studies: no ablation is reported that isolates the contribution of the progressive disentanglement module versus a baseline that simply aligns independent reconstructions; without this, it is impossible to verify that the method overcomes the geometric degeneracy highlighted in the central insight rather than relying on favorable initialization or dataset biases.

Authors: We agree with the referee that an ablation isolating the progressive disentanglement is necessary to fully validate the contribution. We have added this ablation to the experiments section in the revised manuscript. The new results demonstrate that removing the progressive disentanglement leads to degraded performance in motion estimation, confirming that it plays a key role in overcoming the geometric ambiguities rather than depending on dataset biases or initialization alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external sparse-view methods and self-supervised losses without self-referential reduction

full rationale

The paper's pipeline begins with independent per-articulation reconstructions drawn from cited external advances in sparse-view 3D reconstruction, followed by a learned deformation field for correspondences and a progressive disentanglement step optimized via cross-view/cross-pose consistency losses. None of these steps reduce a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the alignment and static/moving separation are presented as emergent from the joint optimization rather than presupposed in the inputs. The central claims therefore remain independent of the paper's own fitted values or prior self-references, qualifying as a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract provides insufficient detail for exhaustive ledger; key unstated premises include reliability of off-the-shelf sparse-view recon and feasibility of unsupervised disentanglement.

axioms (2)

domain assumption Recent sparse-view 3D reconstruction advances can produce usable independent per-articulation models even with unknown poses.
Method explicitly starts by reconstructing each articulation independently using those advances.
domain assumption A learned deformation field can establish reliable dense correspondences across different articulations without supervision.
Central step after independent reconstruction.

pith-pipeline@v0.9.0 · 5684 in / 1253 out tokens · 64212 ms · 2026-05-18T18:41:48.903701+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

min Fdeform LCD(Ĝt, Gt) + Lphoto(R(Ĝt), R(Gt))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 2 internal anchors

[1]

Building rome in a day

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day. Communications of the ACM , 54 (10):105–112, 2011. 2

work page 2011
[2]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19457–19467, 2024. 2

work page 2024
[3]

Op-align: Object-level and part-level alignment for self-supervised category-level articulated object pose estimation

Yuchen Che, Ryo Furukawa, and Asako Kanezaki. Op-align: Object-level and part-level alignment for self-supervised category-level articulated object pose estimation. In Eu- ropean Conference on Computer Vision , pages 72–88. Springer, 2024. 2

work page 2024
[4]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In European Conference on Computer Vision, pages 370–386. Springer, 2024. 2

work page 2024
[5]

Gaussianpro: 3d gaussian splatting with progressive propagation

Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Yuexin Ma, Wenping Wang, and Xuejin Chen. Gaussianpro: 3d gaussian splatting with progressive propagation. InForty- first International Conference on Machine Learning, 2024. 2

work page 2024
[6]

Depth-regularized optimization for 3d gaussian splatting in few-shot images

Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee. Depth-regularized optimization for 3d gaussian splatting in few-shot images. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 811–820, 2024. 2

work page 2024
[7]

Articulate your nerf: Unsupervised articulated object modeling via con- ditional view synthesis

Jianning Deng, Kartic Subr, and Hakan Bilen. Articulate your nerf: Unsupervised articulated object modeling via con- ditional view synthesis. Advances in Neural Information Processing Systems, 37:119717–119741, 2024. 1, 2, 4, 5, 6, 12

work page 2024
[8]

Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ fps

Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang, et al. Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ fps. Advances in neural information processing systems , 37: 140138–140158, 2024. 2

work page 2024
[9]

Capt: Category-level articulation estimation from a single point cloud using transformer

Lian Fu, Ryoichi Ishikawa, Yoshihiro Sato, and Takeshi Oishi. Capt: Category-level articulation estimation from a single point cloud using transformer. In 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 751–757. IEEE, 2024. 2

work page 2024
[10]

Articulatedgs: Self-supervised digital twin modeling of articulated objects using 3d gaussian splatting

Junfu Guo, Yu Xin, Gaoyi Liu, Kai Xu, Ligang Liu, and Ruizhen Hu. Articulatedgs: Self-supervised digital twin modeling of articulated objects using 3d gaussian splatting. In Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 27144–27153, 2025. 1, 2, 5, 6, 12

work page 2025
[11]

Carto: Category and joint agnostic reconstruction of articulated objects

Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Ab- hinav Valada, and Thomas Kollar. Carto: Category and joint agnostic reconstruction of articulated objects. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21201–21210, 2023. 2

work page 2023
[12]

2d gaussian splatting for geometrically ac- curate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. In ACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 2

work page 2024
[13]

Opd: Single-view 3d openable part detection

Hanxiao Jiang, Yongsen Mao, Manolis Savva, and Angel X Chang. Opd: Single-view 3d openable part detection. In European Conference on Computer Vision, pages 410–426. Springer, 2022. 1

work page 2022
[14]

Detection based part- level articulated object reconstruction from single rgbd im- age

Yuki Kawana and Tatsuya Harada. Detection based part- level articulated object reconstruction from single rgbd im- age. Advances in Neural Information Processing Systems , 36:18444–18473, 2023. 2

work page 2023
[15]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

work page
[16]

Gener- ative sparse-view gaussian splatting

Hanyang Kong, Xingyi Yang, and Xinchao Wang. Gener- ative sparse-view gaussian splatting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26745–26755, 2025. 2

work page 2025
[17]

Articulate-anything: Auto- matic modeling of articulated objects via a vision-language foundation model

Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Di- nesh Jayaraman, and Eric Eaton. Articulate-anything: Auto- matic modeling of articulated objects via a vision-language foundation model. arXiv preprint arXiv:2410.13882, 2024. 2

work page arXiv 2024
[18]

Compact 3d gaussian representation for radiance field

Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. Compact 3d gaussian representation for radiance field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21719– 21728, 2024. 2

work page 2024
[19]

Nap: Neural 3d articulation prior

Jiahui Lei, Congyue Deng, Bokui Shen, Leonidas Guibas, and Kostas Daniilidis. Nap: Neural 3d articulation prior. arXiv preprint arXiv:2305.16315, 2023. 2

work page arXiv 2023
[20]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground- ing image matching in 3d with mast3r. In European Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2

work page 2024
[21]

Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normaliza- tion

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normaliza- tion. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 20775–20785,

work page
[22]

Paris: Part-level reconstruction and motion analysis for articulated objects

Jiayi Liu, Ali Mahdavi-Amiri, and Manolis Savva. Paris: Part-level reconstruction and motion analysis for articulated objects. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 352–363, 2023. 1, 2, 5, 12

work page 2023
[23]

Cage: Controllable articulation generation

Jiayi Liu, Hou In Ivan Tam, Ali Mahdavi-Amiri, and Manolis Savva. Cage: Controllable articulation generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17880–17889, 2024. 2

work page 2024
[24]

Zero-1-to- 3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 2

work page 2023
[25]

Build- ing rearticulable models for arbitrary 3d objects from 4d 9 point clouds

Shaowei Liu, Saurabh Gupta, and Shenlong Wang. Build- ing rearticulable models for arbitrary 3d objects from 4d 9 point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21138– 21147, 2023. 2

work page 2023
[26]

Robust incremental structure-from-motion with hybrid fea- tures

Shaohui Liu, Yidan Gao, Tianyi Zhang, Rémi Pautrat, Jo- hannes L Schönberger, Viktor Larsson, and Marc Pollefeys. Robust incremental structure-from-motion with hybrid fea- tures. In European Conference on Computer Vision , pages 249–269. Springer, 2024. 2

work page 2024
[27]

Building interactable replicas of complex articulated objects via gaussian splatting

Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Building interactable replicas of complex articulated objects via gaussian splatting. In The Thirteenth International Conference on Learning Represen- tations, 2025. 2

work page 2025
[28]

Wonder3d: Sin- gle image to 3d using cross-domain diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 2

work page 2024
[29]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. 2024 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20654–20664, 2023. 2

work page 2024
[30]

Nerf: Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM, 65(1):99–106, 2021. 1, 2

work page 2021
[31]

Chang, Li Yi, Subarna Tripathi, Leonidas J

Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, and Hao Su. PartNet: A large- scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In The IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2019. 5

work page 2019
[32]

A-sdf: Learning disentangled signed distance functions for articulated shape representation

Jiteng Mu, Weichao Qiu, Adam Kortylewski, Alan Yuille, Nuno Vasconcelos, and Xiaolong Wang. A-sdf: Learning disentangled signed distance functions for articulated shape representation. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13001–13011,

work page
[33]

Instant neural graphics primitives with a mul- tiresolution hash encoding

Thomas Müller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2

work page 2022
[34]

Structure from action: Learning interactions for articulated object 3d structure discovery

Neil Nie, Samir Yitzhak Gadre, Kiana Ehsani, and Shu- ran Song. Structure from action: Learning interactions for articulated object 3d structure discovery. arXiv preprint arXiv:2207.08997, 2022. 8

work page arXiv 2022
[35]

Understanding 3d object articulation in in- ternet videos

Shengyi Qian, Linyi Jin, Chris Rockwell, Siyi Chen, and David F Fouhey. Understanding 3d object articulation in in- ternet videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1599– 1609, 2022. 1

work page 2022
[36]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:24...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016
[38]

Pixelwise view selection for un- structured multi-view stereo

Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016. 2

work page 2016
[39]

Opdmulti: Openable part detection for multiple objects

Xiaohao Sun, Hanxiao Jiang, Manolis Savva, and An- gel Xuan Chang. Opdmulti: Openable part detection for multiple objects. arXiv preprint arXiv:2303.14087 , 2023. 1

work page arXiv 2023
[40]

Splatter image: Ultra-fast single-view 3d recon- struction

Stanislaw Szymanowicz, Chrisitian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d recon- struction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 10208– 10217, 2024. 2

work page 2024
[41]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. In European Conference on Computer Vision, pages 1–18. Springer, 2024. 2

work page 2024
[42]

Cla-nerf: Category-level articulated neural radiance field

Wei-Cheng Tseng, Hung-Ju Liao, Lin Yen-Chen, and Min Sun. Cla-nerf: Category-level articulated neural radiance field. In 2022 International Conference on Robotics and Au- tomation (ICRA), pages 8454–8460. IEEE, 2022. 2

work page 2022
[43]

Least-squares estimation of transforma- tion parameters between two point patterns

Shinji Umeyama. Least-squares estimation of transforma- tion parameters between two point patterns. IEEE Transac- tions on Pattern Analysis & Machine Intelligence , 13(04): 376–380, 1991. 13

work page 1991
[44]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 5

work page 2025
[45]

NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2

work page 2024
[47]

Self-supervised neural articulated shape and appearance models

Fangyin Wei, Rohan Chabra, Lingni Ma, Christoph Lassner, Michael Zollhöfer, Szymon Rusinkiewicz, Chris Sweeney, Richard Newcombe, and Mira Slavcheva. Self-supervised neural articulated shape and appearance models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15816–15826, 2022. 2

work page 2022
[48]

Neural implicit representation for building digital twins of unknown articulated objects

Yijia Weng, Bowen Wen, Jonathan Tremblay, Valts Blukis, Dieter Fox, Leonidas Guibas, and Stan Birchfield. Neural implicit representation for building digital twins of unknown articulated objects. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3141–3150, 2024. 2 10

work page 2024
[49]

Multi-scale 3d gaussian splatting for anti-aliased rendering

Zhiwen Yan, Weng Fei Low, Yu Chen, and Gim Hee Lee. Multi-scale 3d gaussian splatting for anti-aliased rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20923–20931, 2024. 2

work page 2024
[50]

Teaser: Fast and certifiable point cloud registration

Heng Yang, Jingnan Shi, and Luca Carlone. Teaser: Fast and certifiable point cloud registration. IEEE Transactions on Robotics, 37(2):314–333, 2020. 4

work page 2020
[51]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 21924–21935,

work page
[52]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images. arXiv preprint arXiv:2410.24207, 2024. 2

work page arXiv 2024
[53]

Freesplat- ter: Free-viewpoint 3d gaussian splatting from a single im- age

Zehao Zhang, Anand Goel, Zhan Wang, Vladlen Koltun, Ji- tendra Malik, Chenxu Ma, and Leonidas Guibas. Freesplat- ter: Free-viewpoint 3d gaussian splatting from a single im- age. arXiv preprint arXiv:2401.04644, 2024. 2, 3

work page arXiv 2024
[54]

Fsgs: Real-time few-shot view synthesis using gaussian splatting

Zehao Zhu, Zhiwen Fan, Yifan Jiang, and Zhangyang Wang. Fsgs: Real-time few-shot view synthesis using gaussian splatting. In European conference on computer vision, pages 145–163. Springer, 2024. 2 11 PAOLI: Pose-free Articulated Object Learning from Sparse-view Images Supplementary Material

work page 2024
[55]

Detailed Discussion of Related Work Here we provide more detailed discussion of the most related articulated object learning work including, PARIS [22], AYN [7] and AGS [10]

Supplementary Material 6.1. Detailed Discussion of Related Work Here we provide more detailed discussion of the most related articulated object learning work including, PARIS [22], AYN [7] and AGS [10]. As explained in the submission, these techniques assume dense views of the ob- ject across two articulation state along with the camera in- formation, unl...

work page