pith. machine review for the scientific record. sign in

arxiv: 2604.05621 · v2 · submitted 2026-04-07 · 💻 cs.CV

Recognition: no theorem link

FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene reconstructionarticulated object modelingegocentric videofunctional digital twinskinematic parameter estimationsimulation-compatible mesheshuman-object interaction
0
0 comments X

The pith

FunRec reconstructs interactable 3D digital twins of indoor scenes from ordinary egocentric RGB-D videos of human interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FunRec as a way to build functional 3D scenes straight from everyday egocentric videos in which people interact with objects. The approach identifies movable parts, figures out how they move, follows their positions over time, and assembles both fixed and moving geometry into a shared coordinate frame that produces meshes ready for simulators. This matters because it removes the need for special capture rigs or pre-existing models, turning casual recordings into usable digital environments. If the method works as described, it opens the possibility of creating simulation-ready twins from ordinary home videos without extra hardware or labels.

Core claim

FunRec recovers interactable 3D scenes by processing in-the-wild human interaction sequences from egocentric RGB-D videos. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. The method operates without controlled multi-state captures or CAD priors and is evaluated on new real and simulated benchmarks where it reports large gains in segmentation, articulation accuracy, and reconstruction quality.

What carries the argument

Automatic discovery of articulated parts combined with kinematic parameter estimation from observed human-object interactions in egocentric video.

If this is right

  • Part segmentation improves by up to 50 mIoU compared with prior methods.
  • Articulation and pose errors drop by a factor of 5 to 10.
  • Overall 3D reconstruction accuracy increases substantially.
  • Reconstructed meshes can be exported directly as URDF or USD files for simulation.
  • The output supports hand-guided affordance mapping and robot-scene interaction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Reconstruction from single-view interaction footage could reduce reliance on multi-camera rigs for indoor mapping.
  • The same interaction signal might help infer object affordances that are invisible in static views.
  • If the pipeline scales, it could let robots learn scene geometry by watching people use the environment rather than by direct scanning.

Load-bearing premise

Ordinary egocentric RGB-D interaction videos contain enough information to reliably discover and parameterize articulated object parts without controlled multi-state captures, CAD priors, or additional supervision.

What would settle it

A test sequence in which the reconstructed parts and kinematic parameters produce simulated motions that visibly mismatch the actual object movements recorded in the input video.

Figures

Figures reproduced from arXiv: 2604.05621 by Alexandros Delitzas, Alexey Gavryushin, Boyang Sun, Chenyangguang Zhang, Christian Theobalt, Daniel Barath, Francis Engelmann, Leonidas Guibas, Marc Pollefeys, Rishabh Dabral, Tommaso Di Mario.

Figure 1
Figure 1. Figure 1: Real-world functional digital twins. FUN REC ● takes a single egocentric RGB-D interaction video (top) and reconstructs a functional 3D digital twin of the environment (middle). The system automatically identifies articulated scene components, estimates their kinematic parameters along with per-timestep poses, and jointly reconstructs the static scene and each movable part, including interiors (see left an… view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. Given an egocentric RGB-D interaction video, FunREC first divides it into static and dynamic fragments. For each dynamic fragment, it estimates camera poses, computes sparse 3D point trajectories, and clusters them into articulated components via articulation-aware motion modeling. The interacting part is then segmented to obtain dense masks and reconstructed together with the static scene… view at source ↗
Figure 3
Figure 3. Figure 3: Proposed Datasets. We introduce two datasets for functional 3D scene reconstruction and evaluation: RealFun4D (left), capturing egocentric interactions in real scenes, and OmniFun4D (right), providing photorealistic simulated interactions in synthetic scenes. OmniFun4D HOI4D RealFun4D Input video MonST3R (GT Depth + CoTracker3) Spat.TrackerV2 (GT Depth + SAM2) FunREC (Ours) [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons. We show qualitative comparisons between baselines and our method. For each method, we accumulate the reconstructed point clouds of both the articulated part and the static scene across all timesteps, and visualize them under two selected scene states. Green indicates the articulated part, and red lines denote the estimated articulation axes. Real-world scenes: RealFun4D. Our real d… view at source ↗
Figure 6
Figure 6. Figure 6: Hand-scene interaction. Estimated 3D hand mesh (left) and inferred affordance map (right). Integrating the hand pose into our functional reconstruction enables finding contact regions and consistent reasoning over the associated scene-part motion. Robot-scene interaction from human demonstration. The functional scene model can be directly transferred to a mobile manipulator, enabling robot-scene interactio… view at source ↗
Figure 7
Figure 7. Figure 7: Robot-scene interaction. Left: Human demonstration of opening a cabinet. Center: The articulation trajectory derived from the functional scene model. Right: The robot leverages the functional information to reliably reproduce the same interaction. 7. Conclusion We present FunREC, a training-free method to reconstruct functional, articulated 3D digital twins of real environments from a single egocentric RGB… view at source ↗
Figure 5
Figure 5. Figure 5: Isaac Sim deployment. A mobile manipulator interacts with a drawer reconstructed from a real-world scan. Hand-guided affordance mapping. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

We present FunRec, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunRec operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunRec surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5-10 times lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FunRec, a method for reconstructing functional 3D digital twins of indoor scenes from egocentric RGB-D interaction videos. It automatically discovers articulated parts, estimates kinematic parameters, tracks 3D motion, and reconstructs static and moving geometry in canonical space to produce simulation-compatible meshes. The method is claimed to work on in-the-wild sequences without controlled setups, multi-state captures, or CAD priors, and demonstrates large quantitative improvements on new benchmarks along with applications in simulation and robotics.

Significance. If the results hold, this work would be significant for enabling scalable, automatic reconstruction of interactable 3D scenes from casual egocentric videos. The reported gains of up to +50 mIoU in part segmentation and 5-10x lower errors in articulation and pose suggest a substantial advance over prior methods that require more controlled conditions. The provision of URDF/USD export and affordance mapping further enhances its practical utility for simulation and robot interaction.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation: The abstract reports large quantitative gains (+50 mIoU, 5-10 times lower errors) but provides no method details, error analysis, or ablation studies. This makes it difficult to assess the validity of the claims without the full details in the evaluation section.
  2. [Method (likely §3 or §4)] Method (likely §3 or §4): The central assumption that egocentric RGB-D interaction videos contain sufficient information to uniquely determine articulated kinematics without additional priors is load-bearing. The inverse problem is underconstrained due to limited configuration space coverage in human interactions, egocentric viewpoint correlations, depth ambiguities, and occlusions. The paper should provide concrete evidence or tests showing how ambiguities are resolved without implicit priors.
minor comments (2)
  1. [Abstract] Clarify the exact nature of the new real and simulated benchmarks used for evaluation.
  2. [Evaluation] Ensure all quantitative results include standard deviations or statistical significance to support the 'large margin' claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for reviewing our manuscript and for the encouraging comments on its potential impact. We provide point-by-point responses to the major comments below.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation: The abstract reports large quantitative gains (+50 mIoU, 5-10 times lower errors) but provides no method details, error analysis, or ablation studies. This makes it difficult to assess the validity of the claims without the full details in the evaluation section.

    Authors: The abstract serves as a high-level overview of the paper's contributions and key results, following standard academic conventions for brevity. Comprehensive method details are presented in Section 3, while Section 5 (Experiments) contains the full evaluation, including quantitative benchmarks on new real and simulated datasets, error breakdowns for articulation and pose, ablation studies on key components such as part discovery and motion tracking, and supporting analysis. These sections substantiate the claims made in the abstract. revision: no

  2. Referee: [Method (likely §3 or §4)] Method (likely §3 or §4): The central assumption that egocentric RGB-D interaction videos contain sufficient information to uniquely determine articulated kinematics without additional priors is load-bearing. The inverse problem is underconstrained due to limited configuration space coverage in human interactions, egocentric viewpoint correlations, depth ambiguities, and occlusions. The paper should provide concrete evidence or tests showing how ambiguities are resolved without implicit priors.

    Authors: We acknowledge that the inverse problem is challenging and potentially underconstrained. Our method addresses this through a joint optimization that leverages temporal motion consistency from interaction sequences, RGB-D observations, and interaction-induced constraints to discover parts and estimate kinematics without relying on CAD models or multi-state captures. Concrete evidence is provided via extensive quantitative results and ablations on diverse in-the-wild sequences demonstrating accurate recovery despite occlusions and limited motions, as well as comparisons showing large gains over prior methods. We have added further discussion in the revised manuscript on ambiguity handling and included additional failure case analysis. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent processing steps

full rationale

The abstract and claims describe a method that processes in-the-wild egocentric RGB-D videos to discover articulated parts, estimate kinematic parameters, track motion, and reconstruct canonical geometry without relying on controlled captures or CAD priors. No equations, fitted parameters renamed as predictions, or self-citation chains are exhibited in the provided text that would reduce any central claim to its own inputs by construction. The approach asserts new capabilities for simulation-compatible outputs and reports quantitative improvements over prior work, indicating the derivation chain remains self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5491 in / 1086 out tokens · 37999 ms · 2026-05-10T18:40:20.514680+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 3 internal anchors

  1. [1]

    HOT3D: Hand and Object Tracking in 3D from Ego- centric Multi-View Videos

    Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Julian Engel, and Tomas Hodan. HOT3D: Hand and Object Tracking in 3D from Ego- centric Multi-View Videos. InInternational Conference on Computer Vision and Pattern R...

  2. [2]

    Superansac: One ransac to rule them all

    Daniel Barath. Superansac: One ransac to rule them all. arXiv preprint arXiv:2506.04803, 2025. 3, 5

  3. [3]

    ARKitScenes: A Diverse Real-world Dataset for 3D Indoor Scene Understanding Us- ing Mobile RGB-D Data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. ARKitScenes: A Diverse Real-world Dataset for 3D Indoor Scene Understanding Us- ing Mobile RGB-D Data. InInternational Conference on Neural Information Processing Systems (NeurIPS), 2021. 1, 2

  4. [4]

    Lost & Found: Tracking Changes from Egocentric Observations in 3D Dynamic Scene Graphs

    Tjark Behrens, Ren ´e Zurbr¨ugg, Marc Pollefeys, Zuria Bauer, and Hermann Blum. Lost & Found: Tracking Changes from Egocentric Observations in 3D Dynamic Scene Graphs. IEEE Robotics and Automation Letters (RA-L), 2025. 3

  5. [5]

    Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. Density-Based Clustering Based on Hierarchical Density Estimates. InAdvances in Knowledge Discovery and Data Mining, 2013. 4

  6. [6]

    Matterport3D: Learning from RGB- D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB- D Data in Indoor Environments. InInternational Conference on 3d Vision (3dV), 2017. 1, 2

  7. [7]

    ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 1, 2

  8. [8]

    Auto- mated Creation of Digital Cousins for Robust Policy Learn- ing

    Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Auto- mated Creation of Digital Cousins for Robust Policy Learn- ing. InConference on Robot Learning (CoRL), 2024. 2

  9. [9]

    EPIC-KITCHENS VISOR Benchmark: VIdeo Seg- mentations and Object Relations

    Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. EPIC-KITCHENS VISOR Benchmark: VIdeo Seg- mentations and Object Relations. InInternational Confer- ence on Neural Information Processing Systems (NeurIPS),

  10. [10]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Google Deepmind. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261, 2025. 3

  11. [11]

    SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

    Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  12. [12]

    RoMa: Robust Dense Feature Matching

    Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. RoMa: Robust Dense Feature Matching. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

  13. [13]

    BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

    Chengshu Li et al. BEHA VIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation.arXiv preprint arXiv:2403.09227,

  14. [14]

    Black, and Otmar Hilliges

    Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, and Otmar Hilliges. ARCTIC: A Dataset for Dexterous Bimanual Hand- Object Manipulation. InInternational Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023. 3

  15. [15]

    Black, Trevor Darrell, and Angjoo Kanazawa

    Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J. Black, Trevor Darrell, and Angjoo Kanazawa. St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  16. [16]

    GEOPARD: Geometric Pretraining for Articulation Prediction in 3D Shapes

    Pradyumn Goyal, Dmitry Petrov, Sheldon Andrews, Yizhak Ben-Shabat, Hsueh-Ti Derek Liu, and Evangelos Kaloger- akis. GEOPARD: Geometric Pretraining for Articulation Prediction in 3D Shapes. InInternational Conference on Computer Vision (ICCV), 2025. 2

  17. [17]

    Interaction Replica: Tracking human–object interaction and scene changes from human motion

    Vladimir Guzov, Julian Chibane, Riccardo Marin, Yannan He, Yunus Saracoglu, Torsten Sattler, and Gerard Pons-Moll. Interaction Replica: Tracking human–object interaction and scene changes from human motion. InInternational Confer- ence on 3d Vision (3dV), 2024. 3

  18. [18]

    Holistic Un- derstanding of 3D Scenes as Universal Scene Description

    Anna-Maria Halacheva, Yang Miao, Jan-Nico Zaech, Xi Wang, Luc Van Gool, and Danda Pani Paudel. Holistic Un- derstanding of 3D Scenes as Universal Scene Description. In International Conference on Computer Vision (ICCV), 2025. 2

  19. [19]

    Adam W. Harley, Yang You, Xinglong Sun, Yang Zheng, Nikhil Raghuraman, Yunqi Gu, Sheldon Liang, Wen-Hsuan Chu, Achal Dave, Pavel Tokmakov, Suya You, Rares Am- brus, Katerina Fragkiadaki, and Leonidas J. Guibas. All- Tracker: Efficient Dense Point Tracking at High Resolution. InInternational Conference on Computer Vision (ICCV),

  20. [20]

    CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

    Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Ab- hinav Valada, and Thomas Kollar. CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

  21. [21]

    PREDATOR: Registration of 3D Point Clouds with Low Overlap

    Shengyu Huang, Zan Gojcic, Mikhail Usvyatsov, and Kon- rad Schindler Andreas Wieser. PREDATOR: Registration of 3D Point Clouds with Low Overlap. InInternational Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  22. [22]

    Lite- reality: Graphics-ready 3d scene reconstruction from rgb-d scans.arXiv preprint arXiv:2507.02861, 2025

    Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Heng- shuang Zhao, Matthias Nießner, and Joan Lasenby. LiteReal- ity: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans.arXiv preprint arXiv:2507.02861, 2025. 2

  23. [23]

    React3d: Recovering articulations for interactive physical 3d scenes.IEEE Robotics and Automa- tion Letters (RA-L), 2026

    Zhao Huang, Boyang Sun, Alexandros Delitzas, Jiaqi Chen, and Marc Pollefeys. React3d: Recovering articulations for interactive physical 3d scenes.IEEE Robotics and Automa- tion Letters (RA-L), 2026. 2 9

  24. [24]

    Barron, Mario Fritz, Kate Saenko, and Trevor Darrell

    Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T. Barron, Mario Fritz, Kate Saenko, and Trevor Darrell. A Category-level 3D Object Dataset: Putting the Kinect to Work. InInternational Conference on Computer Vision (ICCV) Workshops, 2011. 1

  25. [25]

    OPD: Single-view 3D Openable Part Detection

    Hanxiao Jiang, Yongsen Mao, Manolis Savva, and Angel X Chang. OPD: Single-view 3D Openable Part Detection. In European Conference on Computer Vision (ECCV), 2022. 2

  26. [26]

    Ditto: Building Digital Twins of Articulated Objects from Interac- tion

    Zhenyu Jiang, Cheng-Chun Hsu, and Yuke Zhu. Ditto: Building Digital Twins of Articulated Objects from Interac- tion. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  27. [27]

    Co- Tracker: It is Better to Track Together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- Tracker: It is Better to Track Together. InEuropean Confer- ence on Computer Vision (ECCV), 2024. 3

  28. [28]

    Co- Tracker3: Simpler and Better Point Tracking by Pseudo- Labelling Real Videos

    Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- Tracker3: Simpler and Better Point Tracking by Pseudo- Labelling Real Videos. InInternational Conference on Com- puter Vision (ICCV), 2025. 3, 7

  29. [29]

    Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

    Justin Kerr, Chung Min Kim, Mingxuan Wu, Brent Yi, Qian- qian Wang, Ken Goldberg, and Angjoo Kanazawa. Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction. InConference on Robot Learning (CoRL), 2024. 2, 3

  30. [30]

    Parahome: Parameterizing everyday home activities to- wards 3d generative modeling of human-object interactions

    Jeonghwan Kim, Jisoo Kim, Jeonghyeon Na, and Hanbyul Joo. Parahome: Parameterizing everyday home activities to- wards 3d generative modeling of human-object interactions. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  31. [31]

    OneFormer3D: One transformer for Unified Point Cloud Segmentation

    Maxim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. OneFormer3D: One transformer for Unified Point Cloud Segmentation. InInternational Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  32. [32]

    H2O: Two Hands Manipulating Objects for First Person Interaction Recognition

    Taein Kwon, Bugra Tekin, Jan St ¨uhmer, Federica Bogo, and Marc Pollefeys. H2O: Two Hands Manipulating Objects for First Person Interaction Recognition. InInternational Con- ference on Computer Vision (ICCV), 2021. 3

  33. [33]

    Cubify Anything: Scaling In- door 3D Object Detection

    Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, and Afshin Dehghan. Cubify Anything: Scaling In- door 3D Object Detection. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

  34. [34]

    Articulate-anything: Auto- matic modeling of articulated objects via a vision-language foundation model

    Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Di- nesh Jayaraman, and Eric Eaton. Articulate-anything: Auto- matic modeling of articulated objects via a vision-language foundation model. InInternational Conference on Learning Representations (ICLR), 2025. 2

  35. [35]

    MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

    Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  36. [36]

    MegaSaM: Accurate, Fast and Ro- bust Structure and Motion from Casual Dynamic Videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. MegaSaM: Accurate, Fast and Ro- bust Structure and Motion from Casual Dynamic Videos. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  37. [37]

    PARIS: Part-level Reconstruction and Motion Analysis for Articu- lated Objects

    Jiayi Liu, Ali Mahdavi-Amiri, and Manolis Savva. PARIS: Part-level Reconstruction and Motion Analysis for Articu- lated Objects. InInternational Conference on Computer Vi- sion (ICCV), 2023. 2

  38. [38]

    Survey on Modeling of Human-made Articulated Objects.arXiv preprint arXiv:2403.14937, 2025

    Jiayi Liu, Manolis Savva, and Ali Mahdavi-Amiri. Survey on Modeling of Human-made Articulated Objects.arXiv preprint arXiv:2403.14937, 2025. 2

  39. [39]

    Self-Supervised Category-Level Articulated Ob- ject Pose Estimation with Part-Level SE(3) Equivariance

    Xueyi Liu, Ji Zhang, Ruizhen Hu, Haibin Huang, He Wang, and Li Yi. Self-Supervised Category-Level Articulated Ob- ject Pose Estimation with Part-Level SE(3) Equivariance. InInternational Conference on Learning Representations (ICLR), 2023. 2

  40. [40]

    HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction

    Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2, 3, 5, 6, 7

  41. [41]

    arXiv preprint arXiv:2509.17647 (2025) 2, 3

    Yu Liu, Baoxiong Jia, Ruijie Lu, Chuyue Gan, Huayu Chen, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Videoartgs: Building digital twins of articulated objects from monocular video.arXiv preprint arXiv:2509.17647, 2025. 3

  42. [42]

    Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting

    Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting. InIn- ternational Conference on Learning Representations (ICLR),

  43. [43]

    Chang, Iro Laina, and Victor Adrian Prisacariu

    Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Polle- feys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, and Victor Adrian Prisacariu. When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Langu...

  44. [44]

    MultiScan: Scalable RGBD Scanning for 3D Environments With Articulated Objects

    Yongsen Mao, Yiming Zhang, Hanxiao Jiang, Angel X Chang, and Manolis Savva. MultiScan: Scalable RGBD Scanning for 3D Environments With Articulated Objects. In International Conference on Neural Information Processing Systems (NeurIPS), 2022. 2

  45. [45]

    Indoor Scene Understanding in 2.5/3D for Autonomous Agents: A Survey.IEEE Access, 2018

    Muzammal Naseer, Salman Khan, and Fatih Porikli. Indoor Scene Understanding in 2.5/3D for Autonomous Agents: A Survey.IEEE Access, 2018. 1

  46. [46]

    DELTA: Dense Efficient Long-Range 3D Tracking for Any Video

    Tuan Ngo, Peiye Zhuang, Evangelos Kalogerakis, Chuang Gan, Sergey Tulyakov, Hsin-Ying Lee, and Chaoyang Wang. DELTA: Dense Efficient Long-Range 3D Tracking for Any Video. InInternational Conference on Learning Represen- tations (ICLR), 2025. 3

  47. [47]

    Isaac Sim

    NVIDIA. Isaac Sim. 8

  48. [48]

    NVIDIA® RTX Path Tracing, 2023

    NVIDIA. NVIDIA® RTX Path Tracing, 2023. 6

  49. [49]

    Recon- structing Hands in 3D With Transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Recon- structing Hands in 3D With Transformers. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 8 10

  50. [50]

    iTACO: Interactable Digital Twins of Articulated Objects from Casu- ally Captured RGBD Videos

    Weikun Peng, Jun Lv, Cewu Lu, and Manolis Savva. iTACO: Interactable Digital Twins of Articulated Objects from Casu- ally Captured RGBD Videos. InInternational Conference on 3d Vision (3dV), 2026. 3

  51. [51]

    Wilor: End-to-end 3d hand localization and reconstruction in-the-wild

    Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 8

  52. [52]

    PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2

  53. [53]

    Deep Hugh V oting for 3D Object Detection in Point Clouds

    Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep Hugh V oting for 3D Object Detection in Point Clouds. InInternational Conference on Computer Vision (ICCV), 2019. 2

  54. [54]

    Multi-view 3d point tracking

    Frano Raji ˇc, Haofei Xu, Marko Mihajlovic, Siyuan Li, Irem Demir, Emircan G¨undo˘gdu, Lei Ke, Sergey Prokudin, Marc Pollefeys, and Siyu Tang. Multi-view 3d point tracking. In International Conference on Computer Vision (ICCV), 2025. 3

  55. [55]

    REACT3D: Recovering Articulations for Interactive Physical 3D Scenes

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:...

  56. [56]

    Tchapmi, Micael E

    Bokui Shen, Fei Xia, Chengshu Li, Roberto Mart ´ın-Mart´ın, Linxi Fan, Guanzhi Wang, Claudia P ´erez-D’Arpino, Shya- mal Buch, Sanjana Srivastava, Lyne P. Tchapmi, Micael E. Tchapmi, Kent Vainio, Josiah Wong, Li Fei-Fei, and Silvio Savarese. iGibson 1.0: A Simulation Environment for Inter- active Tasks in Large Realistic Scenes. InIEEE/RSJ Interna- tional...

  57. [57]

    Indoor Segmentation and Support Inference from RGBD Images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor Segmentation and Support Inference from RGBD Images. InEuropean Conference on Computer Vision (ECCV), 2012. 1

  58. [58]

    Lichtenberg, and Jianxiong Xiao

    Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 1

  59. [59]

    Tao Sun, Yan Hao, Shengyu Huang, Silvio Savarese, Konrad Schindler, Marc Pollefeys, and Iro Armeni. Nothing Stands Still: A Spatiotemporal Benchmark on 3D Point Cloud Reg- istration Under Large Geometric and Temporal Change.IS- PRS Journal of Photogrammetry and Remote Sensing, 2025. 2

  60. [60]

    OPDMulti: Openable Part Detection for Multiple Objects

    Xiaohao Sun, Hanxiao Jiang, Manolis Savva, and Angel Chang. OPDMulti: Openable Part Detection for Multiple Objects. InInternational Conference on 3d Vision (3dV),

  61. [61]

    RIO: 3D Object Instance Re-Localization in Changing Indoor Environments

    Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Niessner. RIO: 3D Object Instance Re-Localization in Changing Indoor Environments. InInter- national Conference on Computer Vision (ICCV), 2019. 1, 2

  62. [62]

    HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand- Object Interaction

    Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xi- aohu Guo, and Yu Xiang. HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand- Object Interaction. InInternational Conference on Neural Information Processing Systems (NeurIPS), 2025. 3

  63. [63]

    Tracking Everything Everywhere All at Once

    Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, and Noah Snavely. Tracking Everything Everywhere All at Once. InInterna- tional Conference on Computer Vision (ICCV), 2023. 3

  64. [64]

    Shape of Mo- tion: 4D Reconstruction from a Single Video

    Qianqian Wang, Vickie Ye, Hang Gao, Weijia Zeng, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of Mo- tion: 4D Reconstruction from a Single Video. InInterna- tional Conference on Computer Vision (ICCV), 2025. 3

  65. [65]

    Efros, and Angjoo Kanazawa

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D Per- ception Model with Persistent State. InInternational Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  66. [66]

    BundleTrack: 6D Pose Track- ing for Novel Objects without Instance or Category-Level 3D Models

    B Wen and Kostas E Bekris. BundleTrack: 6D Pose Track- ing for Novel Objects without Instance or Category-Level 3D Models. InIEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), 2021. 3, 7

  67. [67]

    BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects

    Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas Muller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects. InInterna- tional Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2023. 7, 8

  68. [68]

    FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

    Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

  69. [69]

    Neural Implicit Representation for Building Digital Twins of Un- known Articulated Objects

    Yijia Weng, Bowen Wen, Jonathan Tremblay, Valts Blukis, Dieter Fox, Leonidas Guibas, and Stan Birchfield. Neural Implicit Representation for Building Digital Twins of Un- known Articulated Objects. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  70. [70]

    Artic- ulated object estimation in the wild

    Abdelrhman Werby, Martin Buechner, Adrian Roefer, Chen- guang Huang, Wolfram Burgard, and Abhinav Valada. Artic- ulated object estimation in the wild. InConference on Robot Learning (CoRL), 2025. 3

  71. [71]

    Reartgs: Reconstruct- ing and generating articulated objects via 3d gaussian splat- ting with geometric and motion constraints

    Di Wu, Liu Liu, Zhou Linli, Anran Huang, Liangtu Song, Qiaojun Yu, Qi Wu, and Cewu Lu. Reartgs: Reconstruct- ing and generating articulated objects via 3d gaussian splat- ting with geometric and motion constraints. InInterna- tional Conference on Neural Information Processing Sys- tems (NeurIPS), 2025. 2

  72. [72]

    Predict- Optimize-Distill: A Self-Improving Cycle for 4D Object Un- derstanding

    Mingxuan Wu, Huang Huang, Justin Kerr, Chung Min Kim, Anthony Zhang, Brent Yi, and Angjoo Kanazawa. Predict- Optimize-Distill: A Self-Improving Cycle for 4D Object Un- derstanding. InInternational Conference on Computer Vi- sion (ICCV), 2025. 2, 3

  73. [73]

    Drawer: Digital Reconstruction and Articulation with Environment Realism

    Hongchi Xia, Entong Su, Marius Memmel, Arhan Jain, Ray- mond Yu, Numfor Mbiziwo-Tiapo, Ali Farhadi, Abhishek 11 Gupta, Shenlong Wang, and Wei-Chiu Ma. Drawer: Digital Reconstruction and Articulation with Environment Realism. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

  74. [74]

    SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels

    Jianxiong Xiao, Andrew Owens, and Antonio Torralba. SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels. InInternational Conference on Computer Vision (ICCV), 2013. 1

  75. [75]

    SpatialTracker: Tracking Any 2D Pixels in 3D Space

    Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. SpatialTracker: Tracking Any 2D Pixels in 3D Space. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

  76. [76]

    Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion

    Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion. InInternational Con- ference on Computer Vision (ICCV), 2025. 3, 7, 8

  77. [77]

    ScanNet++: A High-fidelity Dataset of 3D Indoor Scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A High-fidelity Dataset of 3D Indoor Scenes. InInternational Conference on Computer Vision (ICCV), 2023. 1, 2

  78. [78]

    METASCENES: Towards Automated Replica Creation for Real-world 3D Scans

    Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Zhu Song-Chun, Tengyu Liu, and Siyuan Huang. METASCENES: Towards Automated Replica Creation for Real-world 3D Scans. InInternational Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025. 2

  79. [79]

    Self- Supervised Monocular 4D Scene Reconstruction for Ego- centric Videos

    Chengbo Yuan, Geng Chen, Li Yi, and Yang Gao. Self- Supervised Monocular 4D Scene Reconstruction for Ego- centric Videos. InInternational Conference on Computer Vision (ICCV), 2025. 3

  80. [80]

    TAPIP3D: Tracking Any Point in Persistent 3D Ge- ometry

    Bowei Zhang, Lei Ke, Adam W Harley, and Katerina Fragki- adaki. TAPIP3D: Tracking Any Point in Persistent 3D Ge- ometry. InInternational Conference on Neural Information Processing Systems (NeurIPS), 2025. 3

Showing first 80 references.