arxiv: 2604.05621 · v2 · submitted 2026-04-07 · 💻 cs.CV

Recognition: no theorem link

FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos

Alexandros Delitzas , Chenyangguang Zhang , Alexey Gavryushin , Tommaso Di Mario , Boyang Sun , Rishabh Dabral , Leonidas Guibas , Christian Theobalt

show 3 more authors

Marc Pollefeys Francis Engelmann Daniel Barath

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D scene reconstructionarticulated object modelingegocentric videofunctional digital twinskinematic parameter estimationsimulation-compatible mesheshuman-object interaction

0 comments

The pith

FunRec reconstructs interactable 3D digital twins of indoor scenes from ordinary egocentric RGB-D videos of human interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FunRec as a way to build functional 3D scenes straight from everyday egocentric videos in which people interact with objects. The approach identifies movable parts, figures out how they move, follows their positions over time, and assembles both fixed and moving geometry into a shared coordinate frame that produces meshes ready for simulators. This matters because it removes the need for special capture rigs or pre-existing models, turning casual recordings into usable digital environments. If the method works as described, it opens the possibility of creating simulation-ready twins from ordinary home videos without extra hardware or labels.

Core claim

FunRec recovers interactable 3D scenes by processing in-the-wild human interaction sequences from egocentric RGB-D videos. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. The method operates without controlled multi-state captures or CAD priors and is evaluated on new real and simulated benchmarks where it reports large gains in segmentation, articulation accuracy, and reconstruction quality.

What carries the argument

Automatic discovery of articulated parts combined with kinematic parameter estimation from observed human-object interactions in egocentric video.

If this is right

Part segmentation improves by up to 50 mIoU compared with prior methods.
Articulation and pose errors drop by a factor of 5 to 10.
Overall 3D reconstruction accuracy increases substantially.
Reconstructed meshes can be exported directly as URDF or USD files for simulation.
The output supports hand-guided affordance mapping and robot-scene interaction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reconstruction from single-view interaction footage could reduce reliance on multi-camera rigs for indoor mapping.
The same interaction signal might help infer object affordances that are invisible in static views.
If the pipeline scales, it could let robots learn scene geometry by watching people use the environment rather than by direct scanning.

Load-bearing premise

Ordinary egocentric RGB-D interaction videos contain enough information to reliably discover and parameterize articulated object parts without controlled multi-state captures, CAD priors, or additional supervision.

What would settle it

A test sequence in which the reconstructed parts and kinematic parameters produce simulated motions that visibly mismatch the actual object movements recorded in the input video.

Figures

Figures reproduced from arXiv: 2604.05621 by Alexandros Delitzas, Alexey Gavryushin, Boyang Sun, Chenyangguang Zhang, Christian Theobalt, Daniel Barath, Francis Engelmann, Leonidas Guibas, Marc Pollefeys, Rishabh Dabral, Tommaso Di Mario.

**Figure 1.** Figure 1: Real-world functional digital twins. FUN REC ● takes a single egocentric RGB-D interaction video (top) and reconstructs a functional 3D digital twin of the environment (middle). The system automatically identifies articulated scene components, estimates their kinematic parameters along with per-timestep poses, and jointly reconstructs the static scene and each movable part, including interiors (see left an… view at source ↗

**Figure 2.** Figure 2: Method overview. Given an egocentric RGB-D interaction video, FunREC first divides it into static and dynamic fragments. For each dynamic fragment, it estimates camera poses, computes sparse 3D point trajectories, and clusters them into articulated components via articulation-aware motion modeling. The interacting part is then segmented to obtain dense masks and reconstructed together with the static scene… view at source ↗

**Figure 3.** Figure 3: Proposed Datasets. We introduce two datasets for functional 3D scene reconstruction and evaluation: RealFun4D (left), capturing egocentric interactions in real scenes, and OmniFun4D (right), providing photorealistic simulated interactions in synthetic scenes. OmniFun4D HOI4D RealFun4D Input video MonST3R (GT Depth + CoTracker3) Spat.TrackerV2 (GT Depth + SAM2) FunREC (Ours) [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons. We show qualitative comparisons between baselines and our method. For each method, we accumulate the reconstructed point clouds of both the articulated part and the static scene across all timesteps, and visualize them under two selected scene states. Green indicates the articulated part, and red lines denote the estimated articulation axes. Real-world scenes: RealFun4D. Our real d… view at source ↗

**Figure 6.** Figure 6: Hand-scene interaction. Estimated 3D hand mesh (left) and inferred affordance map (right). Integrating the hand pose into our functional reconstruction enables finding contact regions and consistent reasoning over the associated scene-part motion. Robot-scene interaction from human demonstration. The functional scene model can be directly transferred to a mobile manipulator, enabling robot-scene interactio… view at source ↗

**Figure 7.** Figure 7: Robot-scene interaction. Left: Human demonstration of opening a cabinet. Center: The articulation trajectory derived from the functional scene model. Right: The robot leverages the functional information to reliably reproduce the same interaction. 7. Conclusion We present FunREC, a training-free method to reconstruct functional, articulated 3D digital twins of real environments from a single egocentric RGB… view at source ↗

**Figure 5.** Figure 5: Isaac Sim deployment. A mobile manipulator interacts with a drawer reconstructed from a real-world scan. Hand-guided affordance mapping. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

We present FunRec, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunRec operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunRec surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5-10 times lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FunRec builds a pipeline to turn casual egocentric RGB-D videos into simulation-ready 3D scenes with articulated parts, reporting large gains on new benchmarks, but the underconstrained input raises questions about how much learned regularization is doing the heavy lifting.

read the letter

FunRec reconstructs functional 3D scenes with movable parts straight from ordinary head-mounted RGB-D videos of people interacting with indoor objects. It skips the usual requirements for CAD models, multi-view captures, or staged multi-state recordings and instead works on in-the-wild sequences. The pipeline discovers parts, estimates kinematic parameters like joint axes and limits, tracks 3D motion, and produces canonical meshes that export to URDF or USD for simulation. They also add hand-guided affordance mapping and robot interaction demos. That combination is the main step forward from prior articulated reconstruction work. On the new real and simulated benchmarks they introduce, the numbers look strong: up to 50 mIoU better part segmentation, 5-10 times lower errors on articulation and pose, and clearer geometry. Those gains suggest the method is extracting usable structure from the interaction data they collected. The applications to simulation and robotics are shown concretely, which helps ground the claims. The soft spot is the inverse problem itself. Single-view egocentric videos often give only a narrow slice of an object's configuration space, the moving camera correlates with the person's motion, and depth plus self-occlusion by hands or body add ambiguity. The abstract says the approach works automatically without priors, yet any learned component or regularization that resolves these gaps is effectively supplying missing information. The paper would be stronger with explicit ablations on how performance drops when motion variety is reduced or when depth noise increases. This work is aimed at people building 3D scene models for robotics, AR, or simulation who need functional rather than purely geometric output. Readers already working on interaction-aware reconstruction or affordance learning will get the most from the benchmarks and the end-to-end pipeline. It deserves a serious referee because the problem is practically relevant and the reported improvements are large enough to merit detailed checking of the method and experiments.

Referee Report

2 major / 2 minor

Summary. The paper introduces FunRec, a method for reconstructing functional 3D digital twins of indoor scenes from egocentric RGB-D interaction videos. It automatically discovers articulated parts, estimates kinematic parameters, tracks 3D motion, and reconstructs static and moving geometry in canonical space to produce simulation-compatible meshes. The method is claimed to work on in-the-wild sequences without controlled setups, multi-state captures, or CAD priors, and demonstrates large quantitative improvements on new benchmarks along with applications in simulation and robotics.

Significance. If the results hold, this work would be significant for enabling scalable, automatic reconstruction of interactable 3D scenes from casual egocentric videos. The reported gains of up to +50 mIoU in part segmentation and 5-10x lower errors in articulation and pose suggest a substantial advance over prior methods that require more controlled conditions. The provision of URDF/USD export and affordance mapping further enhances its practical utility for simulation and robot interaction.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation: The abstract reports large quantitative gains (+50 mIoU, 5-10 times lower errors) but provides no method details, error analysis, or ablation studies. This makes it difficult to assess the validity of the claims without the full details in the evaluation section.
[Method (likely §3 or §4)] Method (likely §3 or §4): The central assumption that egocentric RGB-D interaction videos contain sufficient information to uniquely determine articulated kinematics without additional priors is load-bearing. The inverse problem is underconstrained due to limited configuration space coverage in human interactions, egocentric viewpoint correlations, depth ambiguities, and occlusions. The paper should provide concrete evidence or tests showing how ambiguities are resolved without implicit priors.

minor comments (2)

[Abstract] Clarify the exact nature of the new real and simulated benchmarks used for evaluation.
[Evaluation] Ensure all quantitative results include standard deviations or statistical significance to support the 'large margin' claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for reviewing our manuscript and for the encouraging comments on its potential impact. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation: The abstract reports large quantitative gains (+50 mIoU, 5-10 times lower errors) but provides no method details, error analysis, or ablation studies. This makes it difficult to assess the validity of the claims without the full details in the evaluation section.

Authors: The abstract serves as a high-level overview of the paper's contributions and key results, following standard academic conventions for brevity. Comprehensive method details are presented in Section 3, while Section 5 (Experiments) contains the full evaluation, including quantitative benchmarks on new real and simulated datasets, error breakdowns for articulation and pose, ablation studies on key components such as part discovery and motion tracking, and supporting analysis. These sections substantiate the claims made in the abstract. revision: no
Referee: [Method (likely §3 or §4)] Method (likely §3 or §4): The central assumption that egocentric RGB-D interaction videos contain sufficient information to uniquely determine articulated kinematics without additional priors is load-bearing. The inverse problem is underconstrained due to limited configuration space coverage in human interactions, egocentric viewpoint correlations, depth ambiguities, and occlusions. The paper should provide concrete evidence or tests showing how ambiguities are resolved without implicit priors.

Authors: We acknowledge that the inverse problem is challenging and potentially underconstrained. Our method addresses this through a joint optimization that leverages temporal motion consistency from interaction sequences, RGB-D observations, and interaction-induced constraints to discover parts and estimate kinematics without relying on CAD models or multi-state captures. Concrete evidence is provided via extensive quantitative results and ablations on diverse in-the-wild sequences demonstrating accurate recovery despite occlusions and limited motions, as well as comparisons showing large gains over prior methods. We have added further discussion in the revised manuscript on ambiguity handling and included additional failure case analysis. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent processing steps

full rationale

The abstract and claims describe a method that processes in-the-wild egocentric RGB-D videos to discover articulated parts, estimate kinematic parameters, track motion, and reconstruct canonical geometry without relying on controlled captures or CAD priors. No equations, fitted parameters renamed as predictions, or self-citation chains are exhibited in the provided text that would reduce any central claim to its own inputs by construction. The approach asserts new capabilities for simulation-compatible outputs and reports quantitative improvements over prior work, indicating the derivation chain remains self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5491 in / 1086 out tokens · 37999 ms · 2026-05-10T18:40:20.514680+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 3 internal anchors

[1]

HOT3D: Hand and Object Tracking in 3D from Ego- centric Multi-View Videos

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Julian Engel, and Tomas Hodan. HOT3D: Hand and Object Tracking in 3D from Ego- centric Multi-View Videos. InInternational Conference on Computer Vision and Pattern R...

work page 2025
[2]

Superansac: One ransac to rule them all

Daniel Barath. Superansac: One ransac to rule them all. arXiv preprint arXiv:2506.04803, 2025. 3, 5

work page arXiv 2025
[3]

ARKitScenes: A Diverse Real-world Dataset for 3D Indoor Scene Understanding Us- ing Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. ARKitScenes: A Diverse Real-world Dataset for 3D Indoor Scene Understanding Us- ing Mobile RGB-D Data. InInternational Conference on Neural Information Processing Systems (NeurIPS), 2021. 1, 2

work page 2021
[4]

Lost & Found: Tracking Changes from Egocentric Observations in 3D Dynamic Scene Graphs

Tjark Behrens, Ren ´e Zurbr¨ugg, Marc Pollefeys, Zuria Bauer, and Hermann Blum. Lost & Found: Tracking Changes from Egocentric Observations in 3D Dynamic Scene Graphs. IEEE Robotics and Automation Letters (RA-L), 2025. 3

work page 2025
[5]

Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. Density-Based Clustering Based on Hierarchical Density Estimates. InAdvances in Knowledge Discovery and Data Mining, 2013. 4

work page 2013
[6]

Matterport3D: Learning from RGB- D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB- D Data in Indoor Environments. InInternational Conference on 3d Vision (3dV), 2017. 1, 2

work page 2017
[7]

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 1, 2

work page 2017
[8]

Auto- mated Creation of Digital Cousins for Robust Policy Learn- ing

Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Auto- mated Creation of Digital Cousins for Robust Policy Learn- ing. InConference on Robot Learning (CoRL), 2024. 2

work page 2024
[9]

EPIC-KITCHENS VISOR Benchmark: VIdeo Seg- mentations and Object Relations

Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. EPIC-KITCHENS VISOR Benchmark: VIdeo Seg- mentations and Object Relations. InInternational Confer- ence on Neural Information Processing Systems (NeurIPS),

work page
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Google Deepmind. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[12]

RoMa: Robust Dense Feature Matching

Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. RoMa: Robust Dense Feature Matching. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[13]

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Chengshu Li et al. BEHA VIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation.arXiv preprint arXiv:2403.09227,

work page internal anchor Pith review arXiv
[14]

Black, and Otmar Hilliges

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, and Otmar Hilliges. ARCTIC: A Dataset for Dexterous Bimanual Hand- Object Manipulation. InInternational Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023. 3

work page 2023
[15]

Black, Trevor Darrell, and Angjoo Kanazawa

Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J. Black, Trevor Darrell, and Angjoo Kanazawa. St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[16]

GEOPARD: Geometric Pretraining for Articulation Prediction in 3D Shapes

Pradyumn Goyal, Dmitry Petrov, Sheldon Andrews, Yizhak Ben-Shabat, Hsueh-Ti Derek Liu, and Evangelos Kaloger- akis. GEOPARD: Geometric Pretraining for Articulation Prediction in 3D Shapes. InInternational Conference on Computer Vision (ICCV), 2025. 2

work page 2025
[17]

Interaction Replica: Tracking human–object interaction and scene changes from human motion

Vladimir Guzov, Julian Chibane, Riccardo Marin, Yannan He, Yunus Saracoglu, Torsten Sattler, and Gerard Pons-Moll. Interaction Replica: Tracking human–object interaction and scene changes from human motion. InInternational Confer- ence on 3d Vision (3dV), 2024. 3

work page 2024
[18]

Holistic Un- derstanding of 3D Scenes as Universal Scene Description

Anna-Maria Halacheva, Yang Miao, Jan-Nico Zaech, Xi Wang, Luc Van Gool, and Danda Pani Paudel. Holistic Un- derstanding of 3D Scenes as Universal Scene Description. In International Conference on Computer Vision (ICCV), 2025. 2

work page 2025
[19]

Adam W. Harley, Yang You, Xinglong Sun, Yang Zheng, Nikhil Raghuraman, Yunqi Gu, Sheldon Liang, Wen-Hsuan Chu, Achal Dave, Pavel Tokmakov, Suya You, Rares Am- brus, Katerina Fragkiadaki, and Leonidas J. Guibas. All- Tracker: Efficient Dense Point Tracking at High Resolution. InInternational Conference on Computer Vision (ICCV),

work page
[20]

CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Ab- hinav Valada, and Thomas Kollar. CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

work page 2023
[21]

PREDATOR: Registration of 3D Point Clouds with Low Overlap

Shengyu Huang, Zan Gojcic, Mikhail Usvyatsov, and Kon- rad Schindler Andreas Wieser. PREDATOR: Registration of 3D Point Clouds with Low Overlap. InInternational Confer- ence on Computer Vision and Pattern Recognition (CVPR),

work page
[22]

Lite- reality: Graphics-ready 3d scene reconstruction from rgb-d scans.arXiv preprint arXiv:2507.02861, 2025

Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Heng- shuang Zhao, Matthias Nießner, and Joan Lasenby. LiteReal- ity: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans.arXiv preprint arXiv:2507.02861, 2025. 2

work page arXiv 2025
[23]

React3d: Recovering articulations for interactive physical 3d scenes.IEEE Robotics and Automa- tion Letters (RA-L), 2026

Zhao Huang, Boyang Sun, Alexandros Delitzas, Jiaqi Chen, and Marc Pollefeys. React3d: Recovering articulations for interactive physical 3d scenes.IEEE Robotics and Automa- tion Letters (RA-L), 2026. 2 9

work page 2026
[24]

Barron, Mario Fritz, Kate Saenko, and Trevor Darrell

Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T. Barron, Mario Fritz, Kate Saenko, and Trevor Darrell. A Category-level 3D Object Dataset: Putting the Kinect to Work. InInternational Conference on Computer Vision (ICCV) Workshops, 2011. 1

work page 2011
[25]

OPD: Single-view 3D Openable Part Detection

Hanxiao Jiang, Yongsen Mao, Manolis Savva, and Angel X Chang. OPD: Single-view 3D Openable Part Detection. In European Conference on Computer Vision (ECCV), 2022. 2

work page 2022
[26]

Ditto: Building Digital Twins of Articulated Objects from Interac- tion

Zhenyu Jiang, Cheng-Chun Hsu, and Yuke Zhu. Ditto: Building Digital Twins of Articulated Objects from Interac- tion. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

work page 2022
[27]

Co- Tracker: It is Better to Track Together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- Tracker: It is Better to Track Together. InEuropean Confer- ence on Computer Vision (ECCV), 2024. 3

work page 2024
[28]

Co- Tracker3: Simpler and Better Point Tracking by Pseudo- Labelling Real Videos

Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- Tracker3: Simpler and Better Point Tracking by Pseudo- Labelling Real Videos. InInternational Conference on Com- puter Vision (ICCV), 2025. 3, 7

work page 2025
[29]

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Justin Kerr, Chung Min Kim, Mingxuan Wu, Brent Yi, Qian- qian Wang, Ken Goldberg, and Angjoo Kanazawa. Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction. InConference on Robot Learning (CoRL), 2024. 2, 3

work page 2024
[30]

Parahome: Parameterizing everyday home activities to- wards 3d generative modeling of human-object interactions

Jeonghwan Kim, Jisoo Kim, Jeonghyeon Na, and Hanbyul Joo. Parahome: Parameterizing everyday home activities to- wards 3d generative modeling of human-object interactions. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[31]

OneFormer3D: One transformer for Unified Point Cloud Segmentation

Maxim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. OneFormer3D: One transformer for Unified Point Cloud Segmentation. InInternational Confer- ence on Computer Vision and Pattern Recognition (CVPR),

work page
[32]

H2O: Two Hands Manipulating Objects for First Person Interaction Recognition

Taein Kwon, Bugra Tekin, Jan St ¨uhmer, Federica Bogo, and Marc Pollefeys. H2O: Two Hands Manipulating Objects for First Person Interaction Recognition. InInternational Con- ference on Computer Vision (ICCV), 2021. 3

work page 2021
[33]

Cubify Anything: Scaling In- door 3D Object Detection

Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, and Afshin Dehghan. Cubify Anything: Scaling In- door 3D Object Detection. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

work page 2025
[34]

Articulate-anything: Auto- matic modeling of articulated objects via a vision-language foundation model

Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Di- nesh Jayaraman, and Eric Eaton. Articulate-anything: Auto- matic modeling of articulated objects via a vision-language foundation model. InInternational Conference on Learning Representations (ICLR), 2025. 2

work page 2025
[35]

MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[36]

MegaSaM: Accurate, Fast and Ro- bust Structure and Motion from Casual Dynamic Videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. MegaSaM: Accurate, Fast and Ro- bust Structure and Motion from Casual Dynamic Videos. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[37]

PARIS: Part-level Reconstruction and Motion Analysis for Articu- lated Objects

Jiayi Liu, Ali Mahdavi-Amiri, and Manolis Savva. PARIS: Part-level Reconstruction and Motion Analysis for Articu- lated Objects. InInternational Conference on Computer Vi- sion (ICCV), 2023. 2

work page 2023
[38]

Survey on Modeling of Human-made Articulated Objects.arXiv preprint arXiv:2403.14937, 2025

Jiayi Liu, Manolis Savva, and Ali Mahdavi-Amiri. Survey on Modeling of Human-made Articulated Objects.arXiv preprint arXiv:2403.14937, 2025. 2

work page arXiv 2025
[39]

Self-Supervised Category-Level Articulated Ob- ject Pose Estimation with Part-Level SE(3) Equivariance

Xueyi Liu, Ji Zhang, Ruizhen Hu, Haibin Huang, He Wang, and Li Yi. Self-Supervised Category-Level Articulated Ob- ject Pose Estimation with Part-Level SE(3) Equivariance. InInternational Conference on Learning Representations (ICLR), 2023. 2

work page 2023
[40]

HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2, 3, 5, 6, 7

work page 2022
[41]

arXiv preprint arXiv:2509.17647 (2025) 2, 3

Yu Liu, Baoxiong Jia, Ruijie Lu, Chuyue Gan, Huayu Chen, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Videoartgs: Building digital twins of articulated objects from monocular video.arXiv preprint arXiv:2509.17647, 2025. 3

work page arXiv 2025
[42]

Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting

Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting. InIn- ternational Conference on Learning Representations (ICLR),

work page
[43]

Chang, Iro Laina, and Victor Adrian Prisacariu

Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Polle- feys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, and Victor Adrian Prisacariu. When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Langu...

work page 2024
[44]

MultiScan: Scalable RGBD Scanning for 3D Environments With Articulated Objects

Yongsen Mao, Yiming Zhang, Hanxiao Jiang, Angel X Chang, and Manolis Savva. MultiScan: Scalable RGBD Scanning for 3D Environments With Articulated Objects. In International Conference on Neural Information Processing Systems (NeurIPS), 2022. 2

work page 2022
[45]

Indoor Scene Understanding in 2.5/3D for Autonomous Agents: A Survey.IEEE Access, 2018

Muzammal Naseer, Salman Khan, and Fatih Porikli. Indoor Scene Understanding in 2.5/3D for Autonomous Agents: A Survey.IEEE Access, 2018. 1

work page 2018
[46]

DELTA: Dense Efficient Long-Range 3D Tracking for Any Video

Tuan Ngo, Peiye Zhuang, Evangelos Kalogerakis, Chuang Gan, Sergey Tulyakov, Hsin-Ying Lee, and Chaoyang Wang. DELTA: Dense Efficient Long-Range 3D Tracking for Any Video. InInternational Conference on Learning Represen- tations (ICLR), 2025. 3

work page 2025
[47]

Isaac Sim

NVIDIA. Isaac Sim. 8

work page
[48]

NVIDIA® RTX Path Tracing, 2023

NVIDIA. NVIDIA® RTX Path Tracing, 2023. 6

work page 2023
[49]

Recon- structing Hands in 3D With Transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Recon- structing Hands in 3D With Transformers. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 8 10

work page 2024
[50]

iTACO: Interactable Digital Twins of Articulated Objects from Casu- ally Captured RGBD Videos

Weikun Peng, Jun Lv, Cewu Lu, and Manolis Savva. iTACO: Interactable Digital Twins of Articulated Objects from Casu- ally Captured RGBD Videos. InInternational Conference on 3d Vision (3dV), 2026. 3

work page 2026
[51]

Wilor: End-to-end 3d hand localization and reconstruction in-the-wild

Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 8

work page 2025
[52]

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2

work page 2017
[53]

Deep Hugh V oting for 3D Object Detection in Point Clouds

Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep Hugh V oting for 3D Object Detection in Point Clouds. InInternational Conference on Computer Vision (ICCV), 2019. 2

work page 2019
[54]

Multi-view 3d point tracking

Frano Raji ˇc, Haofei Xu, Marko Mihajlovic, Siyuan Li, Irem Demir, Emircan G¨undo˘gdu, Lei Ke, Sergey Prokudin, Marc Pollefeys, and Siyu Tang. Multi-view 3d point tracking. In International Conference on Computer Vision (ICCV), 2025. 3

work page 2025
[55]

REACT3D: Recovering Articulations for Interactive Physical 3D Scenes

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Tchapmi, Micael E

Bokui Shen, Fei Xia, Chengshu Li, Roberto Mart ´ın-Mart´ın, Linxi Fan, Guanzhi Wang, Claudia P ´erez-D’Arpino, Shya- mal Buch, Sanjana Srivastava, Lyne P. Tchapmi, Micael E. Tchapmi, Kent Vainio, Josiah Wong, Li Fei-Fei, and Silvio Savarese. iGibson 1.0: A Simulation Environment for Inter- active Tasks in Large Realistic Scenes. InIEEE/RSJ Interna- tional...

work page
[57]

Indoor Segmentation and Support Inference from RGBD Images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor Segmentation and Support Inference from RGBD Images. InEuropean Conference on Computer Vision (ECCV), 2012. 1

work page 2012
[58]

Lichtenberg, and Jianxiong Xiao

Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 1

work page 2015
[59]

Tao Sun, Yan Hao, Shengyu Huang, Silvio Savarese, Konrad Schindler, Marc Pollefeys, and Iro Armeni. Nothing Stands Still: A Spatiotemporal Benchmark on 3D Point Cloud Reg- istration Under Large Geometric and Temporal Change.IS- PRS Journal of Photogrammetry and Remote Sensing, 2025. 2

work page 2025
[60]

OPDMulti: Openable Part Detection for Multiple Objects

Xiaohao Sun, Hanxiao Jiang, Manolis Savva, and Angel Chang. OPDMulti: Openable Part Detection for Multiple Objects. InInternational Conference on 3d Vision (3dV),

work page
[61]

RIO: 3D Object Instance Re-Localization in Changing Indoor Environments

Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Niessner. RIO: 3D Object Instance Re-Localization in Changing Indoor Environments. InInter- national Conference on Computer Vision (ICCV), 2019. 1, 2

work page 2019
[62]

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand- Object Interaction

Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xi- aohu Guo, and Yu Xiang. HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand- Object Interaction. InInternational Conference on Neural Information Processing Systems (NeurIPS), 2025. 3

work page 2025
[63]

Tracking Everything Everywhere All at Once

Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, and Noah Snavely. Tracking Everything Everywhere All at Once. InInterna- tional Conference on Computer Vision (ICCV), 2023. 3

work page 2023
[64]

Shape of Mo- tion: 4D Reconstruction from a Single Video

Qianqian Wang, Vickie Ye, Hang Gao, Weijia Zeng, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of Mo- tion: 4D Reconstruction from a Single Video. InInterna- tional Conference on Computer Vision (ICCV), 2025. 3

work page 2025
[65]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D Per- ception Model with Persistent State. InInternational Confer- ence on Computer Vision and Pattern Recognition (CVPR),

work page
[66]

BundleTrack: 6D Pose Track- ing for Novel Objects without Instance or Category-Level 3D Models

B Wen and Kostas E Bekris. BundleTrack: 6D Pose Track- ing for Novel Objects without Instance or Category-Level 3D Models. InIEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), 2021. 3, 7

work page 2021
[67]

BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects

Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas Muller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects. InInterna- tional Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2023. 7, 8

work page 2023
[68]

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[69]

Neural Implicit Representation for Building Digital Twins of Un- known Articulated Objects

Yijia Weng, Bowen Wen, Jonathan Tremblay, Valts Blukis, Dieter Fox, Leonidas Guibas, and Stan Birchfield. Neural Implicit Representation for Building Digital Twins of Un- known Articulated Objects. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[70]

Artic- ulated object estimation in the wild

Abdelrhman Werby, Martin Buechner, Adrian Roefer, Chen- guang Huang, Wolfram Burgard, and Abhinav Valada. Artic- ulated object estimation in the wild. InConference on Robot Learning (CoRL), 2025. 3

work page 2025
[71]

Reartgs: Reconstruct- ing and generating articulated objects via 3d gaussian splat- ting with geometric and motion constraints

Di Wu, Liu Liu, Zhou Linli, Anran Huang, Liangtu Song, Qiaojun Yu, Qi Wu, and Cewu Lu. Reartgs: Reconstruct- ing and generating articulated objects via 3d gaussian splat- ting with geometric and motion constraints. InInterna- tional Conference on Neural Information Processing Sys- tems (NeurIPS), 2025. 2

work page 2025
[72]

Predict- Optimize-Distill: A Self-Improving Cycle for 4D Object Un- derstanding

Mingxuan Wu, Huang Huang, Justin Kerr, Chung Min Kim, Anthony Zhang, Brent Yi, and Angjoo Kanazawa. Predict- Optimize-Distill: A Self-Improving Cycle for 4D Object Un- derstanding. InInternational Conference on Computer Vi- sion (ICCV), 2025. 2, 3

work page 2025
[73]

Drawer: Digital Reconstruction and Articulation with Environment Realism

Hongchi Xia, Entong Su, Marius Memmel, Arhan Jain, Ray- mond Yu, Numfor Mbiziwo-Tiapo, Ali Farhadi, Abhishek 11 Gupta, Shenlong Wang, and Wei-Chiu Ma. Drawer: Digital Reconstruction and Articulation with Environment Realism. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

work page 2025
[74]

SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels

Jianxiong Xiao, Andrew Owens, and Antonio Torralba. SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels. InInternational Conference on Computer Vision (ICCV), 2013. 1

work page 2013
[75]

SpatialTracker: Tracking Any 2D Pixels in 3D Space

Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. SpatialTracker: Tracking Any 2D Pixels in 3D Space. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[76]

Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion. InInternational Con- ference on Computer Vision (ICCV), 2025. 3, 7, 8

work page 2025
[77]

ScanNet++: A High-fidelity Dataset of 3D Indoor Scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A High-fidelity Dataset of 3D Indoor Scenes. InInternational Conference on Computer Vision (ICCV), 2023. 1, 2

work page 2023
[78]

METASCENES: Towards Automated Replica Creation for Real-world 3D Scans

Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Zhu Song-Chun, Tengyu Liu, and Siyuan Huang. METASCENES: Towards Automated Replica Creation for Real-world 3D Scans. InInternational Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025. 2

work page 2025
[79]

Self- Supervised Monocular 4D Scene Reconstruction for Ego- centric Videos

Chengbo Yuan, Geng Chen, Li Yi, and Yang Gao. Self- Supervised Monocular 4D Scene Reconstruction for Ego- centric Videos. InInternational Conference on Computer Vision (ICCV), 2025. 3

work page 2025
[80]

TAPIP3D: Tracking Any Point in Persistent 3D Ge- ometry

Bowei Zhang, Lei Ke, Adam W Harley, and Katerina Fragki- adaki. TAPIP3D: Tracking Any Point in Persistent 3D Ge- ometry. InInternational Conference on Neural Information Processing Systems (NeurIPS), 2025. 3

work page 2025

Showing first 80 references.