pith. machine review for the scientific record. sign in

arxiv: 2604.11038 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric videosinteractive 3D objectsfunction templates3D reconstructionarticulation estimationembodied AIsimulationpart segmentation
0
0 comments X

The pith

Function templates capture cross-part interactions in 3D objects from egocentric videos and compile directly into simulation code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to obtain simulation-ready 3D models of interactive objects directly from egocentric video by representing how one part affects another through function templates. These templates record general mappings, such as knob rotation controlling burner temperature, and turn them into executable code for any simulator. A new dataset supplies 271 real-world videos with paired 3D geometry, part labels, articulations, and template annotations. A four-stage pipeline performs part segmentation, reconstruction, articulation estimation, and template inference. The benchmark reveals that existing methods struggle with the full task, pointing to needed improvements in functional reasoning from video.

Core claim

We present EgoFun3D, a coordinated task formulation, dataset, and benchmark for modeling interactive 3D objects from egocentric videos. While prior work largely focuses on articulations, we capture general cross-part functional mappings through function templates, a structured computational representation that enables precise evaluation and direct compilation into executable code across simulation platforms. To enable comprehensive benchmarking, we introduce a dataset of 271 egocentric videos featuring challenging real-world interactions with paired 3D geometry, segmentation over 2D and 3D, articulation and function template annotations. To tackle the task, we propose a 4-stage pipeline of 2

What carries the argument

Function templates, a structured computational representation that records general cross-part functional mappings and compiles them into executable simulation code.

If this is right

  • Interactive 3D models obtained from video can be executed as code in multiple simulation environments without manual reprogramming.
  • Functional accuracy of object models can be measured quantitatively by comparing template outputs to observed part behaviors.
  • A shared benchmark of video, geometry, and templates allows direct comparison of methods for segmentation, reconstruction, and functional inference.
  • Separation of geometry from function allows the same template to apply across objects that share interaction patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotic systems could acquire manipulation skills by converting everyday human videos into simulated practice environments.
  • The templates could serve as an intermediate layer for language models to describe or predict object functions from visual input.
  • Extending the approach to longer video sequences might capture time-dependent functions such as heating rates or locking mechanisms.
  • Synthetic data generated from the compiled templates could augment training sets for vision models that must recognize functional relationships.

Load-bearing premise

Egocentric videos contain enough visual information for the four-stage pipeline to recover accurate 3D geometry, articulations, and function templates despite real-world variations in lighting and viewpoint.

What would settle it

Run the full pipeline on a held-out set of videos and check whether the resulting function templates, when executed in a simulator, reproduce the exact interaction outcomes observed in the original footage.

Figures

Figures reproduced from arXiv: 2604.11038 by Denys Iliash, Manolis Savva, Weikun Peng.

Figure 1
Figure 1. Figure 1: We present EgoFun3D: a coordinated task, dataset and benchmark for modeling interactive 3D objects from egocentric videos. Given an egocentric video as an input, the output is a simulation-ready interactive object (e.g., faucet handle starts water flow from faucet spout). We break down the task into 4 steps to propose a baseline approach using off-the-shelf components. Our function template representation,… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of a typical form of human-object interaction. An agent interacts with a receptor, changing its state. Part functionality defines how the state change of the receptor maps to the state change of the effector. On the right, we provide an example of human interacting with a knob of the stove. The part function triggers the temperature change of the burner after knob actuation. output is the segm… view at source ↗
Figure 3
Figure 3. Figure 3: Our baseline framework. We break down the task into 4 steps that are individ￾ually targeted with off-the shelf components. First, a VLM generates part descriptions which are used to segment the parts in the video. Then, the geometry of the receptor and the effector are reconstructed, articulation parameters are estimated, and the func￾tion template is inferred. These outputs are combined to build the inter… view at source ↗
Figure 4
Figure 4. Figure 4: Examples of annotations in our dataset. We provide 2D segmentation masks for hands, receptor (in teal), effector (in orange), and the whole object. We annotate part segmentation for receptor and effector on reconstructed 3D meshes. For articulation, we annotate revolute and prismatic joints, shown as red and green arrows respectively. For the function template, we pick one of four physical effects and one … view at source ↗
Figure 5
Figure 5. Figure 5: Our egocentric video dataset distributions across object categories. There are prominent long tail distributions across categories, physical effects, and function map￾pings, primarily due to inherited biases from source datasets such as Ego-Exo4D. 5.1 Baselines and Implementation Details 2D segmentation. We first use Gemini 3 Flash [17] to output a short label and a long description of the receptor and eff… view at source ↗
Figure 6
Figure 6. Figure 6: Example 2D segmentation results. We find that SAM3 with Qwen3-VL pro￾vides the best segmentation. The main challenges in this subtask are segmentation of incorrect parts (left) and confusion between part instances across frames (middle). Per￾formance on videos featuring more static viewpoints and no part instance ambiguity is better, though such videos are rare (right). 5.3 Experimental Results Segmentatio… view at source ↗
Figure 7
Figure 7. Figure 7: Example results for reconstruction. MapAnything exhibits severe drifting issues as predicted camera poses for different video frames are inaccurate. Other approaches also exhibit significant artifacts. Overall, reconstruction from our egocentric video data is highly challenging for all methods. Anything 3 being trained on much more data, which covers a more diverse set of scenarios. Depth Anything 3 outper… view at source ↗
Figure 8
Figure 8. Figure 8: Example results for articulation estimation. Red arrows refer to revolute joints and green arrows refer to prismatic joints. In the left example, iTACO predicts incorrect joint types, whereas Artipoint is correct. However, both methods struggle with small parts such as the stove knob shown here. Tab. 5 shows that Gemini 3 Flash is the most accurate model for this step, GPT-5 mini is the second best, while … view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results of the final outputs of our system. The first two rows il￾lustrate two interactive faucets in Genesis [2]. The last row illustrates an interactive stove in BEHAVIOR-1K [28]. We use teal to indicate receptors and orange to indicate effectors. Red and green arrows represent revolute and prismatic joints respectively. task, we first presented a function template representation that accurat… view at source ↗
read the original abstract

We present EgoFun3D, a coordinated task formulation, dataset, and benchmark for modeling interactive 3D objects from egocentric videos. Interactive objects are of high interest for embodied AI but scarce, making modeling from readily available real-world videos valuable. Our task focuses on obtaining simulation-ready interactive 3D objects from egocentric video input. While prior work largely focuses on articulations, we capture general cross-part functional mappings (e.g., rotation of stove knob controls stove burner temperature) through function templates, a structured computational representation. Function templates enable precise evaluation and direct compilation into executable code across simulation platforms. To enable comprehensive benchmarking, we introduce a dataset of 271 egocentric videos featuring challenging real-world interactions with paired 3D geometry, segmentation over 2D and 3D, articulation and function template annotations. To tackle the task, we propose a 4-stage pipeline consisting of: 2D part segmentation, reconstruction, articulation estimation, and function template inference. Comprehensive benchmarking shows that the task is challenging for off-the-shelf methods, highlighting avenues for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents EgoFun3D as a new task formulation, dataset (271 egocentric videos with paired 3D geometry, 2D/3D segmentation, articulation, and function template annotations), and benchmark for obtaining simulation-ready interactive 3D objects from egocentric video. It introduces function templates as a structured computational representation capturing general cross-part functional mappings (e.g., knob rotation controlling burner temperature) that support precise evaluation and direct compilation to executable code in simulators. A 4-stage pipeline (2D part segmentation, reconstruction, articulation estimation, function template inference) is proposed to tackle the task, with benchmarking showing that off-the-shelf methods struggle.

Significance. If the pipeline and function templates can be shown to work reliably, the work would be significant for embodied AI by addressing the scarcity of interactive 3D models through modeling from common egocentric videos. The dataset and annotations provide a valuable new benchmark, while function templates offer a novel, structured alternative to articulation-only representations that could enable more general and portable simulation. The emphasis on direct compilability to code is a strength if demonstrated.

major comments (2)
  1. [Abstract and Evaluation/Results] The abstract and evaluation sections state that 'comprehensive benchmarking shows that the task is challenging for off-the-shelf methods' but report no quantitative metrics (accuracy, error rates, simulation fidelity, or end-to-end success rates) for the authors' own 4-stage pipeline on the 271-video dataset. This absence directly undermines the central claim that the pipeline produces usable simulation-ready objects via function templates.
  2. [Abstract and §4 (Pipeline description)] The claim that function templates 'enable direct compilation into executable code across simulation platforms' (abstract) is presented without any demonstrated examples, compilation procedure, or fidelity metrics in the pipeline description or results; this is load-bearing for the novelty of the representation.
minor comments (2)
  1. [Dataset section] Clarify the exact criteria for video selection and annotation protocol for the 271-video dataset to support reproducibility.
  2. [Figures] Ensure figures showing pipeline outputs include clear comparisons to ground-truth annotations and simulation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address each major comment point by point below, providing clarifications and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and Evaluation/Results] The abstract and evaluation sections state that 'comprehensive benchmarking shows that the task is challenging for off-the-shelf methods' but report no quantitative metrics (accuracy, error rates, simulation fidelity, or end-to-end success rates) for the authors' own 4-stage pipeline on the 271-video dataset. This absence directly undermines the central claim that the pipeline produces usable simulation-ready objects via function templates.

    Authors: We appreciate the referee for highlighting this issue. The benchmarking in the manuscript applies off-the-shelf methods to the individual stages of the proposed 4-stage pipeline and reports their performance on the EgoFun3D dataset to demonstrate the inherent challenges of the task. The pipeline itself is presented as an initial baseline approach rather than a fully optimized solution with claimed high-fidelity outputs. The abstract and results sections emphasize the new task, dataset, and function template representation, along with the difficulties faced by existing methods, without asserting end-to-end quantitative superiority for the pipeline. We agree that adding quantitative metrics would strengthen the presentation of the pipeline's utility for producing simulation-ready objects. In the revised manuscript, we will add a dedicated evaluation subsection reporting metrics such as part segmentation IoU, articulation parameter errors, and function template matching accuracy on the 271-video dataset. revision: yes

  2. Referee: [Abstract and §4 (Pipeline description)] The claim that function templates 'enable direct compilation into executable code across simulation platforms' (abstract) is presented without any demonstrated examples, compilation procedure, or fidelity metrics in the pipeline description or results; this is load-bearing for the novelty of the representation.

    Authors: We thank the referee for this observation. Function templates are introduced in the manuscript as a structured representation capturing cross-part functional mappings, with the abstract noting their support for precise evaluation and direct compilation to executable code. The current version defines the template structure and its role in the pipeline but does not include concrete compilation examples or fidelity metrics. This capability is positioned as a core advantage for portability and simulator integration. We acknowledge that explicit demonstration is needed to fully substantiate the claim. In the revision, we will expand Section 4 to include an illustrative example of compiling a function template to code for a standard simulator (e.g., with pseudocode or a specific platform), along with a general description of the compilation procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new task, dataset, and pipeline are independent contributions

full rationale

The paper introduces a new coordinated task, 271-video dataset with independent annotations (2D/3D segmentation, articulations, function templates), and a 4-stage pipeline (2D segmentation, reconstruction, articulation estimation, function template inference). Function templates are defined as a novel structured representation for cross-part mappings without being derived from or equivalent to the pipeline outputs by construction. Benchmarking evaluates off-the-shelf methods rather than claiming predictions from fitted parameters on the same data. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way that reduces the central claims to prior author work or self-referential definitions. The derivation chain consists of empirical task formulation and data collection that remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on domain assumptions about video sufficiency for inference and introduces function templates as a new representation; no free parameters or other invented entities detailed in abstract.

axioms (1)
  • domain assumption Egocentric videos contain sufficient information to infer 3D geometry, articulations, and functional mappings via the proposed pipeline.
    Invoked in the task definition and 4-stage pipeline description.
invented entities (1)
  • Function templates no independent evidence
    purpose: Structured computational representation for cross-part functional mappings enabling precise evaluation and code compilation.
    New concept introduced to go beyond articulations; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5493 in / 1421 out tokens · 66237 ms · 2026-05-10T16:29:36.108340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    In: 8th Annual Conference on Robot Learning (2024) 2

    Abou-Chakra, J., Rana, K., Dayoub, F., Suenderhauf, N.: Physically Embodied Gaussian Splatting: A Realtime Correctable World Model for Robotics. In: 8th Annual Conference on Robot Learning (2024) 2

  2. [2]

    Authors, G.: Genesis: A Generative and Universal Physics Engine for Robotics and Beyond (December 2024),https://github.com/Genesis-Embodied-AI/Genesis 2, 5, 8, 14, 20, 21, 22, 29

  3. [3]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  4. [4]

    arXiv preprint arXiv:2602.16356 (2026) 4

    Buechner, M., Roefer, A., Engelbracht, T., Welschehold, T., Bauer, Z., Blum, H., Pollefeys, M., Valada, A.: Articulated 3D Scene Graphs for Open-World Mobile Manipulation. arXiv preprint arXiv:2602.16356 (2026) 4

  5. [5]

    In: NeurIPS (2025) 3, 4

    Cao, Z., Chen, Z., Pan, L., Liu, Z.: PhysX-3D: Physical-Grounded 3D Asset Gen- eration. In: NeurIPS (2025) 3, 4

  6. [6]

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

  7. [7]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025) 3

    Chen, C., Liu, I., Wei, X., Su, H., Liu, M.: FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025) 3

  8. [8]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025) 3

    Chen, H., Lan, Y., Chen, Y., Pan, X.: ArtiLatent: Realistic Articulated 3D Object Generation via Structured Latents. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025) 3

  9. [9]

    In: Robotics: Science and Systems (RSS) (2024) 3

    Chen, Z., Walsman, A., Memmel, M., Mo, K., Fang, A., Vemuri, K., Wu, A., Fox, D., Gupta, A.: URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images. In: Robotics: Science and Systems (RSS) (2024) 3

  10. [10]

    arXiv (2026) 9, 10

    Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., Tripathi, R., Lee, S., Ren, Z., Kim,C.D.,Yang,Y.,Shao,V.,Yang,Y.,Huang,W.,Gao,Z.,Anderson,T.,Zhang, 16 J., Jain, J., Stoica, G., Han, W., Farhadi, A., Krishna, R.: Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding. arXiv (2026) 9, 10

  11. [11]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Corsetti, J., Giuliari, F., Fasoli, A., Boscaini, D., Poiesi, F.: Functionality under- standing and segmentation in 3D scenes. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24550–24559 (2025) 3

  12. [12]

    CoRL (2025) 2

    Dan, P., Kedia, K., Chao, A., Duan, E.W., Pace, M.A., Ma, W.C., Choudhury, S.: X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real. CoRL (2025) 2

  13. [13]

    In: CVPR (2024) 2, 3, 4

    Delitzas, A., Takmaz, A., Tombari, F., Sumner, R., Pollefeys, M., Engelmann, F.: SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. In: CVPR (2024) 2, 3, 4

  14. [14]

    In: IEEE Conference on Computer Vision and Pattern Recognition (2024) 9

    Edstedt, J., Sun, Q., Bökman, G., Wadenbäck, M., Felsberg, M.: RoMa: Robust Dense Feature Matching. In: IEEE Conference on Computer Vision and Pattern Recognition (2024) 9

  15. [15]

    In: CVPR (2026) 3, 4

    Engelbracht, T., Zurbrügg, R., Wohlrapp, M., Büchner, M., Valada, A., Pollefeys, M., Blum, H., Bauer, Z.: Hoi!-A Multimodal Dataset for Force-Grounded, Cross- View Articulated Manipulation. In: CVPR (2026) 3, 4

  16. [16]

    CoRR (2025) 2

    Escontrela, A., Kerr, J., Allshire, A., Frey, J., Duan, R., Sferrazza, C., Abbeel, P.: GaussGym: An Open-Source Real-To-Sim Framework for Learning Locomotion from Pixels. CoRR (2025) 2

  17. [17]

    google/products-and-platforms/products/gemini/gemini-3/9, 10, 23

    Gemini Team: A new era of intelligence with Gemini 3 (2025),https://blog. google/products-and-platforms/products/gemini/gemini-3/9, 10, 23

  18. [18]

    In: CVPR (2024) 2, 3, 7

    Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-Exo4D: Understand- ing Skilled Human Activity from First- and Third-Person Perspectives. In: CVPR (2024) 2, 3, 7

  19. [19]

    arXiv preprint arXiv:2512.24845 (2025) 4

    Gu, Q., Sheng, Y., Yu, J., Tang, J., Shan, X., Shen, Z., Yi, T., Liang, X., Chen, X., Wang, Y.: ArtiSG: Functional 3D Scene Graph Construction via Human- demonstrated Articulated Objects Manipulation. arXiv preprint arXiv:2512.24845 (2025) 4

  20. [20]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025) 3, 4

    Halacheva, A.M., Miao, Y., Zaech, J.N., Wang, X., Van Gool, L., Paudel, D.P.: Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025) 3, 4

  21. [21]

    Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

    Huang, J., Zhou, Q., Rabeti, H., Korovko, A., Ling, H., Ren, X., Shen, T., Gao, J., Slepichev, D., Lin, C.H., Ren, J., Xie, K., Biswas, J., Leal-Taixe, L., Fidler, S.: ViPE: Video Pose Engine for 3D Geometric Perception. In: NVIDIA Research Whitepapers arXiv:2508.10934 (2025) 9

  22. [22]

    ICCV (2025) 2

    Jiang, H., Hsu, H.Y., Zhang, K., Yu, H.N., Wang, S., Li, Y.: Phystwin: Physics- informed reconstruction and simulation of deformable objects from videos. ICCV (2025) 2

  23. [23]

    In: European Conference on Computer Vision

    Jiang, H., Mao, Y., Savva, M., Chang, A.X.: OPD: Single-view 3D openable part detection. In: European Conference on Computer Vision. pp. 410–426. Springer (2022) 2

  24. [24]

    In: ICLR (2026) 3, 4

    Jin, Z., Che, Z., Zhao, Z., Wu, K., Zhang, Y., Zhao, Y., Liu, Z., Zhang, Q., Ju, X., Tian, J., et al.: ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning. In: ICLR (2026) 3, 4

  25. [25]

    In: International Conference on 3D Vision (2026) 7, 9, 24

    Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera, EgoFun3D 17 M., Bulò, S.R., Richardt, C., Ramanan, D., Scherer, S., Kontschieder, P.: Ma- pAnything: Universal Feed-Forward Metric 3D Reconstruction. In: International Conference on 3D Vision (2026) 7, 9, 24

  26. [26]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Deitke, M., Ehsani, K., Gordon, D., Zhu, Y., et al.: AI2-THOR: An interactive 3D envi- ronment for visual AI. arXiv preprint arXiv:1712.05474 (2017) 2

  27. [27]

    In: International Con- ference on Learning Representations (ICLR) (2025) 3

    Le, L., Xie, J., Liang, W., Wang, H.J., Yang, Y., Ma, Y.J., Vedder, K., Krishna, A., Jayaraman, D., Eaton, E.: Articulate-Anything: Automatic Modeling of Artic- ulated Objects via a Vision-Language Foundation Model. In: International Con- ference on Learning Representations (ICLR) (2025) 3

  28. [28]

    In: Conference on Robot Learning

    Li, C., Zhang, R., Wong, J., Gokmen, C., Srivastava, S., Martín-Martín, R., Wang, C.,Levine,G.,Lingelbach,M.,Sun,J.,etal.:BEHAVIOR-1K:AHuman-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation. In: Conference on Robot Learning. pp. 80–93. PMLR (2023) 2, 3, 4, 5, 8, 14, 21, 22, 29

  29. [29]

    In: CVPR (2025) 3

    Li, Z., Zhang, C., Li, Z., Howard-Jenkins, H., Lv, Z., Geng, C., Wu, J., Newcombe, R., Engel, J., Dong, Z.: ART: Articulated Reconstruction Transformer. In: CVPR (2025) 3

  30. [30]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth Anything 3: Recovering the Visual Space from Any Views. arXiv preprint arXiv:2511.10647 (2025) 9, 24

  31. [31]

    arXiv preprint arXiv:2409.19650 (2024) 3

    Liu, C., Zhai, W., Yang, Y., Luo, H., Liang, S., Cao, Y., Zha, Z.J.: Grounding 3D Scene Affordance From Egocentric Interactions. arXiv preprint arXiv:2409.19650 (2024) 3

  32. [32]

    In: International Conference on Learning Representations (ICLR) (2025) 3

    Liu, J., Iliash, D., Chang, A.X., Savva, M., Mahdavi-Amiri, A.: SINGAPO: Sin- gle Image Controlled Generation of Articulated Parts in Object. In: International Conference on Learning Representations (ICLR) (2025) 3

  33. [33]

    In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023) 2, 3

    Liu, J., Mahdavi-Amiri, A., Savva, M.: PARIS: Part-level Reconstruction and Mo- tion Analysis for Articulated Objects. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023) 2, 3

  34. [34]

    In: Computer Graphics Forum

    Liu, J., Savva, M., Mahdavi-Amiri, A.: Survey on Modeling of Human-made Ar- ticulated Objects. In: Computer Graphics Forum. vol. 44, p. e70092. Wiley Online Library (2025) 2

  35. [35]

    arXiv preprint arXiv:2509.17647 (2025) 2, 3

    Liu, Y., Jia, B., Lu, R., Gan, C., Chen, H., Ni, J., Zhu, S.C., Huang, S.: VideoArtGS:BuildingDigitalTwinsofArticulatedObjectsfromMonocularVideo. arXiv preprint arXiv:2509.17647 (2025) 2, 3

  36. [36]

    In: International Confer- ence on Learning Representations (ICLR) (2025) 2, 3

    Liu, Y., Jia, B., Lu, R., Ni, J., Zhu, S.C., Huang, S.: Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting. In: International Confer- ence on Learning Representations (ICLR) (2025) 2, 3

  37. [37]

    In: ICLR (2026) 9

    Liu, Y., Qu, T., Zhong, Z., PENG, B., Liu, S., Yu, B., Jia, J.: VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning. In: ICLR (2026) 9

  38. [38]

    In: International Conference on Learning Represen- tations (ICLR) (2025) 3

    Mandi, Z., Weng, Y., Bauer, D., Song, S.: Real2Code: Reconstruct Articulated Objects via Code Generation. In: International Conference on Learning Represen- tations (ICLR) (2025) 3

  39. [39]

    In: RSS (2025) 2

    Ning, C., Fang, K., Ma, W.C.: Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins. In: RSS (2025) 2

  40. [40]

    NVIDIA: Isaac Sim,https://github.com/isaac-sim/IsaacSim2, 5, 8, 20, 21, 29

  41. [41]

    https://openai.com/index/introducing-gpt-5/ (2025) 10 18

    OpenAI: Introducing GPT-5. https://openai.com/index/introducing-gpt-5/ (2025) 10 18

  42. [42]

    In: CVPR (2025) 7

    Pataki, Z., Sarlin, P.E., Schönberger, J.L., Pollefeys, M.: MP-SfM: Monocular Sur- face Priors for Robust Structure-from-Motion. In: CVPR (2025) 7

  43. [43]

    In: International Conference on 3D Vision (2026) 2, 3, 4, 10

    Peng, W., Lv, J., Lu, C., Savva, M.: iTACO: Interactable Digital Twins of Articu- lated Objects from Casually Captured RGBD Videos. In: International Conference on 3D Vision (2026) 2, 3, 4, 10

  44. [44]

    In: 8th Annual Conference on Robot Learning (2024) 2

    Peng, W., Lv, J., Zeng, Y., Chen, H., Zhao, S., Sun, J., Lu, C., Shao, L.: TieBot: Learning to Knot a Tie from Visual Demonstration through a Real-to-Sim-to-Real Approach. In: 8th Annual Conference on Robot Learning (2024) 2

  45. [45]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2025) 2, 3

    Perrett, T., Darkhalil, A., Sinha, S., Emara, O., Pollard, S., Parida, K., Liu, K., Gatti, P., Bansal, S., Flanagan, K., Chalk, J., Zhu, Z., Guerrier, R., Abdelazim, F., Zhu, B., Moltisanti, D., Wray, M., Doughty, H., Damen, D.: HD-EPIC: A Highly- Detailed Egocentric Video Dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern ...

  46. [46]

    Polycam: Polycam (2025),https://poly.cam/7

  47. [47]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714 (2024) 7

  48. [48]

    Record3D: Record3d (2025),https://record3d.app/7

  49. [49]

    In: IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (2025) 4

    Rotondi, D., Scaparro, F., Blum, H., Arras, K.O.: FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction. In: IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (2025) 4

  50. [50]

    In: Scientific and Medical Knowledge Production, 1796-1918, pp

    Sherrington, C.S.: The Integrative Action of the Nervous System. In: Scientific and Medical Knowledge Production, 1796-1918, pp. 217–253. Routledge (2023) 2

  51. [51]

    In: CVPR (2026) 7

    Siddiqui, Y., Frost, D., Aroudj, S., Avetisyan, A., Howard-Jenkins, H., DeTone, D., Moulon, P., Wu, Q., Li, Z., Straub, J., Newcombe, R., Engel, J.: ShapeR: Robust Conditional 3D Shape Generation from Casual Captures. In: CVPR (2026) 7

  52. [52]

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly

    Todorov, E., Erez, T., Tassa, Y.: MuJoCo: A physics engine for model-based con- trol. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 5026–5033. IEEE (2012).https://doi.org/10.1109/IROS.2012.638610920

  53. [53]

    In: RSS (2024) 2

    Torne, M., Simeonov, A., Li, Z., Chan, A., Chen, T., Gupta, A., Agrawal, P.: Rec- onciling Reality Through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation. In: RSS (2024) 2

  54. [54]

    In: AAAI (2026) 9

    Wang, H., Qiao, L., Jie, Z., Huang, Z., Feng, C., Zheng, Q., Ma, L., Lan, X., Liang, X.: X-SAM: From Segment Anything to Any Segmentation. In: AAAI (2026) 9

  55. [55]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, X., Zhou, B., Shi, Y., Chen, X., Zhao, Q., Xu, K.: Shape2Motion: Joint Analysis of Motion Parts and Attributes from 3D Shapes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8876– 8884 (2019) 2

  56. [56]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 2, 3

    Weng, Y., Wen, B., Tremblay, J., Blukis, V., Fox, D., Guibas, L., Birchfield, S.: Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 2, 3

  57. [57]

    Conference on Robot Learning (CoRL) (2025) 3, 4, 10

    Werby, A., Buechner, M., Roefer, A., Huang, C., Burgard, W., Valada, A.: Ar- ticulated Object Estimation in the Wild. Conference on Robot Learning (CoRL) (2025) 3, 4, 10

  58. [58]

    arXiv pre-print (2025) 9 EgoFun3D 19

    Yuan, H., Li, X., Zhang, T., Sun, Y., Huang, Z., Xu, S., Ji, S., Tong, Y., Qi, L., Feng, J., Yang, M.H.: Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos. arXiv pre-print (2025) 9 EgoFun3D 19

  59. [59]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025) 3

    Yuan, S., Shi, R., Wei, X., Zhang, X., Su, H., Liu, M.: LARM: A Large Articu- lated Object Reconstruction Model. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025) 3

  60. [60]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhang, C., Delitzas, A., Wang, F., Zhang, R., Ji, X., Pollefeys, M., Engelmann, F.: Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19401–19413 (2025) 2, 3, 4, 7

  61. [61]

    Object"].data.joint_pos.Jointstatesarechanged by setting the target joint values:scene[

    Zhang, K., Sha, S., Jiang, H., Loper, M., Song, H., Cai, G., Xu, Z., Hu, X., Zheng, C., Li, Y.: Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simula- tion of Soft-Body Interactions. In: ICRA (2026) 2 20 Supplementary Material A Function Template Implementation Details A.1 Formalization Details Here, we provide in more detail the formalizatio...

  62. [62]

    Which part of the object r ec ei ve s human action ? \

  63. [63]

    xxx \" and \

    Which part of the object reacts to human action ? \ 4Please de scr ib e the name and fea tu re s of the part as well as the spatial r e l a t i o n s h i p with s u r r o u n d i n g objects . \ 5Please only answer in this te mp la te : \ 6{1: { name : xxx , d e s c r i p t i o n : aaa } , 2: { name : yyy , d e s c r i p t i o n : bbb }} \ 7S u b s t i t ...

  64. [64]

    Which p hy sic al effect best d e s c r i b e s the f u n c t i o n a l r e l a t i o n s h i p between the red part and green part ? Please choose one from the f o l l o w i n g : \ 18( a ) ge om et ry change \ 19( b ) i l l u m i n a t i o n change \ 20( c ) t e m p e r a t u r e change \ 21( d ) fluid change \

  65. [65]

    1\": \" xxx \

    Which f un cti on best d e s c r i b e s the n u m e r i c a l r e l a t i o n s h i p between the state of the red part and the green part ? Please choose one from the f o l l o w i n g : \ 23( a ) binary fu nct io n \ 24( b ) step f un cti on \ 25( c ) linear fu nct io n \ 26( d ) c u m u l a t i v e fu nct io n \ 27Please only answer in this te mp la t...