pith. sign in

arxiv: 2512.04884 · v3 · submitted 2025-12-04 · 💻 cs.RO

Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

Pith reviewed 2026-05-17 01:38 UTC · model grok-4.3

classification 💻 cs.RO
keywords multimodal datasetarticulated manipulationforce sensingtactile sensingcross-view transferrobotic manipulationhuman-robot interaction
0
0 comments X

The pith

A new dataset records 3048 real interactions with 381 articulated objects using four embodiments that include force and tactile sensing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a multimodal dataset that captures what is seen, done, and felt during manipulation of articulated objects in real human interactions. Recordings come from human hands, a wrist-mounted camera, a standard handheld gripper, and a custom gripper equipped with end-effector force and tactile sensors, all operating the same objects across 38 environments. This setup lets researchers measure how well visual methods transfer between human and robotic viewpoints while also exploring the role of physical forces that video alone cannot provide. The collection covers 3048 sequences and is positioned to support more complete models of interaction understanding.

Core claim

The central contribution is the Hoi! dataset, which couples synchronized video from multiple viewpoints with end-effector forces and tactile signals collected from four distinct embodiments on 381 articulated objects, thereby enabling direct study of cross-view transfer and force-grounded manipulation.

What carries the argument

The four-embodiment recording setup that provides aligned visual, force, and tactile data for the same physical interactions performed by a human hand, wrist-camera hand, UMI gripper, and custom Hoi! gripper.

If this is right

  • Video-based methods can now be directly compared against those that also use force and tactile channels for the same interactions.
  • Transfer learning experiments become possible between human demonstrations and robotic execution on identical objects and actions.
  • Force prediction tasks can be added to standard manipulation benchmarks using the provided sensor streams.
  • Policies for articulated objects can be trained and evaluated with explicit physical grounding rather than vision alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset structure could support new benchmarks that quantify how much force information reduces the sim-to-real gap in policy learning.
  • Extending similar recordings to additional object categories or longer interaction sequences would test whether the observed transfer patterns hold more broadly.
  • The aligned multi-embodiment recordings offer a natural testbed for studying embodiment-invariant features in manipulation.

Load-bearing premise

That the force and tactile signals recorded from these specific embodiments accurately represent physical interactions that can transfer to train general robotic policies on articulated objects.

What would settle it

A model trained only on the human-hand embodiment data shows no improvement in force prediction or manipulation success when tested on the custom gripper embodiment with held-out objects.

Figures

Figures reproduced from arXiv: 2512.04884 by Abhinav Valada, Hermann Blum, Marc Pollefeys, Martin B\"uchner, Matteo Wohlrapp, Ren\'e Zurbr\"ugg, Tim Engelbracht, Zuria Bauer.

Figure 1
Figure 1. Figure 1: Overview of the Hoi! Dataset: A multimodal dataset for force-grounded, cross-view articulated manipulation in wild indoor environments. The dataset captures human interactions with common articulated objects (drawers, doors, fridges, dishwashers) with synchronized RGB, depth, force, tactile sensing, and multi-view videos from egocentric and exocentric perspectives. Each interaction is annotated with articu… view at source ↗
Figure 2
Figure 2. Figure 2: Locations of the Hoi! dataset. A diverse collection of real-world indoor environments featuring kitchens, bathrooms, offices, and living spaces, were each has RGB-D sequences, GT, panoramic images, and various articulated objects that have interactions with multiple grippers and users. Video Understanding. Egocentric benchmarks such as [7, 17, 18, 37] have enabled progress on action recog￾nition, anticipat… view at source ↗
Figure 3
Figure 3. Figure 3: Hoi! Gripper. The 2-finger parallel gripper is operated through the load cell, where the measured load is translated into gripping force. In￾teraction force and tactile contact pressure are measured through the Digit and Force-Torque sensors respectively. Aria Glasses and a stereo camera provide pose estimation and wrist-view observations. We will release the design as open source. data during interaction.… view at source ↗
Figure 4
Figure 4. Figure 4: Example of the measured interaction forces for several artic￾ulated elements. Each curve corresponds to a different component (high￾lighted in matching colors below), illustrating how force magnitudes vary across types of articulated parts [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of environments and articulated interaction cat￾egories in the Hoi! dataset. The bar chart depicts the relative frequency of human interactions across articulated categories, while the inset pie chart summarizes the proportion of environments involved in the interactions. ronment, we also capture high-resolution 3D point clouds with a Leica RTC360 laser scanner. We first scan the unar￾ticulate… view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the dataset collection setup. The dataset consists of 3048 multimodal sequences capturing human interactions with 381 articulated objects across 38 locations using multiple viewpoints (egocentric and third-person cameras) and manipulation conditions (hand, gripper-based). Ground truth data includes trajectories, contacts, haptic feedback, force measurements, and high-resolution 3D point clouds … view at source ↗
Figure 7
Figure 7. Figure 7: Viewpoint recordings. Recorded viewpoints for articulated part interactions. Each row corresponds to a different setup, showing synchro￾nized exocentric (left), egocentric (center), and wrist-mounted (right) per￾spectives for both human and robot executions. language description of the part. To generate the mask, we prompt SAMv2 [39] on the panoramic images and lift the predicted mask to 3D using the corre… view at source ↗
read the original abstract

We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated in four embodiments - (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper, where the tool embodiment provides end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as interaction forces. The Project Website can be found at https://timengelbracht.github.io/Hoi-Dataset-Website/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents the Hoi! dataset for force-grounded, cross-view articulated manipulation. It comprises 3048 sequences across 381 articulated objects in 38 environments, captured in four embodiments: human hand, human hand with wrist-mounted camera, handheld UMI gripper, and custom Hoi! gripper. The latter two provide end-effector force and tactile sensing. The central contribution is the release of this multimodal resource to support evaluation of method transfer between human and robotic viewpoints and investigation of interaction forces.

Significance. If the data collection protocols, sensor calibration, and quality controls are rigorously documented and the dataset is released with reproducible access, this resource could meaningfully advance robotics research on articulated manipulation by filling a gap in paired visual-force-tactile data across embodiments. The cross-view and force-grounded aspects address underexplored areas in current manipulation datasets.

major comments (1)
  1. [§3] §3 (Data Collection): The manuscript provides insufficient detail on force/tactile sensor calibration, force range matching between the UMI and Hoi! grippers, and temporal synchronization with video streams. Without these, it is difficult to assess whether the recorded signals capture embodiment-independent interaction physics as needed to support the claimed utility for training general robotic policies.
minor comments (2)
  1. The project website URL is given but the manuscript should include a permanent DOI or direct download link for the dataset and code to ensure long-term accessibility.
  2. [§4] Figure captions and axis labels in the data statistics section could be clarified to explicitly state the number of sequences per embodiment.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our data collection that require clarification. We address the major comment point-by-point below and will revise the manuscript to incorporate additional technical details.

read point-by-point responses
  1. Referee: [§3] §3 (Data Collection): The manuscript provides insufficient detail on force/tactile sensor calibration, force range matching between the UMI and Hoi! grippers, and temporal synchronization with video streams. Without these, it is difficult to assess whether the recorded signals capture embodiment-independent interaction physics as needed to support the claimed utility for training general robotic policies.

    Authors: We agree that the current version of §3 provides insufficient detail on these points. In the revised manuscript we will expand the Data Collection section with a dedicated subsection on sensor calibration. This will describe the procedures for both the UMI and Hoi! grippers, including reference load-cell validation, zero-offset correction, and temperature compensation. We will also add explicit force-range information and matching: the UMI gripper uses a sensor with a 0–50 N range while the Hoi! gripper uses a 0–200 N range; we apply per-embodiment min-max normalization followed by a shared scaling factor derived from overlapping calibration trials to enable direct comparison of interaction forces. For temporal synchronization we will document the hardware trigger protocol (shared clock source with <5 ms measured jitter) and the software alignment routine based on event timestamps and cross-correlation of high-frequency force spikes with video frame changes. These additions, together with released calibration scripts, will allow readers to verify that the recorded signals reflect embodiment-independent physics suitable for cross-embodiment policy training. revision: yes

Circularity Check

0 steps flagged

Dataset release paper with no derivations or predictions

full rationale

The paper presents a new multimodal dataset for force-grounded articulated manipulation, describing data collection across 3048 sequences on 381 objects in four embodiments without any claimed mathematical derivations, model predictions, fitted parameters, or equations. The central contribution is the dataset itself and its release for enabling future research on cross-view and force modalities. No load-bearing steps reduce by construction to self-definitions, fitted inputs renamed as predictions, or self-citation chains; the work is self-contained as an empirical data resource independent of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset paper with no mathematical model, derivations, or postulated physical entities; therefore no free parameters, axioms, or invented entities are required or introduced.

pith-pipeline@v0.9.0 · 5475 in / 1075 out tokens · 66754 ms · 2026-05-17T01:38:28.869566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

    cs.RO 2026-05 unverdicted novelty 4.0

    A survey proposing a hierarchical taxonomy for multimodal tactile fusion datasets and methods across perception, generation, and interaction in embodied intelligence.

  2. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  3. World Model for Robot Learning: A Comprehensive Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 3 Pith papers · 1 internal anchor

  1. [1]

    Agibot world colosseo: A large-scale manipula- tion platform for scalable and intelligent embodied systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. Agibot world colosseo: A large-scale manipula- tion platform for scalable and intelligent embodied systems. CoRR, 2025. 3

  2. [2]

    Gomez Rodriguez, Jose M

    Carlos Campos, Richard Elvira, Juan J. Gomez Rodriguez, Jose M. M. Montiel, and Juan D. Tardos. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE Transactions on Robotics, 37(6): 1874–1890, 2021. 5

  3. [3]

    Emerging Properties in Self-Supervised Vision Transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers.CoRR, abs/2104.14294, 2021. 7, 5

  4. [4]

    Tianyi Cheng, Dandan Shan, Ayda Sultan, Richard E. L. Higgins, and David F. Fouhey. Towards a richer 2d under- standing of hands at scale. InProceedings of the 37th Inter- national Conference on Neural Information Processing Sys- tems, Red Hook, NY , USA, 2023. Curran Associates Inc. 7

  5. [5]

    Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots, 2024

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Ben- jamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots, 2024. 4

  6. [6]

    Collins, Cody Houff, You Liang Tan, and Charles C

    Jeremy A. Collins, Cody Houff, You Liang Tan, and Charles C. Kemp. Forcesight: Text-guided mobile manip- ulation with visual-force goals, 2023. 2, 7, 8, 6

  7. [7]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. 2, 3

  8. [8]

    Epic-kitchens visor benchmark: Video segmenta- tions and object relations, 2022

    Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. Epic-kitchens visor benchmark: Video segmenta- tions and object relations, 2022. 7

  9. [9]

    SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

    Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  10. [10]

    borglab/gtsam,

    Frank Dellaert and GTSAM Contributors. borglab/gtsam,

  11. [11]

    Flowbot3d: Learning 3d ar- ticulation flow to manipulate articulated objects.Robotics Science and Systems 2022, 2022

    Ben Eisner and Harry Zhang. Flowbot3d: Learning 3d ar- ticulation flow to manipulate articulated objects.Robotics Science and Systems 2022, 2022. 2

  12. [12]

    Project aria: A new tool for egocentric multi-modal ai research,

    Jakob Engel, Kiran Somasundaram, Michael Goesele, Al- bert Sun, Alexander Gamino, Andrew Turner, Arjang Talat- tof, Arnie Yuan, Bilal Souti, Brighid Meredith, Cheng Peng, Chris Sweeney, Cole Wilson, Dan Barnes, Daniel DeTone, David Caruso, Derek Valleroy, Dinesh Ginjupalli, Duncan Frost, Edward Miller, Elias Mueggler, Evgeniy Oleinik, Fan Zhang, Guruprasa...

  13. [13]

    Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot, 2023

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot, 2023. 3

  14. [14]

    Fischler and Robert C

    Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commun. ACM, 24(6):381–395, 1981. 4

  15. [15]

    Springer International Publishing, Cham,

    Fadri Furrer, Marius Fehr, Tonci Novkovic, Hannes Sommer, Igor Gilitschenski, and Roland Siegwart.Evaluation of Com- bined Time-Offset Estimation and Hand-Eye Calibration on Robotic Datasets. Springer International Publishing, Cham,

  16. [16]

    Gonzalez

    T. Gonzalez. Clustering to minimize the maximum inter- cluster distance.Theoretical Computer Science, 38:293–306,

  17. [17]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 2, 3, 7

  18. [18]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu- Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moh...

  19. [19]

    Opening articulated structures in the real world,

    Arjun Gupta, Michelle Zhang, Rishik Sathua, and Saurabh Gupta. Opening articulated structures in the real world,

  20. [20]

    Articulate3d: Holistic understanding of 3d scenes as universal scene de- scription

    Anna-Maria Halacheva, Yang Miao, Jan-Nico Zaech, Xi Wang, Luc Van Gool, and Danda Pani Paudel. Articulate3d: Holistic understanding of 3d scenes as universal scene de- scription. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2

  21. [21]

    Carto: Category and joint agnostic reconstruction of articulated objects

    Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Ab- hinav Valada, and Thomas Kollar. Carto: Category and joint agnostic reconstruction of articulated objects. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21201–21210, 2023. 2

  22. [22]

    Sparsh: Self-supervised touch rep- resentations for vision-based tactile sensing

    Carolina Higuera, Akash Sharma, Chaithanya Krishna Bod- duluri, Taosha Fan, Patrick Lancaster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, and Mustafa Mukadam. Sparsh: Self-supervised touch rep- resentations for vision-based tactile sensing. In8th Annual Conference on Robot Learning, 2024. 7, 5

  23. [23]

    Advait Jain and Charles C. Kemp. Improving robot manip- ulation with data-driven object-centric models of everyday forces.Autonomous Robots, 35(2):143–159, 2013. 2

  24. [24]

    Ditto: Building digital twins of articulated objects from interaction,

    Zhenyu Jiang, Cheng-Chun Hsu, and Yuke Zhu. Ditto: Building digital twins of articulated objects from interaction,

  25. [25]

    Egomimic: Scaling imitation learning via egocentric video

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 13226–13233. IEEE, 2025. 3

  26. [26]

    Mapanything: Universal feed- forward metric 3d reconstruction, 2025

    Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed- forward metric 3d reconstructio...

  27. [27]

    Droid: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Bal- akrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InRSS 2024 Workshop: Data Generation for Robotics. 3

  28. [28]

    Phantom: Training robots without robots using only human videos,

    Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos,

  29. [29]

    Akb-48: A real-world articulated object knowledge base

    Liu Liu, Wenqiang Xu, Haoyuan Fu, Sucheng Qian, Qiao- jun Yu, Yang Han, and Cewu Lu. Akb-48: A real-world articulated object knowledge base. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14809–14818, 2022. 2

  30. [30]

    Forcemimic: Force-centric imitation learning with force-motion capture system for contact-rich manipula- tion, 2025

    Wenhai Liu, Junbo Wang, Yiming Wang, Weiming Wang, and Cewu Lu. Forcemimic: Force-centric imitation learning with force-motion capture system for contact-rich manipula- tion, 2025. 3

  31. [31]

    Artgs: Building interactable repli- cas of complex articulated objects via gaussian splatting,

    Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Artgs: Building interactable repli- cas of complex articulated objects via gaussian splatting,

  32. [32]

    The rbo dataset of articulated objects and interactions, 2018

    Roberto Mart ´ın-Mart´ın, Clemens Eppner, and Oliver Brock. The rbo dataset of articulated objects and interactions, 2018. 2, 3

  33. [33]

    Partnet: A large- scale benchmark for fine-grained and hierarchical part-level 3d object understanding

    Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large- scale benchmark for fine-grained and hierarchical part-level 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019. 2

  34. [34]

    R3m: A universal visual repre- sentation for robot manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual repre- sentation for robot manipulation. In6th Annual Conference on Robot Learning. 3

  35. [35]

    Dinov2: Learning robust visual features with- out supervision, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

  36. [36]

    Reconstruct- ing hands in 3D with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstruct- ing hands in 3D with transformers. InCVPR, 2024. 6

  37. [37]

    Hd-epic: A highly-detailed egocentric video dataset

    Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Pra- jwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025. 2, 3

  38. [38]

    Qi, Li Yi, Hao Su, and Leonidas J

    Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space, 2017. 4

  39. [39]

    Sam 2: Segment anything in images and videos,

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

  40. [40]

    Extending kalibr: Cali- brating the extrinsics of multiple imus and of individual axes

    Joern Rehder, Janosch Nikolic, Thomas Schneider, Timo Hinzmann, and Roland Siegwart. Extending kalibr: Cali- brating the extrinsics of multiple imus and of individual axes. pages 4304–4311, 2016. 1

  41. [41]

    Enhanc- ing robotic skill acquisition with multimodal sensory data: A novel dataset for kitchen tasks.Scientific Data, 12(1):476,

    Ruochen Ren, Zhipeng Wang, Chaoyun Yang, Jiahang Liu, Rong Jiang, Yanmin Zhou, Shuo Jiang, and Bin He. Enhanc- ing robotic skill acquisition with multimodal sensory data: A novel dataset for kitchen tasks.Scientific Data, 12(1):476,

  42. [42]

    Orb: An efficient alternative to sift or surf

    Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In2011 International Conference on Computer Vision, pages 2564– 2571, 2011. 4

  43. [43]

    From coarse to fine: Robust hierarchical localization at large scale

    Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. InCVPR, 2019. 5, 4

  44. [44]

    Reacto: Reconstructing articulated ob- jects from a single video, 2024

    Chaoyue Song, Jiacheng Wei, Chuan-Sheng Foo, Guosheng Lin, and Fayao Liu. Reacto: Reconstructing articulated ob- jects from a single video, 2024. 2

  45. [45]

    Learning kinematic models of articulated ob- jects.Springer Tracts in Advanced Robotics, pages 65–111,

    J ¨urgen Sturm. Learning kinematic models of articulated ob- jects.Springer Tracts in Advanced Robotics, pages 65–111,

  46. [46]

    Learn- ing kinematic models for articulated objects

    J ¨urgen Sturm, Vijay Pradeep, Cyrill Stachniss, Christian Plagemann, Kurt Konolige, and Wolfram Burgard. Learn- ing kinematic models for articulated objects. 2

  47. [47]

    Articubot: Learning universal articulated object ma- nipulation policy via large scale simulation, 2025

    Yufei Wang, Ziyu Wang, Mino Nakura, Pratik Bhowal, Chia- Liang Kuo, Yi-Ting Chen, Zackory Erickson, and David Held. Articubot: Learning universal articulated object ma- nipulation policy via large scale simulation, 2025. 3

  48. [48]

    Ar- ticulated object estimation in the wild

    Abdelrhman Werby, Martin B ¨uchner, Adrian R ¨ofer, Chen- guang Huang, Wolfram Burgard, and Abhinav Valada. Ar- ticulated object estimation in the wild. InConference on Robot Learning (CoRL), 2025. 3, 6, 7

  49. [49]

    Ar- ticulated object estimation in the wild, 2025

    Abdelrhman Werby, Martin B ¨uchner, Adrian R ¨ofer, Chen- guang Huang, Wolfram Burgard, and Abhinav Valada. Ar- ticulated object estimation in the wild, 2025. 2, 5, 6

  50. [50]

    Sapien: A simulated part-based interactive environment

    Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097– 11107, 2020. 2

  51. [51]

    Robotube: Learning household manipulation from human videos with simulated twin environments

    Haoyu Xiong, Haoyuan Fu, Jieyi Zhang, Chen Bao, Qiang Zhang, Yongxi Huang, Wenqiang Xu, Animesh Garg, and Cewu Lu. Robotube: Learning household manipulation from human videos with simulated twin environments. In6th An- nual Conference on Robot Learning, 2022. 3

  52. [52]

    Depth any- thing v2, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2, 2024. 4

  53. [53]

    Articulated human detection with flexible mixtures of parts.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 35(12):2878–2890,

    Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 35(12):2878–2890,

  54. [54]

    Open-vocabulary functional 3d scene graphs for real-world indoor spaces

    Chenyangguang Zhang, Alexandros Delitzas, Fangjinhua Wang, Ruida Zhang, Xiangyang Ji, Marc Pollefeys, and Francis Engelmann. Open-vocabulary functional 3d scene graphs for real-world indoor spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19401–19413, 2025. 3

  55. [55]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. 4 Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation Supplementary Material Contents

  56. [56]

    In-the-Wild Articulated Object Estimation

    Evaluations 6 4.1. In-the-Wild Articulated Object Estimation . . . . . 6 4.2. Tactile Force Estimation . . . . . . . . . . . . . . 7 4.3. Visual Force Estimation . . . . . . . . . . . . . . . 7

  57. [57]

    Limitations & Future Work 8

  58. [58]

    Hoi! Gripper Calibration Details 1 A.1

    Conclusions 8 A . Hoi! Gripper Calibration Details 1 A.1 . Motor Calibration . . . . . . . . . . . . . . . . . . 1 A.2 . Inter-Sensor Calibration . . . . . . . . . . . . . . . 1 A.3 . Gripper Gravity Compensation . . . . . . . . . . . 2 B . Alignment of Sensors in the Hoi! Dataset Recordings 2 B.1. Time Alignment . . . . . . . . . . . . . . . . . . . 2 B....