Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

Abhinav Valada; Hermann Blum; Marc Pollefeys; Martin B\"uchner; Matteo Wohlrapp; Ren\'e Zurbr\"ugg; Tim Engelbracht; Zuria Bauer

arxiv: 2512.04884 · v3 · submitted 2025-12-04 · 💻 cs.RO

Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

Tim Engelbracht , Ren\'e Zurbr\"ugg , Matteo Wohlrapp , Martin B\"uchner , Abhinav Valada , Marc Pollefeys , Hermann Blum , Zuria Bauer This is my paper

Pith reviewed 2026-05-17 01:38 UTC · model grok-4.3

classification 💻 cs.RO

keywords multimodal datasetarticulated manipulationforce sensingtactile sensingcross-view transferrobotic manipulationhuman-robot interaction

0 comments

The pith

A new dataset records 3048 real interactions with 381 articulated objects using four embodiments that include force and tactile sensing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a multimodal dataset that captures what is seen, done, and felt during manipulation of articulated objects in real human interactions. Recordings come from human hands, a wrist-mounted camera, a standard handheld gripper, and a custom gripper equipped with end-effector force and tactile sensors, all operating the same objects across 38 environments. This setup lets researchers measure how well visual methods transfer between human and robotic viewpoints while also exploring the role of physical forces that video alone cannot provide. The collection covers 3048 sequences and is positioned to support more complete models of interaction understanding.

Core claim

The central contribution is the Hoi! dataset, which couples synchronized video from multiple viewpoints with end-effector forces and tactile signals collected from four distinct embodiments on 381 articulated objects, thereby enabling direct study of cross-view transfer and force-grounded manipulation.

What carries the argument

The four-embodiment recording setup that provides aligned visual, force, and tactile data for the same physical interactions performed by a human hand, wrist-camera hand, UMI gripper, and custom Hoi! gripper.

If this is right

Video-based methods can now be directly compared against those that also use force and tactile channels for the same interactions.
Transfer learning experiments become possible between human demonstrations and robotic execution on identical objects and actions.
Force prediction tasks can be added to standard manipulation benchmarks using the provided sensor streams.
Policies for articulated objects can be trained and evaluated with explicit physical grounding rather than vision alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset structure could support new benchmarks that quantify how much force information reduces the sim-to-real gap in policy learning.
Extending similar recordings to additional object categories or longer interaction sequences would test whether the observed transfer patterns hold more broadly.
The aligned multi-embodiment recordings offer a natural testbed for studying embodiment-invariant features in manipulation.

Load-bearing premise

That the force and tactile signals recorded from these specific embodiments accurately represent physical interactions that can transfer to train general robotic policies on articulated objects.

What would settle it

A model trained only on the human-hand embodiment data shows no improvement in force prediction or manipulation success when tested on the custom gripper embodiment with held-out objects.

Figures

Figures reproduced from arXiv: 2512.04884 by Abhinav Valada, Hermann Blum, Marc Pollefeys, Martin B\"uchner, Matteo Wohlrapp, Ren\'e Zurbr\"ugg, Tim Engelbracht, Zuria Bauer.

**Figure 1.** Figure 1: Overview of the Hoi! Dataset: A multimodal dataset for force-grounded, cross-view articulated manipulation in wild indoor environments. The dataset captures human interactions with common articulated objects (drawers, doors, fridges, dishwashers) with synchronized RGB, depth, force, tactile sensing, and multi-view videos from egocentric and exocentric perspectives. Each interaction is annotated with articu… view at source ↗

**Figure 2.** Figure 2: Locations of the Hoi! dataset. A diverse collection of real-world indoor environments featuring kitchens, bathrooms, offices, and living spaces, were each has RGB-D sequences, GT, panoramic images, and various articulated objects that have interactions with multiple grippers and users. Video Understanding. Egocentric benchmarks such as [7, 17, 18, 37] have enabled progress on action recognition, anticipat… view at source ↗

**Figure 3.** Figure 3: Hoi! Gripper. The 2-finger parallel gripper is operated through the load cell, where the measured load is translated into gripping force. Interaction force and tactile contact pressure are measured through the Digit and Force-Torque sensors respectively. Aria Glasses and a stereo camera provide pose estimation and wrist-view observations. We will release the design as open source. data during interaction.… view at source ↗

**Figure 4.** Figure 4: Example of the measured interaction forces for several articulated elements. Each curve corresponds to a different component (highlighted in matching colors below), illustrating how force magnitudes vary across types of articulated parts [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of environments and articulated interaction categories in the Hoi! dataset. The bar chart depicts the relative frequency of human interactions across articulated categories, while the inset pie chart summarizes the proportion of environments involved in the interactions. ronment, we also capture high-resolution 3D point clouds with a Leica RTC360 laser scanner. We first scan the unarticulate… view at source ↗

**Figure 6.** Figure 6: Overview of the dataset collection setup. The dataset consists of 3048 multimodal sequences capturing human interactions with 381 articulated objects across 38 locations using multiple viewpoints (egocentric and third-person cameras) and manipulation conditions (hand, gripper-based). Ground truth data includes trajectories, contacts, haptic feedback, force measurements, and high-resolution 3D point clouds … view at source ↗

**Figure 7.** Figure 7: Viewpoint recordings. Recorded viewpoints for articulated part interactions. Each row corresponds to a different setup, showing synchronized exocentric (left), egocentric (center), and wrist-mounted (right) perspectives for both human and robot executions. language description of the part. To generate the mask, we prompt SAMv2 [39] on the panoramic images and lift the predicted mask to 3D using the corre… view at source ↗

read the original abstract

We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated in four embodiments - (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper, where the tool embodiment provides end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as interaction forces. The Project Website can be found at https://timengelbracht.github.io/Hoi-Dataset-Website/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward dataset release with solid scale on force and tactile data for articulated objects across embodiments, but the human-operated collection leaves transfer to real robot policies unproven.

read the letter

The main thing to know is that this paper releases a dataset of 3048 sequences on 381 articulated objects captured in four embodiments, including a custom gripper that records end-effector forces and tactile signals alongside multi-view video. That specific mix of cross-view footage and physical sensing on real articulated items is new at this scale and should give robotics people a practical resource they can actually use.

Referee Report

1 major / 2 minor

Summary. The paper presents the Hoi! dataset for force-grounded, cross-view articulated manipulation. It comprises 3048 sequences across 381 articulated objects in 38 environments, captured in four embodiments: human hand, human hand with wrist-mounted camera, handheld UMI gripper, and custom Hoi! gripper. The latter two provide end-effector force and tactile sensing. The central contribution is the release of this multimodal resource to support evaluation of method transfer between human and robotic viewpoints and investigation of interaction forces.

Significance. If the data collection protocols, sensor calibration, and quality controls are rigorously documented and the dataset is released with reproducible access, this resource could meaningfully advance robotics research on articulated manipulation by filling a gap in paired visual-force-tactile data across embodiments. The cross-view and force-grounded aspects address underexplored areas in current manipulation datasets.

major comments (1)

[§3] §3 (Data Collection): The manuscript provides insufficient detail on force/tactile sensor calibration, force range matching between the UMI and Hoi! grippers, and temporal synchronization with video streams. Without these, it is difficult to assess whether the recorded signals capture embodiment-independent interaction physics as needed to support the claimed utility for training general robotic policies.

minor comments (2)

The project website URL is given but the manuscript should include a permanent DOI or direct download link for the dataset and code to ensure long-term accessibility.
[§4] Figure captions and axis labels in the data statistics section could be clarified to explicitly state the number of sequences per embodiment.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our data collection that require clarification. We address the major comment point-by-point below and will revise the manuscript to incorporate additional technical details.

read point-by-point responses

Referee: [§3] §3 (Data Collection): The manuscript provides insufficient detail on force/tactile sensor calibration, force range matching between the UMI and Hoi! grippers, and temporal synchronization with video streams. Without these, it is difficult to assess whether the recorded signals capture embodiment-independent interaction physics as needed to support the claimed utility for training general robotic policies.

Authors: We agree that the current version of §3 provides insufficient detail on these points. In the revised manuscript we will expand the Data Collection section with a dedicated subsection on sensor calibration. This will describe the procedures for both the UMI and Hoi! grippers, including reference load-cell validation, zero-offset correction, and temperature compensation. We will also add explicit force-range information and matching: the UMI gripper uses a sensor with a 0–50 N range while the Hoi! gripper uses a 0–200 N range; we apply per-embodiment min-max normalization followed by a shared scaling factor derived from overlapping calibration trials to enable direct comparison of interaction forces. For temporal synchronization we will document the hardware trigger protocol (shared clock source with <5 ms measured jitter) and the software alignment routine based on event timestamps and cross-correlation of high-frequency force spikes with video frame changes. These additions, together with released calibration scripts, will allow readers to verify that the recorded signals reflect embodiment-independent physics suitable for cross-embodiment policy training. revision: yes

Circularity Check

0 steps flagged

Dataset release paper with no derivations or predictions

full rationale

The paper presents a new multimodal dataset for force-grounded articulated manipulation, describing data collection across 3048 sequences on 381 objects in four embodiments without any claimed mathematical derivations, model predictions, fitted parameters, or equations. The central contribution is the dataset itself and its release for enabling future research on cross-view and force modalities. No load-bearing steps reduce by construction to self-definitions, fitted inputs renamed as predictions, or self-citation chains; the work is self-contained as an empirical data resource independent of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset paper with no mathematical model, derivations, or postulated physical entities; therefore no free parameters, axioms, or invented entities are required or introduced.

pith-pipeline@v0.9.0 · 5475 in / 1075 out tokens · 66754 ms · 2026-05-17T01:38:28.869566+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The dataset contains 3048 sequences across 381 articulated objects... end-effector forces and tactile sensing.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms
cs.RO 2026-05 unverdicted novelty 4.0

A survey proposing a hierarchical taxonomy for multimodal tactile fusion datasets and methods across perception, generation, and interaction in embodied intelligence.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
World Model for Robot Learning: A Comprehensive Survey
cs.RO 2026-04 unverdicted novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 3 Pith papers · 1 internal anchor

[1]

Agibot world colosseo: A large-scale manipula- tion platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. Agibot world colosseo: A large-scale manipula- tion platform for scalable and intelligent embodied systems. CoRR, 2025. 3

work page 2025
[2]

Gomez Rodriguez, Jose M

Carlos Campos, Richard Elvira, Juan J. Gomez Rodriguez, Jose M. M. Montiel, and Juan D. Tardos. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE Transactions on Robotics, 37(6): 1874–1890, 2021. 5

work page 2021
[3]

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers.CoRR, abs/2104.14294, 2021. 7, 5

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Tianyi Cheng, Dandan Shan, Ayda Sultan, Richard E. L. Higgins, and David F. Fouhey. Towards a richer 2d under- standing of hands at scale. InProceedings of the 37th Inter- national Conference on Neural Information Processing Sys- tems, Red Hook, NY , USA, 2023. Curran Associates Inc. 7

work page 2023
[5]

Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots, 2024

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Ben- jamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots, 2024. 4

work page 2024
[6]

Collins, Cody Houff, You Liang Tan, and Charles C

Jeremy A. Collins, Cody Houff, You Liang Tan, and Charles C. Kemp. Forcesight: Text-guided mobile manip- ulation with visual-force goals, 2023. 2, 7, 8, 6

work page 2023
[7]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. 2, 3

work page 2022
[8]

Epic-kitchens visor benchmark: Video segmenta- tions and object relations, 2022

Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. Epic-kitchens visor benchmark: Video segmenta- tions and object relations, 2022. 7

work page 2022
[9]

SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[10]

borglab/gtsam,

Frank Dellaert and GTSAM Contributors. borglab/gtsam,

work page
[11]

Flowbot3d: Learning 3d ar- ticulation flow to manipulate articulated objects.Robotics Science and Systems 2022, 2022

Ben Eisner and Harry Zhang. Flowbot3d: Learning 3d ar- ticulation flow to manipulate articulated objects.Robotics Science and Systems 2022, 2022. 2

work page 2022
[12]

Project aria: A new tool for egocentric multi-modal ai research,

Jakob Engel, Kiran Somasundaram, Michael Goesele, Al- bert Sun, Alexander Gamino, Andrew Turner, Arjang Talat- tof, Arnie Yuan, Bilal Souti, Brighid Meredith, Cheng Peng, Chris Sweeney, Cole Wilson, Dan Barnes, Daniel DeTone, David Caruso, Derek Valleroy, Dinesh Ginjupalli, Duncan Frost, Edward Miller, Elias Mueggler, Evgeniy Oleinik, Fan Zhang, Guruprasa...

work page
[13]

Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot, 2023

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot, 2023. 3

work page 2023
[14]

Fischler and Robert C

Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commun. ACM, 24(6):381–395, 1981. 4

work page 1981
[15]

Springer International Publishing, Cham,

Fadri Furrer, Marius Fehr, Tonci Novkovic, Hannes Sommer, Igor Gilitschenski, and Roland Siegwart.Evaluation of Com- bined Time-Offset Estimation and Hand-Eye Calibration on Robotic Datasets. Springer International Publishing, Cham,

work page
[16]

Gonzalez

T. Gonzalez. Clustering to minimize the maximum inter- cluster distance.Theoretical Computer Science, 38:293–306,

work page
[17]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 2, 3, 7

work page 2022
[18]

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu- Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moh...

work page 2024
[19]

Opening articulated structures in the real world,

Arjun Gupta, Michelle Zhang, Rishik Sathua, and Saurabh Gupta. Opening articulated structures in the real world,

work page
[20]

Articulate3d: Holistic understanding of 3d scenes as universal scene de- scription

Anna-Maria Halacheva, Yang Miao, Jan-Nico Zaech, Xi Wang, Luc Van Gool, and Danda Pani Paudel. Articulate3d: Holistic understanding of 3d scenes as universal scene de- scription. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2

work page 2025
[21]

Carto: Category and joint agnostic reconstruction of articulated objects

Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Ab- hinav Valada, and Thomas Kollar. Carto: Category and joint agnostic reconstruction of articulated objects. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21201–21210, 2023. 2

work page 2023
[22]

Sparsh: Self-supervised touch rep- resentations for vision-based tactile sensing

Carolina Higuera, Akash Sharma, Chaithanya Krishna Bod- duluri, Taosha Fan, Patrick Lancaster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, and Mustafa Mukadam. Sparsh: Self-supervised touch rep- resentations for vision-based tactile sensing. In8th Annual Conference on Robot Learning, 2024. 7, 5

work page 2024
[23]

Advait Jain and Charles C. Kemp. Improving robot manip- ulation with data-driven object-centric models of everyday forces.Autonomous Robots, 35(2):143–159, 2013. 2

work page 2013
[24]

Ditto: Building digital twins of articulated objects from interaction,

Zhenyu Jiang, Cheng-Chun Hsu, and Yuke Zhu. Ditto: Building digital twins of articulated objects from interaction,

work page
[25]

Egomimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 13226–13233. IEEE, 2025. 3

work page 2025
[26]

Mapanything: Universal feed- forward metric 3d reconstruction, 2025

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed- forward metric 3d reconstructio...

work page 2025
[27]

Droid: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Bal- akrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InRSS 2024 Workshop: Data Generation for Robotics. 3

work page 2024
[28]

Phantom: Training robots without robots using only human videos,

Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos,

work page
[29]

Akb-48: A real-world articulated object knowledge base

Liu Liu, Wenqiang Xu, Haoyuan Fu, Sucheng Qian, Qiao- jun Yu, Yang Han, and Cewu Lu. Akb-48: A real-world articulated object knowledge base. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14809–14818, 2022. 2

work page 2022
[30]

Forcemimic: Force-centric imitation learning with force-motion capture system for contact-rich manipula- tion, 2025

Wenhai Liu, Junbo Wang, Yiming Wang, Weiming Wang, and Cewu Lu. Forcemimic: Force-centric imitation learning with force-motion capture system for contact-rich manipula- tion, 2025. 3

work page 2025
[31]

Artgs: Building interactable repli- cas of complex articulated objects via gaussian splatting,

Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Artgs: Building interactable repli- cas of complex articulated objects via gaussian splatting,

work page
[32]

The rbo dataset of articulated objects and interactions, 2018

Roberto Mart ´ın-Mart´ın, Clemens Eppner, and Oliver Brock. The rbo dataset of articulated objects and interactions, 2018. 2, 3

work page 2018
[33]

Partnet: A large- scale benchmark for fine-grained and hierarchical part-level 3d object understanding

Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large- scale benchmark for fine-grained and hierarchical part-level 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019. 2

work page 2019
[34]

R3m: A universal visual repre- sentation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual repre- sentation for robot manipulation. In6th Annual Conference on Robot Learning. 3

work page
[35]

Dinov2: Learning robust visual features with- out supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024
[36]

Reconstruct- ing hands in 3D with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstruct- ing hands in 3D with transformers. InCVPR, 2024. 6

work page 2024
[37]

Hd-epic: A highly-detailed egocentric video dataset

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Pra- jwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025. 2, 3

work page 2025
[38]

Qi, Li Yi, Hao Su, and Leonidas J

Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space, 2017. 4

work page 2017
[39]

Sam 2: Segment anything in images and videos,

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

work page
[40]

Extending kalibr: Cali- brating the extrinsics of multiple imus and of individual axes

Joern Rehder, Janosch Nikolic, Thomas Schneider, Timo Hinzmann, and Roland Siegwart. Extending kalibr: Cali- brating the extrinsics of multiple imus and of individual axes. pages 4304–4311, 2016. 1

work page 2016
[41]

Enhanc- ing robotic skill acquisition with multimodal sensory data: A novel dataset for kitchen tasks.Scientific Data, 12(1):476,

Ruochen Ren, Zhipeng Wang, Chaoyun Yang, Jiahang Liu, Rong Jiang, Yanmin Zhou, Shuo Jiang, and Bin He. Enhanc- ing robotic skill acquisition with multimodal sensory data: A novel dataset for kitchen tasks.Scientific Data, 12(1):476,

work page
[42]

Orb: An efficient alternative to sift or surf

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In2011 International Conference on Computer Vision, pages 2564– 2571, 2011. 4

work page 2011
[43]

From coarse to fine: Robust hierarchical localization at large scale

Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. InCVPR, 2019. 5, 4

work page 2019
[44]

Reacto: Reconstructing articulated ob- jects from a single video, 2024

Chaoyue Song, Jiacheng Wei, Chuan-Sheng Foo, Guosheng Lin, and Fayao Liu. Reacto: Reconstructing articulated ob- jects from a single video, 2024. 2

work page 2024
[45]

Learning kinematic models of articulated ob- jects.Springer Tracts in Advanced Robotics, pages 65–111,

J ¨urgen Sturm. Learning kinematic models of articulated ob- jects.Springer Tracts in Advanced Robotics, pages 65–111,

work page
[46]

Learn- ing kinematic models for articulated objects

J ¨urgen Sturm, Vijay Pradeep, Cyrill Stachniss, Christian Plagemann, Kurt Konolige, and Wolfram Burgard. Learn- ing kinematic models for articulated objects. 2

work page
[47]

Articubot: Learning universal articulated object ma- nipulation policy via large scale simulation, 2025

Yufei Wang, Ziyu Wang, Mino Nakura, Pratik Bhowal, Chia- Liang Kuo, Yi-Ting Chen, Zackory Erickson, and David Held. Articubot: Learning universal articulated object ma- nipulation policy via large scale simulation, 2025. 3

work page 2025
[48]

Ar- ticulated object estimation in the wild

Abdelrhman Werby, Martin B ¨uchner, Adrian R ¨ofer, Chen- guang Huang, Wolfram Burgard, and Abhinav Valada. Ar- ticulated object estimation in the wild. InConference on Robot Learning (CoRL), 2025. 3, 6, 7

work page 2025
[49]

Ar- ticulated object estimation in the wild, 2025

Abdelrhman Werby, Martin B ¨uchner, Adrian R ¨ofer, Chen- guang Huang, Wolfram Burgard, and Abhinav Valada. Ar- ticulated object estimation in the wild, 2025. 2, 5, 6

work page 2025
[50]

Sapien: A simulated part-based interactive environment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097– 11107, 2020. 2

work page 2020
[51]

Robotube: Learning household manipulation from human videos with simulated twin environments

Haoyu Xiong, Haoyuan Fu, Jieyi Zhang, Chen Bao, Qiang Zhang, Yongxi Huang, Wenqiang Xu, Animesh Garg, and Cewu Lu. Robotube: Learning household manipulation from human videos with simulated twin environments. In6th An- nual Conference on Robot Learning, 2022. 3

work page 2022
[52]

Depth any- thing v2, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2, 2024. 4

work page 2024
[53]

Articulated human detection with flexible mixtures of parts.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 35(12):2878–2890,

Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 35(12):2878–2890,

work page
[54]

Open-vocabulary functional 3d scene graphs for real-world indoor spaces

Chenyangguang Zhang, Alexandros Delitzas, Fangjinhua Wang, Ruida Zhang, Xiangyang Ji, Marc Pollefeys, and Francis Engelmann. Open-vocabulary functional 3d scene graphs for real-world indoor spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19401–19413, 2025. 3

work page 2025
[55]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. 4 Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation Supplementary Material Contents

work page 2023
[56]

In-the-Wild Articulated Object Estimation

Evaluations 6 4.1. In-the-Wild Articulated Object Estimation . . . . . 6 4.2. Tactile Force Estimation . . . . . . . . . . . . . . 7 4.3. Visual Force Estimation . . . . . . . . . . . . . . . 7

work page
[57]

Limitations & Future Work 8

work page
[58]

Hoi! Gripper Calibration Details 1 A.1

Conclusions 8 A . Hoi! Gripper Calibration Details 1 A.1 . Motor Calibration . . . . . . . . . . . . . . . . . . 1 A.2 . Inter-Sensor Calibration . . . . . . . . . . . . . . . 1 A.3 . Gripper Gravity Compensation . . . . . . . . . . . 2 B . Alignment of Sensors in the Hoi! Dataset Recordings 2 B.1. Time Alignment . . . . . . . . . . . . . . . . . . . 2 B....

work page

[1] [1]

Agibot world colosseo: A large-scale manipula- tion platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. Agibot world colosseo: A large-scale manipula- tion platform for scalable and intelligent embodied systems. CoRR, 2025. 3

work page 2025

[2] [2]

Gomez Rodriguez, Jose M

Carlos Campos, Richard Elvira, Juan J. Gomez Rodriguez, Jose M. M. Montiel, and Juan D. Tardos. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE Transactions on Robotics, 37(6): 1874–1890, 2021. 5

work page 2021

[3] [3]

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers.CoRR, abs/2104.14294, 2021. 7, 5

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Tianyi Cheng, Dandan Shan, Ayda Sultan, Richard E. L. Higgins, and David F. Fouhey. Towards a richer 2d under- standing of hands at scale. InProceedings of the 37th Inter- national Conference on Neural Information Processing Sys- tems, Red Hook, NY , USA, 2023. Curran Associates Inc. 7

work page 2023

[5] [5]

Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots, 2024

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Ben- jamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots, 2024. 4

work page 2024

[6] [6]

Collins, Cody Houff, You Liang Tan, and Charles C

Jeremy A. Collins, Cody Houff, You Liang Tan, and Charles C. Kemp. Forcesight: Text-guided mobile manip- ulation with visual-force goals, 2023. 2, 7, 8, 6

work page 2023

[7] [7]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. 2, 3

work page 2022

[8] [8]

Epic-kitchens visor benchmark: Video segmenta- tions and object relations, 2022

Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. Epic-kitchens visor benchmark: Video segmenta- tions and object relations, 2022. 7

work page 2022

[9] [9]

SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024

[10] [10]

borglab/gtsam,

Frank Dellaert and GTSAM Contributors. borglab/gtsam,

work page

[11] [11]

Flowbot3d: Learning 3d ar- ticulation flow to manipulate articulated objects.Robotics Science and Systems 2022, 2022

Ben Eisner and Harry Zhang. Flowbot3d: Learning 3d ar- ticulation flow to manipulate articulated objects.Robotics Science and Systems 2022, 2022. 2

work page 2022

[12] [12]

Project aria: A new tool for egocentric multi-modal ai research,

Jakob Engel, Kiran Somasundaram, Michael Goesele, Al- bert Sun, Alexander Gamino, Andrew Turner, Arjang Talat- tof, Arnie Yuan, Bilal Souti, Brighid Meredith, Cheng Peng, Chris Sweeney, Cole Wilson, Dan Barnes, Daniel DeTone, David Caruso, Derek Valleroy, Dinesh Ginjupalli, Duncan Frost, Edward Miller, Elias Mueggler, Evgeniy Oleinik, Fan Zhang, Guruprasa...

work page

[13] [13]

Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot, 2023

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot, 2023. 3

work page 2023

[14] [14]

Fischler and Robert C

Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commun. ACM, 24(6):381–395, 1981. 4

work page 1981

[15] [15]

Springer International Publishing, Cham,

Fadri Furrer, Marius Fehr, Tonci Novkovic, Hannes Sommer, Igor Gilitschenski, and Roland Siegwart.Evaluation of Com- bined Time-Offset Estimation and Hand-Eye Calibration on Robotic Datasets. Springer International Publishing, Cham,

work page

[16] [16]

Gonzalez

T. Gonzalez. Clustering to minimize the maximum inter- cluster distance.Theoretical Computer Science, 38:293–306,

work page

[17] [17]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 2, 3, 7

work page 2022

[18] [18]

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu- Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moh...

work page 2024

[19] [19]

Opening articulated structures in the real world,

Arjun Gupta, Michelle Zhang, Rishik Sathua, and Saurabh Gupta. Opening articulated structures in the real world,

work page

[20] [20]

Articulate3d: Holistic understanding of 3d scenes as universal scene de- scription

Anna-Maria Halacheva, Yang Miao, Jan-Nico Zaech, Xi Wang, Luc Van Gool, and Danda Pani Paudel. Articulate3d: Holistic understanding of 3d scenes as universal scene de- scription. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2

work page 2025

[21] [21]

Carto: Category and joint agnostic reconstruction of articulated objects

Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Ab- hinav Valada, and Thomas Kollar. Carto: Category and joint agnostic reconstruction of articulated objects. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21201–21210, 2023. 2

work page 2023

[22] [22]

Sparsh: Self-supervised touch rep- resentations for vision-based tactile sensing

Carolina Higuera, Akash Sharma, Chaithanya Krishna Bod- duluri, Taosha Fan, Patrick Lancaster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, and Mustafa Mukadam. Sparsh: Self-supervised touch rep- resentations for vision-based tactile sensing. In8th Annual Conference on Robot Learning, 2024. 7, 5

work page 2024

[23] [23]

Advait Jain and Charles C. Kemp. Improving robot manip- ulation with data-driven object-centric models of everyday forces.Autonomous Robots, 35(2):143–159, 2013. 2

work page 2013

[24] [24]

Ditto: Building digital twins of articulated objects from interaction,

Zhenyu Jiang, Cheng-Chun Hsu, and Yuke Zhu. Ditto: Building digital twins of articulated objects from interaction,

work page

[25] [25]

Egomimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 13226–13233. IEEE, 2025. 3

work page 2025

[26] [26]

Mapanything: Universal feed- forward metric 3d reconstruction, 2025

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed- forward metric 3d reconstructio...

work page 2025

[27] [27]

Droid: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Bal- akrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InRSS 2024 Workshop: Data Generation for Robotics. 3

work page 2024

[28] [28]

Phantom: Training robots without robots using only human videos,

Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos,

work page

[29] [29]

Akb-48: A real-world articulated object knowledge base

Liu Liu, Wenqiang Xu, Haoyuan Fu, Sucheng Qian, Qiao- jun Yu, Yang Han, and Cewu Lu. Akb-48: A real-world articulated object knowledge base. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14809–14818, 2022. 2

work page 2022

[30] [30]

Forcemimic: Force-centric imitation learning with force-motion capture system for contact-rich manipula- tion, 2025

Wenhai Liu, Junbo Wang, Yiming Wang, Weiming Wang, and Cewu Lu. Forcemimic: Force-centric imitation learning with force-motion capture system for contact-rich manipula- tion, 2025. 3

work page 2025

[31] [31]

Artgs: Building interactable repli- cas of complex articulated objects via gaussian splatting,

Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, and Siyuan Huang. Artgs: Building interactable repli- cas of complex articulated objects via gaussian splatting,

work page

[32] [32]

The rbo dataset of articulated objects and interactions, 2018

Roberto Mart ´ın-Mart´ın, Clemens Eppner, and Oliver Brock. The rbo dataset of articulated objects and interactions, 2018. 2, 3

work page 2018

[33] [33]

Partnet: A large- scale benchmark for fine-grained and hierarchical part-level 3d object understanding

Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large- scale benchmark for fine-grained and hierarchical part-level 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019. 2

work page 2019

[34] [34]

R3m: A universal visual repre- sentation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual repre- sentation for robot manipulation. In6th Annual Conference on Robot Learning. 3

work page

[35] [35]

Dinov2: Learning robust visual features with- out supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024

[36] [36]

Reconstruct- ing hands in 3D with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstruct- ing hands in 3D with transformers. InCVPR, 2024. 6

work page 2024

[37] [37]

Hd-epic: A highly-detailed egocentric video dataset

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Pra- jwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025. 2, 3

work page 2025

[38] [38]

Qi, Li Yi, Hao Su, and Leonidas J

Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space, 2017. 4

work page 2017

[39] [39]

Sam 2: Segment anything in images and videos,

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

work page

[40] [40]

Extending kalibr: Cali- brating the extrinsics of multiple imus and of individual axes

Joern Rehder, Janosch Nikolic, Thomas Schneider, Timo Hinzmann, and Roland Siegwart. Extending kalibr: Cali- brating the extrinsics of multiple imus and of individual axes. pages 4304–4311, 2016. 1

work page 2016

[41] [41]

Enhanc- ing robotic skill acquisition with multimodal sensory data: A novel dataset for kitchen tasks.Scientific Data, 12(1):476,

Ruochen Ren, Zhipeng Wang, Chaoyun Yang, Jiahang Liu, Rong Jiang, Yanmin Zhou, Shuo Jiang, and Bin He. Enhanc- ing robotic skill acquisition with multimodal sensory data: A novel dataset for kitchen tasks.Scientific Data, 12(1):476,

work page

[42] [42]

Orb: An efficient alternative to sift or surf

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In2011 International Conference on Computer Vision, pages 2564– 2571, 2011. 4

work page 2011

[43] [43]

From coarse to fine: Robust hierarchical localization at large scale

Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. InCVPR, 2019. 5, 4

work page 2019

[44] [44]

Reacto: Reconstructing articulated ob- jects from a single video, 2024

Chaoyue Song, Jiacheng Wei, Chuan-Sheng Foo, Guosheng Lin, and Fayao Liu. Reacto: Reconstructing articulated ob- jects from a single video, 2024. 2

work page 2024

[45] [45]

Learning kinematic models of articulated ob- jects.Springer Tracts in Advanced Robotics, pages 65–111,

J ¨urgen Sturm. Learning kinematic models of articulated ob- jects.Springer Tracts in Advanced Robotics, pages 65–111,

work page

[46] [46]

Learn- ing kinematic models for articulated objects

J ¨urgen Sturm, Vijay Pradeep, Cyrill Stachniss, Christian Plagemann, Kurt Konolige, and Wolfram Burgard. Learn- ing kinematic models for articulated objects. 2

work page

[47] [47]

Articubot: Learning universal articulated object ma- nipulation policy via large scale simulation, 2025

Yufei Wang, Ziyu Wang, Mino Nakura, Pratik Bhowal, Chia- Liang Kuo, Yi-Ting Chen, Zackory Erickson, and David Held. Articubot: Learning universal articulated object ma- nipulation policy via large scale simulation, 2025. 3

work page 2025

[48] [48]

Ar- ticulated object estimation in the wild

Abdelrhman Werby, Martin B ¨uchner, Adrian R ¨ofer, Chen- guang Huang, Wolfram Burgard, and Abhinav Valada. Ar- ticulated object estimation in the wild. InConference on Robot Learning (CoRL), 2025. 3, 6, 7

work page 2025

[49] [49]

Ar- ticulated object estimation in the wild, 2025

Abdelrhman Werby, Martin B ¨uchner, Adrian R ¨ofer, Chen- guang Huang, Wolfram Burgard, and Abhinav Valada. Ar- ticulated object estimation in the wild, 2025. 2, 5, 6

work page 2025

[50] [50]

Sapien: A simulated part-based interactive environment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097– 11107, 2020. 2

work page 2020

[51] [51]

Robotube: Learning household manipulation from human videos with simulated twin environments

Haoyu Xiong, Haoyuan Fu, Jieyi Zhang, Chen Bao, Qiang Zhang, Yongxi Huang, Wenqiang Xu, Animesh Garg, and Cewu Lu. Robotube: Learning household manipulation from human videos with simulated twin environments. In6th An- nual Conference on Robot Learning, 2022. 3

work page 2022

[52] [52]

Depth any- thing v2, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2, 2024. 4

work page 2024

[53] [53]

Articulated human detection with flexible mixtures of parts.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 35(12):2878–2890,

Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 35(12):2878–2890,

work page

[54] [54]

Open-vocabulary functional 3d scene graphs for real-world indoor spaces

Chenyangguang Zhang, Alexandros Delitzas, Fangjinhua Wang, Ruida Zhang, Xiangyang Ji, Marc Pollefeys, and Francis Engelmann. Open-vocabulary functional 3d scene graphs for real-world indoor spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19401–19413, 2025. 3

work page 2025

[55] [55]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. 4 Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation Supplementary Material Contents

work page 2023

[56] [56]

In-the-Wild Articulated Object Estimation

Evaluations 6 4.1. In-the-Wild Articulated Object Estimation . . . . . 6 4.2. Tactile Force Estimation . . . . . . . . . . . . . . 7 4.3. Visual Force Estimation . . . . . . . . . . . . . . . 7

work page

[57] [57]

Limitations & Future Work 8

work page

[58] [58]

Hoi! Gripper Calibration Details 1 A.1

Conclusions 8 A . Hoi! Gripper Calibration Details 1 A.1 . Motor Calibration . . . . . . . . . . . . . . . . . . 1 A.2 . Inter-Sensor Calibration . . . . . . . . . . . . . . . 1 A.3 . Gripper Gravity Compensation . . . . . . . . . . . 2 B . Alignment of Sensors in the Hoi! Dataset Recordings 2 B.1. Time Alignment . . . . . . . . . . . . . . . . . . . 2 B....

work page