Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction

Bin He; Haonan Li; Ruochen Ren; Shuo Jiang; Yanmin Zhou; Zhipeng Wang

arxiv: 2503.05231 · v2 · submitted 2025-03-07 · 💻 cs.RO · cs.AI

Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction

Shuo Jiang , Haonan Li , Ruochen Ren , Yanmin Zhou , Zhipeng Wang , Bin He This is my paper

Pith reviewed 2026-05-23 01:24 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords multimodal datasetrobot learninghuman-robot interactiondexterous manipulationassembly tasksimitation learningmotion captureEMG signals

0 comments

The pith

Kaiwu dataset records 11,664 synchronized multimodal assembly actions from 20 humans and 30 objects to support robot learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Kaiwu multimodal dataset to supply the synchronized, real-world data that current robot learning methods require but often lack in assembly tasks. It collects integrated recordings from 20 subjects performing actions with 30 objects, yielding 11,664 instances that capture hand motions, operation pressures, sounds, multi-view videos, high-precision motion capture, eye gaze with first-person videos, and electromyography signals. Fine-grained annotations tied to absolute timestamps and semantic segmentation labels are added to each demonstration. The framework combines human, environment, and robot signals to enable progress in imitation learning, dexterous manipulation, and human-robot collaboration.

Core claim

The paper claims that the Kaiwu dataset supplies an integrated human-environment-robot data collection framework that records synchronized multimodal signals across 11,664 integrated actions from 20 subjects and 30 interaction objects, including hand motions, operation pressures, sounds of the assembling process, multi-view videos, high-precision motion capture information, eye gaze with first-person videos, and electromyography signals, together with fine-grained multi-level annotation based on absolute timestamp and semantic segmentation labelling.

What carries the argument

The multimodal data collection framework that synchronizes hand motions, operation pressures, sounds, multi-view videos, high-precision motion capture, eye gaze with first-person videos, and electromyography signals, paired with timestamp-based multi-level annotations and semantic segmentation.

If this is right

Robot learning algorithms can train on synchronized real-world pressure, audio, and motion data from human demonstrations.
Human intention investigation gains access to combined eye gaze, first-person video, and EMG signals during assembly.
Dexterous manipulation research can use the pressure and hand motion recordings to model contact-rich tasks.
Human-robot collaboration studies can draw on the multi-view videos and motion capture for natural interaction patterns.
Semantic segmentation labels enable scene understanding models that segment objects and actions in assembly videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of the dataset could create a shared benchmark for testing whether multimodal fusion improves robot performance on contact-rich tasks.
The recorded sound and pressure channels open the possibility of training models that predict assembly failures from non-visual cues alone.
Researchers might test whether the eye-gaze and EMG data improve prediction of human intent in collaborative settings compared with video alone.

Load-bearing premise

The recorded signals from different sensors remain accurately synchronized and the fine-grained annotations reliably capture task semantics and dynamics in a form that transfers usefully to robot learning.

What would settle it

An experiment in which imitation learning models trained on Kaiwu data show no improvement in assembly success rates or generalization over models trained on existing single-modality or unsynchronized datasets would falsify the central claim.

Figures

Figures reproduced from arXiv: 2503.05231 by Bin He, Haonan Li, Ruochen Ren, Shuo Jiang, Yanmin Zhou, Zhipeng Wang.

**Figure 1.** Figure 1: A framework of wearable and environment-mounted sensors records rich activity information in the assembly environment. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Data glove with angle sensors and force sensors. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of assembling process. C8 and C14 have different focuses. The former collected experimental data on columnar parts that participants tightened by hand in a situation where there is ample mounting space. The latter captures experimental data by tools in a confined space with limited hand movement. 19 sensors on the gloves. Each group of experiments can only be performed after completing the calibra… view at source ↗

**Figure 4.** Figure 4: Overview of annotation. end timestamps of the multimodal data collected from each assembly task action unit. The annotation includes: • Action-level Annotation: Coarse segmentation of left and right hand actions. • Fine-grain Gesture Annotation: Within the coarsely segmented time interval, a fine-grain segmentation of the left or right hand states is made For action segmentation, the left and right-handed… view at source ↗

**Figure 5.** Figure 5: KAIWU Dataset Structure Catalog comes from wearable sensors worn by 20 participants. Each participant is asked to complete 15 typical actions specified in the assembly process. Each participant spend approximately 19 minutes, resulting in the dataset representing about 6.3 hours of the assembling process. A summary of the related data is presented in Table V. B. Detailed information This dataset contains e… view at source ↗

read the original abstract

Cutting-edge robot learning techniques including foundation models and imitation learning from humans all pose huge demands on large-scale and high-quality datasets which constitute one of the bottleneck in the general intelligent robot fields. This paper presents the Kaiwu multimodal dataset to address the missing real-world synchronized multimodal data problems in the sophisticated assembling scenario,especially with dynamics information and its fine-grained labelling. The dataset first provides an integration of human,environment and robot data collection framework with 20 subjects and 30 interaction objects resulting in totally 11,664 instances of integrated actions. For each of the demonstration,hand motions,operation pressures,sounds of the assembling process,multi-view videos, high-precision motion capture information,eye gaze with first-person videos,electromyography signals are all recorded. Fine-grained multi-level annotation based on absolute timestamp,and semantic segmentation labelling are performed. Kaiwu dataset aims to facilitate robot learning,dexterous manipulation,human intention investigation and human-robot collaboration research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

Kaiwu is a dataset paper that collects a large set of multimodal demonstrations for assembly tasks, but it does not include the quantitative checks on data alignment and label quality that would make the contribution more solid. The work brings together hand motion, pressure, audio, multi-view video, motion capture, gaze, and EMG in one synchronized collection for 20 subjects doing tasks with 30 objects. That gives 11,664 instances with timestamp-based annotations and semantic segmentation. The focus on dynamic assembly with these channels is the main novelty, since most existing datasets miss some of these signals or the fine-grained labeling. The paper does well in laying out the collection framework and the motivation from the needs of imitation learning and human-robot interaction. Describing the integration of human, environment, and robot data in a real scenario is useful context. The main soft spot is the absence of any numbers on synchronization accuracy or inter-annotator agreement. The stress-test concern holds: without bounds on timing offsets or reliability of the labels, it's hard to be sure the dataset solves the missing synchronized multimodal data issue it sets out to address. If the full text has experiments showing downstream performance or error measurements, that would change the picture, but based on what's described, the claim stays unverified. This is for robotics researchers who need multimodal data for training models on dexterous manipulation or intention recognition. Someone building on imitation learning in assembly could get value from the scale and variety if the data quality is confirmed. I would send this to peer review. The topic is relevant and the scale is reasonable, so referees can assess the actual data pipeline and suggest the needed validation additions.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the Kaiwu multimodal dataset for robot learning in assembly scenarios. It integrates data from 20 subjects and 30 objects to produce 11,664 instances, recording hand motions, operation pressures, sounds, multi-view videos, high-precision motion capture, eye gaze with first-person videos, and electromyography signals per demonstration. Fine-grained multi-level annotations based on absolute timestamps and semantic segmentation labelling are performed to support robot learning, dexterous manipulation, human intention investigation, and human-robot collaboration.

Significance. If the multimodal streams prove accurately synchronized and the annotations reliably capture task semantics and dynamics, the dataset could address a gap in synchronized real-world multimodal data for imitation learning and HRI. The scale (11,664 instances) and breadth of modalities are strengths, but the absence of any quantitative validation leaves the utility claim unverified.

major comments (2)

[Data Collection Framework] The central claim that the dataset provides 'synchronized multimodal data' to address the stated problem is load-bearing but unsupported. The data collection framework description (abstract and §Data Collection) invokes accurate time-alignment across hand motion, pressure, audio, video, MoCap, gaze, and EMG without reporting alignment error bounds, clock-drift measurements, or maximum observed offsets.
[Annotation Process] The annotation pipeline (abstract and §Annotation) claims fine-grained timestamp-based multi-level labels and semantic segmentation that transfer to robot learning, yet reports no inter-annotator agreement scores, label consistency metrics, or validation against task dynamics (e.g., contact events). This directly affects the weakest assumption identified in the review.

minor comments (2)

The abstract lists modalities but does not specify sampling rates, resolution, or sensor models; adding a summary table would improve clarity.
[Abstract] Minor typographical issues: 'electromyography signals are all recorded' should read 'electromyography signals are recorded'; 'human,environment' lacks spacing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the validation of our claims.

read point-by-point responses

Referee: [Data Collection Framework] The central claim that the dataset provides 'synchronized multimodal data' to address the stated problem is load-bearing but unsupported. The data collection framework description (abstract and §Data Collection) invokes accurate time-alignment across hand motion, pressure, audio, video, MoCap, gaze, and EMG without reporting alignment error bounds, clock-drift measurements, or maximum observed offsets.

Authors: We agree that the manuscript lacks quantitative synchronization metrics. In the revised version we will add a dedicated subsection to §Data Collection that reports the synchronization protocol, measured clock-drift values, alignment error bounds obtained via cross-correlation on known events, and maximum observed offsets across all modality pairs. revision: yes
Referee: [Annotation Process] The annotation pipeline (abstract and §Annotation) claims fine-grained timestamp-based multi-level labels and semantic segmentation that transfer to robot learning, yet reports no inter-annotator agreement scores, label consistency metrics, or validation against task dynamics (e.g., contact events). This directly affects the weakest assumption identified in the review.

Authors: We concur that annotation quality metrics are necessary. The revised §Annotation will include inter-annotator agreement scores (Fleiss' kappa) for the semantic segmentation labels, label consistency statistics, and a validation analysis comparing semantic boundaries against contact events derived from pressure and motion-capture signals. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset description only

full rationale

The paper introduces a multimodal dataset (11,664 instances, multiple sensor streams, timestamped annotations) without any derivations, equations, predictions, or fitted parameters. The contribution is the data collection framework itself; no step reduces a claimed result to its own inputs by construction, self-citation, or renaming. The work is self-contained as a descriptive resource for downstream robot learning research.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Dataset contribution with no mathematical derivations; no free parameters, axioms, or invented entities are introduced beyond standard data-collection practices.

pith-pipeline@v0.9.0 · 5710 in / 1202 out tokens · 36702 ms · 2026-05-23T01:24:26.526424+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Cat- alyzing next-generation artificial intelligence through neuroai,

A. Zador, S. Escola, B. Richards, B. ¨Olveczky, Y . Bengio, K. Boahen, M. Botvinick, D. Chklovskii, A. Churchland, C. Clopath et al. , “Cat- alyzing next-generation artificial intelligence through neuroai,” Nature communications, vol. 14, no. 1, p. 1597, 2023

work page 2023
[2]

Tri-co robot: a chinese robotic research initiative for enhanced robot interaction capabilities,

H. Ding, X. Yang, N. Zheng, M. Li, Y . Lai, and H. Wu, “Tri-co robot: a chinese robotic research initiative for enhanced robot interaction capabilities,” National Science Review, vol. 5, no. 6, pp. 799–801, 2018

work page 2018
[3]

Foundation models in robotics: Applications, challenges, and the future,

R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman et al., “Foundation models in robotics: Applications, challenges, and the future,” The International Journal of Robotics Research, p. 02783649241281508, 2023

work page 2023
[4]

Robot imitation learning from image-only observation without real-world interaction,

X. Xu, M. You, H. Zhou, Z. Qian, and B. He, “Robot imitation learning from image-only observation without real-world interaction,” IEEE/ASME Transactions on Mechatronics , vol. 28, no. 3, pp. 1234– 1244, 2022

work page 2022
[5]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,” arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Whole-body humanoid robot locomotion with human reference,

Q. Zhang, P. Cui, D. Yan, J. Sun, Y . Duan, A. Zhang, and R. Xu, “Whole-body humanoid robot locomotion with human reference,” arXiv preprint arXiv:2402.18294, 2024

work page arXiv 2024
[7]

Robot learning in the era of foundation models: A survey,

X. Xiao, J. Liu, Z. Wang, Y . Zhou, Y . Qi, Q. Cheng, B. He, and S. Jiang, “Robot learning in the era of foundation models: A survey,” arXiv preprint arXiv:2311.14379 , 2023

work page arXiv 2023
[8]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al. , “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choroman- ski, T. Ding, D. Driess, A. Dubey, C. Finnet al., “Rt-2: Vision-language- action models transfer web knowledge to robotic control,”arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

RT-H: Action Hierarchies Using Language

S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh, “Rt-h: Action hierarchies using language,” arXiv preprint arXiv:2403.01823 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Maniwav: Learning robot manipulation from in-the-wild audio-visual data,

Z. Liu, C. Chi, E. Cousineau, N. Kuppuswamy, B. Burchfiel, and S. Song, “Maniwav: Learning robot manipulation from in-the-wild audio-visual data,” arXiv preprint arXiv:2406.19464 , 2024

work page arXiv 2024
[12]

Rh20t: A robotic dataset for learning diverse skills in one-shot,

H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu, “Rh20t: A robotic dataset for learning diverse skills in one-shot,” in RSS 2023 Workshop on Learning for Task and Motion Planning , 2023

work page 2023
[13]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis et al., “Droid: A large-scale in-the-wild robot manipulation dataset,” arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Robust grasping across di- verse sensor qualities: The graspnet-1billion dataset,

H.-S. Fang, M. Gou, C. Wang, and C. Lu, “Robust grasping across di- verse sensor qualities: The graspnet-1billion dataset,” The International Journal of Robotics Research , vol. 42, no. 12, pp. 1094–1103, 2023

work page 2023
[15]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Ir- pan, A. Khazatsky, A. Rai, A. Singh, A. Brohan et al. , “Open x- embodiment: Robotic learning datasets and rt-x models,” arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

All robots in one: A new standard and unified dataset for versatile, general-purpose embodied agents,

Z. Wang, H. Zheng, Y . Nie, W. Xu, Q. Wang, H. Ye, Z. Li, K. Zhang, X. Cheng, W. Dong et al., “All robots in one: A new standard and unified dataset for versatile, general-purpose embodied agents,” arXiv preprint arXiv:2408.10899, 2024

work page arXiv 2024
[17]

Human movement and ergonomics: An industry-oriented dataset for collaborative robotics,

P. Maurice, A. Malais ´e, C. Amiot, N. Paris, G.-J. Richard, O. Rochel, and S. Ivaldi, “Human movement and ergonomics: An industry-oriented dataset for collaborative robotics,” The International Journal of Robotics Research, vol. 38, no. 14, pp. 1529–1537, 2019

work page 2019
[18]

Actionsense: A multimodal dataset and recording framework for human activities using wearable sensors in a kitchen environment,

J. DelPreto, C. Liu, Y . Luo, M. Foshey, Y . Li, A. Torralba, W. Ma- tusik, and D. Rus, “Actionsense: A multimodal dataset and recording framework for human activities using wearable sensors in a kitchen environment,” Advances in Neural Information Processing Systems , vol. 35, pp. 13 800–13 813, 2022

work page 2022
[19]

Humbi: A large multiview dataset of human body expressions and benchmark challenge,

J. S. Yoon, Z. Yu, J. Park, and H. S. Park, “Humbi: A large multiview dataset of human body expressions and benchmark challenge,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 1, pp. 623–640, 2021

work page 2021
[20]

Toyota smarthome untrimmed: Real-world untrimmed videos for activity detection,

R. Dai, S. Das, S. Sharma, L. Minciullo, L. Garattoni, F. Bremond, and G. Francesca, “Toyota smarthome untrimmed: Real-world untrimmed videos for activity detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 2, pp. 2533–2550, 2022

work page 2022
[21]

Real-time recognition of team behaviors by multisensory graph-embedded robot learning,

B. Reily, P. Gao, F. Han, H. Wang, and H. Zhang, “Real-time recognition of team behaviors by multisensory graph-embedded robot learning,” The International Journal of Robotics Research, vol. 41, no. 8, pp. 798–811, 2022

work page 2022
[22]

A dataset of daily interactive manipulation,

Y . Huang and Y . Sun, “A dataset of daily interactive manipulation,” The International Journal of Robotics Research, vol. 38, no. 8, pp. 879–886, 2019

work page 2019
[23]

The rbo dataset of artic- ulated objects and interactions,

R. Mart ´ın-Mart´ın, C. Eppner, and O. Brock, “The rbo dataset of artic- ulated objects and interactions,” The International Journal of Robotics Research, vol. 38, no. 9, pp. 1013–1019, 2019

work page 2019
[24]

Harmonic: A multimodal dataset of assistive human–robot collaboration,

B. A. Newman, R. M. Aronson, S. S. Srinivasa, K. Kitani, and H. Admoni, “Harmonic: A multimodal dataset of assistive human–robot collaboration,” The International Journal of Robotics Research , vol. 41, no. 1, pp. 3–11, 2022

work page 2022
[25]

Hbod: A novel dataset with synchronized hand, body, and object manipulation data for human-robot interaction,

P. Kang, K. Zhu, S. Jiang, B. He, and P. B. Shull, “Hbod: A novel dataset with synchronized hand, body, and object manipulation data for human-robot interaction,” pp. 1–4, 2023

work page 2023
[26]

The effects of selected object features on a pick-and-place task: A human multimodal dataset,

L. Lastrico, V . Belcamino, A. Carf `ı, A. Vignolo, A. Sciutti, F. Mas- trogiovanni, and F. Rea, “The effects of selected object features on a pick-and-place task: A human multimodal dataset,” The International Journal of Robotics Research , vol. 43, no. 1, pp. 98–109, 2024

work page 2024
[27]

Oakink2: A dataset of bimanual hands-object manipulation in complex task completion,

X. Zhan, L. Yang, Y . Zhao, K. Mao, H. Xu, Z. Lin, K. Li, and C. Lu, “Oakink2: A dataset of bimanual hands-object manipulation in complex task completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 445–456, 2024

work page 2024
[28]

An introductory study of common grasps used by adults during perfor- mance of activities of daily living,

M. Vergara, J. L. Sancho-Bru, V . Gracia-Ib´a˜nez, and A. P´erez-Gonz´alez, “An introductory study of common grasps used by adults during perfor- mance of activities of daily living,” Journal of Hand Therapy , vol. 27, no. 3, pp. 225–234, 2014

work page 2014
[29]

A hand prosthesis with an under-actuated and self-adaptive finger mechanism,

R. Gopura and D. Bandara, “A hand prosthesis with an under-actuated and self-adaptive finger mechanism,” Engineering, vol. 10, no. 7, pp. 448–463, 2018

work page 2018

[1] [1]

Cat- alyzing next-generation artificial intelligence through neuroai,

A. Zador, S. Escola, B. Richards, B. ¨Olveczky, Y . Bengio, K. Boahen, M. Botvinick, D. Chklovskii, A. Churchland, C. Clopath et al. , “Cat- alyzing next-generation artificial intelligence through neuroai,” Nature communications, vol. 14, no. 1, p. 1597, 2023

work page 2023

[2] [2]

Tri-co robot: a chinese robotic research initiative for enhanced robot interaction capabilities,

H. Ding, X. Yang, N. Zheng, M. Li, Y . Lai, and H. Wu, “Tri-co robot: a chinese robotic research initiative for enhanced robot interaction capabilities,” National Science Review, vol. 5, no. 6, pp. 799–801, 2018

work page 2018

[3] [3]

Foundation models in robotics: Applications, challenges, and the future,

R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman et al., “Foundation models in robotics: Applications, challenges, and the future,” The International Journal of Robotics Research, p. 02783649241281508, 2023

work page 2023

[4] [4]

Robot imitation learning from image-only observation without real-world interaction,

X. Xu, M. You, H. Zhou, Z. Qian, and B. He, “Robot imitation learning from image-only observation without real-world interaction,” IEEE/ASME Transactions on Mechatronics , vol. 28, no. 3, pp. 1234– 1244, 2022

work page 2022

[5] [5]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,” arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Whole-body humanoid robot locomotion with human reference,

Q. Zhang, P. Cui, D. Yan, J. Sun, Y . Duan, A. Zhang, and R. Xu, “Whole-body humanoid robot locomotion with human reference,” arXiv preprint arXiv:2402.18294, 2024

work page arXiv 2024

[7] [7]

Robot learning in the era of foundation models: A survey,

X. Xiao, J. Liu, Z. Wang, Y . Zhou, Y . Qi, Q. Cheng, B. He, and S. Jiang, “Robot learning in the era of foundation models: A survey,” arXiv preprint arXiv:2311.14379 , 2023

work page arXiv 2023

[8] [8]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al. , “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choroman- ski, T. Ding, D. Driess, A. Dubey, C. Finnet al., “Rt-2: Vision-language- action models transfer web knowledge to robotic control,”arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

RT-H: Action Hierarchies Using Language

S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh, “Rt-h: Action hierarchies using language,” arXiv preprint arXiv:2403.01823 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Maniwav: Learning robot manipulation from in-the-wild audio-visual data,

Z. Liu, C. Chi, E. Cousineau, N. Kuppuswamy, B. Burchfiel, and S. Song, “Maniwav: Learning robot manipulation from in-the-wild audio-visual data,” arXiv preprint arXiv:2406.19464 , 2024

work page arXiv 2024

[12] [12]

Rh20t: A robotic dataset for learning diverse skills in one-shot,

H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu, “Rh20t: A robotic dataset for learning diverse skills in one-shot,” in RSS 2023 Workshop on Learning for Task and Motion Planning , 2023

work page 2023

[13] [13]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis et al., “Droid: A large-scale in-the-wild robot manipulation dataset,” arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Robust grasping across di- verse sensor qualities: The graspnet-1billion dataset,

H.-S. Fang, M. Gou, C. Wang, and C. Lu, “Robust grasping across di- verse sensor qualities: The graspnet-1billion dataset,” The International Journal of Robotics Research , vol. 42, no. 12, pp. 1094–1103, 2023

work page 2023

[15] [15]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Ir- pan, A. Khazatsky, A. Rai, A. Singh, A. Brohan et al. , “Open x- embodiment: Robotic learning datasets and rt-x models,” arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

All robots in one: A new standard and unified dataset for versatile, general-purpose embodied agents,

Z. Wang, H. Zheng, Y . Nie, W. Xu, Q. Wang, H. Ye, Z. Li, K. Zhang, X. Cheng, W. Dong et al., “All robots in one: A new standard and unified dataset for versatile, general-purpose embodied agents,” arXiv preprint arXiv:2408.10899, 2024

work page arXiv 2024

[17] [17]

Human movement and ergonomics: An industry-oriented dataset for collaborative robotics,

P. Maurice, A. Malais ´e, C. Amiot, N. Paris, G.-J. Richard, O. Rochel, and S. Ivaldi, “Human movement and ergonomics: An industry-oriented dataset for collaborative robotics,” The International Journal of Robotics Research, vol. 38, no. 14, pp. 1529–1537, 2019

work page 2019

[18] [18]

Actionsense: A multimodal dataset and recording framework for human activities using wearable sensors in a kitchen environment,

J. DelPreto, C. Liu, Y . Luo, M. Foshey, Y . Li, A. Torralba, W. Ma- tusik, and D. Rus, “Actionsense: A multimodal dataset and recording framework for human activities using wearable sensors in a kitchen environment,” Advances in Neural Information Processing Systems , vol. 35, pp. 13 800–13 813, 2022

work page 2022

[19] [19]

Humbi: A large multiview dataset of human body expressions and benchmark challenge,

J. S. Yoon, Z. Yu, J. Park, and H. S. Park, “Humbi: A large multiview dataset of human body expressions and benchmark challenge,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 1, pp. 623–640, 2021

work page 2021

[20] [20]

Toyota smarthome untrimmed: Real-world untrimmed videos for activity detection,

R. Dai, S. Das, S. Sharma, L. Minciullo, L. Garattoni, F. Bremond, and G. Francesca, “Toyota smarthome untrimmed: Real-world untrimmed videos for activity detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 2, pp. 2533–2550, 2022

work page 2022

[21] [21]

Real-time recognition of team behaviors by multisensory graph-embedded robot learning,

B. Reily, P. Gao, F. Han, H. Wang, and H. Zhang, “Real-time recognition of team behaviors by multisensory graph-embedded robot learning,” The International Journal of Robotics Research, vol. 41, no. 8, pp. 798–811, 2022

work page 2022

[22] [22]

A dataset of daily interactive manipulation,

Y . Huang and Y . Sun, “A dataset of daily interactive manipulation,” The International Journal of Robotics Research, vol. 38, no. 8, pp. 879–886, 2019

work page 2019

[23] [23]

The rbo dataset of artic- ulated objects and interactions,

R. Mart ´ın-Mart´ın, C. Eppner, and O. Brock, “The rbo dataset of artic- ulated objects and interactions,” The International Journal of Robotics Research, vol. 38, no. 9, pp. 1013–1019, 2019

work page 2019

[24] [24]

Harmonic: A multimodal dataset of assistive human–robot collaboration,

B. A. Newman, R. M. Aronson, S. S. Srinivasa, K. Kitani, and H. Admoni, “Harmonic: A multimodal dataset of assistive human–robot collaboration,” The International Journal of Robotics Research , vol. 41, no. 1, pp. 3–11, 2022

work page 2022

[25] [25]

Hbod: A novel dataset with synchronized hand, body, and object manipulation data for human-robot interaction,

P. Kang, K. Zhu, S. Jiang, B. He, and P. B. Shull, “Hbod: A novel dataset with synchronized hand, body, and object manipulation data for human-robot interaction,” pp. 1–4, 2023

work page 2023

[26] [26]

The effects of selected object features on a pick-and-place task: A human multimodal dataset,

L. Lastrico, V . Belcamino, A. Carf `ı, A. Vignolo, A. Sciutti, F. Mas- trogiovanni, and F. Rea, “The effects of selected object features on a pick-and-place task: A human multimodal dataset,” The International Journal of Robotics Research , vol. 43, no. 1, pp. 98–109, 2024

work page 2024

[27] [27]

Oakink2: A dataset of bimanual hands-object manipulation in complex task completion,

X. Zhan, L. Yang, Y . Zhao, K. Mao, H. Xu, Z. Lin, K. Li, and C. Lu, “Oakink2: A dataset of bimanual hands-object manipulation in complex task completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 445–456, 2024

work page 2024

[28] [28]

An introductory study of common grasps used by adults during perfor- mance of activities of daily living,

M. Vergara, J. L. Sancho-Bru, V . Gracia-Ib´a˜nez, and A. P´erez-Gonz´alez, “An introductory study of common grasps used by adults during perfor- mance of activities of daily living,” Journal of Hand Therapy , vol. 27, no. 3, pp. 225–234, 2014

work page 2014

[29] [29]

A hand prosthesis with an under-actuated and self-adaptive finger mechanism,

R. Gopura and D. Bandara, “A hand prosthesis with an under-actuated and self-adaptive finger mechanism,” Engineering, vol. 10, no. 7, pp. 448–463, 2018

work page 2018