pith. machine review for the scientific record. sign in

arxiv: 2604.27621 · v1 · submitted 2026-04-30 · 💻 cs.RO · cs.CV

Recognition: unknown

Robot Learning from Human Videos: A Survey

Authors on Pith no claims yet

Pith reviewed 2026-05-07 04:52 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords robot learninghuman videosskill transfermanipulationembodied AIimitation learningsurveydata scaling
0
0 comments X

The pith

Human videos can supply the data needed to train robots in manipulation skills without requiring large amounts of robot-specific interaction records.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews methods that let robots acquire manipulation skills by observing ordinary human activity videos. It first lays out the basics of robot policy learning and the technical interfaces that link video data to robot control. It then organizes existing work into a hierarchy of transfer routes focused on tasks, observations, or actions, and pairs each route with the data setups and learning styles that support it. The survey also tracks the growth of human video datasets and generation tools, ending with open problems. A reader would care because human videos exist in enormous quantities already, offering a route to general-purpose robots trained passively rather than through expensive, slow robot trials.

Core claim

Learning robot manipulation from human videos can be organized through a hierarchical taxonomy of transfer pathways—task-oriented, observation-oriented, and action-oriented—each coupled with particular data configurations and learning paradigms, resting on expanding human video datasets and generation methods, while facing persistent transfer challenges.

What carries the argument

The hierarchical taxonomy of human-to-robot skill transfer that divides approaches into task-oriented, observation-oriented, and action-oriented pathways and analyzes how each couples with different data setups and learning methods.

If this is right

  • Robots gain access to skill examples without collecting their own interaction data at scale.
  • Different transfer pathways can be matched to imitation learning, reinforcement learning, or hybrid methods depending on available data.
  • Large existing collections of human activity videos become practical training resources for robot policies.
  • Video generation techniques can expand limited demonstration sets while preserving task structure.
  • Persistent gaps in transfer success point to needed advances in handling domain differences between humans and robots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same video sources that train vision models could feed robot policies, creating a shared data foundation across perception and control.
  • Unlabeled internet videos of daily activities could become a default training corpus once transfer methods mature.
  • Benchmark suites that vary only viewpoint and body type while keeping the task fixed would directly measure how well current pathways close the embodiment gap.

Load-bearing premise

Computer vision tools can reliably pull out skill information from human videos that still works on robots even when bodies, camera angles, and physical behaviors differ.

What would settle it

A controlled test in which a robot policy trained only on human videos fails to complete the same manipulation task that a human performs in the video, with the failure traced specifically to embodiment or viewpoint mismatch after using current vision models.

Figures

Figures reproduced from arXiv: 2604.27621 by Chenyang Xu, Ditao Li, Erhang Zhang, Guangming Wang, Haoran Yang, Hesheng Wang, Junyi Ma.

Figure 1
Figure 1. Figure 1: Illustration of bridging human videos and robot execution reviewed in this survey. We organize related works into three categories: task-oriented, observation-oriented, and action-oriented transfer pathways. We conduct intra-family analysis and cross-family comparison to highlight their design principles and trade-offs, and further provide practical guidelines for selecting suitable LfHV routes. 1Shanghai … view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of robot learning from human videos. exploration safety. Therefore, both imitation learning and reinforcement learning still face fundamental challenges in scaling robot skill acquisition to diverse open-world scenarios. In recent years, the advancements in large language models and vision-language models have been largely driven by the dramatic increase in training data scale (Chen et al. 2024d; … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of bridging human videos and robot execution in task, observation, and action levels. learning and reinforcement learning. From the perspective of imitation learning, human videos serve as a scalable substitute for expert demonstrations by providing task instructions, visual observations, and action cues. From the perspective of reinforcement learning, they offer informative priors for policy … view at source ↗
Figure 4
Figure 4. Figure 4: High-level diagram of task structures as a bridge. 3.1.1 Task Structures As shown in view at source ↗
Figure 5
Figure 5. Figure 5: Chronological overview of methods under task structures as a bridge in Sec. 3.1.1. Task Intent Inference Global Intent Robot Execution Phase Signal view at source ↗
Figure 6
Figure 6. Figure 6: High-level diagram of task intents as a bridge. 2016 2018 2019 2022 2023 2024 2026 Yu et al. (2018b) Jang et al. (2022) Jain et al. (2024) Sermanet et al. (2016) Yu et al. (2018a) Sharma et al. (2019) Xu et al. (2023) Li et al. (2026b) Global intent extraction Phase signals generation view at source ↗
Figure 7
Figure 7. Figure 7: Chronological overview of methods under task intents as a bridge in Sec. 3.1.2. executable robot code. It further tightens the connection between task structures and robot action generation. (c) Conclusion for task structures: Task structures as a bridge provide robots with explicit and temporally organized procedural knowledge before action generation. Their evolution shows a clear progression from rule￾b… view at source ↗
Figure 8
Figure 8. Figure 8: Chronological overview of methods under transformed videos as a bridge in Sec. 3.2.1. (a) Transformed videos as a bridge Embodiment Suppression (a) (b) Embodiment Transformation view at source ↗
Figure 10
Figure 10. Figure 10: Chronological overview of methods under visual embeddings as a bridge in Sec. 3.2.2. (b) Visual embeddings as a bridge Visual Pretraining closer farther Temporal Dimension Spatial Dimension Physical Dimension Embodiment Dimension Video Prediction Model Shared Representation Human Video Robot Masked Video Image Target Image E n c o d e r D e c o d e r (a) (b) (c) (d) Robot Execution Visual Guidance Encoder… view at source ↗
Figure 11
Figure 11. Figure 11: High-level diagram of visual embeddings as a bridge. Some elements of the illustrations are adapted from Nair et al. (2022); Xiao et al. (2022); Zhu et al. (2025); Punamiya et al. (2025); Chen et al. (2021). are inspired by inferior parietal lobule (IPL) neurons of the human brain, and propose an IPL token for multisensory integration via a Transformer encoder (Vaswani et al. 2017) in the masked pretraini… view at source ↗
Figure 12
Figure 12. Figure 12: Illustration of HOI analysis techniques and extracted affordances from human videos. Part of images are taken from Zhang et al. (2020); Labbe et al. ´ (2022); Sivakumar et al. (2022); Qin et al. (2022b); Cheng et al. (2023); Wen et al. (2023a); Wang et al. (2023a); Kumar et al. (2023); Bahl et al. (2023); Wen et al. (2023b); Karaev et al. (2024); Bahety et al. (2024); Kerr et al. (2024); Zhang et al. (202… view at source ↗
Figure 13
Figure 13. Figure 13: High-level diagram of affordances as a bridge. (b) Affordances for robot policy backbone training: Affordances encompass rich and explicit hand-object interaction patterns. Therefore, they provide more action￾grounded transfer signals between human videos and robot policies, which can be used to pretrain the backbone of robot policies after attaching an affordance-specific decoder. The high-level diagram … view at source ↗
Figure 14
Figure 14. Figure 14: Chronological overview of methods under affordances as a bridge in Sec. 3.3.1. contact points, human hand poses, and the target object bounding boxes given a video frame as input. Thus, it is encouraged to focus on actionable scene regions, target objects, and interaction-relevant hand motion, leading to stronger downstream robotic performance across diverse tasks, robot morphologies, and camera views. Ne… view at source ↗
Figure 15
Figure 15. Figure 15: Chronological overview of methods under latent actions as a bridge in Sec. 3.3.2. Latent Actions from Visual Reconstruction E n c o d e r D e c o d e r Latent Action (a) Backbone Latent Action Backbone Action Decoder Robot Actions Robot Obs. (b) (c) Latent Actions from Visual-Action Co-Reconstruction (d) E n c o d e r D e c o d e r Latent Action Action-Related Reconstruction (e) E n c o d e r D e c o d e … view at source ↗
Figure 16
Figure 16. Figure 16: High-level diagram of latent actions as a bridge. Some elements of the illustrations are adapted from Ye et al. (2024). from human videos to jointly support foundation policy and world model training across both humans and robots. Moto (Chen et al. 2025i) instead introduces an end-to￾end model to autoregressively predict a trajectory of latent motion tokens for future video clips. It uses an M-Former with… view at source ↗
Figure 17
Figure 17. Figure 17: Popularity trends of egocentric, exocentric, and egocentric+exocentric video configurations. to recover global procedural structure, subtask boundaries, and potential intents. These signals are easier to infer from a stable external viewpoint that preserves the full scene-level arrangement and reduces camera egomotion. In contrast, first-person views often emphasize local interaction details at the expens… view at source ↗
Figure 18
Figure 18. Figure 18: Data pyramids defined by Bjorck et al. (2025); Bi et al. (2025a); Wen et al. (2025), respectively. of works convert human videos into programs, affordances, or other structured intermediates that are deployed through retargeting, optimization, or analytic controllers rather than learned end-to-end policies like IL and RL. We observe that task-oriented transfer is primarily dominated by IL-style formulatio… view at source ↗
Figure 19
Figure 19. Figure 19: Illustration of human video generation for robot execution, which is originally shown in Li et al. (2025b). (see view at source ↗
read the original abstract

A critical bottleneck hindering further advancement in embodied AI and robotics is the challenge of scaling robot data. To address this, the field of learning robot manipulation skills from human video data has attracted rapidly growing attention in recent years, driven by the abundance of human activity videos and advances in computer vision. This line of research promises to enable robots to acquire skills passively from the vast and readily available resource of human demonstrations, substantially favoring scalable learning for generalist robotic systems. Therefore, we present this survey to provide a comprehensive and up-to-date review of human-video-based learning techniques in robotics, focusing on both human-robot skill transfer and data foundations. We first review the policy learning foundations in robotics, and then describe the fundamental interfaces to incorporate human videos. Subsequently, we introduce a hierarchical taxonomy of transferring human videos to robot skills, covering task-, observation-, and action-oriented pathways, along with a cross-family analysis of their couplings with different data configurations and learning paradigms. In addition, we investigate the data foundations including widely-used human video datasets and video generation schemes, and provide large-scale statistical trends in dataset development and utilization. Ultimately, we emphasize the challenges and limitations intrinsic to this field, and delineate potential avenues for future research. The paper list of our survey is available at https://github.com/IRMVLab/awesome-robot-learning-from-human-videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper is a survey reviewing techniques for learning robot manipulation skills from human video data. It begins with policy learning foundations in robotics, describes interfaces for incorporating human videos, introduces a hierarchical taxonomy of human-to-robot skill transfer organized around task-, observation-, and action-oriented pathways together with a cross-family analysis of couplings to data configurations and learning paradigms, surveys widely used human video datasets and video generation methods while reporting large-scale statistical trends in dataset development and utilization, and concludes with challenges, limitations, and future research directions. An accompanying GitHub repository lists the surveyed papers.

Significance. If the coverage proves comprehensive and the summarizations accurate, the survey would be a timely contribution to embodied AI and robotics. It synthesizes a rapidly expanding literature on passive skill acquisition from abundant human videos, supplies a structured taxonomy and data analysis that can orient new researchers, and includes an open paper list that supports community reproducibility. These elements directly address the data-scaling bottleneck highlighted in the abstract and could accelerate work on generalist robotic systems.

major comments (2)
  1. [Taxonomy section] Taxonomy section (following the policy foundations review): the three pathways are labeled 'hierarchical' yet presented as largely parallel families; the manuscript should explicitly define the hierarchy levels (e.g., which pathway subsumes or refines another) and provide a decision tree or table showing how a given method is classified, otherwise the taxonomy risks being descriptive rather than prescriptive.
  2. [Data foundations section] Data foundations and statistical trends section: while dataset growth curves and utilization statistics are reported, the survey does not tabulate the fraction of reviewed papers that employ each dataset or generation scheme; without this breakdown the claimed 'cross-family analysis of couplings' between pathways and data configurations cannot be quantitatively verified by readers.
minor comments (3)
  1. [Abstract and Taxonomy] The abstract states that the taxonomy covers 'task-, observation-, and action-oriented pathways' but the main text should add a short paragraph or figure caption that maps each pathway to the embodiment, viewpoint, and dynamics gaps mentioned in the challenges section.
  2. [Figures] Figure captions for the statistical trend plots should include the exact time window, number of papers sampled, and inclusion criteria so that the trends can be reproduced from the GitHub list.
  3. [Interfaces subsection] A few citations to foundational video-understanding works (e.g., recent action recognition or video prediction benchmarks) appear to be missing from the computer-vision interfaces subsection; adding them would strengthen the claim that CV advances enable the transfer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The comments identify valuable opportunities to strengthen the clarity of our taxonomy and the quantitative support for our data analysis. We address each major comment below and will incorporate the necessary revisions.

read point-by-point responses
  1. Referee: [Taxonomy section] Taxonomy section (following the policy foundations review): the three pathways are labeled 'hierarchical' yet presented as largely parallel families; the manuscript should explicitly define the hierarchy levels (e.g., which pathway subsumes or refines another) and provide a decision tree or table showing how a given method is classified, otherwise the taxonomy risks being descriptive rather than prescriptive.

    Authors: We appreciate the referee's careful reading. The taxonomy is organized around three pathways that reflect increasing specificity in the transfer process: task-oriented pathways address high-level goal and reward specification, observation-oriented pathways refine visual and state alignment, and action-oriented pathways map to low-level control. While this structure implies a natural hierarchy of abstraction levels, the manuscript does not explicitly articulate the subsumption relations or supply a classification procedure. We will revise the Taxonomy section to define the hierarchy levels clearly, state that task-oriented methods typically provide the contextual foundation that observation- and action-oriented methods refine, and add both a decision tree and a summary table that lists classification criteria for each surveyed method. These additions will make the taxonomy prescriptive and directly address the concern. revision: yes

  2. Referee: [Data foundations section] Data foundations and statistical trends section: while dataset growth curves and utilization statistics are reported, the survey does not tabulate the fraction of reviewed papers that employ each dataset or generation scheme; without this breakdown the claimed 'cross-family analysis of couplings' between pathways and data configurations cannot be quantitatively verified by readers.

    Authors: We agree that a quantitative breakdown would strengthen the verifiability of the cross-family analysis. The current section reports aggregate growth trends and qualitative observations on pathway–data couplings. To enable readers to confirm the claimed couplings, we will add a table (or set of tables) in the Data foundations section that reports, for each major dataset and generation scheme, the fraction of papers from each taxonomy pathway that utilize it. We are currently extracting these counts from the surveyed literature and will include the completed table in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Survey paper contains no derivations, equations, or fitted predictions

full rationale

This document is a literature survey that reviews existing policy learning foundations, defines a taxonomy of human-to-robot transfer pathways (task-, observation-, and action-oriented), analyzes couplings to data and learning paradigms, reports statistical trends on datasets, and outlines open challenges. It presents no original equations, no parameter fitting, no 'predictions' derived from inputs, and no self-referential modeling steps. The central aspirational claim about scalable learning from human videos is supported by external literature review rather than any internal chain that reduces to the paper's own assumptions or citations. No load-bearing step matches any of the enumerated circularity patterns; the paper is self-contained as a review and carries no circularity burden.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper the work introduces no new free parameters, axioms, or invented entities. It relies on standard background assumptions from robotics and computer vision that are already established in the cited literature.

pith-pipeline@v0.9.0 · 5552 in / 1175 out tokens · 39030 ms · 2026-05-07T04:52:12.060949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S et al. (2023) Gpt-4 technical report.arXiv preprint arXiv:2303.08774. Agarwal N, Ali A, Bala M, Balaji Y , Barker E, Cai T, Chattopadhyay P, Chen Y , Cui Y , Ding Y et al. (2025) Cosmos world foundation model platform for physical ai.arXiv preprint a...

  2. [2]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    pp. 18135–18143. Bjorck J, Casta ˜neda F, Cherniadev N, Da X, Ding R, Fan L, Fang Y , Fox D, Hu F, Huang S et al. (2025) Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734. Black K, Brown N, Driess D, Esmail A, Equi M, Finn C, Fusai N, Groom L, Hausman K, Ichter B et al. (2024)π 0: A vision- language-action f...

  3. [3]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Bu Q, Cai J, Chen L, Cui X, Ding Y , Feng S, Gao S, He X, Hu X, Huang X et al. (2025a) Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669. Bu Q, Yang Y , Cai J, Gao S, Ren G, Yao M, Luo P and Li H (2025b) Univla: Learning to act anywhere with task-centric latent actions....

  4. [4]

    Jain V , Attarian M, Joshi NJ, Wahid A, Driess D, Vuong Q, Sanketi PR, Sermanet P, Welker S, Chan C et al

    Huo S, Duan A, Han L, Hu L, Wang H and Navarro-Alarcon D (2023) Efficient robot skill learning with imitation from a single video for contact-rich fabric manipulation.arXiv preprint arXiv:2304.11801. Jain V , Attarian M, Joshi NJ, Wahid A, Driess D, Vuong Q, Sanketi PR, Sermanet P, Welker S, Chan C et al. (2024) Vid2robot: End-to-end video-conditioned pol...

  5. [5]

    Deft: Dexterous fine-tuning for real-world hand policies,

    Ju Y , Hu K, Zhang G, Zhang G, Jiang M and Xu H (2024) Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation. In:European Conference on Computer Vision. Springer, pp. 222–239. Kannan A (2023)Learning from human videos for robotic manipulation. PhD Thesis, Carnegie Mellon University Pittsburgh, PA. Kannan ...

  6. [6]

    Robot learning from a physical world model.arXiv preprint arXiv:2511.07416, 2025

    Mandikal P and Grauman K (2022) Dexvip: Learning dexterous grasping with human hand pose priors from video. In: Conference on Robot Learning. PMLR, pp. 651–661. Mao J, He S, Wu HN, You Y , Sun S, Wang Z, Bao Y , Chen H, Guibas L, Guizilini V et al. (2025) Robot learning from a physical world model.arXiv preprint arXiv:2511.07416. Mao J, Zhao S, Song S, Sh...

  7. [7]

    (2025) Hd-epic: A highly-detailed egocentric video dataset

    Perrett T, Darkhalil A, Sinha S, Emara O, Pollard S, Parida KK, Liu K, Gatti P, Bansal S, Flanagan K et al. (2025) Hd-epic: A highly-detailed egocentric video dataset. In:Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23901–23913. Pertsch K, Desai R, Kumar V , Meier F, Lim JJ, Batra D and Rai A (2022) Cross-domain transfer via ...

  8. [8]

    SAM 2: Segment Anything in Images and Videos

    Ravi N, Gabeur V , Hu YT, Hu R, Ryali C, Ma T, Khedr H, R¨adle R, Rolland C, Gustafson L et al. (2024) Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714. Ren J, Sundaresan P, Sadigh D, Choudhury S and Bohg J (2025) Motion tracks: A unified representation for human- robot transfer in few-shot imitation learning. In:2025 IEEE Inte...

  9. [9]

    Mitty: Diffusion-based human-to- robot video generation.arXiv preprint arXiv:2512.17253, 2025

    Song Y , Liu C, Mao W and Shou MZ (2025) Mitty: Diffusion- based human-to-robot video generation.arXiv preprint arXiv:2512.17253. Soraki R, Bharadhwaj H, Farhadi A and Mottaghi R (2026) Objectforesight: Predicting future 3d object trajectories from human videos.arXiv preprint arXiv:2601.05237. Spiridonov A, Zaech JN, Nikolov N, Van Gool L and Paudel DP (2...

  10. [10]

    Functo: Function-centric one-shot imi- tation learning for tool manipulation.arXiv preprint arXiv:2502.11744, 2025

    Sun Z, Shi Z, Chen J, Liu Q, Cui Y , Chen J and Ye Q (2025) Vtao- bimanip: Masked visual-tactile-action pre-training with object understanding for bimanual dexterous manipulation. In:2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 3201–3208. Tang C, Xiao A, Deng Y , Hu T, Dong W, Zhang H, Hsu D and Zhang H (2025a...

  11. [11]

    Advances in neural information processing systems30

    Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł and Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems30. Vinod A, Pandit S, Vavre A and Liu L (2025) Egovlm: Policy optimization for egocentric video understanding.arXiv preprint arXiv:2506.03097. Wake N, Kanehira A, Sasabuchi K, Takamat...

  12. [12]

    Egovid-5m: A large-scale video-action dataset for egocentric video generation, 2024

    Wang R, Zhou H, Yao X, Liu G and Jia K (2025b) Gat-grasp: Gesture-driven affordance transfer for task-aware robotic grasping. In:2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 1076–1083. Wang X, Kwon T, Rad M, Pan B, Chakraborty I, Andrist S, Bohus D, Feniello A, Tekin B, Frujeri FV et al. (2023c) Holoassist: an...

  13. [13]

    Ye J, Wang J, Huang B, Qin Y and Wang X (2023) Learning continuous grasping function with a dexterous hand from human demonstrations.IEEE Robotics and Automation Letters 8(5): 2882–2889. Ye K, Zhou J, Qiu Y , Liu J, Zhou S, Lin KY and Liang J (2025a) From watch to imagine: Steering long-horizon manipulation via human demonstration and future envisionment....

  14. [14]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Ze Y , Zhang G, Zhang K, Hu C, Wang M and Xu H (2024) 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954. Zeng J, Bu Q, Wang B, Xia W, Chen L, Dong H, Song H, Wang D, Hu D, Luo P et al. (2024) Learning manipulation by predicting interaction.arXiv preprint arXiv:2406.00439. Zhan X, Ya...