X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

Atiksh Bhardwaj; Audrey Du; Chuanruo Ning; Edward W. Duan; Kushal Kedia; Maximus A. Pace; Prithwish Dan; Wei-Chiu Ma

arxiv: 2511.04671 · v2 · submitted 2025-11-06 · 💻 cs.RO · cs.AI· cs.CV

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

Maximus A. Pace , Prithwish Dan , Chuanruo Ning , Atiksh Bhardwaj , Audrey Du , Edward W. Duan , Wei-Chiu Ma , Kushal Kedia This is my paper

Pith reviewed 2026-05-18 00:33 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords diffusion policiescross-embodiment learninghuman demonstrationsrobot manipulationAmbient Diffusionreal-world tasks

0 comments

The pith

X-Diffusion trains diffusion policies by treating human actions as noisy robot counterparts at high noise levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to extract useful task guidance from human videos even when the exact movements cannot be performed by a robot due to body differences. It adapts an existing diffusion training trick to add human data only during the noisiest stages of the process, where embodiment details disappear but object interaction intent remains. Experiments on five real robot manipulation tasks report a 16 percent average success improvement over simply mixing all data or hand-filtering it. A reader would care because robot data collection is costly while human videos are plentiful and already available at scale.

Core claim

X-Diffusion is a cross-embodiment learning framework based on Ambient Diffusion that selectively trains diffusion policies on noised human actions. By viewing human actions as noisy counterparts of robot actions, as noise increases along the forward diffusion process embodiment-specific differences fade away while task-relevant guidance is preserved. This enables effective use of easy-to-collect human videos without sacrificing robot feasibility.

What carries the argument

X-Diffusion framework that incorporates human demonstrations only at high-noise timesteps of the forward diffusion process.

If this is right

Average success rates improve by 16% over naive co-training and manual data filtering across five real-world manipulation tasks.
Robots acquire task intent from coarse human guidance without adopting infeasible execution details.
Human videos become a usable, scalable data source for diffusion policies.
Selective noise-based inclusion of cross-embodiment data outperforms both full mixing and filtering approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same noise-level selection could be applied to other mismatched data sources such as internet videos or simulation rollouts.
Testing on longer-horizon or contact-rich tasks would show how much noise is required to bridge larger embodiment gaps.
The approach might reduce the need for manual data curation when scaling to thousands of unfiltered human clips.

Load-bearing premise

Human actions can be viewed as noisy counterparts of robot actions such that as noise increases along the forward diffusion process, embodiment-specific differences fade away while task-relevant guidance is preserved.

What would settle it

Running the same five tasks with human data added only at low noise levels instead of high noise levels and finding no improvement over robot-only training.

Figures

Figures reproduced from arXiv: 2511.04671 by Atiksh Bhardwaj, Audrey Du, Chuanruo Ning, Edward W. Duan, Kushal Kedia, Maximus A. Pace, Prithwish Dan, Wei-Chiu Ma.

**Figure 1.** Figure 1: Overview of X-DIFFUSION: We introduce X-DIFFUSION, a framework to train diffusion policies on cross-embodiment human data containing a variety of execution styles. Naively co-training diffusion policies on human and robot datasets with mismatched dynamics can lead the denoising process to output dynamically infeasible actions for the robot, degrading performance below standard robot-only diffusion policy t… view at source ↗

**Figure 2.** Figure 2: Pipeline: X-DIFFUSION first unifies the state and action representation. State is represented by a colored segmentation mask of relevant objects using Grounded-SAM2 [37]. Action is represented via end-effector/human hand pose utilizing HaMeR [6] for retargeting. During the policy’s forward diffusion process, Gaussian noise is sampled and added to the clean actions. To determine if the policy should learn t… view at source ↗

**Figure 3.** Figure 3: Visualizing Actions under Noise and Classifier Predictions at various Diffusion Steps. Humans execute tasks in various ways. For example, when picking and placing a pan, a human can either execute a top-down grasp or a side grasp. Human actions that are feasible for robots (e.g. top-down grasp) overlap with robot action distribution under low noise timesteps. This data fools the classifier into believing i… view at source ↗

**Figure 4.** Figure 4: Performance vs. Baselines: We report task success rate on 5 different manipulation tasks and compare X-DIFFUSION against a robot-only baseline (Diffusion Policy) and various co-training baselines (Point-Policy, MotionTracks). DemoDiffusion is another diffusion-based method, but it doesn’t train the robot policy on human demonstrations. We find that X-DIFFUSION is the highest performing model on all tasks, … view at source ↗

**Figure 5.** Figure 5: Naive co-training learns infeasible robot actions: Including all human data in policy training can incentivize policies to learn strategies demonstrated by humans but infeasible for robots. On multiple tasks, a human may manipulate objects in ways that are not realizable for a robot. The policy input is the masked image with overlaid keypoints, concatenated with proprioceptive information. More details ar… view at source ↗

**Figure 6.** Figure 6: Classifier Robot Probability across forward diffusion process: As the noise levels increase, the human action distribution becomes more similar to the robot action distribution. The similarity of human actions with robot actions varies across tasks: as shown on the graphs, the distance between the human and robot action distributions at every noise level is smaller for Push Plate data compared to Bottle Up… view at source ↗

read the original abstract

Human videos are a scalable source of training data for robot learning. However, humans and robots significantly differ in embodiment, making many human actions infeasible for direct execution on a robot. Still, these demonstrations convey rich object-interaction cues and task intent. Our goal is to learn from this coarse guidance without transferring embodiment-specific, infeasible execution strategies. Recent advances in generative modeling tackle a related problem of learning from low-quality data. In particular, Ambient Diffusion is a recent method for diffusion modeling that incorporates low-quality data only at high-noise timesteps of the forward diffusion process. Our key insight is to view human actions as noisy counterparts of robot actions. As noise increases along the forward diffusion process, embodiment-specific differences fade away while task-relevant guidance is preserved. Based on these observations, we present X-Diffusion, a cross-embodiment learning framework based on Ambient Diffusion that selectively trains diffusion policies on noised human actions. This enables effective use of easy-to-collect human videos without sacrificing robot feasibility. Across five real-world manipulation tasks, we show that X-Diffusion improves average success rates by 16% over naive co-training and manual data filtering. The project website is available at https://portal-cornell.github.io/X-Diffusion/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

X-Diffusion adapts Ambient Diffusion to cross-embodiment robot policies by training on human actions only at high noise levels, delivering a 16% average gain over simple baselines on five real tasks.

read the letter

Hey, the main new piece is the direct mapping of Ambient Diffusion onto diffusion policies for robots. They treat human actions as noisy robot counterparts and restrict the human data to high-timestep training so that embodiment-specific kinematics drop out while object-interaction signals stay. This produces a clean framework that avoids the usual filtering headaches or direct co-training failures. The real-robot results on five manipulation tasks are the strongest part: a 16% average success-rate lift over naive co-training and manual filtering, with hardware experiments rather than just simulation. That gives the work a practical edge for anyone trying to scale imitation learning with cheap human video. The soft spot is the missing isolation of the mechanism. The central assumption—that Gaussian noise washes away embodiment differences while keeping task guidance—makes sense in principle, but the paper does not appear to run a clean ablation that fixes data volume and varies only the noise schedule or the selective high-timestep rule. Without that, the reported gain could partly reflect extra data diversity rather than the specific Ambient Diffusion trick. The baselines are reasonable and the math follows the standard diffusion setup with a straightforward modification, so nothing looks broken there. This is aimed at the robot learning crowd working on diffusion policies and human-to-robot transfer. Readers who need concrete ways to use human videos will find the experiments and the selective-training idea useful. It has enough new application and grounded results to go to peer review rather than a desk reject, though the reviewers will likely ask for the extra ablation.

Referee Report

2 major / 2 minor

Summary. The paper proposes X-Diffusion, a framework adapting Ambient Diffusion to train diffusion policies on cross-embodiment human videos. Human actions are treated as noisy robot-action counterparts; training occurs selectively on noised human data only at high forward-diffusion timesteps so that embodiment-specific kinematics are suppressed while task-relevant object-interaction cues remain. On five real-world manipulation tasks the method reports a 16% average success-rate gain over naive co-training and manual filtering baselines.

Significance. If the empirical gains prove robust, the work offers a principled route to leverage abundant human video data for robot policy learning without transferring infeasible strategies. The explicit link between Ambient Diffusion’s high-noise regime and embodiment mismatch is a clean conceptual contribution that could generalize beyond the reported tasks.

major comments (2)

[Experiments] Experiments / Results: The abstract and main results claim a 16% average improvement, yet provide no per-task success rates with standard deviations, number of evaluation trials, statistical significance tests, or explicit baseline hyper-parameter settings. Without these, it is impossible to determine whether the reported delta is reliable or driven by a few outlier runs.
[Method] Method / §3.2 (Core Assumption): The central modeling choice—that Gaussian noise addition causes embodiment-specific kinematic differences to become indistinguishable from task structure—receives no isolating ablation. A control that injects human data at high noise levels but disables the Ambient Diffusion weighting (or uses uniform co-training at those timesteps) is required to show that the gain is not simply an artifact of increased data volume or diversity.

minor comments (2)

[Method] Notation: The forward-process noise schedule and the precise timestep threshold used for human-data inclusion should be stated explicitly (e.g., as a single equation or table entry) rather than left to the supplementary material.
[Experiments] Figures: Qualitative rollout visualizations would benefit from side-by-side comparison of failure modes under X-Diffusion versus the naive co-training baseline to illustrate the claimed reduction in infeasible actions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate additional experimental details and ablations for improved rigor and reproducibility.

read point-by-point responses

Referee: The abstract and main results claim a 16% average improvement, yet provide no per-task success rates with standard deviations, number of evaluation trials, statistical significance tests, or explicit baseline hyper-parameter settings. Without these, it is impossible to determine whether the reported delta is reliable or driven by a few outlier runs.

Authors: We agree that these details are essential for assessing the reliability of the results. In the revised manuscript, we will expand the results section with a table reporting per-task success rates (including means and standard deviations) across 10 independent evaluation trials per task and method. We will also report the exact hyperparameter settings for all baselines (naive co-training and manual filtering) and include statistical significance tests such as paired t-tests with p-values to support the 16% average improvement. revision: yes
Referee: The central modeling choice—that Gaussian noise addition causes embodiment-specific kinematic differences to become indistinguishable from task structure—receives no isolating ablation. A control that injects human data at high noise levels but disables the Ambient Diffusion weighting (or uses uniform co-training at those timesteps) is required to show that the gain is not simply an artifact of increased data volume or diversity.

Authors: We appreciate this suggestion to better isolate the contribution of our core modeling assumption. In the revised version, we will add a new ablation study comparing X-Diffusion to a control variant that performs uniform co-training with human data at high noise timesteps but without the Ambient Diffusion selective weighting. This will help demonstrate that the observed gains arise from the principled high-noise selective training rather than from increased data volume alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured against explicit baselines

full rationale

The paper's central contribution is an empirical framework that applies the external Ambient Diffusion technique to cross-embodiment data by treating human actions as noisy robot counterparts at high timesteps. The reported 16% average success-rate improvement is measured directly against two explicit baselines (naive co-training and manual data filtering) across five real-world tasks. No equations, fitted parameters, or self-citations are shown to reduce the performance delta or the core assumption to quantities defined by the method itself; the derivation chain remains self-contained and externally falsifiable via the controlled experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that human actions function as noisy robot actions at high diffusion noise levels; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Human actions can be viewed as noisy counterparts of robot actions such that embodiment-specific differences fade at high noise while task-relevant guidance is preserved.
This premise is stated as the key insight enabling selective training on human data.

pith-pipeline@v0.9.0 · 5791 in / 1206 out tokens · 37564 ms · 2026-05-18T00:33:25.478340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

[1]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomo- tor policy learning via action diffusion,”arXiv preprint arXiv:2303.04137, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Zhao, V . Kumar, S. Levine, and C. Finn, “Learn- ing fine-grained bimanual manipulation with low-cost hardware,”ArXiv, vol. abs/2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Motion tracks: A unified representation for human-robot transfer in few-shot imita- tion learning

J. Ren, P. Sundaresan, D. Sadigh, S. Choudhury, and J. Bohg, “Motion tracks: A unified representation for human-robot transfer in few-shot imitation learning,” ArXiv, vol. abs/2501.06994, 2025

work page arXiv 2025
[4]

Point policy: Unifying observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

S. Haldar and L. Pinto, “Point policy: Unifying obser- vations and actions with key points for robot manipu- lation,”ArXiv, vol. abs/2502.20391, 2025

work page arXiv 2025
[5]

Phantom: Training robots without robots using only human videos, 2025

M. Lepert, J. Fang, and J. Bohg, “Phantom: Training robots without robots using only human videos,”ArXiv, vol. abs/2503.00779, 2025

work page arXiv 2025
[6]

Reconstructing hands in 3d with transformers,

G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. F. Fouhey, and J. Malik, “Reconstructing hands in 3d with transformers,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9826–9836, 2023

work page 2024
[7]

Dexwild: Dexterous human interactions for in-the-wild robot policies,

T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak, “Dexwild: Dexterous human interactions for in-the-wild robot policies,”ArXiv, vol. abs/2505.07813, 2025

work page internal anchor Pith review arXiv 2025
[8]

Egozero: Robot learning from smart glasses,

V . Liu, A. Adeniji, H. Zhan, R. M. Bhirangi, P. Abbeel, and L. Pinto, “Egozero: Robot learning from smart glasses,”ArXiv, vol. abs/2505.20290, 2025

work page arXiv 2025
[9]

Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025

M. Lepert, J. Fang, and J. Bohg, “Masquerade: Learn- ing from in-the-wild human videos using data-editing,” ArXiv, vol. abs/2508.09976, 2025

work page arXiv 2025
[10]

Zeromimic: Distilling robotic manipulation skills from web videos,

J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman, “Zeromimic: Distilling robotic manipulation skills from web videos,”2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pp. 16 939–16 947, 2025

work page 2025
[11]

Zero-shot robot manipulation from pas- sive human videos.arXiv preprint arXiv:2302.02011, 2023

H. Bharadhwaj, A. Gupta, S. Tulsiani, and V . Kumar, “Zero-shot robot manipulation from passive human videos,” vol. abs/2302.02011, 2023

work page arXiv 2023
[12]

Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani, “Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation,”arXiv preprint arXiv:2405.01527, 2024

work page arXiv 2024
[13]

Mimicplay: Long- horizon imitation learning by watching human play,

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar, “Mimicplay: Long- horizon imitation learning by watching human play,” 2023

work page 2023
[14]

Robotic telekinesis: Learning a robotic hand imitator by watch- ing humans on youtube,

A. Sivakumar, K. Shaw, and D. Pathak, “Robotic telekinesis: Learning a robotic hand imitator by watch- ing humans on youtube,” 2022

work page 2022
[15]

Videodex: Learning dexterity from internet videos,

K. Shaw, S. Bahl, and D. Pathak, “Videodex: Learning dexterity from internet videos,” inConference on Robot Learning, 2022

work page 2022
[16]

Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation,

S. P. Arunachalam, S. Silwal, B. Evans, and L. Pinto, “Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation,” 2022

work page 2022
[17]

Learn- ing continuous grasping function with a dexterous hand from human demonstrations,

J. Ye, J. Wang, B. Huang, Y . Qin, and X. Wang, “Learn- ing continuous grasping function with a dexterous hand from human demonstrations,” vol. 8, 2022, pp. 2882– 2889

work page 2022
[18]

Shadow: Leveraging segmentation masks for cross-embodiment policy transfer.arXiv preprint arXiv:2503.00774, 2025

M. Lepert, R. Doshi, and J. Bohg, “Shadow: Leverag- ing segmentation masks for cross-embodiment policy transfer,”ArXiv, vol. abs/2503.00774, 2025

work page arXiv 2025
[19]

Human-to-robot imitation in the wild,

S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” 2022

work page 2022
[20]

Vision-based manipulation from single human video with open-world object graphs,

Y . Zhu, A. Lim, P. Stone, and Y . Zhu, “Vision-based manipulation from single human video with open-world object graphs,”arXiv preprint arXiv:2405.20321, 2024

work page arXiv 2024
[21]

One-shot imitation learning: A pose estimation perspective,

P. Vitiello, K. Dreczkowski, and E. Johns, “One-shot imitation learning: A pose estimation perspective,” in Conference on Robot Learning, 2023

work page 2023
[22]

Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024

J. Li, Y . Zhu, Y . Xie, Z. Jiang, M. Seo, G. Pavlakos, and Y . Zhu, “Okami: Teaching humanoid robots manipula- tion skills through single video imitation,”ArXiv, vol. abs/2410.11792, 2024

work page arXiv 2024
[23]

Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills,

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills,”ACM Trans. Graph., vol. 37, no. 4, pp. 143:1–143:14, Jul. 2018

work page 2018
[24]

Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation,

Z. Yuan, T. Wei, L. Gu, P. Hua, T. Liang, Y . Chen, and H. Xu, “Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation,” 2025

work page 2025
[25]

Xirl: Cross-embodiment inverse rein- forcement learning,

K. Zakka, A. Zeng, P. R. Florence, J. Tompson, J. Bohg, and D. Dwibedi, “Xirl: Cross-embodiment inverse rein- forcement learning,” inConference on Robot Learning, 2021

work page 2021
[26]

Rank2reward: Learning shaped reward functions from passive video,

D. Yang, D. Tjia, J. Berg, D. Damen, P. Agrawal, and A. Gupta, “Rank2reward: Learning shaped reward functions from passive video,” in2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 2806–2813

work page 2024
[27]

Imitation learning from a single temporally misaligned video,

W. Huey, H. Wang, A. Wu, Y . Artzi, and S. Choudhury, “Imitation learning from a single temporally misaligned video,”ArXiv, vol. abs/2502.05397, 2025

work page arXiv 2025
[28]

Concept2robot: Learning manipulation concepts from instructions and human demonstrations,

L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg, “Concept2robot: Learning manipulation concepts from instructions and human demonstrations,” vol. 40, 2020, pp. 1419 – 1434

work page 2020
[29]

Learning generalizable robotic reward functions from” in-the-wild” human videos.arXiv preprint arXiv:2103.16817,

A. S. Chen, S. Nair, and C. Finn, “Learning generaliz- able robotic reward functions from ”in-the-wild” human videos,” vol. abs/2103.16817, 2021

work page arXiv 2021
[30]

X-sim: Cross-embodiment learning via real-to-sim-to-real,

P. Dan, K. Kedia, A. Chao, E. W. Duan, M. A. Pace, W.- C. Ma, and S. Choudhury, “X-sim: Cross-embodiment learning via real-to-sim-to-real,” 2025

work page 2025
[31]

Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstra- tion,

T. Ga, W. Lum, O. Y . Lee, C. K. Liu, J. Bohg, and P.-M. H. Pose, “Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstra- tion,” 2025

work page 2025
[32]

Flow as the cross-domain manipulation interface,

M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,” inConference on Robot Learning, 2024

work page 2024
[33]

Combining self-supervised learning and imitation for vision-based rope manipula- tion,

A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine, “Combining self-supervised learning and imitation for vision-based rope manipula- tion,” 2017, pp. 2146–2153

work page 2017
[34]

Graph-structured visual imitation,

M. Sieb, X. Zhou, A. Huang, O. Kroemer, and K. Fragkiadaki, “Graph-structured visual imitation,” in Conference on Robot Learning, 2019

work page 2019
[35]

Learning predictive models from observation and interaction,

K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Dani- ilidis, S. Levine, and C. Finn, “Learning predictive models from observation and interaction,” inComputer Vision–ECCV 2020: 16th European Conference, Glas- gow, UK, August 23–28, 2020, Proceedings, Part XX. Springer, 2020, pp. 708–725

work page 2020
[36]

Graph inverse reinforcement learning from diverse videos,

S. Kumar, J. Zamora, N. Hansen, R. Jangir, and X. Wang, “Graph inverse reinforcement learning from diverse videos,” 2022

work page 2022
[37]

Sam 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,” 2024

work page 2024
[38]

Bc-z: Zero-shot task generalization with robotic imitation learning, 2022

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning,” vol. abs/2202.02005, 2022

work page arXiv 2022
[39]

Vid2robot: End-to-end video-conditioned policy learning with cross-attention transformers,

V . Jain, M. Attarian, N. Joshi, A. Wahid, D. Driess, Q. Vuong, P. R. Sanketi, P. Sermanet, S. Welker, C. Chan, I. Gilitschenski, Y . Bisk, and D. Dwibedi, “Vid2robot: End-to-end video-conditioned policy learning with cross-attention transformers,” vol. abs/2403.12943, 2024

work page arXiv 2024
[40]

One-shot imitation under mismatched execu- tion,

K. Kedia, P. Dan, A. Chao, M. A. Pace, and S. Choud- hury, “One-shot imitation under mismatched execu- tion,”arXiv preprint arXiv:2409.06615, 2024

work page arXiv 2024
[41]

XSkill: Cross embodiment skill discovery,

M. Xu, Z. Xu, C. Chi, M. Veloso, and S. Song, “XSkill: Cross embodiment skill discovery,” in7th Annual Con- ference on Robot Learning, 2023

work page 2023
[42]

Mimicdroid: In-context learning for humanoid manipulation from human play videos,

R. Shah, S. Liu, Q. Wang, Z. Jiang, S. Kumar, M. Seo, R. Mart´ın-Mart´ın, and Y . Zhu, “Mimicdroid: In-context learning for humanoid manipulation from human play videos,”arXiv preprint arXiv:2509.09769, 2025

work page arXiv 2025
[43]

Instant policy: In-context imitation learning via graph diffusion,

V . V osylius and E. Johns, “Instant policy: In-context imitation learning via graph diffusion,” 2025

work page 2025
[44]

Demod- iffusion: One-shot human imitation using pre-trained diffusion policy,

S. Park, H. Bharadhwaj, and S. Tulsiani, “Demod- iffusion: One-shot human imitation using pre-trained diffusion policy,” 2025

work page 2025
[45]

Cu- rating demonstrations using online experience,

A. S. Chen, A. M. Lessing, Y . Liu, and C. Finn, “Cu- rating demonstrations using online experience,” 2025

work page 2025
[46]

Cupid: Curating data your robot loves with influence functions,

C. Agia, R. Sinha, J. Yang, R. Antonova, M. Pavone, H. Nishimura, M. Itkina, and J. Bohg, “Cupid: Curating data your robot loves with influence functions,” 2025

work page 2025
[47]

Re-mix: Optimizing data mixtures for large scale imitation learning,

J. Hejna, C. Bhateja, Y . Jiang, K. Pertsch, and D. Sadigh, “Re-mix: Optimizing data mixtures for large scale imitation learning,” 2024

work page 2024
[48]

Emu: Enhancing image generation models using photogenic needles in a haystack,

X. Dai, J. Hou, C.-Y . Ma, S. Tsai, J. Wang, R. Wang, P. Zhang, S. Vandenhende, X. Wang, A. Dubey, M. Yu, A. Kadian, F. Radenovic, D. Mahajan, K. Li, Y . Zhao, V . Petrovic, M. K. Singh, S. Motwani, Y . Wen, Y . Song, R. Sumbaly, V . Ramanathan, Z. He, P. Vajda, and D. Parikh, “Emu: Enhancing image generation models using photogenic needles in a haystack,” 2023

work page 2023
[49]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gor- don, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kun- durthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “Laion-5b: An open large-scale dataset for training next generation image-text models,” 2022

work page 2022
[50]

Ambient diffusion omni: Training good models with bad data,

G. Daras, A. Rodriguez-Munoz, A. Klivans, A. Tor- ralba, and C. Daskalakis, “Ambient diffusion omni: Training good models with bad data,” 2025

work page 2025
[51]

Grounded sam: Assembling open-world models for diverse visual tasks,

T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yanet al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv, 2024

work page 2024
[52]

Emergent correspondence from image diffusion,

L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariha- ran, “Emergent correspondence from image diffusion,” Advances in Neural Information Processing Systems, vol. 36, pp. 1363–1389, 2023

work page 2023
[53]

Cotracker: It is better to track together,

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,” inEuropean conference on computer vision. Springer, 2024, pp. 18–35. APPENDIX A. Contributions •Maximus A. Pace:Investigated different algorithms for using human data in policy learning, set up the data collection pipeline using teleoper...

work page 2024
[54]

open or closed) at timestept

Robot Demonstrations:The robot’s proprioceptionq t is computed using forward kinematics given its joint angles and gripper status (e.g. open or closed) at timestept. Visual observationso t are obtained by applying Grounded-SAM 2 [51] with language prompts on a single-view RGB capture of the scene and overlaying end-effector keypoint renderings

work page
[55]

We select 5 of these keypoints along the index finger and thumb to be retargeted into a parallel jaw

Human Demonstrations:We use HaMeR [6] to detect a set of 21 keypoints in 2D pixel space for each camera. We select 5 of these keypoints along the index finger and thumb to be retargeted into a parallel jaw. Using two cameras with known parameters, we triangulate these keypoints into the same 3D coordinate frame as the robot to obtainp t and apply the Kabs...

work page
[56]

Diffusion Policy:This baseline uses the vanilla Diffu- sion Policy architecture trained only on a small set of robot demonstrations

work page
[57]

Point Policy:Instead of using segmented images in its visual observationo t, this baseline represents state via 3D keypoints of relevant objects at each timestept. The keypoints are annotated in the first frame of one training demonstration, and correspondences are automatically de- tected at the start of all other demonstrations and at inference time usi...

work page
[58]

Motion Tracks:This baseline consumes the raw RGB image (without segmentations) and end-effector propriocep- tion as input. The original paper for MOTIONTRACKSuses a keypoint retargeting network to minimize any gap between hand and end-effector keypoints, which we alleviate in our implementation by unifying the proprioception directly into end-effector pos...

work page
[59]

The reverse diffusion process is completed by using the human policyπ H for the initial denoising steps, followed by the robot policyπ R for the remainder of the denoising steps

DemoDiffusion:This baseline leverages two Diffusion Policies: human policyπ H is trained on the full human datasetD H, and robot policyπ R is trained on the full robot datasetD R. The reverse diffusion process is completed by using the human policyπ H for the initial denoising steps, followed by the robot policyπ R for the remainder of the denoising steps...

work page

[1] [1]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomo- tor policy learning via action diffusion,”arXiv preprint arXiv:2303.04137, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Zhao, V . Kumar, S. Levine, and C. Finn, “Learn- ing fine-grained bimanual manipulation with low-cost hardware,”ArXiv, vol. abs/2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Motion tracks: A unified representation for human-robot transfer in few-shot imita- tion learning

J. Ren, P. Sundaresan, D. Sadigh, S. Choudhury, and J. Bohg, “Motion tracks: A unified representation for human-robot transfer in few-shot imitation learning,” ArXiv, vol. abs/2501.06994, 2025

work page arXiv 2025

[4] [4]

Point policy: Unifying observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

S. Haldar and L. Pinto, “Point policy: Unifying obser- vations and actions with key points for robot manipu- lation,”ArXiv, vol. abs/2502.20391, 2025

work page arXiv 2025

[5] [5]

Phantom: Training robots without robots using only human videos, 2025

M. Lepert, J. Fang, and J. Bohg, “Phantom: Training robots without robots using only human videos,”ArXiv, vol. abs/2503.00779, 2025

work page arXiv 2025

[6] [6]

Reconstructing hands in 3d with transformers,

G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. F. Fouhey, and J. Malik, “Reconstructing hands in 3d with transformers,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9826–9836, 2023

work page 2024

[7] [7]

Dexwild: Dexterous human interactions for in-the-wild robot policies,

T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak, “Dexwild: Dexterous human interactions for in-the-wild robot policies,”ArXiv, vol. abs/2505.07813, 2025

work page internal anchor Pith review arXiv 2025

[8] [8]

Egozero: Robot learning from smart glasses,

V . Liu, A. Adeniji, H. Zhan, R. M. Bhirangi, P. Abbeel, and L. Pinto, “Egozero: Robot learning from smart glasses,”ArXiv, vol. abs/2505.20290, 2025

work page arXiv 2025

[9] [9]

Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025

M. Lepert, J. Fang, and J. Bohg, “Masquerade: Learn- ing from in-the-wild human videos using data-editing,” ArXiv, vol. abs/2508.09976, 2025

work page arXiv 2025

[10] [10]

Zeromimic: Distilling robotic manipulation skills from web videos,

J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman, “Zeromimic: Distilling robotic manipulation skills from web videos,”2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pp. 16 939–16 947, 2025

work page 2025

[11] [11]

Zero-shot robot manipulation from pas- sive human videos.arXiv preprint arXiv:2302.02011, 2023

H. Bharadhwaj, A. Gupta, S. Tulsiani, and V . Kumar, “Zero-shot robot manipulation from passive human videos,” vol. abs/2302.02011, 2023

work page arXiv 2023

[12] [12]

Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani, “Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation,”arXiv preprint arXiv:2405.01527, 2024

work page arXiv 2024

[13] [13]

Mimicplay: Long- horizon imitation learning by watching human play,

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar, “Mimicplay: Long- horizon imitation learning by watching human play,” 2023

work page 2023

[14] [14]

Robotic telekinesis: Learning a robotic hand imitator by watch- ing humans on youtube,

A. Sivakumar, K. Shaw, and D. Pathak, “Robotic telekinesis: Learning a robotic hand imitator by watch- ing humans on youtube,” 2022

work page 2022

[15] [15]

Videodex: Learning dexterity from internet videos,

K. Shaw, S. Bahl, and D. Pathak, “Videodex: Learning dexterity from internet videos,” inConference on Robot Learning, 2022

work page 2022

[16] [16]

Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation,

S. P. Arunachalam, S. Silwal, B. Evans, and L. Pinto, “Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation,” 2022

work page 2022

[17] [17]

Learn- ing continuous grasping function with a dexterous hand from human demonstrations,

J. Ye, J. Wang, B. Huang, Y . Qin, and X. Wang, “Learn- ing continuous grasping function with a dexterous hand from human demonstrations,” vol. 8, 2022, pp. 2882– 2889

work page 2022

[18] [18]

Shadow: Leveraging segmentation masks for cross-embodiment policy transfer.arXiv preprint arXiv:2503.00774, 2025

M. Lepert, R. Doshi, and J. Bohg, “Shadow: Leverag- ing segmentation masks for cross-embodiment policy transfer,”ArXiv, vol. abs/2503.00774, 2025

work page arXiv 2025

[19] [19]

Human-to-robot imitation in the wild,

S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” 2022

work page 2022

[20] [20]

Vision-based manipulation from single human video with open-world object graphs,

Y . Zhu, A. Lim, P. Stone, and Y . Zhu, “Vision-based manipulation from single human video with open-world object graphs,”arXiv preprint arXiv:2405.20321, 2024

work page arXiv 2024

[21] [21]

One-shot imitation learning: A pose estimation perspective,

P. Vitiello, K. Dreczkowski, and E. Johns, “One-shot imitation learning: A pose estimation perspective,” in Conference on Robot Learning, 2023

work page 2023

[22] [22]

Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024

J. Li, Y . Zhu, Y . Xie, Z. Jiang, M. Seo, G. Pavlakos, and Y . Zhu, “Okami: Teaching humanoid robots manipula- tion skills through single video imitation,”ArXiv, vol. abs/2410.11792, 2024

work page arXiv 2024

[23] [23]

Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills,

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills,”ACM Trans. Graph., vol. 37, no. 4, pp. 143:1–143:14, Jul. 2018

work page 2018

[24] [24]

Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation,

Z. Yuan, T. Wei, L. Gu, P. Hua, T. Liang, Y . Chen, and H. Xu, “Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation,” 2025

work page 2025

[25] [25]

Xirl: Cross-embodiment inverse rein- forcement learning,

K. Zakka, A. Zeng, P. R. Florence, J. Tompson, J. Bohg, and D. Dwibedi, “Xirl: Cross-embodiment inverse rein- forcement learning,” inConference on Robot Learning, 2021

work page 2021

[26] [26]

Rank2reward: Learning shaped reward functions from passive video,

D. Yang, D. Tjia, J. Berg, D. Damen, P. Agrawal, and A. Gupta, “Rank2reward: Learning shaped reward functions from passive video,” in2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 2806–2813

work page 2024

[27] [27]

Imitation learning from a single temporally misaligned video,

W. Huey, H. Wang, A. Wu, Y . Artzi, and S. Choudhury, “Imitation learning from a single temporally misaligned video,”ArXiv, vol. abs/2502.05397, 2025

work page arXiv 2025

[28] [28]

Concept2robot: Learning manipulation concepts from instructions and human demonstrations,

L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg, “Concept2robot: Learning manipulation concepts from instructions and human demonstrations,” vol. 40, 2020, pp. 1419 – 1434

work page 2020

[29] [29]

Learning generalizable robotic reward functions from” in-the-wild” human videos.arXiv preprint arXiv:2103.16817,

A. S. Chen, S. Nair, and C. Finn, “Learning generaliz- able robotic reward functions from ”in-the-wild” human videos,” vol. abs/2103.16817, 2021

work page arXiv 2021

[30] [30]

X-sim: Cross-embodiment learning via real-to-sim-to-real,

P. Dan, K. Kedia, A. Chao, E. W. Duan, M. A. Pace, W.- C. Ma, and S. Choudhury, “X-sim: Cross-embodiment learning via real-to-sim-to-real,” 2025

work page 2025

[31] [31]

Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstra- tion,

T. Ga, W. Lum, O. Y . Lee, C. K. Liu, J. Bohg, and P.-M. H. Pose, “Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstra- tion,” 2025

work page 2025

[32] [32]

Flow as the cross-domain manipulation interface,

M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,” inConference on Robot Learning, 2024

work page 2024

[33] [33]

Combining self-supervised learning and imitation for vision-based rope manipula- tion,

A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine, “Combining self-supervised learning and imitation for vision-based rope manipula- tion,” 2017, pp. 2146–2153

work page 2017

[34] [34]

Graph-structured visual imitation,

M. Sieb, X. Zhou, A. Huang, O. Kroemer, and K. Fragkiadaki, “Graph-structured visual imitation,” in Conference on Robot Learning, 2019

work page 2019

[35] [35]

Learning predictive models from observation and interaction,

K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Dani- ilidis, S. Levine, and C. Finn, “Learning predictive models from observation and interaction,” inComputer Vision–ECCV 2020: 16th European Conference, Glas- gow, UK, August 23–28, 2020, Proceedings, Part XX. Springer, 2020, pp. 708–725

work page 2020

[36] [36]

Graph inverse reinforcement learning from diverse videos,

S. Kumar, J. Zamora, N. Hansen, R. Jangir, and X. Wang, “Graph inverse reinforcement learning from diverse videos,” 2022

work page 2022

[37] [37]

Sam 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,” 2024

work page 2024

[38] [38]

Bc-z: Zero-shot task generalization with robotic imitation learning, 2022

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning,” vol. abs/2202.02005, 2022

work page arXiv 2022

[39] [39]

Vid2robot: End-to-end video-conditioned policy learning with cross-attention transformers,

V . Jain, M. Attarian, N. Joshi, A. Wahid, D. Driess, Q. Vuong, P. R. Sanketi, P. Sermanet, S. Welker, C. Chan, I. Gilitschenski, Y . Bisk, and D. Dwibedi, “Vid2robot: End-to-end video-conditioned policy learning with cross-attention transformers,” vol. abs/2403.12943, 2024

work page arXiv 2024

[40] [40]

One-shot imitation under mismatched execu- tion,

K. Kedia, P. Dan, A. Chao, M. A. Pace, and S. Choud- hury, “One-shot imitation under mismatched execu- tion,”arXiv preprint arXiv:2409.06615, 2024

work page arXiv 2024

[41] [41]

XSkill: Cross embodiment skill discovery,

M. Xu, Z. Xu, C. Chi, M. Veloso, and S. Song, “XSkill: Cross embodiment skill discovery,” in7th Annual Con- ference on Robot Learning, 2023

work page 2023

[42] [42]

Mimicdroid: In-context learning for humanoid manipulation from human play videos,

R. Shah, S. Liu, Q. Wang, Z. Jiang, S. Kumar, M. Seo, R. Mart´ın-Mart´ın, and Y . Zhu, “Mimicdroid: In-context learning for humanoid manipulation from human play videos,”arXiv preprint arXiv:2509.09769, 2025

work page arXiv 2025

[43] [43]

Instant policy: In-context imitation learning via graph diffusion,

V . V osylius and E. Johns, “Instant policy: In-context imitation learning via graph diffusion,” 2025

work page 2025

[44] [44]

Demod- iffusion: One-shot human imitation using pre-trained diffusion policy,

S. Park, H. Bharadhwaj, and S. Tulsiani, “Demod- iffusion: One-shot human imitation using pre-trained diffusion policy,” 2025

work page 2025

[45] [45]

Cu- rating demonstrations using online experience,

A. S. Chen, A. M. Lessing, Y . Liu, and C. Finn, “Cu- rating demonstrations using online experience,” 2025

work page 2025

[46] [46]

Cupid: Curating data your robot loves with influence functions,

C. Agia, R. Sinha, J. Yang, R. Antonova, M. Pavone, H. Nishimura, M. Itkina, and J. Bohg, “Cupid: Curating data your robot loves with influence functions,” 2025

work page 2025

[47] [47]

Re-mix: Optimizing data mixtures for large scale imitation learning,

J. Hejna, C. Bhateja, Y . Jiang, K. Pertsch, and D. Sadigh, “Re-mix: Optimizing data mixtures for large scale imitation learning,” 2024

work page 2024

[48] [48]

Emu: Enhancing image generation models using photogenic needles in a haystack,

X. Dai, J. Hou, C.-Y . Ma, S. Tsai, J. Wang, R. Wang, P. Zhang, S. Vandenhende, X. Wang, A. Dubey, M. Yu, A. Kadian, F. Radenovic, D. Mahajan, K. Li, Y . Zhao, V . Petrovic, M. K. Singh, S. Motwani, Y . Wen, Y . Song, R. Sumbaly, V . Ramanathan, Z. He, P. Vajda, and D. Parikh, “Emu: Enhancing image generation models using photogenic needles in a haystack,” 2023

work page 2023

[49] [49]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gor- don, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kun- durthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “Laion-5b: An open large-scale dataset for training next generation image-text models,” 2022

work page 2022

[50] [50]

Ambient diffusion omni: Training good models with bad data,

G. Daras, A. Rodriguez-Munoz, A. Klivans, A. Tor- ralba, and C. Daskalakis, “Ambient diffusion omni: Training good models with bad data,” 2025

work page 2025

[51] [51]

Grounded sam: Assembling open-world models for diverse visual tasks,

T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yanet al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv, 2024

work page 2024

[52] [52]

Emergent correspondence from image diffusion,

L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariha- ran, “Emergent correspondence from image diffusion,” Advances in Neural Information Processing Systems, vol. 36, pp. 1363–1389, 2023

work page 2023

[53] [53]

Cotracker: It is better to track together,

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,” inEuropean conference on computer vision. Springer, 2024, pp. 18–35. APPENDIX A. Contributions •Maximus A. Pace:Investigated different algorithms for using human data in policy learning, set up the data collection pipeline using teleoper...

work page 2024

[54] [54]

open or closed) at timestept

Robot Demonstrations:The robot’s proprioceptionq t is computed using forward kinematics given its joint angles and gripper status (e.g. open or closed) at timestept. Visual observationso t are obtained by applying Grounded-SAM 2 [51] with language prompts on a single-view RGB capture of the scene and overlaying end-effector keypoint renderings

work page

[55] [55]

We select 5 of these keypoints along the index finger and thumb to be retargeted into a parallel jaw

Human Demonstrations:We use HaMeR [6] to detect a set of 21 keypoints in 2D pixel space for each camera. We select 5 of these keypoints along the index finger and thumb to be retargeted into a parallel jaw. Using two cameras with known parameters, we triangulate these keypoints into the same 3D coordinate frame as the robot to obtainp t and apply the Kabs...

work page

[56] [56]

Diffusion Policy:This baseline uses the vanilla Diffu- sion Policy architecture trained only on a small set of robot demonstrations

work page

[57] [57]

Point Policy:Instead of using segmented images in its visual observationo t, this baseline represents state via 3D keypoints of relevant objects at each timestept. The keypoints are annotated in the first frame of one training demonstration, and correspondences are automatically de- tected at the start of all other demonstrations and at inference time usi...

work page

[58] [58]

Motion Tracks:This baseline consumes the raw RGB image (without segmentations) and end-effector propriocep- tion as input. The original paper for MOTIONTRACKSuses a keypoint retargeting network to minimize any gap between hand and end-effector keypoints, which we alleviate in our implementation by unifying the proprioception directly into end-effector pos...

work page

[59] [59]

The reverse diffusion process is completed by using the human policyπ H for the initial denoising steps, followed by the robot policyπ R for the remainder of the denoising steps

DemoDiffusion:This baseline leverages two Diffusion Policies: human policyπ H is trained on the full human datasetD H, and robot policyπ R is trained on the full robot datasetD R. The reverse diffusion process is completed by using the human policyπ H for the initial denoising steps, followed by the robot policyπ R for the remainder of the denoising steps...

work page