pith. sign in

arxiv: 2511.04671 · v2 · submitted 2025-11-06 · 💻 cs.RO · cs.AI· cs.CV

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

Pith reviewed 2026-05-18 00:33 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords diffusion policiescross-embodiment learninghuman demonstrationsrobot manipulationAmbient Diffusionreal-world tasks
0
0 comments X

The pith

X-Diffusion trains diffusion policies by treating human actions as noisy robot counterparts at high noise levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to extract useful task guidance from human videos even when the exact movements cannot be performed by a robot due to body differences. It adapts an existing diffusion training trick to add human data only during the noisiest stages of the process, where embodiment details disappear but object interaction intent remains. Experiments on five real robot manipulation tasks report a 16 percent average success improvement over simply mixing all data or hand-filtering it. A reader would care because robot data collection is costly while human videos are plentiful and already available at scale.

Core claim

X-Diffusion is a cross-embodiment learning framework based on Ambient Diffusion that selectively trains diffusion policies on noised human actions. By viewing human actions as noisy counterparts of robot actions, as noise increases along the forward diffusion process embodiment-specific differences fade away while task-relevant guidance is preserved. This enables effective use of easy-to-collect human videos without sacrificing robot feasibility.

What carries the argument

X-Diffusion framework that incorporates human demonstrations only at high-noise timesteps of the forward diffusion process.

If this is right

  • Average success rates improve by 16% over naive co-training and manual data filtering across five real-world manipulation tasks.
  • Robots acquire task intent from coarse human guidance without adopting infeasible execution details.
  • Human videos become a usable, scalable data source for diffusion policies.
  • Selective noise-based inclusion of cross-embodiment data outperforms both full mixing and filtering approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same noise-level selection could be applied to other mismatched data sources such as internet videos or simulation rollouts.
  • Testing on longer-horizon or contact-rich tasks would show how much noise is required to bridge larger embodiment gaps.
  • The approach might reduce the need for manual data curation when scaling to thousands of unfiltered human clips.

Load-bearing premise

Human actions can be viewed as noisy counterparts of robot actions such that as noise increases along the forward diffusion process, embodiment-specific differences fade away while task-relevant guidance is preserved.

What would settle it

Running the same five tasks with human data added only at low noise levels instead of high noise levels and finding no improvement over robot-only training.

Figures

Figures reproduced from arXiv: 2511.04671 by Atiksh Bhardwaj, Audrey Du, Chuanruo Ning, Edward W. Duan, Kushal Kedia, Maximus A. Pace, Prithwish Dan, Wei-Chiu Ma.

Figure 1
Figure 1. Figure 1: Overview of X-DIFFUSION: We introduce X-DIFFUSION, a framework to train diffusion policies on cross-embodiment human data containing a variety of execution styles. Naively co-training diffusion policies on human and robot datasets with mismatched dynamics can lead the denoising process to output dynamically infeasible actions for the robot, degrading performance below standard robot-only diffusion policy t… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline: X-DIFFUSION first unifies the state and action representation. State is represented by a colored segmentation mask of relevant objects using Grounded-SAM2 [37]. Action is represented via end-effector/human hand pose utilizing HaMeR [6] for retargeting. During the policy’s forward diffusion process, Gaussian noise is sampled and added to the clean actions. To determine if the policy should learn t… view at source ↗
Figure 3
Figure 3. Figure 3: Visualizing Actions under Noise and Classifier Predictions at various Diffusion Steps. Humans execute tasks in various ways. For example, when picking and placing a pan, a human can either execute a top-down grasp or a side grasp. Human actions that are feasible for robots (e.g. top-down grasp) overlap with robot action distribution under low noise timesteps. This data fools the classifier into believing i… view at source ↗
Figure 4
Figure 4. Figure 4: Performance vs. Baselines: We report task success rate on 5 different manipulation tasks and compare X-DIFFUSION against a robot-only baseline (Diffusion Policy) and various co-training baselines (Point-Policy, MotionTracks). DemoDiffusion is another diffusion-based method, but it doesn’t train the robot policy on human demonstrations. We find that X-DIFFUSION is the highest performing model on all tasks, … view at source ↗
Figure 5
Figure 5. Figure 5: Naive co-training learns infeasible robot actions: Including all human data in policy training can incentivize policies to learn strategies demonstrated by humans but infeasible for robots. On multiple tasks, a human may manipulate objects in ways that are not realizable for a robot. The policy input is the masked image with overlaid key￾points, concatenated with proprioceptive information. More details ar… view at source ↗
Figure 6
Figure 6. Figure 6: Classifier Robot Probability across forward diffusion process: As the noise levels increase, the human action distribution becomes more similar to the robot action distribution. The similarity of human actions with robot actions varies across tasks: as shown on the graphs, the distance between the human and robot action distributions at every noise level is smaller for Push Plate data compared to Bottle Up… view at source ↗
read the original abstract

Human videos are a scalable source of training data for robot learning. However, humans and robots significantly differ in embodiment, making many human actions infeasible for direct execution on a robot. Still, these demonstrations convey rich object-interaction cues and task intent. Our goal is to learn from this coarse guidance without transferring embodiment-specific, infeasible execution strategies. Recent advances in generative modeling tackle a related problem of learning from low-quality data. In particular, Ambient Diffusion is a recent method for diffusion modeling that incorporates low-quality data only at high-noise timesteps of the forward diffusion process. Our key insight is to view human actions as noisy counterparts of robot actions. As noise increases along the forward diffusion process, embodiment-specific differences fade away while task-relevant guidance is preserved. Based on these observations, we present X-Diffusion, a cross-embodiment learning framework based on Ambient Diffusion that selectively trains diffusion policies on noised human actions. This enables effective use of easy-to-collect human videos without sacrificing robot feasibility. Across five real-world manipulation tasks, we show that X-Diffusion improves average success rates by 16% over naive co-training and manual data filtering. The project website is available at https://portal-cornell.github.io/X-Diffusion/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes X-Diffusion, a framework adapting Ambient Diffusion to train diffusion policies on cross-embodiment human videos. Human actions are treated as noisy robot-action counterparts; training occurs selectively on noised human data only at high forward-diffusion timesteps so that embodiment-specific kinematics are suppressed while task-relevant object-interaction cues remain. On five real-world manipulation tasks the method reports a 16% average success-rate gain over naive co-training and manual filtering baselines.

Significance. If the empirical gains prove robust, the work offers a principled route to leverage abundant human video data for robot policy learning without transferring infeasible strategies. The explicit link between Ambient Diffusion’s high-noise regime and embodiment mismatch is a clean conceptual contribution that could generalize beyond the reported tasks.

major comments (2)
  1. [Experiments] Experiments / Results: The abstract and main results claim a 16% average improvement, yet provide no per-task success rates with standard deviations, number of evaluation trials, statistical significance tests, or explicit baseline hyper-parameter settings. Without these, it is impossible to determine whether the reported delta is reliable or driven by a few outlier runs.
  2. [Method] Method / §3.2 (Core Assumption): The central modeling choice—that Gaussian noise addition causes embodiment-specific kinematic differences to become indistinguishable from task structure—receives no isolating ablation. A control that injects human data at high noise levels but disables the Ambient Diffusion weighting (or uses uniform co-training at those timesteps) is required to show that the gain is not simply an artifact of increased data volume or diversity.
minor comments (2)
  1. [Method] Notation: The forward-process noise schedule and the precise timestep threshold used for human-data inclusion should be stated explicitly (e.g., as a single equation or table entry) rather than left to the supplementary material.
  2. [Experiments] Figures: Qualitative rollout visualizations would benefit from side-by-side comparison of failure modes under X-Diffusion versus the naive co-training baseline to illustrate the claimed reduction in infeasible actions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate additional experimental details and ablations for improved rigor and reproducibility.

read point-by-point responses
  1. Referee: The abstract and main results claim a 16% average improvement, yet provide no per-task success rates with standard deviations, number of evaluation trials, statistical significance tests, or explicit baseline hyper-parameter settings. Without these, it is impossible to determine whether the reported delta is reliable or driven by a few outlier runs.

    Authors: We agree that these details are essential for assessing the reliability of the results. In the revised manuscript, we will expand the results section with a table reporting per-task success rates (including means and standard deviations) across 10 independent evaluation trials per task and method. We will also report the exact hyperparameter settings for all baselines (naive co-training and manual filtering) and include statistical significance tests such as paired t-tests with p-values to support the 16% average improvement. revision: yes

  2. Referee: The central modeling choice—that Gaussian noise addition causes embodiment-specific kinematic differences to become indistinguishable from task structure—receives no isolating ablation. A control that injects human data at high noise levels but disables the Ambient Diffusion weighting (or uses uniform co-training at those timesteps) is required to show that the gain is not simply an artifact of increased data volume or diversity.

    Authors: We appreciate this suggestion to better isolate the contribution of our core modeling assumption. In the revised version, we will add a new ablation study comparing X-Diffusion to a control variant that performs uniform co-training with human data at high noise timesteps but without the Ambient Diffusion selective weighting. This will help demonstrate that the observed gains arise from the principled high-noise selective training rather than from increased data volume alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured against explicit baselines

full rationale

The paper's central contribution is an empirical framework that applies the external Ambient Diffusion technique to cross-embodiment data by treating human actions as noisy robot counterparts at high timesteps. The reported 16% average success-rate improvement is measured directly against two explicit baselines (naive co-training and manual data filtering) across five real-world tasks. No equations, fitted parameters, or self-citations are shown to reduce the performance delta or the core assumption to quantities defined by the method itself; the derivation chain remains self-contained and externally falsifiable via the controlled experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that human actions function as noisy robot actions at high diffusion noise levels; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Human actions can be viewed as noisy counterparts of robot actions such that embodiment-specific differences fade at high noise while task-relevant guidance is preserved.
    This premise is stated as the key insight enabling selective training on human data.

pith-pipeline@v0.9.0 · 5791 in / 1206 out tokens · 37564 ms · 2026-05-18T00:33:25.478340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

  1. [1]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomo- tor policy learning via action diffusion,”arXiv preprint arXiv:2303.04137, 2023

  2. [2]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    T. Zhao, V . Kumar, S. Levine, and C. Finn, “Learn- ing fine-grained bimanual manipulation with low-cost hardware,”ArXiv, vol. abs/2304.13705, 2023

  3. [3]

    Motion tracks: A unified representation for human-robot transfer in few-shot imita- tion learning

    J. Ren, P. Sundaresan, D. Sadigh, S. Choudhury, and J. Bohg, “Motion tracks: A unified representation for human-robot transfer in few-shot imitation learning,” ArXiv, vol. abs/2501.06994, 2025

  4. [4]

    Point policy: Unifying observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

    S. Haldar and L. Pinto, “Point policy: Unifying obser- vations and actions with key points for robot manipu- lation,”ArXiv, vol. abs/2502.20391, 2025

  5. [5]

    Phantom: Training robots without robots using only human videos, 2025

    M. Lepert, J. Fang, and J. Bohg, “Phantom: Training robots without robots using only human videos,”ArXiv, vol. abs/2503.00779, 2025

  6. [6]

    Reconstructing hands in 3d with transformers,

    G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. F. Fouhey, and J. Malik, “Reconstructing hands in 3d with transformers,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9826–9836, 2023

  7. [7]

    Dexwild: Dexterous human interactions for in-the-wild robot policies,

    T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak, “Dexwild: Dexterous human interactions for in-the-wild robot policies,”ArXiv, vol. abs/2505.07813, 2025

  8. [8]

    Egozero: Robot learning from smart glasses,

    V . Liu, A. Adeniji, H. Zhan, R. M. Bhirangi, P. Abbeel, and L. Pinto, “Egozero: Robot learning from smart glasses,”ArXiv, vol. abs/2505.20290, 2025

  9. [9]

    Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025

    M. Lepert, J. Fang, and J. Bohg, “Masquerade: Learn- ing from in-the-wild human videos using data-editing,” ArXiv, vol. abs/2508.09976, 2025

  10. [10]

    Zeromimic: Distilling robotic manipulation skills from web videos,

    J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman, “Zeromimic: Distilling robotic manipulation skills from web videos,”2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pp. 16 939–16 947, 2025

  11. [11]

    Zero-shot robot manipulation from pas- sive human videos.arXiv preprint arXiv:2302.02011, 2023

    H. Bharadhwaj, A. Gupta, S. Tulsiani, and V . Kumar, “Zero-shot robot manipulation from passive human videos,” vol. abs/2302.02011, 2023

  12. [12]

    Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,

    H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani, “Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation,”arXiv preprint arXiv:2405.01527, 2024

  13. [13]

    Mimicplay: Long- horizon imitation learning by watching human play,

    C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar, “Mimicplay: Long- horizon imitation learning by watching human play,” 2023

  14. [14]

    Robotic telekinesis: Learning a robotic hand imitator by watch- ing humans on youtube,

    A. Sivakumar, K. Shaw, and D. Pathak, “Robotic telekinesis: Learning a robotic hand imitator by watch- ing humans on youtube,” 2022

  15. [15]

    Videodex: Learning dexterity from internet videos,

    K. Shaw, S. Bahl, and D. Pathak, “Videodex: Learning dexterity from internet videos,” inConference on Robot Learning, 2022

  16. [16]

    Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation,

    S. P. Arunachalam, S. Silwal, B. Evans, and L. Pinto, “Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation,” 2022

  17. [17]

    Learn- ing continuous grasping function with a dexterous hand from human demonstrations,

    J. Ye, J. Wang, B. Huang, Y . Qin, and X. Wang, “Learn- ing continuous grasping function with a dexterous hand from human demonstrations,” vol. 8, 2022, pp. 2882– 2889

  18. [18]

    Shadow: Leveraging segmentation masks for cross-embodiment policy transfer.arXiv preprint arXiv:2503.00774, 2025

    M. Lepert, R. Doshi, and J. Bohg, “Shadow: Leverag- ing segmentation masks for cross-embodiment policy transfer,”ArXiv, vol. abs/2503.00774, 2025

  19. [19]

    Human-to-robot imitation in the wild,

    S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” 2022

  20. [20]

    Vision-based manipulation from single human video with open-world object graphs,

    Y . Zhu, A. Lim, P. Stone, and Y . Zhu, “Vision-based manipulation from single human video with open-world object graphs,”arXiv preprint arXiv:2405.20321, 2024

  21. [21]

    One-shot imitation learning: A pose estimation perspective,

    P. Vitiello, K. Dreczkowski, and E. Johns, “One-shot imitation learning: A pose estimation perspective,” in Conference on Robot Learning, 2023

  22. [22]

    Okami: Teaching humanoid robots manipulation skills through single video imitation.arXiv preprint arXiv:2410.11792, 2024

    J. Li, Y . Zhu, Y . Xie, Z. Jiang, M. Seo, G. Pavlakos, and Y . Zhu, “Okami: Teaching humanoid robots manipula- tion skills through single video imitation,”ArXiv, vol. abs/2410.11792, 2024

  23. [23]

    Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills,

    X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills,”ACM Trans. Graph., vol. 37, no. 4, pp. 143:1–143:14, Jul. 2018

  24. [24]

    Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation,

    Z. Yuan, T. Wei, L. Gu, P. Hua, T. Liang, Y . Chen, and H. Xu, “Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation,” 2025

  25. [25]

    Xirl: Cross-embodiment inverse rein- forcement learning,

    K. Zakka, A. Zeng, P. R. Florence, J. Tompson, J. Bohg, and D. Dwibedi, “Xirl: Cross-embodiment inverse rein- forcement learning,” inConference on Robot Learning, 2021

  26. [26]

    Rank2reward: Learning shaped reward functions from passive video,

    D. Yang, D. Tjia, J. Berg, D. Damen, P. Agrawal, and A. Gupta, “Rank2reward: Learning shaped reward functions from passive video,” in2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 2806–2813

  27. [27]

    Imitation learning from a single temporally misaligned video,

    W. Huey, H. Wang, A. Wu, Y . Artzi, and S. Choudhury, “Imitation learning from a single temporally misaligned video,”ArXiv, vol. abs/2502.05397, 2025

  28. [28]

    Concept2robot: Learning manipulation concepts from instructions and human demonstrations,

    L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg, “Concept2robot: Learning manipulation concepts from instructions and human demonstrations,” vol. 40, 2020, pp. 1419 – 1434

  29. [29]

    Learning generalizable robotic reward functions from” in-the-wild” human videos.arXiv preprint arXiv:2103.16817,

    A. S. Chen, S. Nair, and C. Finn, “Learning generaliz- able robotic reward functions from ”in-the-wild” human videos,” vol. abs/2103.16817, 2021

  30. [30]

    X-sim: Cross-embodiment learning via real-to-sim-to-real,

    P. Dan, K. Kedia, A. Chao, E. W. Duan, M. A. Pace, W.- C. Ma, and S. Choudhury, “X-sim: Cross-embodiment learning via real-to-sim-to-real,” 2025

  31. [31]

    Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstra- tion,

    T. Ga, W. Lum, O. Y . Lee, C. K. Liu, J. Bohg, and P.-M. H. Pose, “Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstra- tion,” 2025

  32. [32]

    Flow as the cross-domain manipulation interface,

    M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,” inConference on Robot Learning, 2024

  33. [33]

    Combining self-supervised learning and imitation for vision-based rope manipula- tion,

    A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine, “Combining self-supervised learning and imitation for vision-based rope manipula- tion,” 2017, pp. 2146–2153

  34. [34]

    Graph-structured visual imitation,

    M. Sieb, X. Zhou, A. Huang, O. Kroemer, and K. Fragkiadaki, “Graph-structured visual imitation,” in Conference on Robot Learning, 2019

  35. [35]

    Learning predictive models from observation and interaction,

    K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Dani- ilidis, S. Levine, and C. Finn, “Learning predictive models from observation and interaction,” inComputer Vision–ECCV 2020: 16th European Conference, Glas- gow, UK, August 23–28, 2020, Proceedings, Part XX. Springer, 2020, pp. 708–725

  36. [36]

    Graph inverse reinforcement learning from diverse videos,

    S. Kumar, J. Zamora, N. Hansen, R. Jangir, and X. Wang, “Graph inverse reinforcement learning from diverse videos,” 2022

  37. [37]

    Sam 2: Segment anything in images and videos,

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,” 2024

  38. [38]

    Bc-z: Zero-shot task generalization with robotic imitation learning, 2022

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning,” vol. abs/2202.02005, 2022

  39. [39]

    Vid2robot: End-to-end video-conditioned policy learning with cross-attention transformers,

    V . Jain, M. Attarian, N. Joshi, A. Wahid, D. Driess, Q. Vuong, P. R. Sanketi, P. Sermanet, S. Welker, C. Chan, I. Gilitschenski, Y . Bisk, and D. Dwibedi, “Vid2robot: End-to-end video-conditioned policy learning with cross-attention transformers,” vol. abs/2403.12943, 2024

  40. [40]

    One-shot imitation under mismatched execu- tion,

    K. Kedia, P. Dan, A. Chao, M. A. Pace, and S. Choud- hury, “One-shot imitation under mismatched execu- tion,”arXiv preprint arXiv:2409.06615, 2024

  41. [41]

    XSkill: Cross embodiment skill discovery,

    M. Xu, Z. Xu, C. Chi, M. Veloso, and S. Song, “XSkill: Cross embodiment skill discovery,” in7th Annual Con- ference on Robot Learning, 2023

  42. [42]

    Mimicdroid: In-context learning for humanoid manipulation from human play videos,

    R. Shah, S. Liu, Q. Wang, Z. Jiang, S. Kumar, M. Seo, R. Mart´ın-Mart´ın, and Y . Zhu, “Mimicdroid: In-context learning for humanoid manipulation from human play videos,”arXiv preprint arXiv:2509.09769, 2025

  43. [43]

    Instant policy: In-context imitation learning via graph diffusion,

    V . V osylius and E. Johns, “Instant policy: In-context imitation learning via graph diffusion,” 2025

  44. [44]

    Demod- iffusion: One-shot human imitation using pre-trained diffusion policy,

    S. Park, H. Bharadhwaj, and S. Tulsiani, “Demod- iffusion: One-shot human imitation using pre-trained diffusion policy,” 2025

  45. [45]

    Cu- rating demonstrations using online experience,

    A. S. Chen, A. M. Lessing, Y . Liu, and C. Finn, “Cu- rating demonstrations using online experience,” 2025

  46. [46]

    Cupid: Curating data your robot loves with influence functions,

    C. Agia, R. Sinha, J. Yang, R. Antonova, M. Pavone, H. Nishimura, M. Itkina, and J. Bohg, “Cupid: Curating data your robot loves with influence functions,” 2025

  47. [47]

    Re-mix: Optimizing data mixtures for large scale imitation learning,

    J. Hejna, C. Bhateja, Y . Jiang, K. Pertsch, and D. Sadigh, “Re-mix: Optimizing data mixtures for large scale imitation learning,” 2024

  48. [48]

    Emu: Enhancing image generation models using photogenic needles in a haystack,

    X. Dai, J. Hou, C.-Y . Ma, S. Tsai, J. Wang, R. Wang, P. Zhang, S. Vandenhende, X. Wang, A. Dubey, M. Yu, A. Kadian, F. Radenovic, D. Mahajan, K. Li, Y . Zhao, V . Petrovic, M. K. Singh, S. Motwani, Y . Wen, Y . Song, R. Sumbaly, V . Ramanathan, Z. He, P. Vajda, and D. Parikh, “Emu: Enhancing image generation models using photogenic needles in a haystack,” 2023

  49. [49]

    Laion-5b: An open large-scale dataset for training next generation image-text models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gor- don, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kun- durthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “Laion-5b: An open large-scale dataset for training next generation image-text models,” 2022

  50. [50]

    Ambient diffusion omni: Training good models with bad data,

    G. Daras, A. Rodriguez-Munoz, A. Klivans, A. Tor- ralba, and C. Daskalakis, “Ambient diffusion omni: Training good models with bad data,” 2025

  51. [51]

    Grounded sam: Assembling open-world models for diverse visual tasks,

    T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yanet al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv, 2024

  52. [52]

    Emergent correspondence from image diffusion,

    L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariha- ran, “Emergent correspondence from image diffusion,” Advances in Neural Information Processing Systems, vol. 36, pp. 1363–1389, 2023

  53. [53]

    Cotracker: It is better to track together,

    N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,” inEuropean conference on computer vision. Springer, 2024, pp. 18–35. APPENDIX A. Contributions •Maximus A. Pace:Investigated different algorithms for using human data in policy learning, set up the data collection pipeline using teleoper...

  54. [54]

    open or closed) at timestept

    Robot Demonstrations:The robot’s proprioceptionq t is computed using forward kinematics given its joint angles and gripper status (e.g. open or closed) at timestept. Visual observationso t are obtained by applying Grounded-SAM 2 [51] with language prompts on a single-view RGB capture of the scene and overlaying end-effector keypoint renderings

  55. [55]

    We select 5 of these keypoints along the index finger and thumb to be retargeted into a parallel jaw

    Human Demonstrations:We use HaMeR [6] to detect a set of 21 keypoints in 2D pixel space for each camera. We select 5 of these keypoints along the index finger and thumb to be retargeted into a parallel jaw. Using two cameras with known parameters, we triangulate these keypoints into the same 3D coordinate frame as the robot to obtainp t and apply the Kabs...

  56. [56]

    Diffusion Policy:This baseline uses the vanilla Diffu- sion Policy architecture trained only on a small set of robot demonstrations

  57. [57]

    Point Policy:Instead of using segmented images in its visual observationo t, this baseline represents state via 3D keypoints of relevant objects at each timestept. The keypoints are annotated in the first frame of one training demonstration, and correspondences are automatically de- tected at the start of all other demonstrations and at inference time usi...

  58. [58]

    Motion Tracks:This baseline consumes the raw RGB image (without segmentations) and end-effector propriocep- tion as input. The original paper for MOTIONTRACKSuses a keypoint retargeting network to minimize any gap between hand and end-effector keypoints, which we alleviate in our implementation by unifying the proprioception directly into end-effector pos...

  59. [59]

    The reverse diffusion process is completed by using the human policyπ H for the initial denoising steps, followed by the robot policyπ R for the remainder of the denoising steps

    DemoDiffusion:This baseline leverages two Diffusion Policies: human policyπ H is trained on the full human datasetD H, and robot policyπ R is trained on the full robot datasetD R. The reverse diffusion process is completed by using the human policyπ H for the initial denoising steps, followed by the robot policyπ R for the remainder of the denoising steps...