pith. sign in

arxiv: 2606.28813 · v1 · pith:W26KKL6Dnew · submitted 2026-06-27 · 💻 cs.RO · cs.AI

Human2Any: Human-to-Robot Transfer via Constraint-Aware Compositional Planning

Pith reviewed 2026-06-30 09:57 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords human video transferobject-centric manipulationcompositional planningembodiment adaptationrobot learning from videointeraction priors
0
0 comments X

The pith

Human videos supply reusable object interaction knowledge that robots can apply to new tasks by combining with their own feasibility planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that manipulation tasks can be learned from human videos by focusing on how objects interact with each other rather than how the human body moves. This representation lets the same knowledge transfer to different robots and scenes. The method learns interaction priors from video and then uses robot-specific planning to make them work. If successful, it would mean robots can learn many tasks from abundant human video data without needing matching robot demonstrations for each new situation.

Core claim

Human2Any represents manipulation tasks through object-object interaction motion extracted from human videos. These priors are then composed with robot-side feasibility reasoning and motion planning to generate actions that adapt to different robot embodiments and scene geometries, enabling transfer without real-world robot training data in the target contexts.

What carries the argument

Object-object interaction motion, which captures task-relevant scene changes while ignoring embodiment-specific details.

If this is right

  • Human-derived interaction knowledge applies directly to a Franka arm and an RBY-1 humanoid without retraining on robot data.
  • The same priors handle variations in scene geometry and task contexts through compositional planning.
  • Robots can perform manipulation in settings where collecting robot demonstrations is difficult or unsafe.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could lower the barrier for deploying robots in new environments by leveraging existing video data.
  • It opens the possibility of scaling robot learning to the volume of human video available online.
  • The compositional nature suggests it could be combined with other planning methods for more complex behaviors.

Load-bearing premise

The motion of objects interacting in human videos contains the essential information needed for successful robot execution once feasibility is checked.

What would settle it

Demonstrating that the composed plans frequently fail on tasks where human videos show clear interactions but robot execution misses key constraints or details would falsify the approach.

Figures

Figures reproduced from arXiv: 2606.28813 by Ajay Mandlekar, Alfred Cueva, Caelan Garrett, Chuye Zhang, Danfei Xu, Shuo Cheng.

Figure 1
Figure 1. Figure 1: Human-to-robot transfer with Human2Any. Human2Any learns object-centric interac￾tion priors from human videos and adapts them to diverse robot embodiments, tasks, and contexts. 1 Introduction Learning manipulation systems that can generate feasible and purposeful motions for solving ev￾eryday tasks remains a fundamental yet challenging problem [1, 2, 3, 4]. While behavior cloning ∗Equal Contribution. arXiv… view at source ↗
Figure 2
Figure 2. Figure 2: Human2Any overview. Given human demonstration videos, Human2Any extracts object￾centric interaction motion between tool and target objects and learns object–object interaction priors that are independent of the human embodiment. Separately, robot-specific agent–object priors model how a particular robot embodiment grasps and controls tool objects. At deployment, these learned priors are composed under the … view at source ↗
Figure 3
Figure 3. Figure 3: Toy illustration. Independent sam￾plers produce locally plausible samples, but fea￾sible joint compositions occupy only a small subset of the product space. Steering uses fea￾sibility scores during denoising to concentrate particles around valid compositions [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tasks. We evaluate Human2Any on a diverse set of simulated and real-world manip￾ulation tasks spanning fine-grained object interactions, pick-and-place, tool use, and long-horizon skill chaining. For each simulation task, we show the initial states of three variants, along with one successful final state highlighted by a green border. These tasks cover diverse object shapes, poses, and scene layouts, and a… view at source ↗
Figure 5
Figure 5. Figure 5: Steering, efficiency, and scaling. Left: particle-filtering-based steering guides sampled compositions toward higher-feasibility regions under grasp, kinematic, and scene constraints. Right: Human2Any improves task-level throughput and achieves better performance with more interaction￾motion training data. increasingly feasible, with high-quality samples in green and low-quality samples in red. At the task… view at source ↗
Figure 6
Figure 6. Figure 6: shows the real-world hardware setups used in our experiments. We obtain scene point clouds from calibrated cameras. For the Franka setup, we use an Intel RealSense D435; for the RBY-1 setup, we use Meta Aria Gen 2 glasses and estimate depth maps with FoundationStereo [72]. We segment objects with SAM 2 [18] and identify task-relevant objects by extracting CLIP [73] features for each mask region. Once a val… view at source ↗
Figure 7
Figure 7. Figure 7: Real-world reset settings. Object pose randomization ranges used for Franka real-world evaluation. Task variants and generalization. The simulation domains test both local robustness and broader contextual generalization. Each domain contains three variants with distinct global layouts, object arrangements, and interaction contexts; within each variant, object poses are randomized by approx￾imately 0.1 m i… view at source ↗
Figure 8
Figure 8. Figure 8: Simulation reset settings. Global layout variants and local object pose randomization ranges used in simulation. B.2 Baseline Settings Training data. We use strong supervised variants of the baselines whenever possible. In sim￾ulation, DP3 and Im2Flow2Act are trained with 100 robot demonstrations per task from Mim￾icLab [10]. Although the original Im2Flow2Act trains its flow-conditioned policy from prede￾f… view at source ↗
Figure 9
Figure 9. Figure 9: Human data processing. We visualize RGB observations with depth overlays, 3D point tracks, and recovered object-centric trajectories from human demonstrations. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Real-world rollouts on PourInBowl tasks. The robot avoids the kettle, grasps the cup by its handle, and rotates it during the transfer motion to align the cup for pouring into the bowl [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Real-world rollouts on HangMugTree tasks. The robot aligns its grasp to avoid non￾target objects and reorients the mug on the fly to accurately place the handle onto the mug tree [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Real-world rollouts on SortUtensils tasks. The robot sequentially places a bowl onto a plate and a utensil onto the bowl, demonstrating long-horizon skill composition across varying object arrangements and scene layouts. G.2 Simulation Rollouts [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Real-world rollouts on PourCup tasks. The robot firmly grasps the cup and leverages whole-body motion to pour the pink ball onto the target container [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Simulation rollouts. Qualitative POURINBOWL executions showing precise pouring interactions under varied object poses and scene layouts [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Simulation rollouts. Qualitative HANGMUGTREE executions showing fine-grained interaction motion and precise mug–branch alignment. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Simulation rollouts. Qualitative executions of Human2Any on PREPARETABLE, demonstrating long-horizon skill chaining through multi-stage interaction composition. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
read the original abstract

Human videos are a scalable source of supervision for robot manipulation, as they are abundant and naturally capture rich object interactions. However, transferring human demonstrations to robots remains challenging due to embodiment mismatch, scene variation, and robot-specific feasibility constraints. We present Human2Any, a framework for learning reusable object-centric interaction priors from human videos without requiring real-world robot demonstrations in the target task contexts. Human2Any represents manipulation through object-object interaction motion, capturing task-relevant scene changes while abstracting away embodiment-specific details. It composes learned interaction priors with robot-side feasibility reasoning and motion planning, allowing the same human-derived knowledge to adapt to different embodiments, scene geometries, and task contexts. We validate Human2Any across diverse manipulation settings, including real-world experiments on a Franka tabletop setup and an RBY-1 humanoid mobile robot, demonstrating robust interaction-centric manipulation without real-world robot training data. Project website: https://human2any.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Human2Any, a framework for human-to-robot transfer in manipulation that learns reusable object-centric interaction priors from human videos. It represents tasks via object-object interaction motions to abstract embodiment details, then composes these priors with robot-side feasibility reasoning and motion planning to adapt to new embodiments, scenes, and contexts. Validation is claimed via real-world experiments on a Franka arm and RBY-1 humanoid without target-task robot demonstrations.

Significance. If the compositional transfer mechanism holds, the work could meaningfully advance scalable robot learning by reducing reliance on robot-specific data collection. The emphasis on object-object interactions and constraint-aware planning addresses embodiment mismatch in a potentially generalizable way, though the absence of technical details limits assessment of novelty relative to prior video-to-robot methods.

major comments (2)
  1. [Abstract] Abstract: The central claim that object-object interaction motion captures task-relevant scene changes while abstracting embodiment details is stated at a high level without any formal representation, equations, or pseudocode for the prior learning or composition step, making it impossible to evaluate internal consistency or how feasibility constraints are enforced.
  2. [Abstract] Abstract: No quantitative metrics, baselines, success rates, or ablation results are reported for the Franka tabletop or RBY-1 experiments, so the claim of 'robust interaction-centric manipulation without real-world robot training data' cannot be assessed for support by the data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments. We address each point below and will revise the abstract to better support evaluation of the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that object-object interaction motion captures task-relevant scene changes while abstracting embodiment details is stated at a high level without any formal representation, equations, or pseudocode for the prior learning or composition step, making it impossible to evaluate internal consistency or how feasibility constraints are enforced.

    Authors: The abstract is intentionally high-level due to length constraints. The manuscript provides the formal representation of the object-object interaction prior, its learning objective, the composition with robot-side feasibility constraints, and pseudocode in Sections 3 and 4. To improve accessibility from the abstract alone, we will add a concise reference to the key formulation and constraint enforcement mechanism. revision: yes

  2. Referee: [Abstract] Abstract: No quantitative metrics, baselines, success rates, or ablation results are reported for the Franka tabletop or RBY-1 experiments, so the claim of 'robust interaction-centric manipulation without real-world robot training data' cannot be assessed for support by the data.

    Authors: We agree that the abstract lacks specific quantitative results. The manuscript reports these details, including success rates, baseline comparisons, and ablations for the Franka and RBY-1 experiments, in Section 5. We will revise the abstract to include key metrics that support the central claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided manuscript text consists of a high-level description of a compositional framework that learns object-centric interaction priors from human videos and composes them with robot feasibility reasoning and motion planning. No equations, derivations, fitted parameters, or load-bearing self-citations are present in the abstract or summary. The central claim is a methodological composition rather than a mathematical reduction, and the validation is described as empirical across robot platforms without any indication that results are forced by construction from inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5709 in / 949 out tokens · 44647 ms · 2026-06-30T09:57:02.682447+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 34 canonical work pages · 11 internal anchors

  1. [1]

    Firoozi, J

    R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman, et al. Foundation models in robotics: Applications, challenges, and the fu- ture.The International Journal of Robotics Research, 2024. doi:https://doi.org/10.1177/ 02783649241281508

  2. [2]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  3. [3]

    Z. Zhao, S. Cheng, Y . Ding, Z. Zhou, S. Zhang, D. Xu, and Y . Zhao. A survey of optimization- based task and motion planning: From classical to learning approaches.IEEE/ASME Transac- tions on Mechatronics, 30(4):2799–2825, Aug. 2025. ISSN 1941-014X. doi:10.1109/tmech. 2024.3452509. URLhttp://dx.doi.org/10.1109/TMECH.2024.3452509

  4. [4]

    T. L. Team, J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, N. Kuppuswamy, K.-H. Lee, K. Liu, D. McConachie, I. McMahon, H. Nishimura, C. Phillips-Grafflin, C. Richter, P. Shah, K. Srinivasan, B. Wulfe, C. Xu, M. Zhang, A. Alspach, M. Angeles, K. Arora, V . C. Guizilini, A. Castro, D. C...

  5. [5]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

  6. [6]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InarXiv preprint arXiv:2108.03298, 2021

  7. [7]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

  8. [8]

    Florence, C

    P. Florence, C. Lynch, A. Zeng, O. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mor- datch, and J. Tompson. Implicit behavioral cloning.Conference on Robot Learning (CoRL), 2021

  9. [9]

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InProceedings of Robotics: Science and Systems (RSS), 2024

  10. [10]

    Saxena, M

    V . Saxena, M. Bronars, N. R. Arachchige, K. Wang, W. C. Shin, S. Nasiriany, A. Mandlekar, and D. Xu. What matters in learning from large-scale datasets for robot manipulation. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps: //openreview.net/forum?id=LqhorpRLIm. 9

  11. [11]

    O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Her- zog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakr- ishna, A. W...

  12. [12]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  13. [13]

    Kareer, D

    S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Con- ference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

  14. [14]

    M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song. Flow as the cross-domain manipulation interface.arXiv preprint arXiv:2407.15208, 2024

  15. [15]

    Bharadhwaj, R

    H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024

  16. [16]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

    R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  17. [17]

    Bharadhwaj, A

    H. Bharadhwaj, A. Gupta, V . Kumar, and S. Tulsiani. Towards generalizable zero-shot manip- ulation via translating human interaction plans, 2023. 10

  18. [18]

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  19. [19]

    Y . Xiao, J. Wang, N. Xue, N. Karaev, Y . Makarov, B. Kang, X. Zhu, H. Bao, Y . Shen, and X. Zhou. Spatialtrackerv2: 3d point tracking made easy. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. URLhttps://arxiv.org/abs/2507. 12462

  20. [20]

    B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17868–17879, 2024

  21. [21]

    C. Yuan, C. Wen, T. Zhang, and Y . Gao. General flow as foundation affordance for scalable robot learning.arXiv preprint arXiv gjo:2401.11439, 2024

  22. [22]

    S. Noh, D. Nam, K. Kim, G. Lee, Y . Yu, R. Kang, and K. Lee. 3d flow diffusion policy: Visuomotor policy learning via generating flow in 3d space.arXiv preprint arXiv:2509.18676, 2025

  23. [23]

    V . Liu, A. Adeniji, H. Zhan, S. Haldar, R. Bhirangi, P. Abbeel, and L. Pinto. Egozero: Robot learning from smart glasses.arXiv preprint arXiv:2505.20290, 2025

  24. [24]

    J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos.arXiv preprint arXiv:2503.23877, 2025

  25. [25]

    C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023

  26. [26]

    Punamiya, D

    R. Punamiya, D. Patel, P. Aphiwetsa, P. Kuppili, L. Y . Zhu, S. Kareer, J. Hoffman, and D. Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data. In Human to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans, 2025

  27. [27]

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

  28. [28]

    Radosavovic, T

    I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. InConference on Robot Learning, pages 416–426. PMLR, 2023

  29. [29]

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

  30. [30]

    D. Xu, A. Mandlekar, R. Mart ´ın-Mart´ın, Y . Zhu, S. Savarese, and L. Fei-Fei. Deep affor- dance foresight: Planning through what can be done in the future. In2021 IEEE international conference on robotics and automation (ICRA), pages 6206–6213. IEEE, 2021

  31. [31]

    H.-S. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y . Xie, and C. Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics (T-RO), 2023

  32. [32]

    R. Wang, J. Zhang, J. Chen, Y . Xu, P. Li, T. Liu, and H. Wang. Dexgraspnet: A large- scale robotic dexterous grasp dataset for general objects based on simulation.arXiv preprint arXiv:2210.02697, 2022. 11

  33. [33]

    Eppner, A

    C. Eppner, A. Mousavian, and D. Fox. Acronym: A large-scale grasp dataset based on sim- ulation. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6222–6227. IEEE, 2021

  34. [34]

    S. Song, A. Zeng, J. Lee, and T. Funkhouser. Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations.Robotics and Automation Letters, 2020

  35. [35]

    Eisner, H

    B. Eisner, H. Zhang, and D. Held. FlowBot3D: Learning 3D Articulation Flow to Manipulate Articulated Objects. InProceedings of Robotics: Science and Systems, New York City, NY , USA, June 2022. doi:10.15607/RSS.2022.XVIII.018

  36. [36]

    C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield. Spot: Se (3) pose trajectory diffusion for object-centric manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4853–4860. IEEE, 2025

  37. [37]

    Seita, Y

    D. Seita, Y . Wang, S. J. Shetty, E. Y . Li, Z. Erickson, and D. Held. Toolflownet: Robotic manipulation with tools via predicting tool flow from point clouds. InConference on Robot Learning, pages 1038–1049. PMLR, 2023

  38. [38]

    Zhang, B

    H. Zhang, B. Eisner, and D. Held. Flowbot++: Learning generalized articulated objects ma- nipulation via articulation projection.arXiv preprint arXiv:2306.12893, 2023

  39. [39]

    Y . Li, W. H. Leng, Y . Fang, B. Eisner, and D. Held. Flowbothd: History-aware diffuser han- dling ambiguities in articulated objects manipulation.arXiv preprint arXiv:2410.07078, 2024

  40. [40]

    Z.-H. Yin, S. Yang, and P. Abbeel. Object-centric 3d motion field for robot learning from human videos.arXiv preprint arXiv:2506.04227, 2025

  41. [41]

    Y . Su, X. Zhan, H. Fang, Y .-L. Li, C. Lu, and L. Yang. Motion before action: Diffusing object motion as manipulation condition.IEEE Robotics and Automation Letters, 2025

  42. [42]

    H. Li, L. Sun, Y . Hu, D. Ta, J. Barry, G. Konidaris, and J. Fu. Novaflow: Zero-shot manipula- tion via actionable flow from generated videos.arXiv preprint arXiv:2510.08568, 2025

  43. [43]

    X. Dong, M. Johnson-Roberson, and W. Zhi. Joint flow trajectory optimization for feasible robot motion generation from video demonstrations.arXiv preprint arXiv:2509.20703, 2025

  44. [44]

    Dreczkowski, P

    K. Dreczkowski, P. Vitiello, V . V osylius, and E. Johns. Learning a thousand tasks in a day. Science Robotics, 10(108):eadv7594, 2025

  45. [45]

    L. P. Kaelbling and T. Lozano-P ´erez. Hierarchical task and motion planning in the now. In 2011 IEEE international conference on robotics and automation, pages 1470–1477. IEEE, 2011

  46. [46]

    C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-P´erez. Integrated task and motion planning.Annual review of control, robotics, and autonomous systems, 4(1):265–293, 2021

  47. [47]

    C. R. Garrett, T. Lozano-P´erez, and L. P. Kaelbling. Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning. InProceedings of the international conference on automated planning and scheduling, volume 30, pages 440–448, 2020

  48. [48]

    Cheng and D

    S. Cheng and D. Xu. League: Guided skill learning and abstraction for long-horizon manipu- lation.IEEE Robotics and Automation Letters, 8(10):6451–6458, 2023

  49. [49]

    Silver, A

    T. Silver, A. Athalye, J. B. Tenenbaum, T. Lozano-P´erez, and L. P. Kaelbling. Learning neuro- symbolic skills for bilevel planning.arXiv preprint arXiv:2206.10680, 2022

  50. [50]

    Mandlekar, C

    A. Mandlekar, C. Garrett, D. Xu, and D. Fox. Human-in-the-loop task and motion planning for imitation learning. In7th Annual Conference on Robot Learning, 2023. 12

  51. [51]

    Cheng, C

    S. Cheng, C. Garrett, A. Mandlekar, and D. Xu. Nod-tamp: Generalizable long-horizon plan- ning with neural object descriptors.arXiv preprint arXiv:2311.01530, 2023

  52. [52]

    Mandalika, S

    A. Mandalika, S. Choudhury, O. Salzman, and S. Srinivasa. Generalized lazy search for robot motion planning: Interleaving search and edge evaluation via event-based toggles. InPro- ceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 745–753, 2019

  53. [53]

    Chitnis, D

    R. Chitnis, D. Hadfield-Menell, A. Gupta, S. Srivastava, E. Groshev, C. Lin, and P. Abbeel. Guided search for task and motion plans using learned heuristics. In2016 IEEE International Conference on Robotics and Automation (ICRA), pages 447–454. IEEE, 2016

  54. [54]

    Chitnis, L

    R. Chitnis, L. P. Kaelbling, and T. Lozano-P ´erez. Learning quickly to plan quickly using modular meta-learning. In2019 International Conference on Robotics and Automation (ICRA), pages 7865–7871. IEEE, 2019

  55. [55]

    Z. Wang, C. R. Garrett, L. P. Kaelbling, and T. Lozano-P´erez. Learning compositional models of robot skills for task and motion planning.The International Journal of Robotics Research, 40(6-7):866–894, 2021

  56. [56]

    Ortiz-Haro, J.-S

    J. Ortiz-Haro, J.-S. Ha, D. Driess, and M. Toussaint. Structured deep generative models for sampling on constraint manifolds in sequential manipulation. InConference on Robot Learn- ing, pages 213–223. PMLR, 2022

  57. [57]

    X. Fang, C. R. Garrett, C. Eppner, T. Lozano-P ´erez, L. P. Kaelbling, and D. Fox. Dimsam: Diffusion models as samplers for task and motion planning under partial observability. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1412–1419. IEEE, 2024

  58. [58]

    Kumar, W

    N. Kumar, W. Shen, F. Ramos, D. Fox, T. Lozano-P ´erez, L. P. Kaelbling, and C. R. Garrett. Open-world task and motion planning via vision-language model generated constraints.IEEE Robotics and Automation Letters, 2026

  59. [59]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  60. [60]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  61. [61]

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based gen- erative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  62. [62]

    U. A. Mishra, S. Xue, Y . Chen, and D. Xu. Generative skill chaining: Long-horizon skill planning with diffusion models. InConference on Robot Learning, pages 2905–2925. PMLR, 2023

  63. [63]

    U. A. Mishra, Y . Chen, and D. Xu. Generative factor chaining: Coordinated manipulation with diffusion-based factor graph. InICRA 2024 Workshop{\textemdash}Back to the Future: Robot Learning Going Probabilistic, 2024

  64. [64]

    Y . Luo, U. A. Mishra, Y . Du, and D. Xu. Generative trajectory stitching through diffusion composition.arXiv preprint arXiv:2503.05153, 2025

  65. [65]

    Y . Luo, C. Sun, J. B. Tenenbaum, and Y . Du. Potential based diffusion motion planning.arXiv preprint arXiv:2407.06169, 2024

  66. [66]

    Clark and F

    Q. Clark and F. Shkurti. What do you need for diverse trajectory stitching in diffusion plan- ning?arXiv preprint arXiv:2505.18083, 2025. 13

  67. [67]

    Lawrence, J

    J. Lawrence, J. Bernal, and C. Witzgall. A purely algebraic justification of the kabsch- umeyama algorithm.Journal of research of the National Institute of Standards and Technology, 124:1, 2019

  68. [68]

    M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

  69. [69]

    R. K. Guo, X. Lin, M. Liu, J. Gu, and H. Su. Mplib: a lightweight motion planning library,

  70. [70]

    Software available at https://motion- planning-lib.readthedocs.io/latest/

    URLhttps://github.com/haosulab/MPlib. Software available at https://motion- planning-lib.readthedocs.io/latest/

  71. [71]

    O. Khatib. A unified approach for motion and force control of robot manipulators: The opera- tional space formulation.IEEE Journal on Robotics and Automation, 3(1):43–53, 2003

  72. [72]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–

  73. [73]
  74. [74]

    B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield. Foundationstereo: Zero- shot stereo matching.CVPR, 2025

  75. [75]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  76. [76]

    Eppner, A

    C. Eppner, A. Mousavian, and D. Fox. ACRONYM: A large-scale grasp dataset based on simulation. In2021 IEEE Int. Conf. on Robotics and Automation, ICRA, 2020

  77. [77]

    arXiv preprint arXiv:2507.13097 (2025)

    A. Murali, B. Sundaralingam, Y .-W. Chao, J. Yamada, W. Yuan, M. Carlson, F. Ramos, S. Birchfield, D. Fox, and C. Eppner. Graspgen: A diffusion-based framework for 6- dof grasping with on-generator training.arXiv preprint arXiv:2507.13097, 2025. URL https://arxiv.org/abs/2507.13097

  78. [78]

    Z. Wei, Z. Xu, J. Guo, Y . Hou, C. Gao, Z. Cai, J. Luo, and L. Shao.D(R,O)grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4982–4988,

  79. [79]

    doi:10.1109/ICRA55743.2025.11127754

  80. [81]

    14 Supplementary Material The supplementary material has the following contents: •Hardware Setup(Sec

    URLhttps://arxiv.org/abs/2406.09509. 14 Supplementary Material The supplementary material has the following contents: •Hardware Setup(Sec. A): Describes camera, perception, planning, and control details for the Franka and RBY-1 real-world platforms; •Tasks and Baselines(Sec. B): Details simulation and real-world task semantics, reset ranges, evaluation pr...