pith. machine review for the scientific record. sign in

arxiv: 2409.16283 · v1 · submitted 2024-09-24 · 💻 cs.RO · cs.CV· cs.LG· eess.IV

Recognition: no theorem link

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:13 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LGeess.IV
keywords robot manipulationvideo generationgeneralizationzero-shothuman videospolicy conditioningweb data
0
0 comments X

The pith

Generating human videos from web data lets a single robot policy manipulate unseen objects and novel motions without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that robot manipulation policies can generalize to novel tasks by generating videos of humans performing those tasks in new scenarios using a pre-trained model from web data, then executing the actions with a policy conditioned on the generated video. This sidesteps the high cost of collecting large robot datasets by leveraging abundant web videos for motion information. The approach trains the policy on far less robot data and requires no adaptation or fine-tuning of the video model itself. Real-world tests show the method succeeding on tasks and object types absent from the robot training set.

Core claim

Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. The video generation model is used directly without fine-tuning, and the policy is trained on an order of magnitude less robot interaction data.

What carries the argument

Zero-shot generation of human videos from a pre-trained web model, used to condition a single robot policy that executes the depicted motions.

If this is right

  • A single policy can perform tasks absent from its robot training data by following motions in the generated videos.
  • Generalization to unseen object types occurs without new robot demonstrations for each case.
  • Motion information from web-scale video data transfers to robot control without domain-specific retraining.
  • Robot data collection needs drop by an order of magnitude while still supporting novel scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If video generation quality increases, the same conditioning could support longer-horizon or multi-object tasks.
  • Web video data might act as a scalable substitute for robot experience in learning physical motions.
  • Real-time generation during execution could allow the policy to adapt to changes mid-task.

Load-bearing premise

That videos generated by a pre-trained model from web data provide sufficiently accurate and transferable motion information for a robot policy to execute novel tasks without any fine-tuning of the video model or additional domain adaptation.

What would settle it

An experiment showing that the generated videos depict motions the robot policy cannot physically replicate in the real world due to inaccuracies or domain gaps.

read the original abstract

How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn't require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at https://homangab.github.io/gen2act/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Gen2Act, a framework that casts language-conditioned robot manipulation as zero-shot generation of human videos from a pre-trained web-scale model, followed by execution via a single policy conditioned on the generated video. The policy is trained on an order of magnitude less robot interaction data than the video model, requires no fine-tuning of the generator, and is claimed to enable manipulation of unseen object types and novel motions absent from the robot training distribution, with supporting real-world demonstrations.

Significance. If the transfer from generated videos to robot actions holds under rigorous evaluation, the work could meaningfully lower the barrier to generalizable manipulation by substituting abundant web video data for expensive robot data collection. The clean separation between an off-the-shelf video generator and a lightweight video-conditioned policy is a pragmatic design that avoids the computational cost of joint training or domain adaptation.

major comments (2)
  1. [Evaluation/Results] Evaluation section (and abstract): the central claim of generalization to novel tasks and objects rests on real-world success, yet no quantitative metrics (success rates, number of trials, variance, or statistical tests), baselines, or evaluation protocol details are supplied. This omission is load-bearing because the abstract-level assertion cannot be assessed for reliability or scope.
  2. [Method] Method section (policy conditioning and video-to-action transfer): the approach assumes that motion signals extracted from human-centric generated videos are directly executable by a robot gripper without any correction for embodiment mismatch, viewpoint, scale, or contact dynamics. No module, loss term, or preprocessing step is described to bridge this gap, leaving the transfer step unsupported for the claimed zero-shot generalization.
minor comments (2)
  1. [Abstract] Abstract: the statement that the policy uses 'an order of magnitude less robot interaction data' should be accompanied by explicit counts (e.g., hours or episodes) and a comparison to the video model's training scale.
  2. [Experiments] The video link is provided but the text contains no summary of what the supplementary videos demonstrate (e.g., specific failure modes or success conditions).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point-by-point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Evaluation/Results] Evaluation section (and abstract): the central claim of generalization to novel tasks and objects rests on real-world success, yet no quantitative metrics (success rates, number of trials, variance, or statistical tests), baselines, or evaluation protocol details are supplied. This omission is load-bearing because the abstract-level assertion cannot be assessed for reliability or scope.

    Authors: We agree that the absence of quantitative metrics limits the ability to rigorously assess the generalization claims. The current manuscript emphasizes qualitative real-world video demonstrations to highlight novel object and motion handling. In the revised version, we will expand the evaluation section to report success rates across a fixed number of trials per scenario (with variance), detail the evaluation protocol (including task definitions, success criteria, and trial counts), and include comparisons to relevant baselines where feasible. These additions will be placed in both the main text and abstract as appropriate. revision: yes

  2. Referee: [Method] Method section (policy conditioning and video-to-action transfer): the approach assumes that motion signals extracted from human-centric generated videos are directly executable by a robot gripper without any correction for embodiment mismatch, viewpoint, scale, or contact dynamics. No module, loss term, or preprocessing step is described to bridge this gap, leaving the transfer step unsupported for the claimed zero-shot generalization.

    Authors: The policy is trained end-to-end on paired robot video observations and actions, allowing it to learn direct mappings from visual motion cues to gripper commands. During inference, the generated human videos serve as the conditioning input, with the training distribution providing robustness to viewpoint and scale variations present in the robot data. We acknowledge that the manuscript does not explicitly describe any explicit correction steps or losses for embodiment differences. In the revision, we will add a dedicated subsection detailing the data collection process, input preprocessing (e.g., frame resizing and normalization), and training objective to clarify how the transfer is achieved without additional modules. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a pipeline using a separately pre-trained video generation model on web data to produce human videos for novel tasks, followed by training an independent robot policy conditioned on those videos using an order of magnitude less robot data. No equations, derivations, or self-citations are shown that reduce any prediction or result to fitted parameters defined by the output itself, nor do any steps rely on self-definitional loops or imported uniqueness theorems. The components remain independent, with the video model untouched and the policy trained on distinct robot interaction data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the transferability of motion information from web-trained video models to robot execution without fine-tuning, plus the assumption that limited robot data suffices to train a policy that can interpret generated videos.

axioms (1)
  • domain assumption Pre-trained video generation models produce videos whose depicted motions are sufficiently accurate and robot-executable for novel tasks.
    The method uses the video model zero-shot and assumes its outputs directly support policy conditioning without adaptation.

pith-pipeline@v0.9.0 · 5525 in / 1181 out tokens · 37517 ms · 2026-05-15T12:13:10.717237+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

    cs.CV 2026-05 conditional novelty 7.0

    CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

  2. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  3. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

  4. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  5. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  6. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

  7. SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.

  8. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  9. Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.

  10. WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...

  11. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  12. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  13. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  14. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    cs.RO 2025-03 unverdicted novelty 6.0

    GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.

  15. Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    cs.CV 2024-12 unverdicted novelty 6.0

    Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

  16. CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    cs.RO 2024-11 unverdicted novelty 6.0

    CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...

  17. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  18. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  19. Motus: A Unified Latent Action World Model

    cs.CV 2025-12 unverdicted novelty 5.0

    Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

  20. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 19 Pith papers · 9 internal anchors

  1. [1]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817 , 2022

  2. [2]

    Roboagent: Generalization and efficiency in robot manipulation via semantic augmen- tations and action chunking,

    H. Bharadhwaj, J. Vakil, M. Sharma, A. Gupta, S. Tul- siani, and V . Kumar, “Roboagent: Generalization and efficiency in robot manipulation via semantic augmen- tations and action chunking,” in 2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA) , 2024

  3. [4]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. , “Droid: A large-scale in- the-wild robot manipulation dataset,” arXiv preprint arXiv:2403.12945, 2024

  4. [5]

    R3M: A Universal Visual Representation for Robot Manipulation

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” arXiv preprint arXiv:2203.12601 , 2022

  5. [6]

    Where are we in the search for an artificial visual cortex for embodied intelligence?

    A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, et al. , “Where are we in the search for an artificial visual cortex for embodied intelligence?” arXiv preprint arXiv:2303.18240, 2023

  6. [7]

    Masked visual pre-training for motor control,

    T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked visual pre-training for motor control,” arXiv preprint arXiv:2203.06173, 2022

  7. [9]

    Language-driven representation learning for robotics

    S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-driven representation learning for robotics,” arXiv preprint arXiv:2302.12766, 2023

  8. [10]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. , “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in Conference on Robot Learning . PMLR, 2023, pp. 2165–2183

  9. [11]

    Dinobot: Robot manipula- tion via retrieval and alignment with vision foundation models,

    N. D. Palo and E. Johns, “Dinobot: Robot manipula- tion via retrieval and alignment with vision foundation models,” in IEEE International Conference on Robotics and Automation (ICRA) , 2024

  10. [12]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Bal- akrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. San- keti, et al., “Openvla: An open-source vision-language- action model,” arXiv preprint arXiv:2406.09246 , 2024

  11. [13]

    Zero-shot robotic manipulation with pretrained image-editing diffusion models,

    K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipu- lation with pretrained image-editing diffusion models,” arXiv preprint arXiv:2310.10639 , 2023

  12. [14]

    Visual affordance prediction for guiding robot exploration,

    H. Bharadhwaj, A. Gupta, and S. Tulsiani, “Visual affordance prediction for guiding robot exploration,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 3029–3036

  13. [15]

    Dall-e-bot: Introducing web-scale diffusion models to robotics,

    I. Kapelyukh, V . V osylius, and E. Johns, “Dall-e-bot: Introducing web-scale diffusion models to robotics,” IEEE Robotics and Automation Letters , vol. 8, no. 7, pp. 3956–3963, 2023

  14. [16]

    Towards generalizable zero-shot manipulation via translating human interaction plans,

    H. Bharadhwaj, A. Gupta, V . Kumar, and S. Tul- siani, “Towards generalizable zero-shot manipulation via translating human interaction plans,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024

  15. [17]

    Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation,

    H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani, “Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation,” arXiv preprint arXiv:2405.01527, 2024

  16. [18]

    arXiv preprint arXiv:2311.10709 , year=

    R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, “Emu video: Factorizing text-to-video gen- eration by explicit image conditioning,” arXiv preprint arXiv:2311.10709, 2023

  17. [19]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al. , “Photorealistic text-to- image diffusion models with deep language understand- ing,” arXiv preprint arXiv:2205.11487 , 2022

  18. [20]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, R. Hornung, H. Adam, H. Akbari, Y . Alon, V . Birodkar, et al., “Videopoet: A large language model for zero-shot video generation,” arXiv preprint arXiv:2312.14125 , 2023

  19. [21]

    Bootstap: Bootstrapped training for tracking-any-point,

    C. Doersch, Y . Yang, D. Gokay, P. Luc, S. Koppula, A. Gupta, J. Heyward, R. Goroshin, J. Carreira, and A. Zisserman, “Bootstap: Bootstrapped training for tracking-any-point,” arXiv preprint arXiv:2402.00847 , 2024

  20. [22]

    One-shot visual imitation learning via meta-learning,

    C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visual imitation learning via meta-learning,” in Conference on robot learning . PMLR, 2017, pp. 357–368

  21. [23]

    Visual imitation made easy,

    S. Young, D. Gandhi, S. Tulsiani, A. Gupta, P. Abbeel, and L. Pinto, “Visual imitation made easy,” in Confer- ence on Robot Learning (CoRL) , 2020

  22. [25]

    Learning monocular reactive uav control in cluttered natural environments,

    S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wen- del, D. Dey, J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav control in cluttered natural environments,” in 2013 IEEE international conference on robotics and automation . IEEE, 2013, pp. 1765– 1772

  23. [26]

    End-to-end learning for lane keeping of self-driving cars,

    Z. Chen and X. Huang, “End-to-end learning for lane keeping of self-driving cars,” in 2017 IEEE intelligent vehicles symposium (IV) . IEEE, 2017, pp. 1856–1860

  24. [27]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation,

    A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, et al., “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” in Conference on Robot Learning. PMLR, 2018, pp. 879–893

  25. [28]

    Bc-z: Zero-shot task generalization with robotic imitation learning,

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning,” in Conference on Robot Learning. PMLR, 2022, pp. 991– 1002

  26. [29]

    Bridgedata v2: A dataset for robot learning at scale,

    H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . Myers, M. J. Kim, M. Du, et al. , “Bridgedata v2: A dataset for robot learning at scale,” in Conference on Robot Learning . PMLR, 2023, pp. 1723–1736

  27. [30]

    Scalable deep reinforcement learning for vision-based robotic manipulation,

    D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Van- houcke, et al. , “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on robot learning . PMLR, 2018, pp. 651–673

  28. [31]

    Ego4d: Around the world in 3,000 hours of egocentric video,

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 18 995–19 012

  29. [32]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255

  30. [33]

    VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value-implicit pre- training,” arXiv preprint arXiv:2210.00030 , 2022

  31. [34]

    The unsurprising effectiveness of pre- trained vision models for control,

    S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. Gupta, “The unsurprising effectiveness of pre- trained vision models for control,” arXiv preprint arXiv:2203.03580, 2022

  32. [35]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipu- lation,” arXiv preprint arXiv:2312.13139 , 2023

  33. [36]

    Video as the new language for real-world decision making,

    S. Yang, J. Walker, J. Parker-Holder, Y . Du, J. Bruce, A. Barreto, P. Abbeel, and D. Schuurmans, “Video as the new language for real-world decision making,” arXiv preprint arXiv:2402.17139 , 2024

  34. [37]

    Lossless adaptation of pre- trained vision models for robotic manipulation,

    M. Sharma, C. Fantacci, Y . Zhou, S. Koppula, N. Heess, J. Scholz, and Y . Aytar, “Lossless adaptation of pre- trained vision models for robotic manipulation,” arXiv preprint arXiv:2304.06600, 2023

  35. [38]

    On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline,

    N. Hansen, Z. Yuan, Y . Ze, T. Mu, A. Rajeswaran, H. Su, H. Xu, and X. Wang, “On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline,” arXiv preprint arXiv:2212.05749 , 2022

  36. [39]

    Cacti: A framework for scal- able multi-task multi-scene visual imitation learning,

    Z. Mandi, H. Bharadhwaj, V . Moens, S. Song, A. Ra- jeswaran, and V . Kumar, “Cacti: A framework for scal- able multi-task multi-scene visual imitation learning,” arXiv preprint arXiv:2212.05711 , 2022

  37. [40]

    Genaug: Retargeting behaviors to unseen situations via gener- ative augmentation,

    Z. Chen, S. Kiami, A. Gupta, and V . Kumar, “Genaug: Retargeting behaviors to unseen situations via gener- ative augmentation,” arXiv preprint arXiv:2302.06671 , 2023

  38. [41]

    Scaling robot learning with semantically imagined experience,

    T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang, J. Singh, C. Tan, J. Peralta, B. Ichter, et al. , “Scaling robot learning with semantically imagined experience,” arXiv preprint arXiv:2302.11550 , 2023

  39. [42]

    Semantically con- trollable augmentations for generalizable robot learn- ing,

    Z. Chen, Z. Mandi, H. Bharadhwaj, M. Sharma, S. Song, A. Gupta, and V . Kumar, “Semantically con- trollable augmentations for generalizable robot learn- ing,” arXiv preprint arXiv:2409.00951 , 2024

  40. [43]

    Mimicplay: Long- horizon imitation learning by watching human play,

    C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar, “Mimicplay: Long- horizon imitation learning by watching human play,” arXiv preprint arXiv:2302.12422 , 2023

  41. [44]

    Avid: Learning multi-stage tasks via pixel- level translation of human videos,

    L. Smith, N. Dhawan, M. Zhang, P. Abbeel, and S. Levine, “Avid: Learning multi-stage tasks via pixel- level translation of human videos,” arXiv, 2019

  42. [45]

    Learning by watching: Physical imitation of manipulation skills from human videos,

    H. Xiong, Q. Li, Y .-C. Chen, H. Bharadhwaj, S. Sinha, and A. Garg, “Learning by watching: Physical imitation of manipulation skills from human videos,” arXiv, 2021

  43. [46]

    Vid2robot: End-to-end video- conditioned policy learning with cross-attention trans- formers,

    V . Jain, M. Attarian, N. J. Joshi, A. Wahid, D. Driess, Q. Vuong, P. R. Sanketi, P. Sermanet, S. Welker, C. Chan, et al. , “Vid2robot: End-to-end video- conditioned policy learning with cross-attention trans- formers,” arXiv preprint arXiv:2403.12943 , 2024

  44. [47]

    Any-point trajectory modeling for policy learning

    C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel, “Any-point trajectory modeling for policy learning,” arXiv preprint arXiv:2401.00025 , 2023

  45. [48]

    Rt-trajectory: Robotic task generalization via hindsight trajectory sketches

    J. Gu, S. Kirmani, P. Wohlhart, Y . Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, et al., “Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,” arXiv preprint arXiv:2311.01977 , 2023

  46. [49]

    Dexmv: Imitation learning for dexter- ous manipulation from human videos,

    Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang, “Dexmv: Imitation learning for dexter- ous manipulation from human videos,” arXiv preprint arXiv:2108.05877, 2021

  47. [50]

    Videodex: Learning dexterity from internet videos,

    K. Shaw, S. Bahl, and D. Pathak, “Videodex: Learning dexterity from internet videos,” in 6th Annual Confer- ence on Robot Learning

  48. [51]

    Where2act: From pixels to actions for articulated 3d objects,

    K. Mo, L. J. Guibas, M. Mukadam, A. Gupta, and S. Tulsiani, “Where2act: From pixels to actions for articulated 3d objects,” inProceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 6813–6823

  49. [52]

    Human hands as probes for interactive object understanding,

    M. Goyal, S. Modi, R. Goyal, and S. Gupta, “Human hands as probes for interactive object understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 3293–3303

  50. [53]

    Affordances from human videos as a versatile repre- sentation for robotics,

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile repre- sentation for robotics,” in CVPR, 2023

  51. [54]

    Joint hand motion and interaction hotspots prediction from egocentric videos,

    S. Liu, S. Tripathi, S. Majumdar, and X. Wang, “Joint hand motion and interaction hotspots prediction from egocentric videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 3282–3292

  52. [55]

    General flow as foundation affordance for scalable robot learning,

    C. Yuan, C. Wen, T. Zhang, and Y . Gao, “General flow as foundation affordance for scalable robot learning,” arXiv preprint arXiv:2401.11439 , 2024

  53. [56]

    Human-to-robot imitation in the wild,

    S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” RSS, 2022

  54. [57]

    Learning uni- versal policies via text-guided video generation,

    Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenen- baum, D. Schuurmans, and P. Abbeel, “Learning uni- versal policies via text-guided video generation,” Ad- vances in Neural Information Processing Systems , vol. 36, 2024

  55. [58]

    Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024

    J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P. Tokmakov, S. Song, and C. V ondrick, “Dreamitate: Real-world visuomotor policy learning via video gen- eration,” arXiv preprint arXiv:2406.16862 , 2024

  56. [59]

    R+ x: Retrieval and execution from everyday human videos,

    G. Papagiannis, N. Di Palo, P. Vitiello, and E. Johns, “R+ x: Retrieval and execution from everyday human videos,” arXiv preprint arXiv:2407.12957 , 2024

  57. [60]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. , “Flamingo: a visual language model for few-shot learning,” Advances in neural infor- mation processing systems , vol. 35, pp. 23 716–23 736, 2022

  58. [61]

    Tap-vid: A benchmark for tracking any point in a video,

    C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y . Aytar, J. Carreira, A. Zisserman, and Y . Yang, “Tap-vid: A benchmark for tracking any point in a video,” Advances in Neural Information Processing Systems, vol. 35, pp. 13 610–13 626, 2022

  59. [62]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023

  60. [63]

    Cotracker: It is better to track together

    N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,”arXiv preprint arXiv:2307.07635, 2023

  61. [64]

    Dense optical tracking: connecting the dots,

    G. Le Moing, J. Ponce, and C. Schmid, “Dense optical tracking: connecting the dots,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 187–19 197. APPENDIX Here we provide additional details on the method and experiments of Gen2Act. A. Human Video Generation We use a pre-trained VideoPoet model [20] directl...