arxiv: 2409.16283 · v1 · submitted 2024-09-24 · 💻 cs.RO · cs.CV· cs.LG· eess.IV

Recognition: no theorem link

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj , Debidatta Dwibedi , Abhinav Gupta , Shubham Tulsiani , Carl Doersch , Ted Xiao , Dhruv Shah , Fei Xia

show 2 more authors

Dorsa Sadigh Sean Kirmani

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:13 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LGeess.IV

keywords robot manipulationvideo generationgeneralizationzero-shothuman videospolicy conditioningweb data

0 comments

The pith

Generating human videos from web data lets a single robot policy manipulate unseen objects and novel motions without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that robot manipulation policies can generalize to novel tasks by generating videos of humans performing those tasks in new scenarios using a pre-trained model from web data, then executing the actions with a policy conditioned on the generated video. This sidesteps the high cost of collecting large robot datasets by leveraging abundant web videos for motion information. The approach trains the policy on far less robot data and requires no adaptation or fine-tuning of the video model itself. Real-world tests show the method succeeding on tasks and object types absent from the robot training set.

Core claim

Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. The video generation model is used directly without fine-tuning, and the policy is trained on an order of magnitude less robot interaction data.

What carries the argument

Zero-shot generation of human videos from a pre-trained web model, used to condition a single robot policy that executes the depicted motions.

If this is right

A single policy can perform tasks absent from its robot training data by following motions in the generated videos.
Generalization to unseen object types occurs without new robot demonstrations for each case.
Motion information from web-scale video data transfers to robot control without domain-specific retraining.
Robot data collection needs drop by an order of magnitude while still supporting novel scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If video generation quality increases, the same conditioning could support longer-horizon or multi-object tasks.
Web video data might act as a scalable substitute for robot experience in learning physical motions.
Real-time generation during execution could allow the policy to adapt to changes mid-task.

Load-bearing premise

That videos generated by a pre-trained model from web data provide sufficiently accurate and transferable motion information for a robot policy to execute novel tasks without any fine-tuning of the video model or additional domain adaptation.

What would settle it

An experiment showing that the generated videos depict motions the robot policy cannot physically replicate in the real world due to inaccuracies or domain gaps.

read the original abstract

How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn't require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at https://homangab.github.io/gen2act/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gen2Act generates human videos zero-shot from a web model and feeds them to a video-conditioned robot policy trained on far less data, which is a clean way to target novel tasks but leaves the human-to-robot motion transfer under-supported.

read the letter

The paper's main move is to treat language-conditioned manipulation as first generating a human video with a pre-trained model, then running a single policy that takes that video as input. They train the policy on an order of magnitude less robot data than the video model saw and never fine-tune the generator. That framing is new in the way it directly wires web-scale video generation to policy execution for tasks missing from the robot set entirely. It also keeps things modular: separate pre-training for each piece, no joint optimization. That is a practical strength if the transfer works, because it lets people reuse existing video models without touching them. The real-world experiments on unseen objects and new motions are the part that matters most for this line of work. If those hold up with clear metrics, the approach gives a concrete route around scaling robot data collection. The main soft spot is the transfer step itself. Human video to robot action has to cross embodiment differences in hands versus grippers, viewpoint, scale, and contact forces, and the description does not spell out how those are handled or corrected. Without numbers on success rates, baselines, or failure modes, it is hard to tell whether the generated videos actually supply usable motion signals or whether the policy is mostly coasting on its limited training distribution. This is for people working on robot generalization and video-conditioned control. A reader who wants to test whether large generative models can substitute for robot data will find the pipeline worth trying. I would send it to peer review so the experiments get a proper look.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Gen2Act, a framework that casts language-conditioned robot manipulation as zero-shot generation of human videos from a pre-trained web-scale model, followed by execution via a single policy conditioned on the generated video. The policy is trained on an order of magnitude less robot interaction data than the video model, requires no fine-tuning of the generator, and is claimed to enable manipulation of unseen object types and novel motions absent from the robot training distribution, with supporting real-world demonstrations.

Significance. If the transfer from generated videos to robot actions holds under rigorous evaluation, the work could meaningfully lower the barrier to generalizable manipulation by substituting abundant web video data for expensive robot data collection. The clean separation between an off-the-shelf video generator and a lightweight video-conditioned policy is a pragmatic design that avoids the computational cost of joint training or domain adaptation.

major comments (2)

[Evaluation/Results] Evaluation section (and abstract): the central claim of generalization to novel tasks and objects rests on real-world success, yet no quantitative metrics (success rates, number of trials, variance, or statistical tests), baselines, or evaluation protocol details are supplied. This omission is load-bearing because the abstract-level assertion cannot be assessed for reliability or scope.
[Method] Method section (policy conditioning and video-to-action transfer): the approach assumes that motion signals extracted from human-centric generated videos are directly executable by a robot gripper without any correction for embodiment mismatch, viewpoint, scale, or contact dynamics. No module, loss term, or preprocessing step is described to bridge this gap, leaving the transfer step unsupported for the claimed zero-shot generalization.

minor comments (2)

[Abstract] Abstract: the statement that the policy uses 'an order of magnitude less robot interaction data' should be accompanied by explicit counts (e.g., hours or episodes) and a comparison to the video model's training scale.
[Experiments] The video link is provided but the text contains no summary of what the supplementary videos demonstrate (e.g., specific failure modes or success conditions).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point-by-point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Evaluation/Results] Evaluation section (and abstract): the central claim of generalization to novel tasks and objects rests on real-world success, yet no quantitative metrics (success rates, number of trials, variance, or statistical tests), baselines, or evaluation protocol details are supplied. This omission is load-bearing because the abstract-level assertion cannot be assessed for reliability or scope.

Authors: We agree that the absence of quantitative metrics limits the ability to rigorously assess the generalization claims. The current manuscript emphasizes qualitative real-world video demonstrations to highlight novel object and motion handling. In the revised version, we will expand the evaluation section to report success rates across a fixed number of trials per scenario (with variance), detail the evaluation protocol (including task definitions, success criteria, and trial counts), and include comparisons to relevant baselines where feasible. These additions will be placed in both the main text and abstract as appropriate. revision: yes
Referee: [Method] Method section (policy conditioning and video-to-action transfer): the approach assumes that motion signals extracted from human-centric generated videos are directly executable by a robot gripper without any correction for embodiment mismatch, viewpoint, scale, or contact dynamics. No module, loss term, or preprocessing step is described to bridge this gap, leaving the transfer step unsupported for the claimed zero-shot generalization.

Authors: The policy is trained end-to-end on paired robot video observations and actions, allowing it to learn direct mappings from visual motion cues to gripper commands. During inference, the generated human videos serve as the conditioning input, with the training distribution providing robustness to viewpoint and scale variations present in the robot data. We acknowledge that the manuscript does not explicitly describe any explicit correction steps or losses for embodiment differences. In the revision, we will add a dedicated subsection detailing the data collection process, input preprocessing (e.g., frame resizing and normalization), and training objective to clarify how the transfer is achieved without additional modules. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a pipeline using a separately pre-trained video generation model on web data to produce human videos for novel tasks, followed by training an independent robot policy conditioned on those videos using an order of magnitude less robot data. No equations, derivations, or self-citations are shown that reduce any prediction or result to fitted parameters defined by the output itself, nor do any steps rely on self-definitional loops or imported uniqueness theorems. The components remain independent, with the video model untouched and the policy trained on distinct robot interaction data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the transferability of motion information from web-trained video models to robot execution without fine-tuning, plus the assumption that limited robot data suffices to train a policy that can interpret generated videos.

axioms (1)

domain assumption Pre-trained video generation models produce videos whose depicted motions are sufficiently accurate and robot-executable for novel tasks.
The method uses the video model zero-shot and assumes its outputs directly support policy conditioning without adaptation.

pith-pipeline@v0.9.0 · 5525 in / 1181 out tokens · 37517 ms · 2026-05-15T12:13:10.717237+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
cs.CV 2026-05 conditional novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
cs.RO 2025-03 unverdicted novelty 6.0

GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
Motus: A Unified Latent Action World Model
cs.CV 2025-12 unverdicted novelty 5.0

Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 19 Pith papers · 9 internal anchors

[1]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Roboagent: Generalization and efficiency in robot manipulation via semantic augmen- tations and action chunking,

H. Bharadhwaj, J. Vakil, M. Sharma, A. Gupta, S. Tul- siani, and V . Kumar, “Roboagent: Generalization and efficiency in robot manipulation via semantic augmen- tations and action chunking,” in 2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA) , 2024

work page 2024
[4]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. , “Droid: A large-scale in- the-wild robot manipulation dataset,” arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

R3M: A Universal Visual Representation for Robot Manipulation

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” arXiv preprint arXiv:2203.12601 , 2022

work page internal anchor Pith review arXiv 2022
[6]

Where are we in the search for an artificial visual cortex for embodied intelligence?

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, et al. , “Where are we in the search for an artificial visual cortex for embodied intelligence?” arXiv preprint arXiv:2303.18240, 2023

work page arXiv 2023
[7]

Masked visual pre-training for motor control,

T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked visual pre-training for motor control,” arXiv preprint arXiv:2203.06173, 2022

work page arXiv 2022
[9]

Language-driven representation learning for robotics

S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-driven representation learning for robotics,” arXiv preprint arXiv:2302.12766, 2023

work page arXiv 2023
[10]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. , “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in Conference on Robot Learning . PMLR, 2023, pp. 2165–2183

work page 2023
[11]

Dinobot: Robot manipula- tion via retrieval and alignment with vision foundation models,

N. D. Palo and E. Johns, “Dinobot: Robot manipula- tion via retrieval and alignment with vision foundation models,” in IEEE International Conference on Robotics and Automation (ICRA) , 2024

work page 2024
[12]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Bal- akrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. San- keti, et al., “Openvla: An open-source vision-language- action model,” arXiv preprint arXiv:2406.09246 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Zero-shot robotic manipulation with pretrained image-editing diffusion models,

K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipu- lation with pretrained image-editing diffusion models,” arXiv preprint arXiv:2310.10639 , 2023

work page arXiv 2023
[14]

Visual affordance prediction for guiding robot exploration,

H. Bharadhwaj, A. Gupta, and S. Tulsiani, “Visual affordance prediction for guiding robot exploration,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 3029–3036

work page 2023
[15]

Dall-e-bot: Introducing web-scale diffusion models to robotics,

I. Kapelyukh, V . V osylius, and E. Johns, “Dall-e-bot: Introducing web-scale diffusion models to robotics,” IEEE Robotics and Automation Letters , vol. 8, no. 7, pp. 3956–3963, 2023

work page 2023
[16]

Towards generalizable zero-shot manipulation via translating human interaction plans,

H. Bharadhwaj, A. Gupta, V . Kumar, and S. Tul- siani, “Towards generalizable zero-shot manipulation via translating human interaction plans,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024
[17]

Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation,

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani, “Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation,” arXiv preprint arXiv:2405.01527, 2024

work page arXiv 2024
[18]

arXiv preprint arXiv:2311.10709 , year=

R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, “Emu video: Factorizing text-to-video gen- eration by explicit image conditioning,” arXiv preprint arXiv:2311.10709, 2023

work page arXiv 2023
[19]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al. , “Photorealistic text-to- image diffusion models with deep language understand- ing,” arXiv preprint arXiv:2205.11487 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, R. Hornung, H. Adam, H. Akbari, Y . Alon, V . Birodkar, et al., “Videopoet: A large language model for zero-shot video generation,” arXiv preprint arXiv:2312.14125 , 2023

work page internal anchor Pith review arXiv 2023
[21]

Bootstap: Bootstrapped training for tracking-any-point,

C. Doersch, Y . Yang, D. Gokay, P. Luc, S. Koppula, A. Gupta, J. Heyward, R. Goroshin, J. Carreira, and A. Zisserman, “Bootstap: Bootstrapped training for tracking-any-point,” arXiv preprint arXiv:2402.00847 , 2024

work page arXiv 2024
[22]

One-shot visual imitation learning via meta-learning,

C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visual imitation learning via meta-learning,” in Conference on robot learning . PMLR, 2017, pp. 357–368

work page 2017
[23]

Visual imitation made easy,

S. Young, D. Gandhi, S. Tulsiani, A. Gupta, P. Abbeel, and L. Pinto, “Visual imitation made easy,” in Confer- ence on Robot Learning (CoRL) , 2020

work page 2020
[25]

Learning monocular reactive uav control in cluttered natural environments,

S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wen- del, D. Dey, J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav control in cluttered natural environments,” in 2013 IEEE international conference on robotics and automation . IEEE, 2013, pp. 1765– 1772

work page 2013
[26]

End-to-end learning for lane keeping of self-driving cars,

Z. Chen and X. Huang, “End-to-end learning for lane keeping of self-driving cars,” in 2017 IEEE intelligent vehicles symposium (IV) . IEEE, 2017, pp. 1856–1860

work page 2017
[27]

Roboturk: A crowdsourcing platform for robotic skill learning through imitation,

A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, et al., “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” in Conference on Robot Learning. PMLR, 2018, pp. 879–893

work page 2018
[28]

Bc-z: Zero-shot task generalization with robotic imitation learning,

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning,” in Conference on Robot Learning. PMLR, 2022, pp. 991– 1002

work page 2022
[29]

Bridgedata v2: A dataset for robot learning at scale,

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . Myers, M. J. Kim, M. Du, et al. , “Bridgedata v2: A dataset for robot learning at scale,” in Conference on Robot Learning . PMLR, 2023, pp. 1723–1736

work page 2023
[30]

Scalable deep reinforcement learning for vision-based robotic manipulation,

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Van- houcke, et al. , “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on robot learning . PMLR, 2018, pp. 651–673

work page 2018
[31]

Ego4d: Around the world in 3,000 hours of egocentric video,

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 18 995–19 012

work page 2022
[32]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255

work page 2009
[33]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value-implicit pre- training,” arXiv preprint arXiv:2210.00030 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

The unsurprising effectiveness of pre- trained vision models for control,

S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. Gupta, “The unsurprising effectiveness of pre- trained vision models for control,” arXiv preprint arXiv:2203.03580, 2022

work page arXiv 2022
[35]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipu- lation,” arXiv preprint arXiv:2312.13139 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Video as the new language for real-world decision making,

S. Yang, J. Walker, J. Parker-Holder, Y . Du, J. Bruce, A. Barreto, P. Abbeel, and D. Schuurmans, “Video as the new language for real-world decision making,” arXiv preprint arXiv:2402.17139 , 2024

work page arXiv 2024
[37]

Lossless adaptation of pre- trained vision models for robotic manipulation,

M. Sharma, C. Fantacci, Y . Zhou, S. Koppula, N. Heess, J. Scholz, and Y . Aytar, “Lossless adaptation of pre- trained vision models for robotic manipulation,” arXiv preprint arXiv:2304.06600, 2023

work page arXiv 2023
[38]

On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline,

N. Hansen, Z. Yuan, Y . Ze, T. Mu, A. Rajeswaran, H. Su, H. Xu, and X. Wang, “On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline,” arXiv preprint arXiv:2212.05749 , 2022

work page arXiv 2022
[39]

Cacti: A framework for scal- able multi-task multi-scene visual imitation learning,

Z. Mandi, H. Bharadhwaj, V . Moens, S. Song, A. Ra- jeswaran, and V . Kumar, “Cacti: A framework for scal- able multi-task multi-scene visual imitation learning,” arXiv preprint arXiv:2212.05711 , 2022

work page arXiv 2022
[40]

Genaug: Retargeting behaviors to unseen situations via gener- ative augmentation,

Z. Chen, S. Kiami, A. Gupta, and V . Kumar, “Genaug: Retargeting behaviors to unseen situations via gener- ative augmentation,” arXiv preprint arXiv:2302.06671 , 2023

work page arXiv 2023
[41]

Scaling robot learning with semantically imagined experience,

T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang, J. Singh, C. Tan, J. Peralta, B. Ichter, et al. , “Scaling robot learning with semantically imagined experience,” arXiv preprint arXiv:2302.11550 , 2023

work page arXiv 2023
[42]

Semantically con- trollable augmentations for generalizable robot learn- ing,

Z. Chen, Z. Mandi, H. Bharadhwaj, M. Sharma, S. Song, A. Gupta, and V . Kumar, “Semantically con- trollable augmentations for generalizable robot learn- ing,” arXiv preprint arXiv:2409.00951 , 2024

work page arXiv 2024
[43]

Mimicplay: Long- horizon imitation learning by watching human play,

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar, “Mimicplay: Long- horizon imitation learning by watching human play,” arXiv preprint arXiv:2302.12422 , 2023

work page arXiv 2023
[44]

Avid: Learning multi-stage tasks via pixel- level translation of human videos,

L. Smith, N. Dhawan, M. Zhang, P. Abbeel, and S. Levine, “Avid: Learning multi-stage tasks via pixel- level translation of human videos,” arXiv, 2019

work page 2019
[45]

Learning by watching: Physical imitation of manipulation skills from human videos,

H. Xiong, Q. Li, Y .-C. Chen, H. Bharadhwaj, S. Sinha, and A. Garg, “Learning by watching: Physical imitation of manipulation skills from human videos,” arXiv, 2021

work page 2021
[46]

Vid2robot: End-to-end video- conditioned policy learning with cross-attention trans- formers,

V . Jain, M. Attarian, N. J. Joshi, A. Wahid, D. Driess, Q. Vuong, P. R. Sanketi, P. Sermanet, S. Welker, C. Chan, et al. , “Vid2robot: End-to-end video- conditioned policy learning with cross-attention trans- formers,” arXiv preprint arXiv:2403.12943 , 2024

work page arXiv 2024
[47]

Any-point trajectory modeling for policy learning

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel, “Any-point trajectory modeling for policy learning,” arXiv preprint arXiv:2401.00025 , 2023

work page arXiv 2023
[48]

Rt-trajectory: Robotic task generalization via hindsight trajectory sketches

J. Gu, S. Kirmani, P. Wohlhart, Y . Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, et al., “Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,” arXiv preprint arXiv:2311.01977 , 2023

work page arXiv 2023
[49]

Dexmv: Imitation learning for dexter- ous manipulation from human videos,

Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang, “Dexmv: Imitation learning for dexter- ous manipulation from human videos,” arXiv preprint arXiv:2108.05877, 2021

work page arXiv 2021
[50]

Videodex: Learning dexterity from internet videos,

K. Shaw, S. Bahl, and D. Pathak, “Videodex: Learning dexterity from internet videos,” in 6th Annual Confer- ence on Robot Learning

work page
[51]

Where2act: From pixels to actions for articulated 3d objects,

K. Mo, L. J. Guibas, M. Mukadam, A. Gupta, and S. Tulsiani, “Where2act: From pixels to actions for articulated 3d objects,” inProceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 6813–6823

work page 2021
[52]

Human hands as probes for interactive object understanding,

M. Goyal, S. Modi, R. Goyal, and S. Gupta, “Human hands as probes for interactive object understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 3293–3303

work page 2022
[53]

Affordances from human videos as a versatile repre- sentation for robotics,

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile repre- sentation for robotics,” in CVPR, 2023

work page 2023
[54]

Joint hand motion and interaction hotspots prediction from egocentric videos,

S. Liu, S. Tripathi, S. Majumdar, and X. Wang, “Joint hand motion and interaction hotspots prediction from egocentric videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 3282–3292

work page 2022
[55]

General flow as foundation affordance for scalable robot learning,

C. Yuan, C. Wen, T. Zhang, and Y . Gao, “General flow as foundation affordance for scalable robot learning,” arXiv preprint arXiv:2401.11439 , 2024

work page arXiv 2024
[56]

Human-to-robot imitation in the wild,

S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” RSS, 2022

work page 2022
[57]

Learning uni- versal policies via text-guided video generation,

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenen- baum, D. Schuurmans, and P. Abbeel, “Learning uni- versal policies via text-guided video generation,” Ad- vances in Neural Information Processing Systems , vol. 36, 2024

work page 2024
[58]

Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024

J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P. Tokmakov, S. Song, and C. V ondrick, “Dreamitate: Real-world visuomotor policy learning via video gen- eration,” arXiv preprint arXiv:2406.16862 , 2024

work page arXiv 2024
[59]

R+ x: Retrieval and execution from everyday human videos,

G. Papagiannis, N. Di Palo, P. Vitiello, and E. Johns, “R+ x: Retrieval and execution from everyday human videos,” arXiv preprint arXiv:2407.12957 , 2024

work page arXiv 2024
[60]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. , “Flamingo: a visual language model for few-shot learning,” Advances in neural infor- mation processing systems , vol. 35, pp. 23 716–23 736, 2022

work page 2022
[61]

Tap-vid: A benchmark for tracking any point in a video,

C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y . Aytar, J. Carreira, A. Zisserman, and Y . Yang, “Tap-vid: A benchmark for tracking any point in a video,” Advances in Neural Information Processing Systems, vol. 35, pp. 13 610–13 626, 2022

work page 2022
[62]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Cotracker: It is better to track together

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,”arXiv preprint arXiv:2307.07635, 2023

work page arXiv 2023
[64]

Dense optical tracking: connecting the dots,

G. Le Moing, J. Ponce, and C. Schmid, “Dense optical tracking: connecting the dots,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 187–19 197. APPENDIX Here we provide additional details on the method and experiments of Gen2Act. A. Human Video Generation We use a pre-trained VideoPoet model [20] directl...

work page 2024