hub

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Patel S, Mohan S, Mai H, Jain U, Lazebnik S, Li Y ( · 2025 · cs.RO · arXiv 2507.00990

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

open full Pith review browse 14 citing papers arXiv PDF

abstract

This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks--such as pouring, wiping, and mixing--purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

cs.RO · 2026-06-10 · unverdicted · novelty 7.0

EgoEngine transforms egocentric human videos into high-fidelity robot data enabling zero-shot visuomotor dexterous policy learning without real-robot demonstrations.

PlayWorld: Learning Robot World Models from Autonomous Play

cs.RO · 2026-03-09 · unverdicted · novelty 7.0

PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.

Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data

cs.RO · 2026-06-20 · unverdicted · novelty 6.0

Wh0 generates scalable egocentric human manipulation videos with world models and converts them to boost pretrained VLA models' zero-shot dexterous task success from 8.3% to 38.9% on 18 real-world tasks.

Vesta: A Generalist Embodied Reasoning Model

cs.RO · 2026-06-18 · unverdicted · novelty 6.0

Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.

GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training

cs.RO · 2026-06-11 · conditional · novelty 6.0

A humanoid robot can carry out diverse manipulation tasks in a zero-shot way by imitating one AI-generated video, using contact-aware trajectory optimization instead of task-specific policy training.

Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

cs.RO · 2026-05-21 · unverdicted · novelty 6.0 · 2 refs

Imagine2Real enables zero-shot humanoid-object interaction by unifying motions as 4D point trajectories, tracking only base/hands/object keypoints inside a BFM latent space, and training with progressive simple rewards for mocap deployment.

AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs

cs.RO · 2026-05-04 · unverdicted · novelty 6.0

AnchorD anchors monocular depth priors in metric sensor data via patch-wise affine alignment using factor graph optimization, improving accuracy on non-Lambertian objects and introducing a new benchmark dataset with dense ground truth.

From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

cs.RO · 2026-04-04 · accept · novelty 6.0

Video-to-robot control methods cluster into three interface families, and the field’s main bottleneck is grounding video-derived predictions into dependable closed-loop robot behavior.

HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

cs.RO · 2026-05-24 · unverdicted · novelty 5.0

HumanEgo reports 92.5% average success on four real robot tasks using only 15-30 minutes of human video per task and zero robot data, with zero-shot transfer to new robots and cameras.

Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

eess.IV · 2026-03-30 · conditional · novelty 5.0

In twisted bilayer nodal d-wave superconductors, interlayer hopping creates nodes on the C2 axis and Bogoliubov flat bands when the single-layer Berry connection is parallel to that axis.

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

cs.RO · 2025-08-18 · unverdicted · novelty 5.0

This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

From World Models to World Action Models: A Concise Tutorial for Robotics

cs.RO · 2026-07-01 · conditional · novelty 3.0 · 2 refs

A tutorial defining world models and world action models for robotics, with design axes and a four-paradigm taxonomy of prediction-action coupling.

World Action Models: A Survey

cs.RO · 2026-06-18 · unverdicted · novelty 3.0

A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.

citing papers explorer

Showing 14 of 14 citing papers.

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations cs.RO · 2026-06-10 · unverdicted · none · ref 44 · internal anchor
EgoEngine transforms egocentric human videos into high-fidelity robot data enabling zero-shot visuomotor dexterous policy learning without real-robot demonstrations.
PlayWorld: Learning Robot World Models from Autonomous Play cs.RO · 2026-03-09 · unverdicted · none · ref 17 · internal anchor
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.
Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data cs.RO · 2026-06-20 · unverdicted · none · ref 53 · internal anchor
Wh0 generates scalable egocentric human manipulation videos with world models and converts them to boost pretrained VLA models' zero-shot dexterous task success from 8.3% to 38.9% on 18 real-world tasks.
Vesta: A Generalist Embodied Reasoning Model cs.RO · 2026-06-18 · unverdicted · none · ref 90 · internal anchor
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training cs.RO · 2026-06-11 · conditional · none · ref 13 · internal anchor
A humanoid robot can carry out diverse manipulation tasks in a zero-shot way by imitating one AI-generated video, using contact-aware trajectory optimization instead of task-specific policy training.
Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors cs.RO · 2026-05-21 · unverdicted · none · ref 58 · 2 links · internal anchor
Imagine2Real enables zero-shot humanoid-object interaction by unifying motions as 4D point trajectories, tracking only base/hands/object keypoints inside a BFM latent space, and training with progressive simple rewards for mocap deployment.
AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs cs.RO · 2026-05-04 · unverdicted · none · ref 12
AnchorD anchors monocular depth priors in metric sensor data via patch-wise affine alignment using factor graph optimization, improving accuracy on non-Lambertian objects and introducing a new benchmark dataset with dense ground truth.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data cs.RO · 2026-04-04 · accept · none · ref 74
Video-to-robot control methods cluster into three interface families, and the field’s main bottleneck is grounding video-derived predictions into dependable closed-loop robot behavior.
HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos cs.RO · 2026-05-24 · unverdicted · none · ref 31 · internal anchor
HumanEgo reports 92.5% average success on four real robot tasks using only 15-30 minutes of human video per task and zero robot data, with zero-shot transfer to new robots and cameras.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms eess.IV · 2026-03-30 · conditional · none · ref 202 · internal anchor
In twisted bilayer nodal d-wave superconductors, interlayer hopping creates nodes on the C2 axis and Bogoliubov flat bands when the single-layer Berry connection is parallel to that axis.
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey cs.RO · 2025-08-18 · unverdicted · none · ref 199 · internal anchor
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 81
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
From World Models to World Action Models: A Concise Tutorial for Robotics cs.RO · 2026-07-01 · conditional · none · ref 18 · 2 links · internal anchor
A tutorial defining world models and world action models for robotics, with design axes and a four-paradigm taxonomy of prediction-action coupling.
World Action Models: A Survey cs.RO · 2026-06-18 · unverdicted · none · ref 133 · internal anchor
A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer