Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control
read the original abstract
Deep reinforcement learning (RL) algorithms can learn complex robotic skills from raw sensory inputs, but have yet to achieve the kind of broad generalization and applicability demonstrated by deep learning methods in supervised domains. We present a deep RL method that is practical for real-world robotics tasks, such as robotic manipulation, and generalizes effectively to never-before-seen tasks and objects. In these settings, ground truth reward signals are typically unavailable, and we therefore propose a self-supervised model-based approach, where a predictive model learns to directly predict the future from raw sensory readings, such as camera images. At test time, we explore three distinct goal specification methods: designated pixels, where a user specifies desired object manipulation tasks by selecting particular pixels in an image and corresponding goal positions, goal images, where the desired goal state is specified with an image, and image classifiers, which define spaces of goal states. Our deep predictive models are trained using data collected autonomously and continuously by a robot interacting with hundreds of objects, without human supervision. We demonstrate that visual MPC can generalize to never-before-seen objects---both rigid and deformable---and solve a range of user-defined object manipulation tasks using the same model.
This paper has not been read by Pith yet.
Forward citations
Cited by 35 Pith papers
-
Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
-
PlayWorld: Learning Robot World Models from Autonomous Play
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...
-
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.
-
Unified Motion-Action Modeling for Heterogeneous Robot Learning
UMA treats object motion and robot actions as co-evolving variables under a masked generative objective with hindsight relabeling and contrastive disentanglement to support multi-task pretraining and deployment across...
-
CAPE: Contrastive Action-conditioned Parallel Encoding for Embodied Planning
CAPE learns action-conditioned visual dynamics via parallel encoding and a goal-convergent contrastive objective, outperforming baselines on retrieval, matching, and closed-loop planning while cutting inference cost.
-
CLAW: Learning Continuous Latent Action World Models via Adversarial Latent Regularization
CLAW is an end-to-end self-supervised method that learns semantically meaningful continuous latent actions and predictive world models from action-free videos to support imitation learning and goal-directed planning.
-
Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation
AHEAD augments frozen VLAs with a 4.9M-parameter latent world model that forecasts future visual features using optical-flow motion cues, achieving 79-97% success on dynamic simulation tasks and high real-robot succes...
-
SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models
SKIP achieves 4.16x faster dense video rollouts for robot world models by synthesizing only multimodal-identified keyframes and interpolating the rest, preserving policy training effectiveness with minimal success rate drops.
-
StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement
StressDream optimizes initial noise in diffusion video world models using VLM semantic and plausibility objectives to steer generations toward specified high-impact outcomes for improved policy evaluation.
-
Parallel Differentiable Reachability for Learning and Planning with Certified Neural Dynamics and Controllers
A JAX-based differentiable reachability primitive for continuous- and discrete-time NN dynamics and controllers that supports certified training and sampling-based MPC with gradient refinement.
-
EgoExo-WM: Unlocking Exo Video for Ego World Models
Method converts exocentric videos to egocentric format via body-pose extraction and kinematics to improve egocentric world-model prediction and planning.
-
EgoExo-WM: Unlocking Exo Video for Ego World Models
Converting exocentric video to egocentric format via body-pose extraction and kinematics prior enables training of action-conditioned egocentric world models that improve prediction quality and goal-directed planning.
-
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
Hamiltonian World Models structure latent dynamics around energy-conserving Hamiltonian evolution to produce physically grounded, action-controllable predictions for embodied decision making.
-
Toward Hardware-Agnostic Quadrupedal World Models via Morphology Conditioning
Morphology-conditioned quadrupedal world model enables zero-shot generalization to new robot embodiments for locomotion tasks.
-
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.
-
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
-
ReSim: Reliable World Simulation for Autonomous Driving
ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module f...
-
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
-
Is Conditional Generative Modeling all you need for Decision-Making?
Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
-
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
A large multi-task multi-domain robot dataset combined with 50 new demonstrations yields 2x higher success rates on never-before-seen tasks in new domains.
-
RoboNet: Large-Scale Multi-Robot Learning
RoboNet is a multi-robot video dataset that enables pre-training of vision-based manipulation models which, after fine-tuning on a new robot, outperform robot-specific training that uses 4-20 times more data.
-
Motion-Aware Reinforcement Learning For Object Localization
MARLNet adds a motion prior to observations and smoothness penalty to rewards in a PPO bounding-box agent, producing small gains (+0.011 on VOC, +0.007 on VisDrone) at IoU≥0.5 while exposing a reward-interference fail...
-
VICX: Generalizable Robot Manipulation via Video Generation and In-Context Operator Network
VICX decouples frozen video-based visual planning from in-context visual-to-trajectory mapping via V2T-ICON to achieve cross-task and cross-embodiment generalization in robot manipulation.
-
IMWM: Intuition Models Complement World Models for Latent Planning
IMWM combines a world model with an intuition model from demonstrations to improve sample-based latent planning success rates over world-model-only baselines on pixel control tasks.
-
$\tau_0$-WM: A Unified Video-Action World Model for Robotic Manipulation
A shared video diffusion backbone jointly predicts future latents and continuous actions while also rolling out candidate actions to predict dense task-progress scores, trained on 27,300 hours of mixed robot and human data.
-
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
Proposes Hamiltonian World Models as a physically grounded framework encoding observations into latent phase space and evolving them via Hamiltonian dynamics with control and dissipation for embodied prediction and planning.
-
WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems
WestWorld introduces a scalable trajectory world model with Sys-MoE routing via system embeddings and structural embeddings for physical knowledge, pretrained on 89 environments to improve zero-shot prediction and rea...
-
Reasoning and Generalization in RL: A Tool Use Perspective
Proposes a tool-use inspired framework with multiple test sets to measure specified types of generalization in RL.
-
Learning to Cope with Adversarial Attacks
MLAH agent in deep RL demonstrates hierarchical coping mechanisms and improved reward maintenance under spaced adversarial attacks, at the expense of stability.
-
Can Predicted Dynamics Exist in the Physical World?
Physical admissibility is defined as a prediction-control interface using kinematic, dynamic, and composed-horizon conditions to reject invalid dynamics proposals, with AUC 0.957 on LeRobot PushT and 87-89% prevention...
-
Planning Robot Motion using Deep Visual Prediction
PROM-Net performs unsupervised visual prediction of robot motion from raw frames and integrates the predictions into model predictive control for navigation in unknown dynamic settings.
-
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.