Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

Adrien Ecoffet; Bowen Baker; Brandon Houghton; Ilge Akkaya; Jeff Clune; Jie Tang; Joost Huizinga; Peter Zhokhov; Raul Sampedro

arxiv: 2206.11795 · v1 · pith:OWR4IBOTnew · submitted 2022-06-23 · 💻 cs.LG · cs.AI

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

Bowen Baker , Ilge Akkaya , Peter Zhokhov , Joost Huizinga , Jie Tang , Adrien Ecoffet , Brandon Houghton , Raul Sampedro

show 1 more author

Jeff Clune

This is my paper

classification 💻 cs.LG cs.AI

keywords learningonlinebehavioraldatapretrainingtrainunlabeledvideos

0 comments

read the original abstract

Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models
cs.CV 2026-06 unverdicted novelty 7.0

PhysEditWorld is a new dataset of over 60 million frames from 12 UE5 cinematic scenes with synchronized multimodal signals and explicit gravity labels, built via replay to support physics-editable world models.
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
cs.RO 2026-05 unverdicted novelty 7.0

ARC-RL provides four new MuJoCo continuous-control environments with hexapod and quadruped morphologies inspired by ARC Raiders, a unified multi-component reward without motion capture, CPG expert demonstrators, and e...
ASH: Agents that Self-Hone via Embodied Learning
cs.AI 2026-05 unverdicted novelty 7.0

ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
cs.RO 2025-05 unverdicted novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
In-Context World Modeling for Robotic Control
cs.RO 2026-06 unverdicted novelty 6.0

ICWM frames system identification as in-context adaptation so VLA policies can infer dynamics from self-generated interactions and handle novel configurations without parameter updates.
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
cs.RO 2026-05 accept novelty 6.0

ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical c...
CA2: Code-Aware Agent for Automated Game Testing
cs.SE 2026-05 unverdicted novelty 6.0

CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.
Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies
cs.AI 2024-12 unverdicted novelty 6.0

PGT optimizes latent goal embeddings for frozen policies via trajectory-level preference objectives, reporting 72-81.6% relative gains on 17 Minecraft tasks and 13.4% better OOD performance than fine-tuning.
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents
cs.AI 2023-02 conditional novelty 6.0

DEPS combines LLM-based interactive planning with a trainable goal selector to create a zero-shot multi-task agent that completes 70+ Minecraft tasks and nearly doubles prior performance.
PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models
cs.CV 2026-06 unverdicted novelty 5.0

PhysEditWorld supplies 12 UE5 scenes, 60+ million frames, and explicit gravity labels via a replay paradigm to support gravity-faithful and physically editable world models.
In-Context World Modeling for Robotic Control
cs.RO 2026-06 unverdicted novelty 5.0

ICWM reframes system identification as in-context adaptation, letting VLA policies capture current world dynamics from task-agnostic interaction histories to generalize to novel configurations.
ASH: Agents that Self-Hone via Embodied Learning
cs.AI 2026-05 unverdicted novelty 5.0

ASH learns long-horizon embodied policies from unlabeled internet video via a self-improvement loop that trains an IDM on its own trajectories and extracts supervision plus key-moment memory from video.