Playing hard exploration games by watching YouTube

David Budden; Nando de Freitas; Tobias Pfaff; Tom Le Paine; Yusuf Aytar; Ziyu Wang

arxiv: 1805.11592 · v2 · pith:ZFPF3RVPnew · submitted 2018-05-29 · 💻 cs.LG · cs.AI· cs.CV· stat.ML

Playing hard exploration games by watching YouTube

Yusuf Aytar , Tobias Pfaff , David Budden , Tom Le Paine , Ziyu Wang , Nando de Freitas This is my paper

classification 💻 cs.LG cs.AIcs.CVstat.ML

keywords agentenvironmentexplorationmethodaccessdemonstratorfirstgames

0 comments

read the original abstract

Deep reinforcement learning methods traditionally struggle with tasks where environment rewards are particularly sparse. One successful method of guiding exploration in these domains is to imitate trajectories provided by a human demonstrator. However, these demonstrations are typically collected under artificial conditions, i.e. with access to the agent's exact environment setup and the demonstrator's action and reward trajectories. Here we propose a two-stage method that overcomes these limitations by relying on noisy, unaligned footage without access to such data. First, we learn to map unaligned videos from multiple sources to a common representation using self-supervised objectives constructed over both time and modality (i.e. vision and sound). Second, we embed a single YouTube video in this representation to construct a reward function that encourages an agent to imitate human gameplay. This method of one-shot imitation allows our agent to convincingly exceed human-level performance on the infamously hard exploration games Montezuma's Revenge, Pitfall! and Private Eye for the first time, even if the agent is not presented with any environment rewards.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Attentive Multi-Task Deep Reinforcement Learning
cs.LG 2019-07 unverdicted novelty 6.0

Attention mechanism dynamically groups task knowledge at state granularity in multi-task DRL to enable positive transfer and avoid negative transfer, matching or exceeding prior methods with fewer parameters.
Supervise Thyself: Examining Self-Supervised Representations in Interactive Environments
cs.LG 2019-06 unverdicted novelty 5.0

Empirical comparison finds that self-supervised representations vary in capturing agent state and generalizing to new levels or textures depending on environment visuals and dynamics.