Never Give Up: Learning Directed Exploration Strategies

Adri\`a Puigdom\`enech Badia; Alexander Pritzel; Alex Vitvitskyi; Andew Bolt; Bilal Piot; Charles Blundell; Daniel Guo; Mart\'in Arjovsky; Olivier Tieleman; Pablo Sprechmann

arxiv: 2002.06038 · v1 · pith:5AGUSWYXnew · submitted 2020-02-14 · 💻 cs.LG · stat.ML

Never Give Up: Learning Directed Exploration Strategies

Adri\`a Puigdom\`enech Badia , Pablo Sprechmann , Alex Vitvitskyi , Daniel Guo , Bilal Piot , Steven Kapturowski , Olivier Tieleman , Mart\'in Arjovsky

show 3 more authors

Alexander Pritzel Andew Bolt Charles Blundell

This is my paper

classification 💻 cs.LG stat.ML

keywords explorationagentpoliciesdirectedexploratorylearningmethodscore

0 comments

read the original abstract

We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies, thereby encouraging the agent to repeatedly revisit all states in its environment. A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control. We employ the framework of Universal Value Function Approximators (UVFA) to simultaneously learn many directed exploration policies with the same neural network, with different trade-offs between exploration and exploitation. By using the same neural network for different degrees of exploration/exploitation, transfer is demonstrated from predominantly exploratory policies yielding effective exploitative policies. The proposed method can be incorporated to run with modern distributed RL agents that collect large amounts of experience from many actors running in parallel on separate environment instances. Our method doubles the performance of the base agent in all hard exploration in the Atari-57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344.0%. Notably, the proposed method is the first algorithm to achieve non-zero rewards (with a mean score of 8,400) in the game of Pitfall! without using demonstrations or hand-crafted features.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models
cs.AI 2026-05 unverdicted novelty 7.0

Alice uses preservation conflicts from failed candidate updates to create class-stratified hypotheses and guide exploration, improving executable world-model learning under prior misalignment.
Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring
cs.LG 2025-09 unverdicted novelty 7.0

LPM uses a dual-network design to compute intrinsic rewards from the change in prediction error across iterations, providing a noise-robust signal that is theoretically linked to information gain.
Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty
cs.AI 2026-06 accept novelty 6.0

Heuresis evaluates six search strategies for LLM research agents and shows they steer ideas along quality-diversity-novelty axes but fail to generate novel ideas that match or exceed known high-performing recipes.
Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty
cs.AI 2026-06 unverdicted novelty 6.0

Heuresis evaluates six search strategies for autonomous ML research agents and finds that novel ideas are rare, none rated original, and only one reaches top-10 quality while strategies steer axes but do not expand th...