Learning to reinforcement learn

Charles Blundell; Dharshan Kumaran; Dhruva Tirumala; Hubert Soyer; Jane X Wang; Joel Z Leibo; Matt Botvinick; Remi Munos; Zeb Kurth-Nelson

arxiv: 1611.05763 · v3 · pith:WL2ENWWRnew · submitted 2016-11-17 · 💻 cs.LG · cs.AI· stat.ML

Learning to reinforcement learn

Jane X Wang , Zeb Kurth-Nelson , Dhruva Tirumala , Hubert Soyer , Joel Z Leibo , Remi Munos , Charles Blundell , Dharshan Kumaran

show 1 more author

Matt Botvinick

This is my paper

classification 💻 cs.LG cs.AIstat.ML

keywords deepapproachlearningalgorithmlearnedrecurrentreinforcementsecond

0 comments

read the original abstract

In recent years deep reinforcement learning (RL) systems have attained superhuman performance in a number of challenging task domains. However, a major limitation of such applications is their demand for massive amounts of training data. A critical present objective is thus to develop deep RL methods that can adapt rapidly to new tasks. In the present work we introduce a novel approach to this challenge, which we refer to as deep meta-reinforcement learning. Previous work has shown that recurrent networks can support meta-learning in a fully supervised context. We extend this approach to the RL setting. What emerges is a system that is trained using one RL algorithm, but whose recurrent dynamics implement a second, quite separate RL procedure. This second, learned RL algorithm can differ from the original one in arbitrary ways. Importantly, because it is learned, it is configured to exploit structure in the training domain. We unpack these points in a series of seven proof-of-concept experiments, each of which examines a key aspect of deep meta-RL. We consider prospects for extending and scaling up the approach, and also point out some potentially important implications for neuroscience.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Risks from Learned Optimization in Advanced Machine Learning Systems
cs.AI 2019-06 accept novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation
eess.SP 2026-05 unverdicted novelty 7.0

NDR-SHKF replaces the static forgetting factor in Sage-Husa Kalman Filters with a learned vector-valued memory attenuation policy from a bifurcated recurrent network trained end-to-end on whitened innovations to minim...
Harnessing Agentic Evolution
cs.AI 2026-05 unverdicted novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
Zero-shot Imitation Learning by Latent Topology Mapping
cs.LG 2026-05 unverdicted novelty 7.0

ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
Automated Design of Agentic Systems
cs.AI 2024-08 conditional novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
Solving Rubik's Cube with a Robot Hand
cs.LG 2019-10 accept novelty 7.0

Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
Searching for Activation Functions
cs.NE 2017-10 conditional novelty 7.0

Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

CPSS projects cumulative safety constraints into time-varying per-state thresholds for online action shielding in nonstationary RL, providing per-state guarantees and cumulative bounds.
Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions
cs.LG 2025-12 unverdicted novelty 6.0

GLiBRL uses GLMs with learnable basis functions for exact Bayesian inference in deep BRL, derives a closed-form link between L2 task distances and kernel task similarity, and reports up to 1.8x gains over prior meta-R...
RAPTOR: A Foundation Policy for Quadrotor Control
cs.RO 2025-09 unverdicted novelty 6.0

A 2084-parameter recurrent policy trained by distilling 1000 RL teacher policies enables zero-shot control across 10 real quadrotors differing in mass, motors, frames, propellers, and flight controllers.
Environment Probing Interaction Policies
cs.RO 2019-07 unverdicted novelty 6.0

EPI policies use a transition-predictability reward to probe environments and condition task policies, outperforming standard generalization methods on novel test environments.
Evolvability ES: Scalable and Direct Optimization of Evolvability
cs.NE 2019-07 unverdicted novelty 6.0

Evolvability ES is an evolutionary strategy variant that directly optimizes for evolvability by maximizing behavioral diversity under mutations, tested on 2D/3D locomotion tasks and shown competitive with MAML.
Generalizing from a few environments in safety-critical reinforcement learning
cs.LG 2019-07 unverdicted novelty 6.0

RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
Relational inductive biases, deep learning, and graph networks
cs.LG 2018-06 conditional novelty 6.0

Graph networks unify graph-based neural methods into a general framework with strong relational inductive biases to support combinatorial generalization and structured reasoning in AI.
Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints
cs.LG 2026-05 unverdicted novelty 5.0

LILAC+ combines context-based, adaptation-speed, and budget-to-state safety constraints to reduce violations in continual RL under nonstationary conditions, demonstrated in simulated driving tasks.
Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning
cs.LG 2026-05 unverdicted novelty 5.0

Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by int...
The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

SDB balances behavioral diversity and learning stability in VLN self-improvement by expanding decisions into latent hypotheses, performing reliability-aware aggregation, and applying a regularizer, yielding gains such...
Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents
cs.AI 2026-04 unverdicted novelty 5.0

Self-monitoring modules in multi-timescale agents fail as auxiliary losses due to collapse but show limited gains when wired into policy decisions, without outperforming simple baselines.
Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)
cs.AI 2025-05 unverdicted novelty 5.0

Model-free RNN agents in Overcooked-AI spontaneously develop structured internal models of partner abilities when they can allocate tasks, enabling adaptation to novel collaborators.