Learning to reinforcement learn
read the original abstract
In recent years deep reinforcement learning (RL) systems have attained superhuman performance in a number of challenging task domains. However, a major limitation of such applications is their demand for massive amounts of training data. A critical present objective is thus to develop deep RL methods that can adapt rapidly to new tasks. In the present work we introduce a novel approach to this challenge, which we refer to as deep meta-reinforcement learning. Previous work has shown that recurrent networks can support meta-learning in a fully supervised context. We extend this approach to the RL setting. What emerges is a system that is trained using one RL algorithm, but whose recurrent dynamics implement a second, quite separate RL procedure. This second, learned RL algorithm can differ from the original one in arbitrary ways. Importantly, because it is learned, it is configured to exploit structure in the training domain. We unpack these points in a series of seven proof-of-concept experiments, each of which examines a key aspect of deep meta-RL. We consider prospects for extending and scaling up the approach, and also point out some potentially important implications for neuroscience.
This paper has not been read by Pith yet.
Forward citations
Cited by 19 Pith papers
-
Risks from Learned Optimization in Advanced Machine Learning Systems
Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
-
Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation
NDR-SHKF replaces the static forgetting factor in Sage-Husa Kalman Filters with a learned vector-valued memory attenuation policy from a bifurcated recurrent network trained end-to-end on whitened innovations to minim...
-
Harnessing Agentic Evolution
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
-
Zero-shot Imitation Learning by Latent Topology Mapping
ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
-
Automated Design of Agentic Systems
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
-
Solving Rubik's Cube with a Robot Hand
Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
-
Searching for Activation Functions
Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
-
From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning
CPSS projects cumulative safety constraints into time-varying per-state thresholds for online action shielding in nonstationary RL, providing per-state guarantees and cumulative bounds.
-
Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions
GLiBRL uses GLMs with learnable basis functions for exact Bayesian inference in deep BRL, derives a closed-form link between L2 task distances and kernel task similarity, and reports up to 1.8x gains over prior meta-R...
-
RAPTOR: A Foundation Policy for Quadrotor Control
A 2084-parameter recurrent policy trained by distilling 1000 RL teacher policies enables zero-shot control across 10 real quadrotors differing in mass, motors, frames, propellers, and flight controllers.
-
Environment Probing Interaction Policies
EPI policies use a transition-predictability reward to probe environments and condition task policies, outperforming standard generalization methods on novel test environments.
-
Evolvability ES: Scalable and Direct Optimization of Evolvability
Evolvability ES is an evolutionary strategy variant that directly optimizes for evolvability by maximizing behavioral diversity under mutations, tested on 2D/3D locomotion tasks and shown competitive with MAML.
-
Generalizing from a few environments in safety-critical reinforcement learning
RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
-
Relational inductive biases, deep learning, and graph networks
Graph networks unify graph-based neural methods into a general framework with strong relational inductive biases to support combinatorial generalization and structured reasoning in AI.
-
Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints
LILAC+ combines context-based, adaptation-speed, and budget-to-state safety constraints to reduce violations in continual RL under nonstationary conditions, demonstrated in simulated driving tasks.
-
Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning
Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by int...
-
The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation
SDB balances behavioral diversity and learning stability in VLN self-improvement by expanding decisions into latent hypotheses, performing reliability-aware aggregation, and applying a regularizer, yielding gains such...
-
Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents
Self-monitoring modules in multi-timescale agents fail as auxiliary losses due to collapse but show limited gains when wired into policy decisions, without outperforming simple baselines.
-
Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)
Model-free RNN agents in Overcooked-AI spontaneously develop structured internal models of partner abilities when they can allocate tasks, enabling adaptation to novel collaborators.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.