RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Ilya Sutskever; John Schulman; Peter L. Bartlett; Pieter Abbeel; Xi Chen; Yan Duan

arxiv: 1611.02779 · v2 · pith:3Z6UTH5Jnew · submitted 2016-11-09 · 💻 cs.AI · cs.LG· cs.NE· stat.ML

RL²: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan , John Schulman , Xi Chen , Peter L. Bartlett , Ilya Sutskever , Pieter Abbeel This is my paper

classification 💻 cs.AI cs.LGcs.NEstat.ML

keywords learningalgorithmreinforcementfastproblemsdeeplarge-scalelearn

0 comments

read the original abstract

Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Risks from Learned Optimization in Advanced Machine Learning Systems
cs.AI 2019-06 accept novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation
cs.RO 2026-05 unverdicted novelty 7.0

CoRMA enables within-episode adaptation for contact-rich robotic assembly by inferring semantic contact context with a causal Transformer and force-regime contrastive objective, retaining higher real success than FORG...
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 unverdicted novelty 7.0

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...
Zero-shot Imitation Learning by Latent Topology Mapping
cs.LG 2026-05 unverdicted novelty 7.0

ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
Solving Rubik's Cube with a Robot Hand
cs.LG 2019-10 accept novelty 7.0

Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
Searching for Activation Functions
cs.NE 2017-10 conditional novelty 7.0

Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
Dual-Timescale Memory in a Spiking Neuron-Astrocyte Network for Efficient Navigation
q-bio.QM 2026-04 unverdicted novelty 6.0

A neuron-astrocyte network with dual-timescale memory reduces median path lengths up to sixfold in partially observable grid-world navigation tasks.
Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions
cs.LG 2025-12 unverdicted novelty 6.0

GLiBRL uses GLMs with learnable basis functions for exact Bayesian inference in deep BRL, derives a closed-form link between L2 task distances and kernel task similarity, and reports up to 1.8x gains over prior meta-R...
RAPTOR: A Foundation Policy for Quadrotor Control
cs.RO 2025-09 unverdicted novelty 6.0

A 2084-parameter recurrent policy trained by distilling 1000 RL teacher policies enables zero-shot control across 10 real quadrotors differing in mass, motors, frames, propellers, and flight controllers.
Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks
cs.LG 2025-02 unverdicted novelty 6.0

TAVT improves OOD task generalization in meta-RL by preserving task characteristics in virtual tasks via metric learning and using state regularization.
Environment Probing Interaction Policies
cs.RO 2019-07 unverdicted novelty 6.0

EPI policies use a transition-predictability reward to probe environments and condition task policies, outperforming standard generalization methods on novel test environments.
Generalizing from a few environments in safety-critical reinforcement learning
cs.LG 2019-07 unverdicted novelty 6.0

RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning
cs.LG 2026-05 unverdicted novelty 5.0

Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by int...
Learning Material-Aware Hamiltonian Risk Fields for Safe Navigation
cs.LG 2026-05 unverdicted novelty 5.0

A learned context-energy term in port-Hamiltonian policies creates selective risk navigation that activates evasive forces only when safer paths are available.
Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning
cs.LG 2025-02 unverdicted novelty 4.0

SISL adds self-improving decoupled policies and return-based prioritization to skill-based meta-RL to achieve stable adaptation from noisy demonstrations on long-horizon tasks.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent
cs.AI 2026-02 unverdicted novelty 2.0

A survey provides a task-based formalization of meta-learning and meta-RL while chronicling algorithms that lead to DeepMind's Adaptive Agent.