hub

Learning to reinforcement learn

Wang, J · 2016 · cs.LG · arXiv 1611.05763

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

open full Pith review browse 19 citing papers arXiv PDF

abstract

In recent years deep reinforcement learning (RL) systems have attained superhuman performance in a number of challenging task domains. However, a major limitation of such applications is their demand for massive amounts of training data. A critical present objective is thus to develop deep RL methods that can adapt rapidly to new tasks. In the present work we introduce a novel approach to this challenge, which we refer to as deep meta-reinforcement learning. Previous work has shown that recurrent networks can support meta-learning in a fully supervised context. We extend this approach to the RL setting. What emerges is a system that is trained using one RL algorithm, but whose recurrent dynamics implement a second, quite separate RL procedure. This second, learned RL algorithm can differ from the original one in arbitrary ways. Importantly, because it is learned, it is configured to exploit structure in the training domain. We unpack these points in a series of seven proof-of-concept experiments, each of which examines a key aspect of deep meta-RL. We consider prospects for extending and scaling up the approach, and also point out some potentially important implications for neuroscience.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Risks from Learned Optimization in Advanced Machine Learning Systems

cs.AI · 2019-06-05 · accept · novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.

Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation

eess.SP · 2026-05-18 · unverdicted · novelty 7.0

NDR-SHKF replaces the static forgetting factor in Sage-Husa Kalman Filters with a learned vector-valued memory attenuation policy from a bifurcated recurrent network trained end-to-end on whitened innovations to minimize estimation error.

Harnessing Agentic Evolution

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

Automated Design of Agentic Systems

cs.AI · 2024-08-15 · conditional · novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.

Solving Rubik's Cube with a Robot Hand

cs.LG · 2019-10-16 · accept · novelty 7.0

Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.

Zero-shot Imitation Learning by Latent Topology Mapping

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.

Searching for Activation Functions

cs.NE · 2017-10-16 · conditional · novelty 7.0

Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.

From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

CPSS projects cumulative safety constraints into time-varying per-state thresholds for online action shielding in nonstationary RL, providing per-state guarantees and cumulative bounds.

Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

cs.LG · 2025-12-24 · unverdicted · novelty 6.0

GLiBRL uses GLMs with learnable basis functions for exact Bayesian inference in deep BRL, derives a closed-form link between L2 task distances and kernel task similarity, and reports up to 1.8x gains over prior meta-RL on MuJoCo and MetaWorld.

RAPTOR: A Foundation Policy for Quadrotor Control

cs.RO · 2025-09-15 · unverdicted · novelty 6.0

A 2084-parameter recurrent policy trained by distilling 1000 RL teacher policies enables zero-shot control across 10 real quadrotors differing in mass, motors, frames, propellers, and flight controllers.

Environment Probing Interaction Policies

cs.RO · 2019-07-26 · unverdicted · novelty 6.0

EPI policies use a transition-predictability reward to probe environments and condition task policies, outperforming standard generalization methods on novel test environments.

Evolvability ES: Scalable and Direct Optimization of Evolvability

cs.NE · 2019-07-13 · unverdicted · novelty 6.0

Evolvability ES is an evolutionary strategy variant that directly optimizes for evolvability by maximizing behavioral diversity under mutations, tested on 2D/3D locomotion tasks and shown competitive with MAML.

Generalizing from a few environments in safety-critical reinforcement learning

cs.LG · 2019-07-02 · unverdicted · novelty 6.0

RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.

Relational inductive biases, deep learning, and graph networks

cs.LG · 2018-06-04 · conditional · novelty 6.0

Graph networks unify graph-based neural methods into a general framework with strong relational inductive biases to support combinatorial generalization and structured reasoning in AI.

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

LILAC+ combines context-based, adaptation-speed, and budget-to-state safety constraints to reduce violations in continual RL under nonstationary conditions, demonstrated in simulated driving tasks.

Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)

cs.AI · 2025-05-22 · unverdicted · novelty 5.0

Model-free RNN agents in Overcooked-AI spontaneously develop structured internal models of partner abilities when they can allocate tasks, enabling adaptation to novel collaborators.

Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

cs.LG · 2026-05-09 · unverdicted · novelty 5.0

Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by interpreting attention as Q-function estimation.

The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation

cs.CV · 2026-04-21 · unverdicted · novelty 5.0

SDB balances behavioral diversity and learning stability in VLN self-improvement by expanding decisions into latent hypotheses, performing reliability-aware aggregation, and applying a regularizer, yielding gains such as SPL 33.73 to 35.93 on REVERIE val-unseen.

Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

Self-monitoring modules in multi-timescale agents fail as auxiliary losses due to collapse but show limited gains when wired into policy decisions, without outperforming simple baselines.

citing papers explorer

Showing 19 of 19 citing papers.

Risks from Learned Optimization in Advanced Machine Learning Systems cs.AI · 2019-06-05 · accept · none · ref 33 · internal anchor
Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation eess.SP · 2026-05-18 · unverdicted · none · ref 39 · internal anchor
NDR-SHKF replaces the static forgetting factor in Sage-Husa Kalman Filters with a learned vector-valued memory attenuation policy from a bifurcated recurrent network trained end-to-end on whitened innovations to minimize estimation error.
Harnessing Agentic Evolution cs.AI · 2026-05-13 · unverdicted · none · ref 27 · internal anchor
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
Automated Design of Agentic Systems cs.AI · 2024-08-15 · conditional · none · ref 217 · internal anchor
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
Solving Rubik's Cube with a Robot Hand cs.LG · 2019-10-16 · accept · none · ref 116 · internal anchor
Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
Zero-shot Imitation Learning by Latent Topology Mapping cs.LG · 2026-05-08 · unverdicted · none · ref 54
ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
Searching for Activation Functions cs.NE · 2017-10-16 · conditional · none · ref 18
Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning cs.LG · 2026-05-13 · unverdicted · none · ref 46 · internal anchor
CPSS projects cumulative safety constraints into time-varying per-state thresholds for online action shielding in nonstationary RL, providing per-state guarantees and cumulative bounds.
Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions cs.LG · 2025-12-24 · unverdicted · none · ref 32 · internal anchor
GLiBRL uses GLMs with learnable basis functions for exact Bayesian inference in deep BRL, derives a closed-form link between L2 task distances and kernel task similarity, and reports up to 1.8x gains over prior meta-RL on MuJoCo and MetaWorld.
RAPTOR: A Foundation Policy for Quadrotor Control cs.RO · 2025-09-15 · unverdicted · none · ref 56 · internal anchor
A 2084-parameter recurrent policy trained by distilling 1000 RL teacher policies enables zero-shot control across 10 real quadrotors differing in mass, motors, frames, propellers, and flight controllers.
Environment Probing Interaction Policies cs.RO · 2019-07-26 · unverdicted · none · ref 26 · internal anchor
EPI policies use a transition-predictability reward to probe environments and condition task policies, outperforming standard generalization methods on novel test environments.
Evolvability ES: Scalable and Direct Optimization of Evolvability cs.NE · 2019-07-13 · unverdicted · none · ref 45 · internal anchor
Evolvability ES is an evolutionary strategy variant that directly optimizes for evolvability by maximizing behavioral diversity under mutations, tested on 2D/3D locomotion tasks and shown competitive with MAML.
Generalizing from a few environments in safety-critical reinforcement learning cs.LG · 2019-07-02 · unverdicted · none · ref 37 · internal anchor
RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
Relational inductive biases, deep learning, and graph networks cs.LG · 2018-06-04 · conditional · none · ref 7
Graph networks unify graph-based neural methods into a general framework with strong relational inductive biases to support combinatorial generalization and structured reasoning in AI.
Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints cs.LG · 2026-05-13 · unverdicted · none · ref 45 · internal anchor
LILAC+ combines context-based, adaptation-speed, and budget-to-state safety constraints to reduce violations in continual RL under nonstationary conditions, demonstrated in simulated driving tasks.
Partner Modelling Emerges in Recurrent Agents (But Only When It Matters) cs.AI · 2025-05-22 · unverdicted · none · ref 37 · internal anchor
Model-free RNN agents in Overcooked-AI spontaneously develop structured internal models of partner abilities when they can allocate tasks, enabling adaptation to novel collaborators.
Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning cs.LG · 2026-05-09 · unverdicted · none · ref 30
Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by interpreting attention as Q-function estimation.
The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation cs.CV · 2026-04-21 · unverdicted · none · ref 2
SDB balances behavioral diversity and learning stability in VLN self-improvement by expanding decisions into latent hypotheses, performing reliability-aware aggregation, and applying a regularizer, yielding gains such as SPL 33.73 to 35.93 on REVERIE val-unseen.
Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents cs.AI · 2026-04-13 · unverdicted · none · ref 28
Self-monitoring modules in multi-timescale agents fail as auxiliary losses due to collapse but show limited gains when wired into policy decisions, without outperforming simple baselines.

Learning to reinforcement learn

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer