pith. machine review for the scientific record. sign in

arxiv: 1611.05763 · v3 · submitted 2016-11-17 · 💻 cs.LG · cs.AI· stat.ML

Recognition: unknown

Learning to reinforcement learn

Authors on Pith no claims yet
classification 💻 cs.LG cs.AIstat.ML
keywords deepapproachlearningalgorithmlearnedrecurrentreinforcementsecond
0
0 comments X
read the original abstract

In recent years deep reinforcement learning (RL) systems have attained superhuman performance in a number of challenging task domains. However, a major limitation of such applications is their demand for massive amounts of training data. A critical present objective is thus to develop deep RL methods that can adapt rapidly to new tasks. In the present work we introduce a novel approach to this challenge, which we refer to as deep meta-reinforcement learning. Previous work has shown that recurrent networks can support meta-learning in a fully supervised context. We extend this approach to the RL setting. What emerges is a system that is trained using one RL algorithm, but whose recurrent dynamics implement a second, quite separate RL procedure. This second, learned RL algorithm can differ from the original one in arbitrary ways. Importantly, because it is learned, it is configured to exploit structure in the training domain. We unpack these points in a series of seven proof-of-concept experiments, each of which examines a key aspect of deep meta-RL. We consider prospects for extending and scaling up the approach, and also point out some potentially important implications for neuroscience.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Risks from Learned Optimization in Advanced Machine Learning Systems

    cs.AI 2019-06 accept novelty 9.0

    Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.

  2. Harnessing Agentic Evolution

    cs.AI 2026-05 unverdicted novelty 7.0

    AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

  3. Zero-shot Imitation Learning by Latent Topology Mapping

    cs.LG 2026-05 unverdicted novelty 7.0

    ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.

  4. Automated Design of Agentic Systems

    cs.AI 2024-08 conditional novelty 7.0

    Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...

  5. Solving Rubik's Cube with a Robot Hand

    cs.LG 2019-10 accept novelty 7.0

    Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.

  6. Searching for Activation Functions

    cs.NE 2017-10 conditional novelty 7.0

    Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.

  7. Relational inductive biases, deep learning, and graph networks

    cs.LG 2018-06 conditional novelty 6.0

    Graph networks unify graph-based neural methods into a general framework with strong relational inductive biases to support combinatorial generalization and structured reasoning in AI.

  8. Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 5.0

    Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by int...

  9. The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation

    cs.CV 2026-04 unverdicted novelty 5.0

    SDB balances behavioral diversity and learning stability in VLN self-improvement by expanding decisions into latent hypotheses, performing reliability-aware aggregation, and applying a regularizer, yielding gains such...

  10. Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    Self-monitoring modules in multi-timescale agents fail as auxiliary losses due to collapse but show limited gains when wired into policy decisions, without outperforming simple baselines.