Asynchronous Methods for Deep Reinforcement Learning

Volodymyr Mnih , Adri\`a Puigdom\`enech Badia , Mehdi Mirza , Alex Graves , Timothy P. Lillicrap , Tim Harley , David Silver , Koray Kavukcuoglu

Authors on Pith no claims yet

classification 💻 cs.LG

keywords asynchronousdeeplearningreinforcementactor-criticcontrollersfourmethods

0 comments

read the original abstract

We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
Planning in entropy-regularized Markov decision processes and games
cs.LG 2026-04 unverdicted novelty 7.0

SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
OpenAI Gym
cs.LG 2016-06 accept novelty 7.0

OpenAI Gym introduces a common interface for reinforcement learning environments and a results-sharing website to enable consistent algorithm comparisons.
Delay-Empowered Causal Hierarchical Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

DECHRL models causal structures and stochastic delay distributions within hierarchical RL and incorporates them into a delay-aware empowerment objective to improve performance under temporal uncertainty.
Error whitening: Why Gauss-Newton outperforms Newton
cs.LG 2026-05 conditional novelty 6.0

Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields
cs.AI 2026-04 unverdicted novelty 6.0

Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
DeepMind Control Suite
cs.AI 2018-01 accept novelty 6.0

The DeepMind Control Suite supplies a standardized collection of continuous control tasks with interpretable rewards for benchmarking reinforcement learning agents.
Beyond Distribution Sharpening: The Importance of Task Rewards
cs.LG 2026-04 unverdicted novelty 5.0

Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
Closed-Loop CO2 Storage Control With History-Based Reinforcement Learning and Latent Model-Based Adaptation
cs.LG 2026-05 unverdicted novelty 4.0

History-conditioned RL policies recover nearly all privileged-state performance with deployable well data, and latent model-based retuning outperforms direct model-free retuning under abnormal reservoir conditions.