Recognition: unknown
Asynchronous Methods for Deep Reinforcement Learning
read the original abstract
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
This paper has not been read by Pith yet.
Forward citations
Cited by 14 Pith papers
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Planning in entropy-regularized Markov decision processes and games
SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
-
OpenAI Gym
OpenAI Gym introduces a common interface for reinforcement learning environments and a results-sharing website to enable consistent algorithm comparisons.
-
Delay-Empowered Causal Hierarchical Reinforcement Learning
DECHRL models causal structures and stochastic delay distributions within hierarchical RL and incorporates them into a delay-aware empowerment objective to improve performance under temporal uncertainty.
-
Error whitening: Why Gauss-Newton outperforms Newton
Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields
Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
DeepMind Control Suite
The DeepMind Control Suite supplies a standardized collection of continuous control tasks with interpretable rewards for benchmarking reinforcement learning agents.
-
Beyond Distribution Sharpening: The Importance of Task Rewards
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
-
Closed-Loop CO2 Storage Control With History-Based Reinforcement Learning and Latent Model-Based Adaptation
History-conditioned RL policies recover nearly all privileged-state performance with deployable well data, and latent model-based retuning outperforms direct model-free retuning under abnormal reservoir conditions.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.