hub

Sample Efficient Actor-Critic with Experience Replay

Sample efficient actor-critic with experience replay , author= · 2016 · cs.LG · arXiv 1611.01224

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

open full Pith review browse 13 citing papers arXiv PDF

abstract

This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise

math.PR · 2026-05-20 · unverdicted · novelty 7.0

Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters

cs.LG · 2026-05-07 · accept · novelty 7.0

Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.

Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives

cs.LG · 2025-09-11 · conditional · novelty 6.0

Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.

Beyond Importance Sampling: Rejection-Gated Policy Optimization

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

cs.LG · 2019-10-01 · conditional · novelty 6.0

AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.

Polychromic Objectives for Reinforcement Learning

cs.LG · 2025-09-29 · unverdicted · novelty 5.0

Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.

Multi-Agent Deep Reinforcement Learning for Liquidation Strategy Analysis

q-fin.TR · 2019-06-24 · unverdicted · novelty 5.0

The authors extend the Almgren-Chriss model to a multi-agent setting and apply deep reinforcement learning to simulate and optimize liquidation strategies under practical constraints.

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

cs.LG · 2026-06-02 · unverdicted · novelty 4.0

ASymPO normalizes token losses by average current-policy negative log-probability to restore zero-sum balance in asynchronous LLM RL without behavior information.

To Learn or Not to Learn: Analyzing the Role of Learning for Navigation in Virtual Environments

cs.CV · 2019-07-26 · unverdicted · novelty 4.0

Classical agents outperform learning-based ones on MINOS and Stanford 3D Indoor Spaces, with learned agents weaker at collision avoidance and memory but stronger at handling ambiguity and noise.

A Dual Memory Structure for Efficient Use of Replay Memory in Deep Reinforcement Learning

cs.LG · 2019-07-15 · unverdicted · novelty 4.0

Dual memory (main plus cache) for replay memory in DRL yields higher scores than single memory across three Gym environments.

Optimal Use of Experience in First Person Shooter Environments

cs.LG · 2019-06-24 · unverdicted · novelty 2.0

Empirical tests in VizDoom show multiple DQN updates per step do not improve performance after learning rate adjustment, with a 4:1 update-to-step ratio optimal before significant degradation.

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

cs.LG · 2020-05-04 · unverdicted · novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

citing papers explorer

Showing 13 of 13 citing papers.

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise math.PR · 2026-05-20 · unverdicted · none · ref 133 · internal anchor
Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling cs.LG · 2026-05-14 · unverdicted · none · ref 31 · internal anchor
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters cs.LG · 2026-05-07 · accept · none · ref 280
Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives cs.LG · 2025-09-11 · conditional · none · ref 33 · internal anchor
Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.
Beyond Importance Sampling: Rejection-Gated Policy Optimization cs.LG · 2026-04-16 · unverdicted · none · ref 10
RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning cs.LG · 2019-10-01 · conditional · none · ref 19
AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.
Polychromic Objectives for Reinforcement Learning cs.LG · 2025-09-29 · unverdicted · none · ref 41 · internal anchor
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
Multi-Agent Deep Reinforcement Learning for Liquidation Strategy Analysis q-fin.TR · 2019-06-24 · unverdicted · none · ref 21 · internal anchor
The authors extend the Almgren-Chriss model to a multi-agent setting and apply deep reinforcement learning to simulate and optimize liquidation strategies under practical constraints.
ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information cs.LG · 2026-06-02 · unverdicted · none · ref 29 · internal anchor
ASymPO normalizes token losses by average current-policy negative log-probability to restore zero-sum balance in asynchronous LLM RL without behavior information.
To Learn or Not to Learn: Analyzing the Role of Learning for Navigation in Virtual Environments cs.CV · 2019-07-26 · unverdicted · none · ref 34 · internal anchor
Classical agents outperform learning-based ones on MINOS and Stanford 3D Indoor Spaces, with learned agents weaker at collision avoidance and memory but stronger at handling ambiguity and noise.
A Dual Memory Structure for Efficient Use of Replay Memory in Deep Reinforcement Learning cs.LG · 2019-07-15 · unverdicted · none · ref 6 · internal anchor
Dual memory (main plus cache) for replay memory in DRL yields higher scores than single memory across three Gym environments.
Optimal Use of Experience in First Person Shooter Environments cs.LG · 2019-06-24 · unverdicted · none · ref 13 · internal anchor
Empirical tests in VizDoom show multiple DQN updates per step do not improve performance after learning rate adjustment, with a 4:1 update-to-step ratio optimal before significant degradation.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems cs.LG · 2020-05-04 · unverdicted · none · ref 215
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Sample Efficient Actor-Critic with Experience Replay

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer