Experience Replay Optimization

Daochen Zha; Kaixiong Zhou; Kwei-Herng Lai; Xia Hu

arxiv: 1906.08387 · v1 · pith:HV6QLK3Bnew · submitted 2019-06-19 · 💻 cs.LG · stat.ML

Experience Replay Optimization

Daochen Zha , Kwei-Herng Lai , Kaixiong Zhou , Xia Hu This is my paper

Pith reviewed 2026-05-25 20:01 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords experience replayreinforcement learningoff-policy algorithmsreplay policycontinuous control

0 comments

The pith

Learning a replay policy by alternating updates with the agent policy improves off-policy reinforcement learning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that experience replay in off-policy RL can be improved by training a separate replay policy whose goal is to select past experiences that maximize the agent's eventual cumulative reward. Current uniform or rule-based replay strategies are treated as potentially suboptimal, so the method instead learns the selection process directly. Challenges of noisy large buffers and unstable reward signals are addressed by alternately updating the agent policy on replayed data and then updating the replay policy to supply better data. Experiments on continuous control tasks are presented as evidence that this yields measurable gains over standard replay methods.

Core claim

The central claim is that a replay policy can be learned to optimize the cumulative reward by alternately updating two policies: the agent policy is trained to maximize reward from the selected experiences, while the replay policy is trained to choose the most useful experiences from the memory buffer for the agent.

What carries the argument

The ERO framework that alternately updates the agent policy on replayed data and the replay policy to supply experiences maximizing the agent's cumulative reward.

If this is right

Off-policy algorithms gain a learned mechanism for prioritizing experiences instead of relying on fixed rules.
The alternating update scheme enables the replay policy to adapt as the agent improves.
Performance gains appear across multiple continuous control environments when the replay policy is optimized this way.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alternating-optimization idea could be tested in settings with very large replay buffers where manual prioritization becomes impractical.
If the replay policy generalizes across tasks, it might reduce the need for environment-specific replay heuristics.
Extending the approach to discrete-action or model-based RL would test whether the core alternation pattern transfers.

Load-bearing premise

A replay policy can be stably learned and provide net benefit despite the replay memory being noisy and large and the cumulative reward signal being unstable.

What would settle it

Running the method on standard continuous control benchmarks and finding that final performance or sample efficiency does not exceed that of uniform replay or prioritized experience replay.

Figures

Figures reproduced from arXiv: 1906.08387 by Daochen Zha, Kaixiong Zhou, Kwei-Herng Lai, Xia Hu.

**Figure 2.** Figure 2: Each task is run for 5 times with 2 × 106 timesteps using different random seeds, and the average return over episodes is reported. We make the following observations. First, the proposed ERO consistently outperforms all the baselines on most of the continuous control tasks in terms of sample efficiency. On HalfCheetah, InvertedPendulum, and InvertedDoublePendulum, ERO performs clearly better than Vanilla-… view at source ↗

**Figure 2.** Figure 2: Performance comparison of ERO against baselines on 8 continuous control tasks. The shaded area represents mean [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of running time in seconds for ERO and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The evolution of TD error (top), timestep difference (mid [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Experience replay enables reinforcement learning agents to memorize and reuse past experiences, just as humans replay memories for the situation at hand. Contemporary off-policy algorithms either replay past experiences uniformly or utilize a rule-based replay strategy, which may be sub-optimal. In this work, we consider learning a replay policy to optimize the cumulative reward. Replay learning is challenging because the replay memory is noisy and large, and the cumulative reward is unstable. To address these issues, we propose a novel experience replay optimization (ERO) framework which alternately updates two policies: the agent policy, and the replay policy. The agent is updated to maximize the cumulative reward based on the replayed data, while the replay policy is updated to provide the agent with the most useful experiences. The conducted experiments on various continuous control tasks demonstrate the effectiveness of ERO, empirically showing promise in experience replay learning to improve the performance of off-policy reinforcement learning algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ERO's alternating-update scheme for learning a replay policy is a fresh construction on top of standard off-policy methods, but the abstract gives almost no mechanics or numbers so the practical payoff is still unclear.

read the letter

The one thing to know is that the paper frames replay selection itself as a learnable policy and updates it in alternation with the agent policy. That construction does not appear in the uniform or prioritized replay baselines they cite, so the core idea is new rather than a minor tweak. They motivate it by pointing out that the replay buffer is noisy and large while the reward signal is unstable, and they claim the alternating scheme lets the replay policy supply more useful data to the agent. Experiments on continuous-control tasks are said to show gains over existing off-policy algorithms. That is the extent of what is actually demonstrated in the text we have. The method is presented as a modular addition that could sit on top of existing algorithms, which is a reasonable scope. What is missing is any visible description of the replay policy's objective, how the alternation is scheduled, what architecture is used for the replay policy, or even a single quantitative result with error bars. Without those pieces the central claim that the method stably improves performance rests on an uncheckable empirical assertion. The paper is aimed at people already running off-policy continuous-control experiments who might want to try a learned replay component. It is coherent on its own terms and engages the literature at the level of the abstract, so it is worth sending to referees who can ask for the missing implementation and result details. I would not cite it yet, but a serious editor should let it go to review rather than desk-reject.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Experience Replay Optimization (ERO), a framework for off-policy RL that learns a replay policy to select experiences from a noisy, large replay buffer in order to maximize the agent's cumulative reward. The method alternates between (i) updating the agent policy on data sampled by the current replay policy and (ii) updating the replay policy to supply the experiences most useful to the agent. Experiments on continuous-control tasks are presented as empirical evidence that ERO improves performance relative to uniform or rule-based replay.

Significance. If the reported gains are reproducible and the alternating optimization is stable, the work supplies a concrete, learnable alternative to hand-designed replay heuristics. This is a direct attack on a long-standing practical bottleneck in off-policy methods; a successful instantiation would be of immediate interest to the continuous-control and robotics communities.

minor comments (3)

The abstract states that the replay memory is 'noisy and large' and the cumulative reward is 'unstable,' yet the manuscript does not quantify these difficulties (e.g., variance of TD targets or buffer size relative to episode length) before claiming that the alternating scheme resolves them.
Section 4 (experiments) should report the precise off-policy base algorithms, the number of random seeds, and whether the replay-policy parameters are shared across tasks or re-learned per environment.
The notation distinguishing the agent policy π and the replay policy ρ is introduced only in the method section; a short table of symbols at the beginning would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. We are pleased that the potential of ERO as a learnable alternative to hand-designed replay heuristics is recognized.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided abstract and description present ERO as an alternating-update framework for learning a replay policy alongside the agent policy, with effectiveness shown via experiments on continuous control tasks. No equations, parameter fits, self-citations, or uniqueness theorems appear in the text that would reduce any claimed prediction or result to its own inputs by construction. The load-bearing elements are the empirical demonstrations and the stated handling of noisy replay memory, both of which are externally falsifiable and independent of the method's internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated assumption that the replay policy can be optimized despite noisy memory and unstable reward signals.

pith-pipeline@v0.9.0 · 5681 in / 1003 out tokens · 16593 ms · 2026-05-25T20:01:56.005013+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 3 internal anchors

[1]

Hindsight experience re- play

[Andrychowicz et al., 2017] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Pe- ter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience re- play. In NeurIPS,

work page 2017
[2]

OpenAI Gym

[Brockman et al., 2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Learning to teach

[Fan et al., 2018] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In ICLR,

work page 2018
[4]

Rainbow: Combining improvements in deep rein- forcement learning

[Hessel et al., 2018] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Sil- ver. Rainbow: Combining improvements in deep rein- forcement learning. In AAAI,

work page 2018
[5]

Selective experience replay for lifelong learning

[Isele and Cosgun, 2018] David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. In AAAI,

work page 2018
[6]

Continuous control with deep reinforcement learning

[Lillicrap et al., 2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR,

work page 2016
[7]

Self-improving reactive agents based on reinforcement learning, planning and teaching

[Lin, 1992] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321,

work page 1992
[8]

Reinforcement learning for robots using neural networks

[Lin, 1993] Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, Carnegie-Mellon Univ Pittsburgh PA School of Computer Science,

work page 1993
[9]

The Effects of Memory Replay in Reinforcement Learning

[Liu and Zou, 2017] Ruishan Liu and James Zou. The ef- fects of memory replay in reinforcement learning. arXiv preprint arXiv:1710.06574,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Prioritized memory access explains planning and hippocampal replay

[Mattar and Daw, 2018] Marcelo Gomes Mattar and Nathaniel D Daw. Prioritized memory access explains planning and hippocampal replay. bioRxiv, page 225664,

work page 2018
[11]

Playing atari with deep reinforcement learning

[Mnih et al., 2013] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Work- shop,

work page 2013
[12]

Human-level control through deep reinforcement learning

[Mnih et al., 2015] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidje- land, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529,

work page 2015
[13]

Remember and Forget for Experience Replay

[Novati and Koumoutsakos, 2018] Guido Novati and Petros Koumoutsakos. Remember and forget for experience re- play. arXiv preprint arXiv:1807.05827,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Organiz- ing experience: a deeper look at replay mechanisms for sample-based planning in continuous state domains

[Pan et al., 2018] Yangchen Pan, Muhammad Zaheer, Adam White, Andrew Patterson, and Martha White. Organiz- ing experience: a deeper look at replay mechanisms for sample-based planning in continuous state domains. In IJ- CAI,

work page 2018
[15]

Prioritized experience re- play

[Schaul et al., 2016] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience re- play. In ICML,

work page 2016
[16]

Integrating memories to guide decisions

[Shohamy and Daw, 2015] Daphna Shohamy and Nathaniel D Daw. Integrating memories to guide decisions. Current Opinion in Behavioral Sciences , 5:85–90,

work page 2015
[17]

Deterministic policy gradient algorithms

[Silver et al., 2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML,

work page 2014
[18]

Reinforcement learning: An introduction

[Sutton and Barto, 2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction . MIT press,

work page 2018
[19]

Mujoco: A physics engine for model-based con- trol

[Todorov et al., 2012] Emanuel Todorov, Tom Erez, and Yu- val Tassa. Mujoco: A physics engine for model-based con- trol. In IROS,

work page 2012
[20]

Deep reinforcement learning with dou- ble q-learning

[Van Hasselt et al., 2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with dou- ble q-learning. In AAAI,

work page 2016
[21]

Sample efﬁcient actor-critic with expe- rience replay

[Wang et al., 2017] Ziyu Wang, Victor Bapst, Nicolas Heess, V olodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efﬁcient actor-critic with expe- rience replay. In ICLR,

work page 2017
[22]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning

[Williams, 1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. Machine learning, 8(3-4):229–256,

work page 1992
[23]

Learning to teach with dynamic loss functions

[Wu et al., 2018] Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning to teach with dynamic loss functions. In NeurIPS,

work page 2018
[24]

Knowledge transfer for deep reinforcement learning with hierarchical experience replay

[Yin and Pan, 2017] Haiyan Yin and Sinno Jialin Pan. Knowledge transfer for deep reinforcement learning with hierarchical experience replay. In AAAI,

work page 2017
[25]

A deeper look at experience replay

[Zhang and Sutton, 2017] Shangtong Zhang and Richard S Sutton. A deeper look at experience replay. NIPS Deep Reinforcement Learning Symposium,

work page 2017
[26]

On learning intrinsic rewards for policy gradient methods

[Zheng et al., 2018] Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. In NeurIPS, 2018

work page 2018

[1] [1]

Hindsight experience re- play

[Andrychowicz et al., 2017] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Pe- ter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience re- play. In NeurIPS,

work page 2017

[2] [2]

OpenAI Gym

[Brockman et al., 2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

Learning to teach

[Fan et al., 2018] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In ICLR,

work page 2018

[4] [4]

Rainbow: Combining improvements in deep rein- forcement learning

[Hessel et al., 2018] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Sil- ver. Rainbow: Combining improvements in deep rein- forcement learning. In AAAI,

work page 2018

[5] [5]

Selective experience replay for lifelong learning

[Isele and Cosgun, 2018] David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. In AAAI,

work page 2018

[6] [6]

Continuous control with deep reinforcement learning

[Lillicrap et al., 2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR,

work page 2016

[7] [7]

Self-improving reactive agents based on reinforcement learning, planning and teaching

[Lin, 1992] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321,

work page 1992

[8] [8]

Reinforcement learning for robots using neural networks

[Lin, 1993] Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, Carnegie-Mellon Univ Pittsburgh PA School of Computer Science,

work page 1993

[9] [9]

The Effects of Memory Replay in Reinforcement Learning

[Liu and Zou, 2017] Ruishan Liu and James Zou. The ef- fects of memory replay in reinforcement learning. arXiv preprint arXiv:1710.06574,

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Prioritized memory access explains planning and hippocampal replay

[Mattar and Daw, 2018] Marcelo Gomes Mattar and Nathaniel D Daw. Prioritized memory access explains planning and hippocampal replay. bioRxiv, page 225664,

work page 2018

[11] [11]

Playing atari with deep reinforcement learning

[Mnih et al., 2013] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Work- shop,

work page 2013

[12] [12]

Human-level control through deep reinforcement learning

[Mnih et al., 2015] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidje- land, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529,

work page 2015

[13] [13]

Remember and Forget for Experience Replay

[Novati and Koumoutsakos, 2018] Guido Novati and Petros Koumoutsakos. Remember and forget for experience re- play. arXiv preprint arXiv:1807.05827,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Organiz- ing experience: a deeper look at replay mechanisms for sample-based planning in continuous state domains

[Pan et al., 2018] Yangchen Pan, Muhammad Zaheer, Adam White, Andrew Patterson, and Martha White. Organiz- ing experience: a deeper look at replay mechanisms for sample-based planning in continuous state domains. In IJ- CAI,

work page 2018

[15] [15]

Prioritized experience re- play

[Schaul et al., 2016] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience re- play. In ICML,

work page 2016

[16] [16]

Integrating memories to guide decisions

[Shohamy and Daw, 2015] Daphna Shohamy and Nathaniel D Daw. Integrating memories to guide decisions. Current Opinion in Behavioral Sciences , 5:85–90,

work page 2015

[17] [17]

Deterministic policy gradient algorithms

[Silver et al., 2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML,

work page 2014

[18] [18]

Reinforcement learning: An introduction

[Sutton and Barto, 2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction . MIT press,

work page 2018

[19] [19]

Mujoco: A physics engine for model-based con- trol

[Todorov et al., 2012] Emanuel Todorov, Tom Erez, and Yu- val Tassa. Mujoco: A physics engine for model-based con- trol. In IROS,

work page 2012

[20] [20]

Deep reinforcement learning with dou- ble q-learning

[Van Hasselt et al., 2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with dou- ble q-learning. In AAAI,

work page 2016

[21] [21]

Sample efﬁcient actor-critic with expe- rience replay

[Wang et al., 2017] Ziyu Wang, Victor Bapst, Nicolas Heess, V olodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efﬁcient actor-critic with expe- rience replay. In ICLR,

work page 2017

[22] [22]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning

[Williams, 1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. Machine learning, 8(3-4):229–256,

work page 1992

[23] [23]

Learning to teach with dynamic loss functions

[Wu et al., 2018] Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning to teach with dynamic loss functions. In NeurIPS,

work page 2018

[24] [24]

Knowledge transfer for deep reinforcement learning with hierarchical experience replay

[Yin and Pan, 2017] Haiyan Yin and Sinno Jialin Pan. Knowledge transfer for deep reinforcement learning with hierarchical experience replay. In AAAI,

work page 2017

[25] [25]

A deeper look at experience replay

[Zhang and Sutton, 2017] Shangtong Zhang and Richard S Sutton. A deeper look at experience replay. NIPS Deep Reinforcement Learning Symposium,

work page 2017

[26] [26]

On learning intrinsic rewards for policy gradient methods

[Zheng et al., 2018] Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. In NeurIPS, 2018

work page 2018