Experience Replay Optimization
Pith reviewed 2026-05-25 20:01 UTC · model grok-4.3
The pith
Learning a replay policy by alternating updates with the agent policy improves off-policy reinforcement learning performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a replay policy can be learned to optimize the cumulative reward by alternately updating two policies: the agent policy is trained to maximize reward from the selected experiences, while the replay policy is trained to choose the most useful experiences from the memory buffer for the agent.
What carries the argument
The ERO framework that alternately updates the agent policy on replayed data and the replay policy to supply experiences maximizing the agent's cumulative reward.
If this is right
- Off-policy algorithms gain a learned mechanism for prioritizing experiences instead of relying on fixed rules.
- The alternating update scheme enables the replay policy to adapt as the agent improves.
- Performance gains appear across multiple continuous control environments when the replay policy is optimized this way.
Where Pith is reading between the lines
- The same alternating-optimization idea could be tested in settings with very large replay buffers where manual prioritization becomes impractical.
- If the replay policy generalizes across tasks, it might reduce the need for environment-specific replay heuristics.
- Extending the approach to discrete-action or model-based RL would test whether the core alternation pattern transfers.
Load-bearing premise
A replay policy can be stably learned and provide net benefit despite the replay memory being noisy and large and the cumulative reward signal being unstable.
What would settle it
Running the method on standard continuous control benchmarks and finding that final performance or sample efficiency does not exceed that of uniform replay or prioritized experience replay.
Figures
read the original abstract
Experience replay enables reinforcement learning agents to memorize and reuse past experiences, just as humans replay memories for the situation at hand. Contemporary off-policy algorithms either replay past experiences uniformly or utilize a rule-based replay strategy, which may be sub-optimal. In this work, we consider learning a replay policy to optimize the cumulative reward. Replay learning is challenging because the replay memory is noisy and large, and the cumulative reward is unstable. To address these issues, we propose a novel experience replay optimization (ERO) framework which alternately updates two policies: the agent policy, and the replay policy. The agent is updated to maximize the cumulative reward based on the replayed data, while the replay policy is updated to provide the agent with the most useful experiences. The conducted experiments on various continuous control tasks demonstrate the effectiveness of ERO, empirically showing promise in experience replay learning to improve the performance of off-policy reinforcement learning algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Experience Replay Optimization (ERO), a framework for off-policy RL that learns a replay policy to select experiences from a noisy, large replay buffer in order to maximize the agent's cumulative reward. The method alternates between (i) updating the agent policy on data sampled by the current replay policy and (ii) updating the replay policy to supply the experiences most useful to the agent. Experiments on continuous-control tasks are presented as empirical evidence that ERO improves performance relative to uniform or rule-based replay.
Significance. If the reported gains are reproducible and the alternating optimization is stable, the work supplies a concrete, learnable alternative to hand-designed replay heuristics. This is a direct attack on a long-standing practical bottleneck in off-policy methods; a successful instantiation would be of immediate interest to the continuous-control and robotics communities.
minor comments (3)
- The abstract states that the replay memory is 'noisy and large' and the cumulative reward is 'unstable,' yet the manuscript does not quantify these difficulties (e.g., variance of TD targets or buffer size relative to episode length) before claiming that the alternating scheme resolves them.
- Section 4 (experiments) should report the precise off-policy base algorithms, the number of random seeds, and whether the replay-policy parameters are shared across tasks or re-learned per environment.
- The notation distinguishing the agent policy π and the replay policy ρ is introduced only in the method section; a short table of symbols at the beginning would improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. We are pleased that the potential of ERO as a learnable alternative to hand-designed replay heuristics is recognized.
Circularity Check
No significant circularity identified
full rationale
The provided abstract and description present ERO as an alternating-update framework for learning a replay policy alongside the agent policy, with effectiveness shown via experiments on continuous control tasks. No equations, parameter fits, self-citations, or uniqueness theorems appear in the text that would reduce any claimed prediction or result to its own inputs by construction. The load-bearing elements are the empirical demonstrations and the stated handling of noisy replay memory, both of which are externally falsifiable and independent of the method's internal definitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
[Andrychowicz et al., 2017] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Pe- ter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience re- play. In NeurIPS,
work page 2017
-
[2]
[Brockman et al., 2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
[Fan et al., 2018] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In ICLR,
work page 2018
-
[4]
Rainbow: Combining improvements in deep rein- forcement learning
[Hessel et al., 2018] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Sil- ver. Rainbow: Combining improvements in deep rein- forcement learning. In AAAI,
work page 2018
-
[5]
Selective experience replay for lifelong learning
[Isele and Cosgun, 2018] David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. In AAAI,
work page 2018
-
[6]
Continuous control with deep reinforcement learning
[Lillicrap et al., 2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR,
work page 2016
-
[7]
Self-improving reactive agents based on reinforcement learning, planning and teaching
[Lin, 1992] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321,
work page 1992
-
[8]
Reinforcement learning for robots using neural networks
[Lin, 1993] Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, Carnegie-Mellon Univ Pittsburgh PA School of Computer Science,
work page 1993
-
[9]
The Effects of Memory Replay in Reinforcement Learning
[Liu and Zou, 2017] Ruishan Liu and James Zou. The ef- fects of memory replay in reinforcement learning. arXiv preprint arXiv:1710.06574,
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[10]
Prioritized memory access explains planning and hippocampal replay
[Mattar and Daw, 2018] Marcelo Gomes Mattar and Nathaniel D Daw. Prioritized memory access explains planning and hippocampal replay. bioRxiv, page 225664,
work page 2018
-
[11]
Playing atari with deep reinforcement learning
[Mnih et al., 2013] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Work- shop,
work page 2013
-
[12]
Human-level control through deep reinforcement learning
[Mnih et al., 2015] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidje- land, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529,
work page 2015
-
[13]
Remember and Forget for Experience Replay
[Novati and Koumoutsakos, 2018] Guido Novati and Petros Koumoutsakos. Remember and forget for experience re- play. arXiv preprint arXiv:1807.05827,
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
[Pan et al., 2018] Yangchen Pan, Muhammad Zaheer, Adam White, Andrew Patterson, and Martha White. Organiz- ing experience: a deeper look at replay mechanisms for sample-based planning in continuous state domains. In IJ- CAI,
work page 2018
-
[15]
Prioritized experience re- play
[Schaul et al., 2016] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience re- play. In ICML,
work page 2016
-
[16]
Integrating memories to guide decisions
[Shohamy and Daw, 2015] Daphna Shohamy and Nathaniel D Daw. Integrating memories to guide decisions. Current Opinion in Behavioral Sciences , 5:85–90,
work page 2015
-
[17]
Deterministic policy gradient algorithms
[Silver et al., 2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML,
work page 2014
-
[18]
Reinforcement learning: An introduction
[Sutton and Barto, 2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction . MIT press,
work page 2018
-
[19]
Mujoco: A physics engine for model-based con- trol
[Todorov et al., 2012] Emanuel Todorov, Tom Erez, and Yu- val Tassa. Mujoco: A physics engine for model-based con- trol. In IROS,
work page 2012
-
[20]
Deep reinforcement learning with dou- ble q-learning
[Van Hasselt et al., 2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with dou- ble q-learning. In AAAI,
work page 2016
-
[21]
Sample efficient actor-critic with expe- rience replay
[Wang et al., 2017] Ziyu Wang, Victor Bapst, Nicolas Heess, V olodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with expe- rience replay. In ICLR,
work page 2017
-
[22]
Simple statistical gradient-following algorithms for connectionist reinforce- ment learning
[Williams, 1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. Machine learning, 8(3-4):229–256,
work page 1992
-
[23]
Learning to teach with dynamic loss functions
[Wu et al., 2018] Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning to teach with dynamic loss functions. In NeurIPS,
work page 2018
-
[24]
Knowledge transfer for deep reinforcement learning with hierarchical experience replay
[Yin and Pan, 2017] Haiyan Yin and Sinno Jialin Pan. Knowledge transfer for deep reinforcement learning with hierarchical experience replay. In AAAI,
work page 2017
-
[25]
A deeper look at experience replay
[Zhang and Sutton, 2017] Shangtong Zhang and Richard S Sutton. A deeper look at experience replay. NIPS Deep Reinforcement Learning Symposium,
work page 2017
-
[26]
On learning intrinsic rewards for policy gradient methods
[Zheng et al., 2018] Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. In NeurIPS, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.