A Dual Memory Structure for Efficient Use of Replay Memory in Deep Reinforcement Learning

Dong Eui Chang; Wonshick Ko

arxiv: 1907.06396 · v1 · pith:PKRUKAE6new · submitted 2019-07-15 · 💻 cs.LG · stat.ML

A Dual Memory Structure for Efficient Use of Replay Memory in Deep Reinforcement Learning

Wonshick Ko , Dong Eui Chang This is my paper

Pith reviewed 2026-05-24 21:34 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords dual memoryreplay memorydeep reinforcement learningcache memoryOpenAI Gymtraining efficiencymemory structure

0 comments

The pith

A dual memory structure with main storage and cache management raises training and test scores over standard replay memory in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a dual memory setup for replay-based reinforcement learning that splits functions between a main memory holding stored data and a cache memory that handles selection and training steps. Tests across three OpenAI Gym environments produce higher scores during both training and evaluation than the usual single-memory baseline. This separation is presented as the mechanism that makes training more efficient while keeping the core learning algorithm unchanged.

Core claim

The dual memory structure, consisting of a main memory that stores various data and a cache memory that manages the data and trains the reinforcement learning agent efficiently, achieves higher training and test scores than the conventional single memory structure in three selected environments of OpenAI Gym.

What carries the argument

Dual memory structure: main memory stores data while cache memory manages selection and performs efficient agent training.

If this is right

Reinforcement learning agents reach higher performance levels in the tested environments during both training and testing phases.
Training proceeds more efficiently by delegating data management to the cache component.
The dual structure applies directly to any replay-memory algorithm without altering the underlying policy or value updates.
Score gains appear consistently across the three chosen environments when the cache is active.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation could reduce harmful correlations in sampled experiences by letting the cache apply its own filtering rules.
Similar main-plus-cache splits might be tested in other memory-intensive settings such as experience replay for offline learning.
If the cache overhead scales linearly, the method could extend to larger state spaces where single buffers become bottlenecks.

Load-bearing premise

The cache memory can be designed and tuned to manage data in a way that produces net efficiency gains without introducing unmeasured overheads or selection effects.

What would settle it

Reproducing the experiments in the same three Gym environments and obtaining equal or lower scores with the dual structure would falsify the performance claim.

Figures

Figures reproduced from arXiv: 1907.06396 by Dong Eui Chang, Wonshick Ko.

**Figure 2.** Figure 2: Cache data selection method. is assumed that data in the main memory with the capacity m is arranged in the order of stored time. That is, if the i-th data in the main memory is denoted by D(i) , the dataset Mm of the main memory can be expressed as Mm = D(1), D(2), · · · , D(m−1), D(m) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The result on Assault-v0. (a) mean training score for past 100 consecutive episodes (b) mean test score for 10 test episodes [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The result on SpaceInvaders-v0. (a) mean training score for past 100 consecutive episodes (b) mean test score for 10 test episodes [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: The result on KungFuMaster-v0. Dual Memory Structure with PER and PSMM. In Figs. 3−5, we can see that the DMS method has the highest mean training and test scores in all the three environments. In particular, in [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗

read the original abstract

In this paper, we propose a dual memory structure for reinforcement learning algorithms with replay memory. The dual memory consists of a main memory that stores various data and a cache memory that manages the data and trains the reinforcement learning agent efficiently. Experimental results show that the dual memory structure achieves higher training and test scores than the conventional single memory structure in three selected environments of OpenAI Gym. This implies that the dual memory structure enables better and more efficient training than the single memory structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dual-memory replay idea is a minor code-level tweak whose performance claim rests on unverified score improvements with no stats or controls shown.

read the letter

The paper splits replay memory into a main buffer that holds data and a cache that manages selection and training. It reports better training and test scores than standard single-buffer replay on three Gym environments. That is the entire contribution: a practical suggestion for handling the buffer differently, with the cache supposedly making updates more efficient. If the cache can be implemented without extra overhead or selection bias, the change might help some training runs in practice. The authors get credit for framing the split as a distinct structure rather than just another sampling trick. The work stays grounded in existing DQN-style replay and does not claim new theory. The soft spot is the evidence. The abstract states the score advantage but gives no run counts, variance, error bars, ablation on cache size or eviction rules, or comparison to other memory-management baselines. Without those, the reported gains cannot be checked against noise or implementation details. The cache component is described at a high level, so it is unclear whether it actually reduces forgetting or simply changes the effective batch composition in an unmeasured way. No equations or derivations appear, and the citation pattern is minimal. This paper is for people already writing DRL training loops who are looking for small implementation ideas. Most readers will not find enough detail to adopt or build on the method. It does not rise to the level that needs a serious referee; the empirical claim is too lightly supported to justify the time.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a dual memory structure for replay memory in deep reinforcement learning algorithms. It consists of a main memory that stores various data and a cache memory that manages the data to train the RL agent more efficiently. The central claim is that this dual structure yields higher training and test scores than the conventional single-memory replay buffer in three selected OpenAI Gym environments.

Significance. If the empirical claim is substantiated with reproducible controls and ablations, the dual-memory idea could provide a lightweight, architecture-agnostic improvement to experience replay that is easy to implement on top of existing DQN-style agents. The absence of any parameter-free derivation or machine-checked component means the contribution would rest entirely on the strength of the experimental evidence.

major comments (3)

[Experimental results] The experimental results section provides no implementation details, hyperparameter settings, or pseudocode for the cache memory component (including its size, eviction policy, or sampling mechanism). Without these, it is impossible to determine whether the reported score gains arise from the dual structure itself or from unmeasured differences in effective replay ratio or data selection.
[Experimental results] No error bars, number of random seeds, or statistical significance tests are reported for the training and test scores across the three Gym environments. The claim that the dual structure 'achieves higher' scores therefore cannot be evaluated against the conventional single-memory baseline.
[Experimental results] The manuscript contains no ablation studies that isolate the contribution of the cache memory (e.g., varying cache size while keeping total memory fixed, or comparing against a single memory of equivalent total capacity). This omission leaves open the possibility that any observed improvement is explained by increased total memory rather than the dual structure.

minor comments (2)

[Abstract] The abstract and introduction use the phrase 'manages the data and trains the reinforcement learning agent efficiently' without defining what 'efficiently' means (wall-clock time, sample efficiency, or both).
[Experimental results] The environments are described only as 'three selected environments of OpenAI Gym'; the specific names (e.g., CartPole, LunarLander) and their difficulty levels should be stated explicitly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below, indicating the revisions we will make to improve clarity and rigor.

read point-by-point responses

Referee: [Experimental results] The experimental results section provides no implementation details, hyperparameter settings, or pseudocode for the cache memory component (including its size, eviction policy, or sampling mechanism). Without these, it is impossible to determine whether the reported score gains arise from the dual structure itself or from unmeasured differences in effective replay ratio or data selection.

Authors: We agree that these details are essential for reproducibility. In the revised manuscript we will add a new subsection (and appendix with pseudocode) that fully specifies the cache memory implementation, including relative size to the main memory, eviction policy, sampling mechanism, and all hyperparameters used in the experiments. revision: yes
Referee: [Experimental results] No error bars, number of random seeds, or statistical significance tests are reported for the training and test scores across the three Gym environments. The claim that the dual structure 'achieves higher' scores therefore cannot be evaluated against the conventional single-memory baseline.

Authors: We accept this criticism. The revised version will report results from multiple independent random seeds, include mean scores with standard-deviation error bars on all figures, and add appropriate statistical significance tests to support the performance comparisons. revision: yes
Referee: [Experimental results] The manuscript contains no ablation studies that isolate the contribution of the cache memory (e.g., varying cache size while keeping total memory fixed, or comparing against a single memory of equivalent total capacity). This omission leaves open the possibility that any observed improvement is explained by increased total memory rather than the dual structure.

Authors: This is a valid concern. We will incorporate ablation experiments in the revision: (i) varying cache size while holding total memory capacity constant and (ii) direct comparison against a single replay buffer whose capacity equals the sum of main memory plus cache. These results will be added to the experimental section to isolate the benefit of the dual structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an empirical dual-memory architecture for replay buffers in deep RL and supports its claim solely via reported score improvements on three Gym environments. No equations, parameter fits, derivations, or self-citations appear in the abstract or the described full text. The central argument is therefore an external empirical comparison and does not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of an invented dual-memory architecture whose internal management rules are not specified in the abstract; no free parameters or background axioms are stated.

invented entities (1)

cache memory no independent evidence
purpose: to manage data and train the reinforcement learning agent efficiently
Introduced as the second component of the proposed dual memory structure.

pith-pipeline@v0.9.0 · 5598 in / 994 out tokens · 19077 ms · 2026-05-24T21:34:28.007073+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 4 internal anchors

[1]

The effects of memory replay in reinforcement learning,

R. Liu and J. Zou, “The effects of memory replay in reinforcement learning,” in2018 56th Annual Aller- ton Conference on Communication, Control, and Computing (Allerton), pp. 478–485, IEEE, 2018

work page 2018
[2]

A. L. Caterini and D. E. Chang, Deep Neural Net- works in a Mathematical Framework . Springer, 2018

work page 2018
[3]

Reinforcement learning for robots using neural networks,

L.-J. Lin, “Reinforcement learning for robots using neural networks,” tech. rep., Carnegie-Mellon Univ Pittsburgh PA School of Computer Science, 1993

work page 1993
[4]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried- miller, A. K. Fidjeland, G. Ostrovski, et al. , “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015

work page 2015
[5]

Continuous control with deep reinforcement learning

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Con- tinuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[6]

Sample Efficient Actor-Critic with Experience Replay

Z. Wang, V . Bapst, N. Heess, V . Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample ef- ﬁcient actor-critic with experience replay,” arXiv preprint arXiv:1611.01224, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Prioritized Experience Replay

T. Schaul, J. Quan, I. Antonoglou, and D. Sil- ver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

Prioritized stochas- tic memory management for enhanced reinforce- ment learning,

T. Kwon and D. E. Chang, “Prioritized stochas- tic memory management for enhanced reinforce- ment learning,” in 2018 IEEE International Confer- ence on Consumer Electronics-Asia (ICCE-Asia) , pp. 206–212, IEEE, 2018

work page 2018
[9]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schnei- der, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Policy gradient methods for reinforce- ment learning with function approximation,

R. S. Sutton, D. A. McAllester, S. P. Singh, and Y . Mansour, “Policy gradient methods for reinforce- ment learning with function approximation,” in Ad- vances in neural information processing systems , pp. 1057–1063, 2000

work page 2000
[11]

Natural actor-critic,

J. Peters and S. Schaal, “Natural actor-critic,” Neu- rocomputing, vol. 71, no. 7-9, pp. 1180–1190, 2008

work page 2008
[12]

Openai base- lines

P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y . Wu, and P. Zhokhov, “Openai base- lines.” https://github.com/openai/ baselines, 2017

work page 2017

[1] [1]

The effects of memory replay in reinforcement learning,

R. Liu and J. Zou, “The effects of memory replay in reinforcement learning,” in2018 56th Annual Aller- ton Conference on Communication, Control, and Computing (Allerton), pp. 478–485, IEEE, 2018

work page 2018

[2] [2]

A. L. Caterini and D. E. Chang, Deep Neural Net- works in a Mathematical Framework . Springer, 2018

work page 2018

[3] [3]

Reinforcement learning for robots using neural networks,

L.-J. Lin, “Reinforcement learning for robots using neural networks,” tech. rep., Carnegie-Mellon Univ Pittsburgh PA School of Computer Science, 1993

work page 1993

[4] [4]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried- miller, A. K. Fidjeland, G. Ostrovski, et al. , “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015

work page 2015

[5] [5]

Continuous control with deep reinforcement learning

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Con- tinuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[6] [6]

Sample Efficient Actor-Critic with Experience Replay

Z. Wang, V . Bapst, N. Heess, V . Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample ef- ﬁcient actor-critic with experience replay,” arXiv preprint arXiv:1611.01224, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Prioritized Experience Replay

T. Schaul, J. Quan, I. Antonoglou, and D. Sil- ver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

Prioritized stochas- tic memory management for enhanced reinforce- ment learning,

T. Kwon and D. E. Chang, “Prioritized stochas- tic memory management for enhanced reinforce- ment learning,” in 2018 IEEE International Confer- ence on Consumer Electronics-Asia (ICCE-Asia) , pp. 206–212, IEEE, 2018

work page 2018

[9] [9]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schnei- der, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Policy gradient methods for reinforce- ment learning with function approximation,

R. S. Sutton, D. A. McAllester, S. P. Singh, and Y . Mansour, “Policy gradient methods for reinforce- ment learning with function approximation,” in Ad- vances in neural information processing systems , pp. 1057–1063, 2000

work page 2000

[11] [11]

Natural actor-critic,

J. Peters and S. Schaal, “Natural actor-critic,” Neu- rocomputing, vol. 71, no. 7-9, pp. 1180–1190, 2008

work page 2008

[12] [12]

Openai base- lines

P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y . Wu, and P. Zhokhov, “Openai base- lines.” https://github.com/openai/ baselines, 2017

work page 2017