Optimal Use of Experience in First Person Shooter Environments

Matthew Aitchison

arxiv: 1906.09734 · v1 · pith:OOO6Y72Qnew · submitted 2019-06-24 · 💻 cs.LG · stat.ML

Optimal Use of Experience in First Person Shooter Environments

Matthew Aitchison This is my paper

Pith reviewed 2026-05-25 17:42 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords reinforcement learningdeep q-learningexperience replayupdate frequencyvizdoomfirst person shootersample efficiencydqn

0 comments

The pith

Deep Q-Learning in first-person shooter games performs best when the model updates only every fourth environmental step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether reusing experience from the replay buffer by running multiple learning updates per environment step can reduce the high number of interactions reinforcement learning needs. Experiments in the VizDoom first-person shooter environment using Deep Q-Learning show that extra updates per step require lowering the learning rate yet still fail to raise agent performance. Updating less often remains effective up to a four-to-one ratio of environmental steps to learning updates, after which results drop sharply. This matters because most reinforcement learning training consumes millions of environment steps, so identifying when reuse stops helping could cut wasted computation without losing learning progress.

Core claim

The authors demonstrate that applying learning update steps multiple times per environmental step in the VizDoom environment requires a change in the learning rate but does not improve the performance of the agent. They further show that updating less frequently is effective up to a ratio of 4:1, after which performance degrades significantly, thereby confirming the widespread practice of performing learning updates every fourth environmental step.

What carries the argument

The ratio of learning updates to environmental steps when reusing samples from the experience replay buffer in Deep Q-Learning.

If this is right

Extra updates per step require a lower learning rate to keep training stable.
Performance stays comparable when updates occur only every fourth step.
Ratios higher than 4:1 produce clear drops in final agent scores.
The 4:1 schedule reduces the number of gradient steps while preserving learning quality in this setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same 4:1 limit may serve as a useful default in other environments that rely on experience replay.
The result underscores the need to retune the learning rate whenever the update schedule changes.
Repeating the test on different network sizes or game complexities would show whether the ratio is task-specific.

Load-bearing premise

Observed performance differences arise solely from the chosen update frequency rather than from other experimental choices such as learning-rate adjustments or the particular VizDoom tasks.

What would settle it

An experiment that applies an 8:1 update-to-step ratio in the same VizDoom DQN setup and records no significant performance drop would falsify the claim that degradation begins after 4:1.

Figures

Figures reproduced from arXiv: 1906.09734 by Matthew Aitchison.

**Figure 2.** Figure 2: Architecture of the model used in these experiments. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Agents test scores during training taken from the optimal learning rate, and averaged over the 5 runs. The every [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Heat map showing average final score for of the 5 runs over each learning rate modifier [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Although reinforcement learning has made great strides recently, a continuing limitation is that it requires an extremely high number of interactions with the environment. In this paper, we explore the effectiveness of reusing experience from the experience replay buffer in the Deep Q-Learning algorithm. We test the effectiveness of applying learning update steps multiple times per environmental step in the VizDoom environment and show first, this requires a change in the learning rate, and second that it does not improve the performance of the agent. Furthermore, we show that updating less frequently is effective up to a ratio of 4:1, after which performance degrades significantly. These results quantitatively confirm the widespread practice of performing learning updates every 4th environmental step.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper confirms the standard 4:1 DQN update ratio in VizDoom but the results are weakened by an uncontrolled learning rate variable.

read the letter

The paper confirms that in a VizDoom DQN setup, you can update the network once every four environment steps without losing performance, and that doing more updates per step does not help. It also shows degradation beyond that ratio. What the paper does well is run controlled tests on the update frequency and report a clear threshold. This gives some quantitative support for a rule of thumb that many people already use. The main soft spot is the learning rate. The abstract notes that multiple updates per step requires adjusting the learning rate, but then presents the performance results as a function of the update ratio. If the learning rate was changed along with the ratio, then the drop in performance after 4:1 could be due to the learning rate choice rather than the frequency itself. The stress test note points this out, and the abstract does not say the learning rate was held fixed. That makes the central claim harder to trust. There are also no details on error bars, number of trials, or statistical tests in the abstract. Without those, it's difficult to know how reliable the degrades significantly finding is. This kind of work is aimed at people implementing DQN agents in similar game environments who need practical guidance on hyperparameters. It does not introduce new algorithms or theoretical insights. I would not bring this to a reading group. The result is confirmatory rather than novel, and the potential confound with learning rate means it does not stand out as particularly sharp evidence. I would not cite it in my own work. It does not deserve peer review in its current state because the evidence does not cleanly support the claimed isolation of the update ratio effect.

Referee Report

2 major / 0 minor

Summary. The paper examines reuse of experience in DQN via multiple learning updates per environmental step in the VizDoom FPS environment. It reports that such reuse requires a learning-rate adjustment and yields no performance gain; separately, it finds that less frequent updating remains effective up to a 4:1 update-to-environment ratio, after which performance drops sharply, thereby providing quantitative support for the common practice of updating every fourth step.

Significance. If the central empirical claims hold after controlling for confounding factors, the work supplies a concrete, environment-specific calibration of the update ratio that is already standard in DQN implementations. This could serve as a reference point for practitioners tuning experience-replay schedules in similar visual RL tasks.

major comments (2)

[Abstract] Abstract: the claim that performance degrades beyond the 4:1 ratio is attributed solely to update frequency, yet the same paragraph states that multiple updates 'requires a change in the learning rate.' No information is given on whether the learning rate (or other hyperparameters) was held constant when the ratio was varied; because DQN performance is known to be sensitive to LR–update interactions, this leaves the isolation of the frequency effect unverified.
[Abstract] Experimental results (implicit in the abstract's quantitative claims): the reported performance differences lack any mention of the number of independent runs, error bars, or statistical tests. Without these, it is impossible to determine whether the observed degradation past 4:1 is reliable or could be explained by run-to-run variance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate where revisions will be made to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that performance degrades beyond the 4:1 ratio is attributed solely to update frequency, yet the same paragraph states that multiple updates 'requires a change in the learning rate.' No information is given on whether the learning rate (or other hyperparameters) was held constant when the ratio was varied; because DQN performance is known to be sensitive to LR–update interactions, this leaves the isolation of the frequency effect unverified.

Authors: The learning-rate adjustment applied only to the multiple-updates-per-step regime. In the separate experiments that varied update frequency (less frequent updating up to the 4:1 ratio), the learning rate was held fixed at the value tuned for the standard DQN configuration. We will revise the abstract to state this explicitly so that the isolation of the frequency effect is clear. revision: yes
Referee: [Abstract] Experimental results (implicit in the abstract's quantitative claims): the reported performance differences lack any mention of the number of independent runs, error bars, or statistical tests. Without these, it is impossible to determine whether the observed degradation past 4:1 is reliable or could be explained by run-to-run variance.

Authors: We agree that the manuscript should report the experimental protocol in more detail. The results were obtained from multiple independent runs with different random seeds; we will revise the text and figures to state the number of runs, include error bars (standard deviation), and note that the degradation past 4:1 is consistent across seeds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical experimental results

full rationale

The paper reports direct experimental comparisons in the VizDoom environment using a DQN agent, testing the effects of varying the ratio of learning updates to environmental steps. No mathematical derivations, fitted parameters presented as predictions, or self-citation chains are present that would reduce claims to inputs by construction. The central findings on update frequency (effective up to 4:1) are based on observed performance metrics rather than any self-referential definitions or renamings. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The empirical results depend on the assumption that the DQN algorithm and VizDoom setup allow isolation of the update frequency effect through learning rate tuning; no new entities or axioms are introduced.

free parameters (1)

learning rate adjustment
The paper states that multiple updates per step require a change in the learning rate to maintain stability, implying it was tuned for the experiment.

pith-pipeline@v0.9.0 · 5633 in / 1172 out tokens · 37842 ms · 2026-05-25T17:42:06.993845+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

[1]

Human-level control through deep reinforcement learning

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning”, Nature, vol. 518, 2015. DOI: 10.1038/nature14236. [Onl...

work page doi:10.1038/nature14236 2015
[2]

A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft

S. Ontanon, G. Synnaeve, A. Uriarte, F. Richoux, D. Churchill, and M. Preuss, “A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft”, IEEE Transactions on Computational Intelligence and AI in Games, vol. 5, no. 4, pp. 293–311, Dec. 2013, ISSN : 1943-068X. DOI: 10.1109/TCIAIG.2013.2286295. [Online]. Available: http://ieeexplore.ieee...

work page doi:10.1109/tciaig.2013.2286295 2013
[3]

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.”,Science (New York, N.Y.), vol. 362, no. 6419, pp. 1140–1144, Dec. 2018, ISSN : 1095-9203. DOI: 10...

work page doi:10.1126/science.aar6404 2018
[4]

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates”, in 2017 IEEE international conference on robotics and automation (ICRA), IEEE, 2017, pp. 3389–3396

work page 2017
[5]

The MineRL Competition on Sample Efﬁcient Reinforcement Learning using Human Priors

W. H. Guss, C. Codel, K. Hofmann, B. Houghton, N. Kuno, S. Milani, S. Mohanty, D. P. Liebana, R. Salakhutdi- nov, N. Topin, M. Veloso, and P. Wang, “The MineRL Competition on Sample Efﬁcient Reinforcement Learning using Human Priors”, Apr. 2019. [Online]. Available:http://arxiv.org/abs/1904.10079

work page arXiv 2019
[6]

Self-improving reactive agents based on reinforcement learning, planning and teaching

L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching”, Machine learning, vol. 8, no. 3-4, pp. 293–321, 1992

work page 1992
[7]

Deep Reinforcement Learning with Double Q-learning

H. Van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning with Double Q-learning”, Tech. Rep. [Online]. Available: www.aaai.org

work page
[8]

Playing fps games with deep reinforcement learning

G. Lample and D. S. Chaplot, “Playing fps games with deep reinforcement learning”, in Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017
[9]

ViZDoom: A Doom-based AI research platform for visual reinforcement learning

M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski, “ViZDoom: A Doom-based AI research platform for visual reinforcement learning”, in 2016 IEEE Conference on Computational Intelligence and Games (CIG), IEEE, Sep. 2016, pp. 1–8, ISBN : 978-1-5090-1883-3. DOI: 10.1109/CIG.2016.7860433 . [Online]. Available: http://ieeexplore.ieee.org/document/7860433/

work page doi:10.1109/cig.2016.7860433 2016
[10]

The arcade learning environment: An evaluation platform for general agents

M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents”, Journal of Artiﬁcial Intelligence Research, vol. 47, pp. 253–279, 2013

work page 2013
[11]

Asynchronous methods for deep reinforcement learning

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning”, in International conference on machine learning, 2016, pp. 1928– 1937

work page 2016
[12]

Prioritized Experience Replay

T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized Experience Replay”, Nov. 2015. [Online]. Available: http://arxiv.org/abs/1511.05952

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Sample Efficient Actor-Critic with Experience Replay

Z. Wang, V . Bapst, N. Heess, V . Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample Efﬁcient Actor-Critic with Experience Replay”, Nov. 2016. [Online]. Available:http://arxiv.org/abs/1611.01224

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

Learning to Act by Predicting the Future

A. Dosovitskiy and V . Koltun, “Learning to Act by Predicting the Future”, Nov. 2016. [Online]. Available: http://arxiv.org/abs/1611.01779. 6

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Human-level control through deep reinforcement learning

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning”, Nature, vol. 518, 2015. DOI: 10.1038/nature14236. [Onl...

work page doi:10.1038/nature14236 2015

[2] [2]

A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft

S. Ontanon, G. Synnaeve, A. Uriarte, F. Richoux, D. Churchill, and M. Preuss, “A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft”, IEEE Transactions on Computational Intelligence and AI in Games, vol. 5, no. 4, pp. 293–311, Dec. 2013, ISSN : 1943-068X. DOI: 10.1109/TCIAIG.2013.2286295. [Online]. Available: http://ieeexplore.ieee...

work page doi:10.1109/tciaig.2013.2286295 2013

[3] [3]

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.”,Science (New York, N.Y.), vol. 362, no. 6419, pp. 1140–1144, Dec. 2018, ISSN : 1095-9203. DOI: 10...

work page doi:10.1126/science.aar6404 2018

[4] [4]

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates”, in 2017 IEEE international conference on robotics and automation (ICRA), IEEE, 2017, pp. 3389–3396

work page 2017

[5] [5]

The MineRL Competition on Sample Efﬁcient Reinforcement Learning using Human Priors

W. H. Guss, C. Codel, K. Hofmann, B. Houghton, N. Kuno, S. Milani, S. Mohanty, D. P. Liebana, R. Salakhutdi- nov, N. Topin, M. Veloso, and P. Wang, “The MineRL Competition on Sample Efﬁcient Reinforcement Learning using Human Priors”, Apr. 2019. [Online]. Available:http://arxiv.org/abs/1904.10079

work page arXiv 2019

[6] [6]

Self-improving reactive agents based on reinforcement learning, planning and teaching

L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching”, Machine learning, vol. 8, no. 3-4, pp. 293–321, 1992

work page 1992

[7] [7]

Deep Reinforcement Learning with Double Q-learning

H. Van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning with Double Q-learning”, Tech. Rep. [Online]. Available: www.aaai.org

work page

[8] [8]

Playing fps games with deep reinforcement learning

G. Lample and D. S. Chaplot, “Playing fps games with deep reinforcement learning”, in Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017

[9] [9]

ViZDoom: A Doom-based AI research platform for visual reinforcement learning

M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski, “ViZDoom: A Doom-based AI research platform for visual reinforcement learning”, in 2016 IEEE Conference on Computational Intelligence and Games (CIG), IEEE, Sep. 2016, pp. 1–8, ISBN : 978-1-5090-1883-3. DOI: 10.1109/CIG.2016.7860433 . [Online]. Available: http://ieeexplore.ieee.org/document/7860433/

work page doi:10.1109/cig.2016.7860433 2016

[10] [10]

The arcade learning environment: An evaluation platform for general agents

M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents”, Journal of Artiﬁcial Intelligence Research, vol. 47, pp. 253–279, 2013

work page 2013

[11] [11]

Asynchronous methods for deep reinforcement learning

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning”, in International conference on machine learning, 2016, pp. 1928– 1937

work page 2016

[12] [12]

Prioritized Experience Replay

T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized Experience Replay”, Nov. 2015. [Online]. Available: http://arxiv.org/abs/1511.05952

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Sample Efficient Actor-Critic with Experience Replay

Z. Wang, V . Bapst, N. Heess, V . Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample Efﬁcient Actor-Critic with Experience Replay”, Nov. 2016. [Online]. Available:http://arxiv.org/abs/1611.01224

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

Learning to Act by Predicting the Future

A. Dosovitskiy and V . Koltun, “Learning to Act by Predicting the Future”, Nov. 2016. [Online]. Available: http://arxiv.org/abs/1611.01779. 6

work page internal anchor Pith review Pith/arXiv arXiv 2016