Optimal Use of Experience in First Person Shooter Environments
Pith reviewed 2026-05-25 17:42 UTC · model grok-4.3
The pith
Deep Q-Learning in first-person shooter games performs best when the model updates only every fourth environmental step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate that applying learning update steps multiple times per environmental step in the VizDoom environment requires a change in the learning rate but does not improve the performance of the agent. They further show that updating less frequently is effective up to a ratio of 4:1, after which performance degrades significantly, thereby confirming the widespread practice of performing learning updates every fourth environmental step.
What carries the argument
The ratio of learning updates to environmental steps when reusing samples from the experience replay buffer in Deep Q-Learning.
If this is right
- Extra updates per step require a lower learning rate to keep training stable.
- Performance stays comparable when updates occur only every fourth step.
- Ratios higher than 4:1 produce clear drops in final agent scores.
- The 4:1 schedule reduces the number of gradient steps while preserving learning quality in this setting.
Where Pith is reading between the lines
- The same 4:1 limit may serve as a useful default in other environments that rely on experience replay.
- The result underscores the need to retune the learning rate whenever the update schedule changes.
- Repeating the test on different network sizes or game complexities would show whether the ratio is task-specific.
Load-bearing premise
Observed performance differences arise solely from the chosen update frequency rather than from other experimental choices such as learning-rate adjustments or the particular VizDoom tasks.
What would settle it
An experiment that applies an 8:1 update-to-step ratio in the same VizDoom DQN setup and records no significant performance drop would falsify the claim that degradation begins after 4:1.
Figures
read the original abstract
Although reinforcement learning has made great strides recently, a continuing limitation is that it requires an extremely high number of interactions with the environment. In this paper, we explore the effectiveness of reusing experience from the experience replay buffer in the Deep Q-Learning algorithm. We test the effectiveness of applying learning update steps multiple times per environmental step in the VizDoom environment and show first, this requires a change in the learning rate, and second that it does not improve the performance of the agent. Furthermore, we show that updating less frequently is effective up to a ratio of 4:1, after which performance degrades significantly. These results quantitatively confirm the widespread practice of performing learning updates every 4th environmental step.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines reuse of experience in DQN via multiple learning updates per environmental step in the VizDoom FPS environment. It reports that such reuse requires a learning-rate adjustment and yields no performance gain; separately, it finds that less frequent updating remains effective up to a 4:1 update-to-environment ratio, after which performance drops sharply, thereby providing quantitative support for the common practice of updating every fourth step.
Significance. If the central empirical claims hold after controlling for confounding factors, the work supplies a concrete, environment-specific calibration of the update ratio that is already standard in DQN implementations. This could serve as a reference point for practitioners tuning experience-replay schedules in similar visual RL tasks.
major comments (2)
- [Abstract] Abstract: the claim that performance degrades beyond the 4:1 ratio is attributed solely to update frequency, yet the same paragraph states that multiple updates 'requires a change in the learning rate.' No information is given on whether the learning rate (or other hyperparameters) was held constant when the ratio was varied; because DQN performance is known to be sensitive to LR–update interactions, this leaves the isolation of the frequency effect unverified.
- [Abstract] Experimental results (implicit in the abstract's quantitative claims): the reported performance differences lack any mention of the number of independent runs, error bars, or statistical tests. Without these, it is impossible to determine whether the observed degradation past 4:1 is reliable or could be explained by run-to-run variance.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and indicate where revisions will be made to improve clarity and completeness.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that performance degrades beyond the 4:1 ratio is attributed solely to update frequency, yet the same paragraph states that multiple updates 'requires a change in the learning rate.' No information is given on whether the learning rate (or other hyperparameters) was held constant when the ratio was varied; because DQN performance is known to be sensitive to LR–update interactions, this leaves the isolation of the frequency effect unverified.
Authors: The learning-rate adjustment applied only to the multiple-updates-per-step regime. In the separate experiments that varied update frequency (less frequent updating up to the 4:1 ratio), the learning rate was held fixed at the value tuned for the standard DQN configuration. We will revise the abstract to state this explicitly so that the isolation of the frequency effect is clear. revision: yes
-
Referee: [Abstract] Experimental results (implicit in the abstract's quantitative claims): the reported performance differences lack any mention of the number of independent runs, error bars, or statistical tests. Without these, it is impossible to determine whether the observed degradation past 4:1 is reliable or could be explained by run-to-run variance.
Authors: We agree that the manuscript should report the experimental protocol in more detail. The results were obtained from multiple independent runs with different random seeds; we will revise the text and figures to state the number of runs, include error bars (standard deviation), and note that the degradation past 4:1 is consistent across seeds. revision: yes
Circularity Check
No significant circularity; purely empirical experimental results
full rationale
The paper reports direct experimental comparisons in the VizDoom environment using a DQN agent, testing the effects of varying the ratio of learning updates to environmental steps. No mathematical derivations, fitted parameters presented as predictions, or self-citation chains are present that would reduce claims to inputs by construction. The central findings on update frequency (effective up to 4:1) are based on observed performance metrics rather than any self-referential definitions or renamings. This is a standard self-contained empirical study.
Axiom & Free-Parameter Ledger
free parameters (1)
- learning rate adjustment
Reference graph
Works this paper leans on
-
[1]
Human-level control through deep reinforcement learning
V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning”, Nature, vol. 518, 2015. DOI: 10.1038/nature14236. [Onl...
-
[2]
A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft
S. Ontanon, G. Synnaeve, A. Uriarte, F. Richoux, D. Churchill, and M. Preuss, “A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft”, IEEE Transactions on Computational Intelligence and AI in Games, vol. 5, no. 4, pp. 293–311, Dec. 2013, ISSN : 1943-068X. DOI: 10.1109/TCIAIG.2013.2286295. [Online]. Available: http://ieeexplore.ieee...
-
[3]
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.”,Science (New York, N.Y.), vol. 362, no. 6419, pp. 1140–1144, Dec. 2018, ISSN : 1095-9203. DOI: 10...
-
[4]
Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates
S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates”, in 2017 IEEE international conference on robotics and automation (ICRA), IEEE, 2017, pp. 3389–3396
work page 2017
-
[5]
The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors
W. H. Guss, C. Codel, K. Hofmann, B. Houghton, N. Kuno, S. Milani, S. Mohanty, D. P. Liebana, R. Salakhutdi- nov, N. Topin, M. Veloso, and P. Wang, “The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors”, Apr. 2019. [Online]. Available:http://arxiv.org/abs/1904.10079
-
[6]
Self-improving reactive agents based on reinforcement learning, planning and teaching
L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching”, Machine learning, vol. 8, no. 3-4, pp. 293–321, 1992
work page 1992
-
[7]
Deep Reinforcement Learning with Double Q-learning
H. Van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning with Double Q-learning”, Tech. Rep. [Online]. Available: www.aaai.org
-
[8]
Playing fps games with deep reinforcement learning
G. Lample and D. S. Chaplot, “Playing fps games with deep reinforcement learning”, in Thirty-First AAAI Conference on Artificial Intelligence, 2017
work page 2017
-
[9]
ViZDoom: A Doom-based AI research platform for visual reinforcement learning
M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski, “ViZDoom: A Doom-based AI research platform for visual reinforcement learning”, in 2016 IEEE Conference on Computational Intelligence and Games (CIG), IEEE, Sep. 2016, pp. 1–8, ISBN : 978-1-5090-1883-3. DOI: 10.1109/CIG.2016.7860433 . [Online]. Available: http://ieeexplore.ieee.org/document/7860433/
-
[10]
The arcade learning environment: An evaluation platform for general agents
M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents”, Journal of Artificial Intelligence Research, vol. 47, pp. 253–279, 2013
work page 2013
-
[11]
Asynchronous methods for deep reinforcement learning
V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning”, in International conference on machine learning, 2016, pp. 1928– 1937
work page 2016
-
[12]
T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized Experience Replay”, Nov. 2015. [Online]. Available: http://arxiv.org/abs/1511.05952
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
Sample Efficient Actor-Critic with Experience Replay
Z. Wang, V . Bapst, N. Heess, V . Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample Efficient Actor-Critic with Experience Replay”, Nov. 2016. [Online]. Available:http://arxiv.org/abs/1611.01224
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[14]
Learning to Act by Predicting the Future
A. Dosovitskiy and V . Koltun, “Learning to Act by Predicting the Future”, Nov. 2016. [Online]. Available: http://arxiv.org/abs/1611.01779. 6
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.