pith. sign in

arxiv: 1906.09734 · v1 · pith:OOO6Y72Qnew · submitted 2019-06-24 · 💻 cs.LG · stat.ML

Optimal Use of Experience in First Person Shooter Environments

Pith reviewed 2026-05-25 17:42 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords reinforcement learningdeep q-learningexperience replayupdate frequencyvizdoomfirst person shootersample efficiencydqn
0
0 comments X

The pith

Deep Q-Learning in first-person shooter games performs best when the model updates only every fourth environmental step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether reusing experience from the replay buffer by running multiple learning updates per environment step can reduce the high number of interactions reinforcement learning needs. Experiments in the VizDoom first-person shooter environment using Deep Q-Learning show that extra updates per step require lowering the learning rate yet still fail to raise agent performance. Updating less often remains effective up to a four-to-one ratio of environmental steps to learning updates, after which results drop sharply. This matters because most reinforcement learning training consumes millions of environment steps, so identifying when reuse stops helping could cut wasted computation without losing learning progress.

Core claim

The authors demonstrate that applying learning update steps multiple times per environmental step in the VizDoom environment requires a change in the learning rate but does not improve the performance of the agent. They further show that updating less frequently is effective up to a ratio of 4:1, after which performance degrades significantly, thereby confirming the widespread practice of performing learning updates every fourth environmental step.

What carries the argument

The ratio of learning updates to environmental steps when reusing samples from the experience replay buffer in Deep Q-Learning.

If this is right

  • Extra updates per step require a lower learning rate to keep training stable.
  • Performance stays comparable when updates occur only every fourth step.
  • Ratios higher than 4:1 produce clear drops in final agent scores.
  • The 4:1 schedule reduces the number of gradient steps while preserving learning quality in this setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same 4:1 limit may serve as a useful default in other environments that rely on experience replay.
  • The result underscores the need to retune the learning rate whenever the update schedule changes.
  • Repeating the test on different network sizes or game complexities would show whether the ratio is task-specific.

Load-bearing premise

Observed performance differences arise solely from the chosen update frequency rather than from other experimental choices such as learning-rate adjustments or the particular VizDoom tasks.

What would settle it

An experiment that applies an 8:1 update-to-step ratio in the same VizDoom DQN setup and records no significant performance drop would falsify the claim that degradation begins after 4:1.

Figures

Figures reproduced from arXiv: 1906.09734 by Matthew Aitchison.

Figure 1
Figure 1. Figure 1: ViZDoom ‘Health Gathering Supreme’ scenario showing health kits and poison bottles. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the model used in these experiments. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Agents test scores during training taken from the optimal learning rate, and averaged over the 5 runs. The every [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Heat map showing average final score for of the 5 runs over each learning rate modifier [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Although reinforcement learning has made great strides recently, a continuing limitation is that it requires an extremely high number of interactions with the environment. In this paper, we explore the effectiveness of reusing experience from the experience replay buffer in the Deep Q-Learning algorithm. We test the effectiveness of applying learning update steps multiple times per environmental step in the VizDoom environment and show first, this requires a change in the learning rate, and second that it does not improve the performance of the agent. Furthermore, we show that updating less frequently is effective up to a ratio of 4:1, after which performance degrades significantly. These results quantitatively confirm the widespread practice of performing learning updates every 4th environmental step.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper examines reuse of experience in DQN via multiple learning updates per environmental step in the VizDoom FPS environment. It reports that such reuse requires a learning-rate adjustment and yields no performance gain; separately, it finds that less frequent updating remains effective up to a 4:1 update-to-environment ratio, after which performance drops sharply, thereby providing quantitative support for the common practice of updating every fourth step.

Significance. If the central empirical claims hold after controlling for confounding factors, the work supplies a concrete, environment-specific calibration of the update ratio that is already standard in DQN implementations. This could serve as a reference point for practitioners tuning experience-replay schedules in similar visual RL tasks.

major comments (2)
  1. [Abstract] Abstract: the claim that performance degrades beyond the 4:1 ratio is attributed solely to update frequency, yet the same paragraph states that multiple updates 'requires a change in the learning rate.' No information is given on whether the learning rate (or other hyperparameters) was held constant when the ratio was varied; because DQN performance is known to be sensitive to LR–update interactions, this leaves the isolation of the frequency effect unverified.
  2. [Abstract] Experimental results (implicit in the abstract's quantitative claims): the reported performance differences lack any mention of the number of independent runs, error bars, or statistical tests. Without these, it is impossible to determine whether the observed degradation past 4:1 is reliable or could be explained by run-to-run variance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate where revisions will be made to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that performance degrades beyond the 4:1 ratio is attributed solely to update frequency, yet the same paragraph states that multiple updates 'requires a change in the learning rate.' No information is given on whether the learning rate (or other hyperparameters) was held constant when the ratio was varied; because DQN performance is known to be sensitive to LR–update interactions, this leaves the isolation of the frequency effect unverified.

    Authors: The learning-rate adjustment applied only to the multiple-updates-per-step regime. In the separate experiments that varied update frequency (less frequent updating up to the 4:1 ratio), the learning rate was held fixed at the value tuned for the standard DQN configuration. We will revise the abstract to state this explicitly so that the isolation of the frequency effect is clear. revision: yes

  2. Referee: [Abstract] Experimental results (implicit in the abstract's quantitative claims): the reported performance differences lack any mention of the number of independent runs, error bars, or statistical tests. Without these, it is impossible to determine whether the observed degradation past 4:1 is reliable or could be explained by run-to-run variance.

    Authors: We agree that the manuscript should report the experimental protocol in more detail. The results were obtained from multiple independent runs with different random seeds; we will revise the text and figures to state the number of runs, include error bars (standard deviation), and note that the degradation past 4:1 is consistent across seeds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical experimental results

full rationale

The paper reports direct experimental comparisons in the VizDoom environment using a DQN agent, testing the effects of varying the ratio of learning updates to environmental steps. No mathematical derivations, fitted parameters presented as predictions, or self-citation chains are present that would reduce claims to inputs by construction. The central findings on update frequency (effective up to 4:1) are based on observed performance metrics rather than any self-referential definitions or renamings. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The empirical results depend on the assumption that the DQN algorithm and VizDoom setup allow isolation of the update frequency effect through learning rate tuning; no new entities or axioms are introduced.

free parameters (1)
  • learning rate adjustment
    The paper states that multiple updates per step require a change in the learning rate to maintain stability, implying it was tuned for the experiment.

pith-pipeline@v0.9.0 · 5633 in / 1172 out tokens · 37842 ms · 2026-05-25T17:42:06.993845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    Human-level control through deep reinforcement learning

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning”, Nature, vol. 518, 2015. DOI: 10.1038/nature14236. [Onl...

  2. [2]

    A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft

    S. Ontanon, G. Synnaeve, A. Uriarte, F. Richoux, D. Churchill, and M. Preuss, “A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft”, IEEE Transactions on Computational Intelligence and AI in Games, vol. 5, no. 4, pp. 293–311, Dec. 2013, ISSN : 1943-068X. DOI: 10.1109/TCIAIG.2013.2286295. [Online]. Available: http://ieeexplore.ieee...

  3. [3]

    A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

    D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.”,Science (New York, N.Y.), vol. 362, no. 6419, pp. 1140–1144, Dec. 2018, ISSN : 1095-9203. DOI: 10...

  4. [4]

    Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

    S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates”, in 2017 IEEE international conference on robotics and automation (ICRA), IEEE, 2017, pp. 3389–3396

  5. [5]

    The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors

    W. H. Guss, C. Codel, K. Hofmann, B. Houghton, N. Kuno, S. Milani, S. Mohanty, D. P. Liebana, R. Salakhutdi- nov, N. Topin, M. Veloso, and P. Wang, “The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors”, Apr. 2019. [Online]. Available:http://arxiv.org/abs/1904.10079

  6. [6]

    Self-improving reactive agents based on reinforcement learning, planning and teaching

    L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching”, Machine learning, vol. 8, no. 3-4, pp. 293–321, 1992

  7. [7]

    Deep Reinforcement Learning with Double Q-learning

    H. Van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning with Double Q-learning”, Tech. Rep. [Online]. Available: www.aaai.org

  8. [8]

    Playing fps games with deep reinforcement learning

    G. Lample and D. S. Chaplot, “Playing fps games with deep reinforcement learning”, in Thirty-First AAAI Conference on Artificial Intelligence, 2017

  9. [9]

    ViZDoom: A Doom-based AI research platform for visual reinforcement learning

    M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski, “ViZDoom: A Doom-based AI research platform for visual reinforcement learning”, in 2016 IEEE Conference on Computational Intelligence and Games (CIG), IEEE, Sep. 2016, pp. 1–8, ISBN : 978-1-5090-1883-3. DOI: 10.1109/CIG.2016.7860433 . [Online]. Available: http://ieeexplore.ieee.org/document/7860433/

  10. [10]

    The arcade learning environment: An evaluation platform for general agents

    M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents”, Journal of Artificial Intelligence Research, vol. 47, pp. 253–279, 2013

  11. [11]

    Asynchronous methods for deep reinforcement learning

    V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning”, in International conference on machine learning, 2016, pp. 1928– 1937

  12. [12]

    Prioritized Experience Replay

    T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized Experience Replay”, Nov. 2015. [Online]. Available: http://arxiv.org/abs/1511.05952

  13. [13]

    Sample Efficient Actor-Critic with Experience Replay

    Z. Wang, V . Bapst, N. Heess, V . Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample Efficient Actor-Critic with Experience Replay”, Nov. 2016. [Online]. Available:http://arxiv.org/abs/1611.01224

  14. [14]

    Learning to Act by Predicting the Future

    A. Dosovitskiy and V . Koltun, “Learning to Act by Predicting the Future”, Nov. 2016. [Online]. Available: http://arxiv.org/abs/1611.01779. 6