pith. sign in

arxiv: 1907.03098 · v1 · pith:OQHPSZUMnew · submitted 2019-07-06 · 💻 cs.LG · cs.NE

Playing Flappy Bird via Asynchronous Advantage Actor Critic Algorithm

Pith reviewed 2026-05-25 01:39 UTC · model grok-4.3

classification 💻 cs.LG cs.NE
keywords flappy birdreinforcement learninga3cdeep q-networkraw pixel inputgame playing
0
0 comments X

The pith

Flappy Bird is trained to play using Deep Q-Network and Asynchronous Advantage Actor Critic algorithms directly from raw game images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning agents can master Flappy Bird by processing raw pixel inputs and receiving only a reward or penalty after each action. It applies both Deep Q-Network and Asynchronous Advantage Actor Critic methods to show that the model learns appropriate decisions through this end-to-end process. A sympathetic reader would care because the setup demonstrates game mastery without hand-engineered features or custom state representations. The work focuses on completing training via standard reinforcement signals from visual data alone.

Core claim

The Flappy Bird game was trained with the Reinforcement Learning algorithm Deep Q-Network and Asynchronous Advantage Actor Critic (A3C) algorithms from raw game images. The trained model has learned as reinforcement when to make which decision. As an input to the model, the reward or penalty at the end of each step was returned and the training was completed.

What carries the argument

Asynchronous Advantage Actor Critic (A3C) and Deep Q-Network algorithms that process raw game images and update policies based on per-step rewards or penalties.

If this is right

  • The model acquires the ability to choose actions based solely on visual input.
  • Training finishes when the agent consistently receives positive cumulative rewards from correct decisions.
  • Both DQN and A3C can be applied to the same raw-image game task with per-step feedback.
  • No additional state attributes beyond pixels are needed for the learning process described.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar raw-pixel training might apply to other simple 2D games that use basic reward structures.
  • The approach leaves open whether the same methods would require more steps or different rewards in visually richer settings.
  • Results connect to questions of how much preprocessing RL needs for visual control tasks.

Load-bearing premise

That standard reward signals and raw pixel inputs alone enable the algorithms to reach stable successful play without extra input engineering or hyperparameter details.

What would settle it

Running the trained agent in the game and finding that it achieves only random-level scores or fails to clear any pipes would show the training did not succeed.

Figures

Figures reproduced from arXiv: 1907.03098 by Elit Cenk Alp, Mehmet Serdar Guzel.

Figure 1
Figure 1. Figure 1: The raw image of the Game [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The pre-processed image of Flappy Image 2. RELATED WORKS In 2013, Mnih et al. [2] developed an algorithm called DQN. In this study, the agent is trained from the images that have never been seen before. The algorithm that is tested on Atari games has produced results far above human results. Appiah and Vare [5] trained Flappy Bird with DQN in their study. They have found much better human results than befo… view at source ↗
Figure 3
Figure 3. Figure 3: The Flow Chart of Q Learning Algorithm [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Actor Critic Model [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Asynchoronous Advantage Actor Critic model 4. EXPERIMENTS Tensorflow was selected in the backend of the Keras library. Pygame was used to develop Flappy Bird game on Python. After the game, this environment was used for education. In the game, each pipe is given a point of 1 point and the bird is hit as a result of the death of -1 point penalty. Various modifications and experiments were performed in Flapp… view at source ↗
Figure 7
Figure 7. Figure 7: DQN avarage score. 5. CONCLUSIONS Flappy Bird, DQN and A3C were trained in this paper. Very little information (images in the game) was used during the training. Experiments with the A3C resulted in much faster training. In addition to being fast, the A3C has been shown to deliver much better results. The reason why A3C gives better and faster results is that there is a reward system that can change in any… view at source ↗
read the original abstract

Flappy Bird, which has a very high popularity, has been trained in many algorithms. Some of these studies were trained from raw pixel values of game and some from specific attributes. In this study, the model was trained with raw game images, which had not been seen before. The trained model has learned as reinforcement when to make which decision. As an input to the model, the reward or penalty at the end of each step was returned and the training was completed. Flappy Bird game was trained with the Reinforcement Learning algorithm Deep Q-Network and Asynchronous Advantage Actor Critic (A3C) algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to apply the Deep Q-Network (DQN) and Asynchronous Advantage Actor-Critic (A3C) reinforcement learning algorithms to the Flappy Bird game using raw pixel inputs from game images. It states that per-step reward or penalty signals were provided and asserts that training was completed, enabling the model to learn appropriate actions.

Significance. If the central claim of successful training were supported by data, the work would constitute a straightforward empirical application of established RL methods to a pixel-based game, adding to the existing literature on game-playing agents but offering no novel algorithmic contributions or comparisons to prior baselines on this environment.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'the training was completed' and 'the model has learned' when to act is presented without any quantitative results (e.g., scores, episode counts, learning curves, success rates, or baseline comparisons). This directly undermines evaluation of the central empirical claim that learning occurred under the described setup.
  2. [Abstract] Abstract and methods description: No details are supplied on network architecture, hyperparameters, training duration, or implementation of either DQN or A3C, nor is there any indication of how raw images were preprocessed or how the reward signal was defined. These omissions make the training claim impossible to assess or reproduce.
minor comments (1)
  1. [Abstract] Abstract: The sentence 'Flappy Bird game was trained with the Reinforcement Learning algorithm Deep Q-Network and Asynchronous Advantage Actor Critic (A3C) algorithms' contains redundant wording and should be clarified to distinguish the two algorithms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. We acknowledge that the submitted manuscript is brief and lacks the quantitative results and implementation details necessary to fully support and reproduce the claims. We will revise the manuscript to address these issues.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'the training was completed' and 'the model has learned' when to act is presented without any quantitative results (e.g., scores, episode counts, learning curves, success rates, or baseline comparisons). This directly undermines evaluation of the central empirical claim that learning occurred under the described setup.

    Authors: We agree that the abstract and manuscript do not include quantitative results to support the claim that training was completed and the model learned appropriate actions. In the revised manuscript, we will add learning curves, achieved scores over episodes, number of training episodes, success rates, and any available baseline comparisons to substantiate the empirical results. revision: yes

  2. Referee: [Abstract] Abstract and methods description: No details are supplied on network architecture, hyperparameters, training duration, or implementation of either DQN or A3C, nor is there any indication of how raw images were preprocessed or how the reward signal was defined. These omissions make the training claim impossible to assess or reproduce.

    Authors: We agree that the manuscript omits critical details on network architecture, hyperparameters, training duration, implementation specifics for DQN and A3C, image preprocessing, and reward signal definition, which are needed for assessment and reproducibility. In the revised version, we will expand the methods section to provide these details, including convolutional network layers, specific hyperparameter values, training steps, preprocessing steps such as frame resizing and stacking, and the exact per-step reward/penalty formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical application

full rationale

The manuscript applies standard DQN and A3C algorithms to Flappy Bird using raw pixel inputs and per-step reward/penalty signals. No equations, derivations, parameter fittings, or uniqueness theorems are presented. The work contains no load-bearing steps that reduce by construction to their own inputs, self-citations, or ansatzes. It is a straightforward empirical RL experiment whose validity rests on reported outcomes rather than any definitional or fitted circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. Standard RL training implicitly relies on reward design and network architecture choices that are not detailed.

pith-pipeline@v0.9.0 · 5624 in / 1030 out tokens · 16572 ms · 2026-05-25T01:39:18.130606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

  1. [1]

    It was the most downloaded mobile game at the beginning of 2014

    INTRODUCTION Flappy Bird made a very fast entry into the market. It was the most downloaded mobile game at the beginning of 2014. But within a very short time the market has withdrawn. Flappy Bird game is a single player game. There is only one action that jump. The game is about deciding when a bird should jump. The bird ends as soon as it strikes down, ...

  2. [2]

    [2] developed an algorithm called DQN

    RELATED WORKS In 2013, Mnih et al. [2] developed an algorithm called DQN. In this study, the agent is trained from the images that have never been seen before. The algorithm that is tested on Atari games has produced results far above human results. Appiah and Vare [5] trained Flappy Bird with DQN in their study. They have found much better human results ...

  3. [3]

    at”, observes a rewards “rt

    METHOD The reinforcement learning environment is PyGame. Flappy Bird game was run with PyGame. The environment in which the software is developed is Python. Deep Learning models are written in Keras, which is currently working on Tensorflow, and their training is done through these libraries. A. Q Learning It determines the reward for performing a specifi...

  4. [4]

    Pygame was used to develop Flappy Bird game on Python

    EXPERIMENTS Tensorflow was selected in the backend of the Keras library. Pygame was used to develop Flappy Bird game on Python. After the game, this environment was used for education. In the game, each pipe is given a point of 1 point and the bird is hit as a result of the death of -1 point penalty. Various modifications and experiments were performed in...

  5. [5]

    Very little information (images in the game) was used during the training

    CONCLUSIONS Flappy Bird, DQN and A3C were trained in this paper. Very little information (images in the game) was used during the training. Experiments with the A3C resulted in much faster training. In addition to being fast, the A3C has been shown to deliver much better results. The reason why A3C gives better and faster results is that there is a reward...

  6. [6]

    Q-learning,

    C. J. C. H. Watkins and P. Dayan, “Q-learning,” in Machine Learning, 1992

  7. [7]

    Playing Atari with Deep Reinforcement Learning

    V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves and I. Antonoglou, “Playing Atari with Deep Reinforcement Learning,” ArXiv, vol. abs/1312.5602, 2013

  8. [8]

    Human-level control through deep reinforcement learning,

    V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran and D. Wier, “Human-level control through deep reinforcement learning,” Nature, vol. 518, p. 529, February 25, 2015

  9. [9]

    Asynchronous Methods for Deep Reinforcement Learning

    V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” arxiv, vol. 1602.01783v2, 2016

  10. [10]

    Playing FlappyBird with Deep Reinforcement Learning,

    N. Appiah and S. Vare, “Playing FlappyBird with Deep Reinforcement Learning,” http://cs231n.stanford.edu/reports/2016/pdfs/111Report.pdf

  11. [12]

    Applying Q-Learning to Flappy Bird,

    M. Ebeling-Rump and Z. Hervieux-Moore, “Applying Q-Learning to Flappy Bird,” https://pdfs.semanticscholar.org/c8d8/45063aedd44e8dbf668774532aa0c01baa4f.pdf, 2016

  12. [13]

    Cooperative Multi-agent Reinforcement Learning for Flappy Bird,

    C. Rosset, C. Cevallos and I. Mukherjee, “Cooperative Multi-agent Reinforcement Learning for Flappy Bird,” 2016

  13. [14]

    Deep Reinforcement Learning to play Flappy Bird using A3C algorithm,

    S. Singh, “Deep Reinforcement Learning to play Flappy Bird using A3C algorithm,” [Online]. Available: https://shalabhsingh.github.io/Deep-RL-Flappy-Bird/

  14. [15]

    OpenAI Gym

    G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman and W. Z. Jie Tang, “OpenAI Gym,” Arxiv, vol. 1606.01540, 2016

  15. [16]

    M., Güzel, M

    Karim, A. M., Güzel, M. S., Tolun, M. R., Kaya, H., & Çelebi, F. V. (2018). A new generalized deep learning framework combining sparse autoencoder and taguchi method for novel data classification and processing mathematical problems in engineering (p. Article ID 3145947, 13 pages)

  16. [17]

    M., Güzel, M

    Karim, A. M., Güzel, M. S., Tolun, M. R., Kaya, H., & Çelebi, F. V. (2019). A new framework using deep auto-encoder and ener gy spectral density for medical waveform data classification and processing. Biocybernetics and Biomedical Engineering, 39 , 148–159

  17. [18]

    Hamide Ozlem Dalgic, Erkan Bostanci, Mehmet Serdar Guzel, , Genetic Algorithm Based Floor Planning System, arXiv preprint arXiv:1704.06016, 2017

  18. [19]

    S., Kara M

    Guzel, M. S., Kara M. and Beyazkılıç, M. S., “An adaptive framework for mobile robot navigation, “Adapt. Behav. 25(1), 30-39(2017)