Playing Flappy Bird via Asynchronous Advantage Actor Critic Algorithm

Elit Cenk Alp; Mehmet Serdar Guzel

arxiv: 1907.03098 · v1 · pith:OQHPSZUMnew · submitted 2019-07-06 · 💻 cs.LG · cs.NE

Playing Flappy Bird via Asynchronous Advantage Actor Critic Algorithm

Elit Cenk Alp , Mehmet Serdar Guzel This is my paper

Pith reviewed 2026-05-25 01:39 UTC · model grok-4.3

classification 💻 cs.LG cs.NE

keywords flappy birdreinforcement learninga3cdeep q-networkraw pixel inputgame playing

0 comments

The pith

Flappy Bird is trained to play using Deep Q-Network and Asynchronous Advantage Actor Critic algorithms directly from raw game images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning agents can master Flappy Bird by processing raw pixel inputs and receiving only a reward or penalty after each action. It applies both Deep Q-Network and Asynchronous Advantage Actor Critic methods to show that the model learns appropriate decisions through this end-to-end process. A sympathetic reader would care because the setup demonstrates game mastery without hand-engineered features or custom state representations. The work focuses on completing training via standard reinforcement signals from visual data alone.

Core claim

The Flappy Bird game was trained with the Reinforcement Learning algorithm Deep Q-Network and Asynchronous Advantage Actor Critic (A3C) algorithms from raw game images. The trained model has learned as reinforcement when to make which decision. As an input to the model, the reward or penalty at the end of each step was returned and the training was completed.

What carries the argument

Asynchronous Advantage Actor Critic (A3C) and Deep Q-Network algorithms that process raw game images and update policies based on per-step rewards or penalties.

If this is right

The model acquires the ability to choose actions based solely on visual input.
Training finishes when the agent consistently receives positive cumulative rewards from correct decisions.
Both DQN and A3C can be applied to the same raw-image game task with per-step feedback.
No additional state attributes beyond pixels are needed for the learning process described.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar raw-pixel training might apply to other simple 2D games that use basic reward structures.
The approach leaves open whether the same methods would require more steps or different rewards in visually richer settings.
Results connect to questions of how much preprocessing RL needs for visual control tasks.

Load-bearing premise

That standard reward signals and raw pixel inputs alone enable the algorithms to reach stable successful play without extra input engineering or hyperparameter details.

What would settle it

Running the trained agent in the game and finding that it achieves only random-level scores or fails to clear any pipes would show the training did not succeed.

Figures

Figures reproduced from arXiv: 1907.03098 by Elit Cenk Alp, Mehmet Serdar Guzel.

**Figure 2.** Figure 2: The pre-processed image of Flappy Image 2. RELATED WORKS In 2013, Mnih et al. [2] developed an algorithm called DQN. In this study, the agent is trained from the images that have never been seen before. The algorithm that is tested on Atari games has produced results far above human results. Appiah and Vare [5] trained Flappy Bird with DQN in their study. They have found much better human results than befo… view at source ↗

**Figure 3.** Figure 3: The Flow Chart of Q Learning Algorithm [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Actor Critic Model [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Asynchoronous Advantage Actor Critic model 4. EXPERIMENTS Tensorflow was selected in the backend of the Keras library. Pygame was used to develop Flappy Bird game on Python. After the game, this environment was used for education. In the game, each pipe is given a point of 1 point and the bird is hit as a result of the death of -1 point penalty. Various modifications and experiments were performed in Flapp… view at source ↗

**Figure 7.** Figure 7: DQN avarage score. 5. CONCLUSIONS Flappy Bird, DQN and A3C were trained in this paper. Very little information (images in the game) was used during the training. Experiments with the A3C resulted in much faster training. In addition to being fast, the A3C has been shown to deliver much better results. The reason why A3C gives better and faster results is that there is a reward system that can change in any… view at source ↗

read the original abstract

Flappy Bird, which has a very high popularity, has been trained in many algorithms. Some of these studies were trained from raw pixel values of game and some from specific attributes. In this study, the model was trained with raw game images, which had not been seen before. The trained model has learned as reinforcement when to make which decision. As an input to the model, the reward or penalty at the end of each step was returned and the training was completed. Flappy Bird game was trained with the Reinforcement Learning algorithm Deep Q-Network and Asynchronous Advantage Actor Critic (A3C) algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper applies A3C to Flappy Bird from raw pixels but reports no scores, curves, or comparisons, so the claim that training succeeded cannot be checked.

read the letter

The main thing here is that the authors ran A3C on Flappy Bird using raw game images and per-step rewards, yet the manuscript contains no numbers at all to show whether the agent actually learned anything useful. The abstract simply states that training completed and the model has learned when to act. That is the entire result offered. The work also mentions DQN in passing but focuses on A3C. The setup itself is ordinary: raw pixels in, standard reward signal out, off-the-shelf algorithm. By 2019 this combination had already appeared in the literature, so the paper adds no new method or benchmark variant. What it does is lay out a basic experimental description without any supporting data. The obvious weakness is the total absence of evidence. No average scores, no episode lengths, no learning curves, no baseline runs, and no mention of how many frames or episodes were needed. Without those details the central claim stays unevaluated. The paper also does not compare against earlier Flappy Bird results or discuss why raw pixels were chosen over hand-crafted features in this specific case. This kind of write-up might serve as a quick classroom example for someone first learning A3C, but it does not move the research conversation forward. I would not bring it to a reading group or cite it. It does not merit sending out for peer review because the reader has nothing concrete to referee.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to apply the Deep Q-Network (DQN) and Asynchronous Advantage Actor-Critic (A3C) reinforcement learning algorithms to the Flappy Bird game using raw pixel inputs from game images. It states that per-step reward or penalty signals were provided and asserts that training was completed, enabling the model to learn appropriate actions.

Significance. If the central claim of successful training were supported by data, the work would constitute a straightforward empirical application of established RL methods to a pixel-based game, adding to the existing literature on game-playing agents but offering no novel algorithmic contributions or comparisons to prior baselines on this environment.

major comments (2)

[Abstract] Abstract: The assertion that 'the training was completed' and 'the model has learned' when to act is presented without any quantitative results (e.g., scores, episode counts, learning curves, success rates, or baseline comparisons). This directly undermines evaluation of the central empirical claim that learning occurred under the described setup.
[Abstract] Abstract and methods description: No details are supplied on network architecture, hyperparameters, training duration, or implementation of either DQN or A3C, nor is there any indication of how raw images were preprocessed or how the reward signal was defined. These omissions make the training claim impossible to assess or reproduce.

minor comments (1)

[Abstract] Abstract: The sentence 'Flappy Bird game was trained with the Reinforcement Learning algorithm Deep Q-Network and Asynchronous Advantage Actor Critic (A3C) algorithms' contains redundant wording and should be clarified to distinguish the two algorithms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. We acknowledge that the submitted manuscript is brief and lacks the quantitative results and implementation details necessary to fully support and reproduce the claims. We will revise the manuscript to address these issues.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'the training was completed' and 'the model has learned' when to act is presented without any quantitative results (e.g., scores, episode counts, learning curves, success rates, or baseline comparisons). This directly undermines evaluation of the central empirical claim that learning occurred under the described setup.

Authors: We agree that the abstract and manuscript do not include quantitative results to support the claim that training was completed and the model learned appropriate actions. In the revised manuscript, we will add learning curves, achieved scores over episodes, number of training episodes, success rates, and any available baseline comparisons to substantiate the empirical results. revision: yes
Referee: [Abstract] Abstract and methods description: No details are supplied on network architecture, hyperparameters, training duration, or implementation of either DQN or A3C, nor is there any indication of how raw images were preprocessed or how the reward signal was defined. These omissions make the training claim impossible to assess or reproduce.

Authors: We agree that the manuscript omits critical details on network architecture, hyperparameters, training duration, implementation specifics for DQN and A3C, image preprocessing, and reward signal definition, which are needed for assessment and reproducibility. In the revised version, we will expand the methods section to provide these details, including convolutional network layers, specific hyperparameter values, training steps, preprocessing steps such as frame resizing and stacking, and the exact per-step reward/penalty formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical application

full rationale

The manuscript applies standard DQN and A3C algorithms to Flappy Bird using raw pixel inputs and per-step reward/penalty signals. No equations, derivations, parameter fittings, or uniqueness theorems are presented. The work contains no load-bearing steps that reduce by construction to their own inputs, self-citations, or ansatzes. It is a straightforward empirical RL experiment whose validity rests on reported outcomes rather than any definitional or fitted circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. Standard RL training implicitly relies on reward design and network architecture choices that are not detailed.

pith-pipeline@v0.9.0 · 5624 in / 1030 out tokens · 16572 ms · 2026-05-25T01:39:18.130606+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

[1]

It was the most downloaded mobile game at the beginning of 2014

INTRODUCTION Flappy Bird made a very fast entry into the market. It was the most downloaded mobile game at the beginning of 2014. But within a very short time the market has withdrawn. Flappy Bird game is a single player game. There is only one action that jump. The game is about deciding when a bird should jump. The bird ends as soon as it strikes down, ...

work page 2014
[2]

[2] developed an algorithm called DQN

RELATED WORKS In 2013, Mnih et al. [2] developed an algorithm called DQN. In this study, the agent is trained from the images that have never been seen before. The algorithm that is tested on Atari games has produced results far above human results. Appiah and Vare [5] trained Flappy Bird with DQN in their study. They have found much better human results ...

work page 2013
[3]

at”, observes a rewards “rt

METHOD The reinforcement learning environment is PyGame. Flappy Bird game was run with PyGame. The environment in which the software is developed is Python. Deep Learning models are written in Keras, which is currently working on Tensorflow, and their training is done through these libraries. A. Q Learning It determines the reward for performing a specifi...

work page
[4]

Pygame was used to develop Flappy Bird game on Python

EXPERIMENTS Tensorflow was selected in the backend of the Keras library. Pygame was used to develop Flappy Bird game on Python. After the game, this environment was used for education. In the game, each pipe is given a point of 1 point and the bird is hit as a result of the death of -1 point penalty. Various modifications and experiments were performed in...

work page
[5]

Very little information (images in the game) was used during the training

CONCLUSIONS Flappy Bird, DQN and A3C were trained in this paper. Very little information (images in the game) was used during the training. Experiments with the A3C resulted in much faster training. In addition to being fast, the A3C has been shown to deliver much better results. The reason why A3C gives better and faster results is that there is a reward...

work page
[6]

Q-learning,

C. J. C. H. Watkins and P. Dayan, “Q-learning,” in Machine Learning, 1992

work page 1992
[7]

Playing Atari with Deep Reinforcement Learning

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves and I. Antonoglou, “Playing Atari with Deep Reinforcement Learning,” ArXiv, vol. abs/1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[8]

Human-level control through deep reinforcement learning,

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran and D. Wier, “Human-level control through deep reinforcement learning,” Nature, vol. 518, p. 529, February 25, 2015

work page 2015
[9]

Asynchronous Methods for Deep Reinforcement Learning

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” arxiv, vol. 1602.01783v2, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Playing FlappyBird with Deep Reinforcement Learning,

N. Appiah and S. Vare, “Playing FlappyBird with Deep Reinforcement Learning,” http://cs231n.stanford.edu/reports/2016/pdfs/111Report.pdf

work page 2016
[12]

Applying Q-Learning to Flappy Bird,

M. Ebeling-Rump and Z. Hervieux-Moore, “Applying Q-Learning to Flappy Bird,” https://pdfs.semanticscholar.org/c8d8/45063aedd44e8dbf668774532aa0c01baa4f.pdf, 2016

work page 2016
[13]

Cooperative Multi-agent Reinforcement Learning for Flappy Bird,

C. Rosset, C. Cevallos and I. Mukherjee, “Cooperative Multi-agent Reinforcement Learning for Flappy Bird,” 2016

work page 2016
[14]

Deep Reinforcement Learning to play Flappy Bird using A3C algorithm,

S. Singh, “Deep Reinforcement Learning to play Flappy Bird using A3C algorithm,” [Online]. Available: https://shalabhsingh.github.io/Deep-RL-Flappy-Bird/

work page
[15]

OpenAI Gym

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman and W. Z. Jie Tang, “OpenAI Gym,” Arxiv, vol. 1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

M., Güzel, M

Karim, A. M., Güzel, M. S., Tolun, M. R., Kaya, H., & Çelebi, F. V. (2018). A new generalized deep learning framework combining sparse autoencoder and taguchi method for novel data classification and processing mathematical problems in engineering (p. Article ID 3145947, 13 pages)

work page 2018
[17]

M., Güzel, M

Karim, A. M., Güzel, M. S., Tolun, M. R., Kaya, H., & Çelebi, F. V. (2019). A new framework using deep auto-encoder and ener gy spectral density for medical waveform data classification and processing. Biocybernetics and Biomedical Engineering, 39 , 148–159

work page 2019
[18]

Hamide Ozlem Dalgic, Erkan Bostanci, Mehmet Serdar Guzel, , Genetic Algorithm Based Floor Planning System, arXiv preprint arXiv:1704.06016, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

S., Kara M

Guzel, M. S., Kara M. and Beyazkılıç, M. S., “An adaptive framework for mobile robot navigation, “Adapt. Behav. 25(1), 30-39(2017)

work page 2017

[1] [1]

It was the most downloaded mobile game at the beginning of 2014

INTRODUCTION Flappy Bird made a very fast entry into the market. It was the most downloaded mobile game at the beginning of 2014. But within a very short time the market has withdrawn. Flappy Bird game is a single player game. There is only one action that jump. The game is about deciding when a bird should jump. The bird ends as soon as it strikes down, ...

work page 2014

[2] [2]

[2] developed an algorithm called DQN

RELATED WORKS In 2013, Mnih et al. [2] developed an algorithm called DQN. In this study, the agent is trained from the images that have never been seen before. The algorithm that is tested on Atari games has produced results far above human results. Appiah and Vare [5] trained Flappy Bird with DQN in their study. They have found much better human results ...

work page 2013

[3] [3]

at”, observes a rewards “rt

METHOD The reinforcement learning environment is PyGame. Flappy Bird game was run with PyGame. The environment in which the software is developed is Python. Deep Learning models are written in Keras, which is currently working on Tensorflow, and their training is done through these libraries. A. Q Learning It determines the reward for performing a specifi...

work page

[4] [4]

Pygame was used to develop Flappy Bird game on Python

EXPERIMENTS Tensorflow was selected in the backend of the Keras library. Pygame was used to develop Flappy Bird game on Python. After the game, this environment was used for education. In the game, each pipe is given a point of 1 point and the bird is hit as a result of the death of -1 point penalty. Various modifications and experiments were performed in...

work page

[5] [5]

Very little information (images in the game) was used during the training

CONCLUSIONS Flappy Bird, DQN and A3C were trained in this paper. Very little information (images in the game) was used during the training. Experiments with the A3C resulted in much faster training. In addition to being fast, the A3C has been shown to deliver much better results. The reason why A3C gives better and faster results is that there is a reward...

work page

[6] [6]

Q-learning,

C. J. C. H. Watkins and P. Dayan, “Q-learning,” in Machine Learning, 1992

work page 1992

[7] [7]

Playing Atari with Deep Reinforcement Learning

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves and I. Antonoglou, “Playing Atari with Deep Reinforcement Learning,” ArXiv, vol. abs/1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[8] [8]

Human-level control through deep reinforcement learning,

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran and D. Wier, “Human-level control through deep reinforcement learning,” Nature, vol. 518, p. 529, February 25, 2015

work page 2015

[9] [9]

Asynchronous Methods for Deep Reinforcement Learning

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” arxiv, vol. 1602.01783v2, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Playing FlappyBird with Deep Reinforcement Learning,

N. Appiah and S. Vare, “Playing FlappyBird with Deep Reinforcement Learning,” http://cs231n.stanford.edu/reports/2016/pdfs/111Report.pdf

work page 2016

[11] [12]

Applying Q-Learning to Flappy Bird,

M. Ebeling-Rump and Z. Hervieux-Moore, “Applying Q-Learning to Flappy Bird,” https://pdfs.semanticscholar.org/c8d8/45063aedd44e8dbf668774532aa0c01baa4f.pdf, 2016

work page 2016

[12] [13]

Cooperative Multi-agent Reinforcement Learning for Flappy Bird,

C. Rosset, C. Cevallos and I. Mukherjee, “Cooperative Multi-agent Reinforcement Learning for Flappy Bird,” 2016

work page 2016

[13] [14]

Deep Reinforcement Learning to play Flappy Bird using A3C algorithm,

S. Singh, “Deep Reinforcement Learning to play Flappy Bird using A3C algorithm,” [Online]. Available: https://shalabhsingh.github.io/Deep-RL-Flappy-Bird/

work page

[14] [15]

OpenAI Gym

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman and W. Z. Jie Tang, “OpenAI Gym,” Arxiv, vol. 1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [16]

M., Güzel, M

Karim, A. M., Güzel, M. S., Tolun, M. R., Kaya, H., & Çelebi, F. V. (2018). A new generalized deep learning framework combining sparse autoencoder and taguchi method for novel data classification and processing mathematical problems in engineering (p. Article ID 3145947, 13 pages)

work page 2018

[16] [17]

M., Güzel, M

Karim, A. M., Güzel, M. S., Tolun, M. R., Kaya, H., & Çelebi, F. V. (2019). A new framework using deep auto-encoder and ener gy spectral density for medical waveform data classification and processing. Biocybernetics and Biomedical Engineering, 39 , 148–159

work page 2019

[17] [18]

Hamide Ozlem Dalgic, Erkan Bostanci, Mehmet Serdar Guzel, , Genetic Algorithm Based Floor Planning System, arXiv preprint arXiv:1704.06016, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [19]

S., Kara M

Guzel, M. S., Kara M. and Beyazkılıç, M. S., “An adaptive framework for mobile robot navigation, “Adapt. Behav. 25(1), 30-39(2017)

work page 2017