Playing Flappy Bird via Asynchronous Advantage Actor Critic Algorithm
Pith reviewed 2026-05-25 01:39 UTC · model grok-4.3
The pith
Flappy Bird is trained to play using Deep Q-Network and Asynchronous Advantage Actor Critic algorithms directly from raw game images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Flappy Bird game was trained with the Reinforcement Learning algorithm Deep Q-Network and Asynchronous Advantage Actor Critic (A3C) algorithms from raw game images. The trained model has learned as reinforcement when to make which decision. As an input to the model, the reward or penalty at the end of each step was returned and the training was completed.
What carries the argument
Asynchronous Advantage Actor Critic (A3C) and Deep Q-Network algorithms that process raw game images and update policies based on per-step rewards or penalties.
If this is right
- The model acquires the ability to choose actions based solely on visual input.
- Training finishes when the agent consistently receives positive cumulative rewards from correct decisions.
- Both DQN and A3C can be applied to the same raw-image game task with per-step feedback.
- No additional state attributes beyond pixels are needed for the learning process described.
Where Pith is reading between the lines
- Similar raw-pixel training might apply to other simple 2D games that use basic reward structures.
- The approach leaves open whether the same methods would require more steps or different rewards in visually richer settings.
- Results connect to questions of how much preprocessing RL needs for visual control tasks.
Load-bearing premise
That standard reward signals and raw pixel inputs alone enable the algorithms to reach stable successful play without extra input engineering or hyperparameter details.
What would settle it
Running the trained agent in the game and finding that it achieves only random-level scores or fails to clear any pipes would show the training did not succeed.
Figures
read the original abstract
Flappy Bird, which has a very high popularity, has been trained in many algorithms. Some of these studies were trained from raw pixel values of game and some from specific attributes. In this study, the model was trained with raw game images, which had not been seen before. The trained model has learned as reinforcement when to make which decision. As an input to the model, the reward or penalty at the end of each step was returned and the training was completed. Flappy Bird game was trained with the Reinforcement Learning algorithm Deep Q-Network and Asynchronous Advantage Actor Critic (A3C) algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to apply the Deep Q-Network (DQN) and Asynchronous Advantage Actor-Critic (A3C) reinforcement learning algorithms to the Flappy Bird game using raw pixel inputs from game images. It states that per-step reward or penalty signals were provided and asserts that training was completed, enabling the model to learn appropriate actions.
Significance. If the central claim of successful training were supported by data, the work would constitute a straightforward empirical application of established RL methods to a pixel-based game, adding to the existing literature on game-playing agents but offering no novel algorithmic contributions or comparisons to prior baselines on this environment.
major comments (2)
- [Abstract] Abstract: The assertion that 'the training was completed' and 'the model has learned' when to act is presented without any quantitative results (e.g., scores, episode counts, learning curves, success rates, or baseline comparisons). This directly undermines evaluation of the central empirical claim that learning occurred under the described setup.
- [Abstract] Abstract and methods description: No details are supplied on network architecture, hyperparameters, training duration, or implementation of either DQN or A3C, nor is there any indication of how raw images were preprocessed or how the reward signal was defined. These omissions make the training claim impossible to assess or reproduce.
minor comments (1)
- [Abstract] Abstract: The sentence 'Flappy Bird game was trained with the Reinforcement Learning algorithm Deep Q-Network and Asynchronous Advantage Actor Critic (A3C) algorithms' contains redundant wording and should be clarified to distinguish the two algorithms.
Simulated Author's Rebuttal
We thank the referee for the detailed comments. We acknowledge that the submitted manuscript is brief and lacks the quantitative results and implementation details necessary to fully support and reproduce the claims. We will revise the manuscript to address these issues.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that 'the training was completed' and 'the model has learned' when to act is presented without any quantitative results (e.g., scores, episode counts, learning curves, success rates, or baseline comparisons). This directly undermines evaluation of the central empirical claim that learning occurred under the described setup.
Authors: We agree that the abstract and manuscript do not include quantitative results to support the claim that training was completed and the model learned appropriate actions. In the revised manuscript, we will add learning curves, achieved scores over episodes, number of training episodes, success rates, and any available baseline comparisons to substantiate the empirical results. revision: yes
-
Referee: [Abstract] Abstract and methods description: No details are supplied on network architecture, hyperparameters, training duration, or implementation of either DQN or A3C, nor is there any indication of how raw images were preprocessed or how the reward signal was defined. These omissions make the training claim impossible to assess or reproduce.
Authors: We agree that the manuscript omits critical details on network architecture, hyperparameters, training duration, implementation specifics for DQN and A3C, image preprocessing, and reward signal definition, which are needed for assessment and reproducibility. In the revised version, we will expand the methods section to provide these details, including convolutional network layers, specific hyperparameter values, training steps, preprocessing steps such as frame resizing and stacking, and the exact per-step reward/penalty formulation. revision: yes
Circularity Check
No significant circularity; purely empirical application
full rationale
The manuscript applies standard DQN and A3C algorithms to Flappy Bird using raw pixel inputs and per-step reward/penalty signals. No equations, derivations, parameter fittings, or uniqueness theorems are presented. The work contains no load-bearing steps that reduce by construction to their own inputs, self-citations, or ansatzes. It is a straightforward empirical RL experiment whose validity rests on reported outcomes rather than any definitional or fitted circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
It was the most downloaded mobile game at the beginning of 2014
INTRODUCTION Flappy Bird made a very fast entry into the market. It was the most downloaded mobile game at the beginning of 2014. But within a very short time the market has withdrawn. Flappy Bird game is a single player game. There is only one action that jump. The game is about deciding when a bird should jump. The bird ends as soon as it strikes down, ...
work page 2014
-
[2]
[2] developed an algorithm called DQN
RELATED WORKS In 2013, Mnih et al. [2] developed an algorithm called DQN. In this study, the agent is trained from the images that have never been seen before. The algorithm that is tested on Atari games has produced results far above human results. Appiah and Vare [5] trained Flappy Bird with DQN in their study. They have found much better human results ...
work page 2013
-
[3]
METHOD The reinforcement learning environment is PyGame. Flappy Bird game was run with PyGame. The environment in which the software is developed is Python. Deep Learning models are written in Keras, which is currently working on Tensorflow, and their training is done through these libraries. A. Q Learning It determines the reward for performing a specifi...
-
[4]
Pygame was used to develop Flappy Bird game on Python
EXPERIMENTS Tensorflow was selected in the backend of the Keras library. Pygame was used to develop Flappy Bird game on Python. After the game, this environment was used for education. In the game, each pipe is given a point of 1 point and the bird is hit as a result of the death of -1 point penalty. Various modifications and experiments were performed in...
-
[5]
Very little information (images in the game) was used during the training
CONCLUSIONS Flappy Bird, DQN and A3C were trained in this paper. Very little information (images in the game) was used during the training. Experiments with the A3C resulted in much faster training. In addition to being fast, the A3C has been shown to deliver much better results. The reason why A3C gives better and faster results is that there is a reward...
-
[6]
C. J. C. H. Watkins and P. Dayan, “Q-learning,” in Machine Learning, 1992
work page 1992
-
[7]
Playing Atari with Deep Reinforcement Learning
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves and I. Antonoglou, “Playing Atari with Deep Reinforcement Learning,” ArXiv, vol. abs/1312.5602, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[8]
Human-level control through deep reinforcement learning,
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran and D. Wier, “Human-level control through deep reinforcement learning,” Nature, vol. 518, p. 529, February 25, 2015
work page 2015
-
[9]
Asynchronous Methods for Deep Reinforcement Learning
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning,” arxiv, vol. 1602.01783v2, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[10]
Playing FlappyBird with Deep Reinforcement Learning,
N. Appiah and S. Vare, “Playing FlappyBird with Deep Reinforcement Learning,” http://cs231n.stanford.edu/reports/2016/pdfs/111Report.pdf
work page 2016
-
[12]
Applying Q-Learning to Flappy Bird,
M. Ebeling-Rump and Z. Hervieux-Moore, “Applying Q-Learning to Flappy Bird,” https://pdfs.semanticscholar.org/c8d8/45063aedd44e8dbf668774532aa0c01baa4f.pdf, 2016
work page 2016
-
[13]
Cooperative Multi-agent Reinforcement Learning for Flappy Bird,
C. Rosset, C. Cevallos and I. Mukherjee, “Cooperative Multi-agent Reinforcement Learning for Flappy Bird,” 2016
work page 2016
-
[14]
Deep Reinforcement Learning to play Flappy Bird using A3C algorithm,
S. Singh, “Deep Reinforcement Learning to play Flappy Bird using A3C algorithm,” [Online]. Available: https://shalabhsingh.github.io/Deep-RL-Flappy-Bird/
-
[15]
G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman and W. Z. Jie Tang, “OpenAI Gym,” Arxiv, vol. 1606.01540, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
Karim, A. M., Güzel, M. S., Tolun, M. R., Kaya, H., & Çelebi, F. V. (2018). A new generalized deep learning framework combining sparse autoencoder and taguchi method for novel data classification and processing mathematical problems in engineering (p. Article ID 3145947, 13 pages)
work page 2018
-
[17]
Karim, A. M., Güzel, M. S., Tolun, M. R., Kaya, H., & Çelebi, F. V. (2019). A new framework using deep auto-encoder and ener gy spectral density for medical waveform data classification and processing. Biocybernetics and Biomedical Engineering, 39 , 148–159
work page 2019
-
[18]
Hamide Ozlem Dalgic, Erkan Bostanci, Mehmet Serdar Guzel, , Genetic Algorithm Based Floor Planning System, arXiv preprint arXiv:1704.06016, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
Guzel, M. S., Kara M. and Beyazkılıç, M. S., “An adaptive framework for mobile robot navigation, “Adapt. Behav. 25(1), 30-39(2017)
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.