Enhanced Deep Q-Learning for 2D Self-Driving Cars: Implementation and Evaluation on a Custom Track Environment
Pith reviewed 2026-05-24 03:09 UTC · model grok-4.3
The pith
A modified Deep Q-Network using priority-based action selection reaches an average reward of around 40 after 1000 episodes in a 2D car simulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the custom Pygame track environment, the modified DQN with priority-based action selection achieves an average reward of around 40 after 1000 episodes, approximately 60 percent higher than the original DQN and 50 percent higher than the vanilla neural network.
What carries the argument
The priority-based action selection mechanism added to the standard DQN, which alters how the agent chooses actions during training.
If this is right
- The priority-based modification improves final performance relative to both standard DQN and a plain neural network under the same training budget.
- Sensor readings spaced 20 degrees apart supply sufficient state information for the agent to learn a driving policy that yields positive average reward.
- Training runs of 1000 episodes are long enough for the modified DQN to demonstrate a clear numerical advantage over the baselines.
Where Pith is reading between the lines
- The same priority mechanism could be tested on tracks with different layouts or with added sensors to check whether the reported gain persists.
- If the reward function penalizes crashes or off-track time, the higher average reward implies the modified agent spends more time on the track.
- The gap versus the vanilla neural network isolates the contribution of the Q-learning update combined with the priority rule.
Load-bearing premise
The custom Pygame track, seven fixed-angle sensors, and unspecified reward function create a stable testbed that supports direct comparison of the DQN variants.
What would settle it
Re-training the three agents in the identical environment for 1000 episodes and observing that the modified DQN no longer records a higher average reward than the original DQN.
Figures
read the original abstract
This research project presents the implementation of a Deep Q-Learning Network (DQN) for a self-driving car on a 2-dimensional (2D) custom track, with the objective of enhancing the DQN network's performance. It encompasses the development of a custom driving environment using Pygame on a track surrounding the University of Memphis map, as well as the design and implementation of the DQN model. The algorithm utilizes data from 7 sensors installed in the car, which measure the distance between the car and the track. These sensors are positioned in front of the vehicle, spaced 20 degrees apart, enabling them to sense a wide area ahead. We successfully implemented the DQN and also a modified version of the DQN with a priority-based action selection mechanism, which we refer to as modified DQN. The model was trained over 1000 episodes, and the average reward received by the agent was found to be around 40, which is approximately 60% higher than the original DQN and around 50% higher than the vanilla neural network.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes the implementation of a standard Deep Q-Network (DQN) and a modified DQN incorporating priority-based action selection for controlling a 2D self-driving car in a custom Pygame environment. The car is equipped with seven distance sensors spaced 20 degrees apart. After training for 1000 episodes, the modified DQN is reported to achieve an average reward of approximately 40, stated to be 60% higher than the original DQN and 50% higher than a vanilla neural network.
Significance. If the performance improvements could be verified with complete and reproducible details, the work would provide a concrete example of a simple modification to DQN yielding gains in a simulated 2D driving task. The current lack of methodological transparency, however, prevents any assessment of whether the result holds or generalizes, limiting its potential contribution to reinforcement learning for autonomous agents.
major comments (5)
- [Abstract and Results] The reward function is never defined (neither in the abstract, environment section, nor results). This is load-bearing for the central claim because the reported average reward of ~40 and the 60%/50% improvements cannot be interpreted, reproduced, or attributed to the priority mechanism without knowing the components for progress, collisions, time, or any shaping terms.
- [Methodology] No neural network architecture is specified (layers, units per layer, activations, or input/output dimensions). This prevents evaluation of whether the modified DQN differs meaningfully from the baseline beyond the action-selection rule.
- [Experimental Setup] Hyperparameters (learning rate, discount factor, replay buffer size, epsilon schedule, batch size, target network update frequency) and training procedure details are absent. Without these, the 1000-episode result cannot be assessed for stability or compared fairly across the three methods.
- [Results] The performance numbers lack any report of independent runs, standard deviations, error bars, or statistical tests. The 60% improvement claim therefore cannot be distinguished from run-to-run variance or selective reporting.
- [Environment Description] The custom Pygame environment (track geometry, exact sensor ray-casting implementation, action space, collision model, episode termination conditions) is described only at a high level. This undermines the claim that the testbed allows unbiased comparison of the DQN variants.
minor comments (3)
- Include learning curves or per-episode reward plots to document convergence behavior rather than reporting only the final average.
- Define the precise priority-based action selection rule (e.g., how priorities are computed and how they alter the standard argmax or epsilon-greedy policy).
- Add a table listing all hyperparameters and environment constants to support replication.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting areas where methodological transparency can be improved. We agree that several details were omitted from the initial manuscript and will incorporate them in a major revision to enhance reproducibility and allow proper evaluation of the claims.
read point-by-point responses
-
Referee: [Abstract and Results] The reward function is never defined (neither in the abstract, environment section, nor results). This is load-bearing for the central claim because the reported average reward of ~40 and the 60%/50% improvements cannot be interpreted, reproduced, or attributed to the priority mechanism without knowing the components for progress, collisions, time, or any shaping terms.
Authors: We agree the reward function must be explicitly defined to interpret the reported rewards and improvements. The original manuscript omitted this. In revision we will add a complete description of the reward function, including all components for track distance, collision penalties, time costs, and progress shaping terms. revision: yes
-
Referee: [Methodology] No neural network architecture is specified (layers, units per layer, activations, or input/output dimensions). This prevents evaluation of whether the modified DQN differs meaningfully from the baseline beyond the action-selection rule.
Authors: The neural network architecture was not detailed. We will add a full specification in the revised manuscript, covering layer counts, units per layer, activation functions, input dimensions derived from the seven sensors, and output dimensions matching the action space. revision: yes
-
Referee: [Experimental Setup] Hyperparameters (learning rate, discount factor, replay buffer size, epsilon schedule, batch size, target network update frequency) and training procedure details are absent. Without these, the 1000-episode result cannot be assessed for stability or compared fairly across the three methods.
Authors: Hyperparameters and training details were omitted. The revision will include a comprehensive list of all hyperparameters together with the training procedure to support reproducibility and fair comparison. revision: yes
-
Referee: [Results] The performance numbers lack any report of independent runs, standard deviations, error bars, or statistical tests. The 60% improvement claim therefore cannot be distinguished from run-to-run variance or selective reporting.
Authors: Results were reported from single runs without variance measures. We will rerun experiments with multiple independent seeds, report means and standard deviations, and add error bars to strengthen the improvement claims. revision: yes
-
Referee: [Environment Description] The custom Pygame environment (track geometry, exact sensor ray-casting implementation, action space, collision model, episode termination conditions) is described only at a high level. This undermines the claim that the testbed allows unbiased comparison of the DQN variants.
Authors: The environment section will be expanded with precise details on track geometry, ray-casting sensor implementation, discrete action space, collision detection model, and episode termination rules to enable unbiased comparisons. revision: yes
Circularity Check
No circularity: empirical implementation results with no derivations or fitted equations
full rationale
The paper is an implementation and evaluation study reporting average rewards from training DQN variants over 1000 episodes in a custom Pygame environment. No mathematical derivations, equations, or parameter-fitting steps are present that could reduce any claim to its inputs by construction. The central performance numbers are direct training outcomes, not predictions derived from fitted quantities or self-referential definitions. Self-citations are absent from the provided text, and the work is self-contained as an empirical report.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
CarRacing DQN: A DQN Agent to play CarRacing 2d using TensorFlow and Keras
2020. CarRacing DQN: A DQN Agent to play CarRacing 2d using TensorFlow and Keras. (2020). https: //github.com/andywu0913/OpenAI-GYM-CarRacing-DQN
work page 2020
-
[2]
Documentation - SUMO Documentation
2023a. Documentation - SUMO Documentation. https://sumo.dlr.de/docs/index.html. (2023). (Accessed on 05/05/2023)
work page 2023
-
[3]
duarouter - SUMO Documentation
2023. duarouter - SUMO Documentation. https://sumo.dlr.de/docs/duarouter.html. (2023). (Accessed on 05/05/2023)
work page 2023
-
[4]
netconvert - SUMO Documentation
2023. netconvert - SUMO Documentation. https://sumo.dlr.de/docs/netconvert.html. (2023). (Accessed on 05/05/2023)
work page 2023
-
[5]
2023. OpenStreetMap Foundation. https://wiki.osmfoundation.org/wiki/Main_Page. (2023). (Accessed on 05/05/2023)
work page 2023
-
[6]
2023. pygame · PyPI. https://pypi.org/project/pygame/. (2023). (Accessed on 05/05/2023)
work page 2023
-
[7]
2023b. sumo - SUMO Documentation. https://sumo.dlr.de/docs/sumo.html. (2023). (Accessed on 05/05/2023)
work page 2023
-
[8]
2023c. sumo-gui - SUMO Documentation. https://sumo.dlr.de/docs/sumo-gui.html. (2023). (Accessed on 05/05/2023)
work page 2023
-
[9]
2023. TraCI - SUMO Documentation. https://sumo.dlr.de/docs/TraCI.html. (2023). (Accessed on 05/05/2023)
work page 2023
-
[10]
2023. Trip - SUMO Documentation. https://sumo.dlr.de/docs/Tools/Trip.html. (2023). (Accessed on 05/05/2023)
work page 2023
-
[11]
Max Peter Ronecker and Yuan Zhu. 2019. Deep Q-Network Based Decision Making for Autonomous Driving. In 2019 3rd International Conference on Robotics and Automation Sciences (ICRAS) . 154–160. DOI:http://dx.doi.org/10.1109/ICRAS.2019.8808950
-
[12]
Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. 2017. Deep Reinforcement Learning framework for Autonomous Driving. Electronic Imaging 29, 19 (jan 2017), 70–76. DOI:http: //dx.doi.org/10.2352/issn.2470-1173.2017.19.avm-023
-
[13]
Lei Tai and Ming Liu. 2016. A robot exploration strategy based on Q-learning network. In 2016 IEEE International Conference on Real-time Computing and Robotics (RCAR). 57–62. DOI: http://dx.doi.org/10.1109/RCAR.2016.7784001
-
[14]
Li Ling Koh Yang Thee Quek. 2021. Deep Q-network implementation for simulated autonomous vehicle control - Quek - 2021 - IET Intelligent Transport Systems - Wiley Online Library. https://ietresearch. onlinelibrary.wiley.com/doi/10.1049/itr2.12067. (2021). (Accessed on 05/05/2023)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.