Enhanced Deep Q-Learning for 2D Self-Driving Cars: Implementation and Evaluation on a Custom Track Environment

Bidhya Shrestha; Sagar Pathak

arxiv: 2402.08780 · v2 · submitted 2024-02-13 · 💻 cs.AI

Enhanced Deep Q-Learning for 2D Self-Driving Cars: Implementation and Evaluation on a Custom Track Environment

Sagar Pathak , Bidhya Shrestha This is my paper

Pith reviewed 2026-05-24 03:09 UTC · model grok-4.3

classification 💻 cs.AI

keywords deep q-learningreinforcement learningself-driving simulation2d track environmentpriority action selectionpygamedistance sensors

0 comments

The pith

A modified Deep Q-Network using priority-based action selection reaches an average reward of around 40 after 1000 episodes in a 2D car simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a custom 2D driving track in Pygame and equips a car with seven fixed-angle distance sensors to gather input for a reinforcement learning agent. It implements both a standard Deep Q-Network and a modified version that adds priority-based action selection, then trains each for 1000 episodes. The modified version records an average reward of around 40, reported as 60 percent above the standard DQN and 50 percent above a basic neural network. A sympathetic reader would care because the result suggests a simple change to action selection can measurably improve learning speed and final performance inside a controlled sensor-based driving task.

Core claim

In the custom Pygame track environment, the modified DQN with priority-based action selection achieves an average reward of around 40 after 1000 episodes, approximately 60 percent higher than the original DQN and 50 percent higher than the vanilla neural network.

What carries the argument

The priority-based action selection mechanism added to the standard DQN, which alters how the agent chooses actions during training.

If this is right

The priority-based modification improves final performance relative to both standard DQN and a plain neural network under the same training budget.
Sensor readings spaced 20 degrees apart supply sufficient state information for the agent to learn a driving policy that yields positive average reward.
Training runs of 1000 episodes are long enough for the modified DQN to demonstrate a clear numerical advantage over the baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same priority mechanism could be tested on tracks with different layouts or with added sensors to check whether the reported gain persists.
If the reward function penalizes crashes or off-track time, the higher average reward implies the modified agent spends more time on the track.
The gap versus the vanilla neural network isolates the contribution of the Q-learning update combined with the priority rule.

Load-bearing premise

The custom Pygame track, seven fixed-angle sensors, and unspecified reward function create a stable testbed that supports direct comparison of the DQN variants.

What would settle it

Re-training the three agents in the identical environment for 1000 episodes and observing that the modified DQN no longer records a higher average reward than the original DQN.

Figures

Figures reproduced from arXiv: 2402.08780 by Bidhya Shrestha, Sagar Pathak.

**Figure 1.** Figure 1: Flowchart of reinforcement learning Reinforcement Learning Reinforcement Learning is a machine learning technique that enables an agent to learn by interacting with the environment. The agent makes decisions to take actions based on rewards and penalties it gets from taking a particular action in a particular states. It is an interactive process by trial and error using feedback from its own actions and e… view at source ↗

**Figure 3.** Figure 3: Double DQN structure proposed by Max Peter et. al. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 2.** Figure 2: Proposed Deep RL Framework for autonomous driving by [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Selected region for the vehicle simulation around the University [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Network Map after converted using SUMO netconvert tool [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: Self driving car in the environment with sensors. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Performance of vanilla neural network [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Performance of modified DQN DISCUSSION The experiment showed it was hard for the original DQN to finish reach from source to destination which is to complete the whole track. However tweaking the action selection mechanism improved the performance and the car could make a complete round in the track. The vanilla neural network was also able to make a complete round but it took longer time to learn. It was… view at source ↗

read the original abstract

This research project presents the implementation of a Deep Q-Learning Network (DQN) for a self-driving car on a 2-dimensional (2D) custom track, with the objective of enhancing the DQN network's performance. It encompasses the development of a custom driving environment using Pygame on a track surrounding the University of Memphis map, as well as the design and implementation of the DQN model. The algorithm utilizes data from 7 sensors installed in the car, which measure the distance between the car and the track. These sensors are positioned in front of the vehicle, spaced 20 degrees apart, enabling them to sense a wide area ahead. We successfully implemented the DQN and also a modified version of the DQN with a priority-based action selection mechanism, which we refer to as modified DQN. The model was trained over 1000 episodes, and the average reward received by the agent was found to be around 40, which is approximately 60% higher than the original DQN and around 50% higher than the vanilla neural network.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 60% improvement claim can't be checked because the reward function and all training details are missing from this DQN driving implementation.

read the letter

The central problem is that the reported performance gap rests on an undefined reward function and zero supporting statistics, so we have no basis to attribute anything to the priority tweak. They built a Pygame environment on a University of Memphis map, equipped the car with seven fixed-angle distance sensors spaced 20 degrees apart, trained a standard DQN plus a modified version that changes action selection priority, and ran both for 1000 episodes. The modified agent reached average reward around 40, which they say is 60% above plain DQN and 50% above a vanilla network. That is the entire result. What the paper actually supplies is a working custom environment and two runnable agents. The implementation itself appears complete at the level of getting the car to drive without crashing in their track. The priority mechanism is a simple heuristic addition that could be worth testing in other settings. Everything else is thin. The reward is never specified—no terms for forward progress, collision penalty, time cost, or any shaping—so the numerical difference could come from environment quirks rather than the change in action selection. No network sizes, optimizer settings, replay buffer details, or epsilon schedule appear. There are no training curves, no multiple seeds, and no variance numbers. The comparison to a vanilla neural network is also left vague. This is a clean student-style project that demonstrates basic DQN on a new 2D track. It could serve as a starting point for someone who wants to code their own Pygame driving agent, but it adds nothing new to the methods or the literature on DQN variants. I would not bring it to a reading group, would not cite it, and would not send it to referees. An editor should desk-reject.

Referee Report

5 major / 3 minor

Summary. The paper describes the implementation of a standard Deep Q-Network (DQN) and a modified DQN incorporating priority-based action selection for controlling a 2D self-driving car in a custom Pygame environment. The car is equipped with seven distance sensors spaced 20 degrees apart. After training for 1000 episodes, the modified DQN is reported to achieve an average reward of approximately 40, stated to be 60% higher than the original DQN and 50% higher than a vanilla neural network.

Significance. If the performance improvements could be verified with complete and reproducible details, the work would provide a concrete example of a simple modification to DQN yielding gains in a simulated 2D driving task. The current lack of methodological transparency, however, prevents any assessment of whether the result holds or generalizes, limiting its potential contribution to reinforcement learning for autonomous agents.

major comments (5)

[Abstract and Results] The reward function is never defined (neither in the abstract, environment section, nor results). This is load-bearing for the central claim because the reported average reward of ~40 and the 60%/50% improvements cannot be interpreted, reproduced, or attributed to the priority mechanism without knowing the components for progress, collisions, time, or any shaping terms.
[Methodology] No neural network architecture is specified (layers, units per layer, activations, or input/output dimensions). This prevents evaluation of whether the modified DQN differs meaningfully from the baseline beyond the action-selection rule.
[Experimental Setup] Hyperparameters (learning rate, discount factor, replay buffer size, epsilon schedule, batch size, target network update frequency) and training procedure details are absent. Without these, the 1000-episode result cannot be assessed for stability or compared fairly across the three methods.
[Results] The performance numbers lack any report of independent runs, standard deviations, error bars, or statistical tests. The 60% improvement claim therefore cannot be distinguished from run-to-run variance or selective reporting.
[Environment Description] The custom Pygame environment (track geometry, exact sensor ray-casting implementation, action space, collision model, episode termination conditions) is described only at a high level. This undermines the claim that the testbed allows unbiased comparison of the DQN variants.

minor comments (3)

Include learning curves or per-episode reward plots to document convergence behavior rather than reporting only the final average.
Define the precise priority-based action selection rule (e.g., how priorities are computed and how they alter the standard argmax or epsilon-greedy policy).
Add a table listing all hyperparameters and environment constants to support replication.

Simulated Author's Rebuttal

5 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where methodological transparency can be improved. We agree that several details were omitted from the initial manuscript and will incorporate them in a major revision to enhance reproducibility and allow proper evaluation of the claims.

read point-by-point responses

Referee: [Abstract and Results] The reward function is never defined (neither in the abstract, environment section, nor results). This is load-bearing for the central claim because the reported average reward of ~40 and the 60%/50% improvements cannot be interpreted, reproduced, or attributed to the priority mechanism without knowing the components for progress, collisions, time, or any shaping terms.

Authors: We agree the reward function must be explicitly defined to interpret the reported rewards and improvements. The original manuscript omitted this. In revision we will add a complete description of the reward function, including all components for track distance, collision penalties, time costs, and progress shaping terms. revision: yes
Referee: [Methodology] No neural network architecture is specified (layers, units per layer, activations, or input/output dimensions). This prevents evaluation of whether the modified DQN differs meaningfully from the baseline beyond the action-selection rule.

Authors: The neural network architecture was not detailed. We will add a full specification in the revised manuscript, covering layer counts, units per layer, activation functions, input dimensions derived from the seven sensors, and output dimensions matching the action space. revision: yes
Referee: [Experimental Setup] Hyperparameters (learning rate, discount factor, replay buffer size, epsilon schedule, batch size, target network update frequency) and training procedure details are absent. Without these, the 1000-episode result cannot be assessed for stability or compared fairly across the three methods.

Authors: Hyperparameters and training details were omitted. The revision will include a comprehensive list of all hyperparameters together with the training procedure to support reproducibility and fair comparison. revision: yes
Referee: [Results] The performance numbers lack any report of independent runs, standard deviations, error bars, or statistical tests. The 60% improvement claim therefore cannot be distinguished from run-to-run variance or selective reporting.

Authors: Results were reported from single runs without variance measures. We will rerun experiments with multiple independent seeds, report means and standard deviations, and add error bars to strengthen the improvement claims. revision: yes
Referee: [Environment Description] The custom Pygame environment (track geometry, exact sensor ray-casting implementation, action space, collision model, episode termination conditions) is described only at a high level. This undermines the claim that the testbed allows unbiased comparison of the DQN variants.

Authors: The environment section will be expanded with precise details on track geometry, ray-casting sensor implementation, discrete action space, collision detection model, and episode termination rules to enable unbiased comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical implementation results with no derivations or fitted equations

full rationale

The paper is an implementation and evaluation study reporting average rewards from training DQN variants over 1000 episodes in a custom Pygame environment. No mathematical derivations, equations, or parameter-fitting steps are present that could reduce any claim to its inputs by construction. The central performance numbers are direct training outcomes, not predictions derived from fitted quantities or self-referential definitions. Self-citations are absent from the provided text, and the work is self-contained as an empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim depends on the definition of the custom track, the reward function, and the precise implementation of the priority mechanism, none of which are specified or justified in the provided abstract.

pith-pipeline@v0.9.0 · 5721 in / 1147 out tokens · 35761 ms · 2026-05-24T03:09:21.091075+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

CarRacing DQN: A DQN Agent to play CarRacing 2d using TensorFlow and Keras

2020. CarRacing DQN: A DQN Agent to play CarRacing 2d using TensorFlow and Keras. (2020). https: //github.com/andywu0913/OpenAI-GYM-CarRacing-DQN

work page 2020
[2]

Documentation - SUMO Documentation

2023a. Documentation - SUMO Documentation. https://sumo.dlr.de/docs/index.html. (2023). (Accessed on 05/05/2023)

work page 2023
[3]

duarouter - SUMO Documentation

2023. duarouter - SUMO Documentation. https://sumo.dlr.de/docs/duarouter.html. (2023). (Accessed on 05/05/2023)

work page 2023
[4]

netconvert - SUMO Documentation

2023. netconvert - SUMO Documentation. https://sumo.dlr.de/docs/netconvert.html. (2023). (Accessed on 05/05/2023)

work page 2023
[5]

OpenStreetMap Foundation

2023. OpenStreetMap Foundation. https://wiki.osmfoundation.org/wiki/Main_Page. (2023). (Accessed on 05/05/2023)

work page 2023
[6]

pygame · PyPI

2023. pygame · PyPI. https://pypi.org/project/pygame/. (2023). (Accessed on 05/05/2023)

work page 2023
[7]

sumo - SUMO Documentation

2023b. sumo - SUMO Documentation. https://sumo.dlr.de/docs/sumo.html. (2023). (Accessed on 05/05/2023)

work page 2023
[8]

sumo-gui - SUMO Documentation

2023c. sumo-gui - SUMO Documentation. https://sumo.dlr.de/docs/sumo-gui.html. (2023). (Accessed on 05/05/2023)

work page 2023
[9]

TraCI - SUMO Documentation

2023. TraCI - SUMO Documentation. https://sumo.dlr.de/docs/TraCI.html. (2023). (Accessed on 05/05/2023)

work page 2023
[10]

Trip - SUMO Documentation

2023. Trip - SUMO Documentation. https://sumo.dlr.de/docs/Tools/Trip.html. (2023). (Accessed on 05/05/2023)

work page 2023
[11]

Max Peter Ronecker and Yuan Zhu. 2019. Deep Q-Network Based Decision Making for Autonomous Driving. In 2019 3rd International Conference on Robotics and Automation Sciences (ICRAS) . 154–160. DOI:http://dx.doi.org/10.1109/ICRAS.2019.8808950

work page doi:10.1109/icras.2019.8808950 2019
[12]

Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. 2017. Deep Reinforcement Learning framework for Autonomous Driving. Electronic Imaging 29, 19 (jan 2017), 70–76. DOI:http: //dx.doi.org/10.2352/issn.2470-1173.2017.19.avm-023

work page doi:10.2352/issn.2470-1173.2017.19.avm-023 2017
[13]

Lei Tai and Ming Liu. 2016. A robot exploration strategy based on Q-learning network. In 2016 IEEE International Conference on Real-time Computing and Robotics (RCAR). 57–62. DOI: http://dx.doi.org/10.1109/RCAR.2016.7784001

work page doi:10.1109/rcar.2016.7784001 2016
[14]

Li Ling Koh Yang Thee Quek. 2021. Deep Q-network implementation for simulated autonomous vehicle control - Quek - 2021 - IET Intelligent Transport Systems - Wiley Online Library. https://ietresearch. onlinelibrary.wiley.com/doi/10.1049/itr2.12067. (2021). (Accessed on 05/05/2023)

work page doi:10.1049/itr2.12067 2021

[1] [1]

CarRacing DQN: A DQN Agent to play CarRacing 2d using TensorFlow and Keras

2020. CarRacing DQN: A DQN Agent to play CarRacing 2d using TensorFlow and Keras. (2020). https: //github.com/andywu0913/OpenAI-GYM-CarRacing-DQN

work page 2020

[2] [2]

Documentation - SUMO Documentation

2023a. Documentation - SUMO Documentation. https://sumo.dlr.de/docs/index.html. (2023). (Accessed on 05/05/2023)

work page 2023

[3] [3]

duarouter - SUMO Documentation

2023. duarouter - SUMO Documentation. https://sumo.dlr.de/docs/duarouter.html. (2023). (Accessed on 05/05/2023)

work page 2023

[4] [4]

netconvert - SUMO Documentation

2023. netconvert - SUMO Documentation. https://sumo.dlr.de/docs/netconvert.html. (2023). (Accessed on 05/05/2023)

work page 2023

[5] [5]

OpenStreetMap Foundation

2023. OpenStreetMap Foundation. https://wiki.osmfoundation.org/wiki/Main_Page. (2023). (Accessed on 05/05/2023)

work page 2023

[6] [6]

pygame · PyPI

2023. pygame · PyPI. https://pypi.org/project/pygame/. (2023). (Accessed on 05/05/2023)

work page 2023

[7] [7]

sumo - SUMO Documentation

2023b. sumo - SUMO Documentation. https://sumo.dlr.de/docs/sumo.html. (2023). (Accessed on 05/05/2023)

work page 2023

[8] [8]

sumo-gui - SUMO Documentation

2023c. sumo-gui - SUMO Documentation. https://sumo.dlr.de/docs/sumo-gui.html. (2023). (Accessed on 05/05/2023)

work page 2023

[9] [9]

TraCI - SUMO Documentation

2023. TraCI - SUMO Documentation. https://sumo.dlr.de/docs/TraCI.html. (2023). (Accessed on 05/05/2023)

work page 2023

[10] [10]

Trip - SUMO Documentation

2023. Trip - SUMO Documentation. https://sumo.dlr.de/docs/Tools/Trip.html. (2023). (Accessed on 05/05/2023)

work page 2023

[11] [11]

Max Peter Ronecker and Yuan Zhu. 2019. Deep Q-Network Based Decision Making for Autonomous Driving. In 2019 3rd International Conference on Robotics and Automation Sciences (ICRAS) . 154–160. DOI:http://dx.doi.org/10.1109/ICRAS.2019.8808950

work page doi:10.1109/icras.2019.8808950 2019

[12] [12]

Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. 2017. Deep Reinforcement Learning framework for Autonomous Driving. Electronic Imaging 29, 19 (jan 2017), 70–76. DOI:http: //dx.doi.org/10.2352/issn.2470-1173.2017.19.avm-023

work page doi:10.2352/issn.2470-1173.2017.19.avm-023 2017

[13] [13]

Lei Tai and Ming Liu. 2016. A robot exploration strategy based on Q-learning network. In 2016 IEEE International Conference on Real-time Computing and Robotics (RCAR). 57–62. DOI: http://dx.doi.org/10.1109/RCAR.2016.7784001

work page doi:10.1109/rcar.2016.7784001 2016

[14] [14]

Li Ling Koh Yang Thee Quek. 2021. Deep Q-network implementation for simulated autonomous vehicle control - Quek - 2021 - IET Intelligent Transport Systems - Wiley Online Library. https://ietresearch. onlinelibrary.wiley.com/doi/10.1049/itr2.12067. (2021). (Accessed on 05/05/2023)

work page doi:10.1049/itr2.12067 2021