A Hierarchical Architecture for Sequential Decision-Making in Autonomous Driving using Deep Reinforcement Learning

Gabriel Hugh Elkaim; Majid Moghadam

arxiv: 1906.08464 · v1 · pith:URV7QBLQnew · submitted 2019-06-20 · 💻 cs.RO · cs.AI· cs.LG· cs.SY· eess.SY

A Hierarchical Architecture for Sequential Decision-Making in Autonomous Driving using Deep Reinforcement Learning

Majid Moghadam , Gabriel Hugh Elkaim This is my paper

Pith reviewed 2026-05-25 19:57 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LGcs.SYeess.SY

keywords autonomous drivingdeep reinforcement learninghierarchical architectureoccupancy gridtactical decision makinghighway drivingsequential decision-makingmulti-layer control

0 comments

The pith

A hierarchical architecture allows deep reinforcement learning to make reliable high-level driving decisions by processing occupancy grids and delegating execution to lower layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a multi-modal architecture for tactical decision making in autonomous driving. It trains a deep reinforcement learning agent that takes occupancy grids of the vehicle's surroundings and outputs high-level commands such as lane changes. These commands are then sent to lower-level controllers for execution. By dividing the problem into separate layers, the approach aims to achieve better reliability than end-to-end systems, making it more suitable for real self-driving cars. A sympathetic reader would care because this separation could address the complexity and uncertainty challenges in driving environments.

Core claim

The central claim is that dividing the autonomous driving problem into a multi-layer control architecture enables leveraging AI to solve each layer separately, achieving an admissible reliability score. Specifically, the DRL agent fed with occupancy grids yields consistent performance in stochastic highway driving scenarios, and the resulting high-level commands can be executed reliably by lower-level controllers, leading to a more reliable system than end-to-end approaches that can be implemented in actual self-driving cars.

What carries the argument

The multi-layer control architecture, where a deep reinforcement learning agent processes occupancy grids to generate high-level sequential commands like lane changes for lower-level controllers.

If this is right

The DRL agent achieves consistent performance in stochastic highway driving scenarios.
High-level commands are sent to and executed by lower-level controllers.
The system achieves an admissible reliability score.
It results in a more reliable system compared to end-to-end approaches.
The architecture can be implemented in actual self-driving cars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This layered approach might integrate more easily with existing vehicle control systems that already handle low-level tasks.
It could enable testing and validation of the decision-making layer independently from perception and control modules.
Extending the occupancy grid input to include more environmental details might further improve decision consistency in complex scenarios.

Load-bearing premise

The deep reinforcement learning agent produces consistent performance when given occupancy grids of the surroundings in stochastic highway scenarios, and the high-level commands it generates can be reliably executed by the lower-level controllers.

What would settle it

Observing whether the trained DRL agent maintains consistent performance across multiple stochastic highway driving simulations using occupancy grid inputs, or testing if lower-level controllers can execute the generated lane change commands without failure in real or simulated conditions.

Figures

Figures reproduced from arXiv: 1906.08464 by Gabriel Hugh Elkaim, Majid Moghadam.

**Figure 1.** Figure 1: A sketch of our hierarchical approach for the autonomous driving problem for both fully and partially autonomous driving systems (Alizadeh et al., 2019). Before AI, control and orientation of ground vehicles were tackled using feedback control techniques (Falcone et al., 2007; Broggi et al., 1999; Moghadam & Caliskan, 2015) that attempt to stabilize the vehicle using the information collected from sensory… view at source ↗

**Figure 2.** Figure 2: Hierarchical architecture of the general ADAS systems vs. end-to-end approaches In this study, we address the problem of high-level decision making for an autonomous car using classical reinforcement learning technique known as Q-learning. We implement the −greedy algorithm to the problem defined in DeepCars simulation environment which is also designed and implemented by the authors. After commenting on… view at source ↗

**Figure 4.** Figure 4: Tabular Q-learning algorithm (Busoniu et al., 2017) where x and u indicate the observed state and input respectively The main objective is to train the agent to avoid making collisions with other vehicles in the environment. Thus, we define a simple reward function ρ(s, a, s0 ) = +1 s 0 6= sT −1 s 0 = sT (6) Where sT indicates the terminal state that the agent makes a collision. For the hyper-parameter t… view at source ↗

**Figure 5.** Figure 5: A crop of sparse Q-table in tabular RL (# of lanes: 3). Red rectangle indicates the greedy optimal action where s = 1 6 3 0 3.2. Deep Reinforcement Learning on DeepCars3 As discussed in previous section, the tabular Q-learning approach lacks the generalization property. In addition, the course of dimensionality is another problem while using Q-learning techniques. In our case, the number of Markov states a… view at source ↗

**Figure 6.** Figure 6: DQN performance for three network architectures [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: DQN vs. DDQN with different network architectures. Here 16x16 indicates two dense layers with 16 neurons at each. reward peak value comparing with others, while shallow DDQN converged much faster than its rivals. DQN has shown a mediocre performance between these agents. Following the real-time validation policy in algorithm 1, we recorded the best model in all training phases and evaluated their performan… view at source ↗

read the original abstract

Tactical decision making is a critical feature for advanced driving systems, that incorporates several challenges such as complexity of the uncertain environment and reliability of the autonomous system. In this work, we develop a multi-modal architecture that includes the environmental modeling of ego surrounding and train a deep reinforcement learning (DRL) agent that yields consistent performance in stochastic highway driving scenarios. To this end, we feed the occupancy grid of the ego surrounding into the DRL agent and obtain the high-level sequential commands (i.e. lane change) to send them to lower-level controllers. We will show that dividing the autonomous driving problem into a multi-layer control architecture enables us to leverage the AI power to solve each layer separately and achieve an admissible reliability score. Comparing with end-to-end approaches, this architecture enables us to end up with a more reliable system which can be implemented in actual self-driving cars.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a hierarchical DRL setup for highway driving but supplies no experiments, metrics, or results to support its reliability claims over end-to-end methods.

read the letter

The core of this paper is a multi-layer architecture: occupancy grids feed a DRL agent that outputs high-level commands such as lane changes, which then go to lower-level controllers. The stated goal is to get better reliability than end-to-end DRL by handling each layer separately, with the claim that the result can reach an admissible reliability score and run on actual cars. That split is a reasonable engineering intuition for modularity in driving systems, and occupancy grids were already a standard input representation by 2019. The paper does not appear to introduce new algorithms or theorems; it applies established DRL techniques to tactical highway decisions. No new empirical result is shown either. The main limitation is the complete absence of any training details, reward function, network description, simulation setup, quantitative metrics, or comparisons. The strongest claim—that the architecture is more reliable and ready for real vehicles—therefore rests on an untested assumption about consistent DRL performance and safe handoff to lower controllers under uncertainty. The stress-test note correctly flags this gap; nothing in the text closes it. Without closed-loop results or safety data, the reliability advantage cannot be assessed. This work would interest readers already exploring modular DRL for autonomous vehicles who want a high-level architecture sketch, but it offers little concrete value to anyone needing methods or evidence. It does not show enough substance or grounding to merit a serious referee at this stage.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a hierarchical multi-layer architecture for autonomous driving in which a deep reinforcement learning (DRL) agent receives occupancy-grid representations of the ego vehicle's surroundings and outputs high-level tactical commands (e.g., lane-change decisions) that are passed to unspecified lower-level controllers. The central claim is that this separation yields consistent performance in stochastic highway scenarios, an 'admissible reliability score,' and a system that is more reliable than end-to-end approaches and therefore suitable for deployment on actual self-driving cars.

Significance. If the reliability and consistency claims were quantitatively demonstrated with closed-loop results under sensor/actuator noise and compared against end-to-end baselines, the modular approach could meaningfully advance practical DRL deployment in autonomous vehicles by allowing independent verification and tuning of each layer.

major comments (3)

[Abstract] Abstract: the assertion that the architecture 'achieves an admissible reliability score' and is 'more reliable' than end-to-end methods for 'actual self-driving cars' is unsupported; the text supplies neither quantitative reliability metrics, success rates, nor any closed-loop simulation or real-vehicle results.
[Abstract] Abstract / architecture description: the claim that 'the DRL agent yields consistent performance in stochastic highway driving scenarios' rests on an untested interface assumption between high-level commands and lower-level controllers; no details of controller execution under uncertainty, reward function, network architecture, or evaluation protocol are provided.
[Abstract] Abstract: the comparison to end-to-end approaches is stated without any baseline experiments, safety metrics, or success-rate tables, rendering the 'more reliable' conclusion unevaluable.

minor comments (1)

[Abstract] Clarify whether 'multi-modal' refers to additional sensor inputs beyond occupancy grids, as the description mentions only grids.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications from the full paper where applicable and indicating revisions to strengthen the presentation of results and claims.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the architecture 'achieves an admissible reliability score' and is 'more reliable' than end-to-end methods for 'actual self-driving cars' is unsupported; the text supplies neither quantitative reliability metrics, success rates, nor any closed-loop simulation or real-vehicle results.

Authors: We acknowledge that the abstract phrasing is overly strong and not fully supported by quantitative metrics in a self-contained manner. The manuscript body presents simulation results on stochastic highway scenarios using occupancy grids, but does not include real-vehicle tests or explicit 'admissible reliability score' calculations. We will revise the abstract to remove deployment-oriented claims and ensure all reliability assertions are tied directly to the reported simulation metrics. revision: yes
Referee: [Abstract] Abstract / architecture description: the claim that 'the DRL agent yields consistent performance in stochastic highway driving scenarios' rests on an untested interface assumption between high-level commands and lower-level controllers; no details of controller execution under uncertainty, reward function, network architecture, or evaluation protocol are provided.

Authors: The full manuscript details the DRL network architecture, reward function, and evaluation protocol in the methods and experiments sections. The interface to lower-level controllers is described at a high level with the assumption that they can execute the tactical commands. We agree more explicit discussion of execution under uncertainty is warranted and will add this in a revision. revision: partial
Referee: [Abstract] Abstract: the comparison to end-to-end approaches is stated without any baseline experiments, safety metrics, or success-rate tables, rendering the 'more reliable' conclusion unevaluable.

Authors: The comparison is presented conceptually, highlighting the benefit of modular verification. No direct baseline experiments against end-to-end methods are included. We will revise the abstract to qualify this statement and add discussion of related end-to-end metrics from the literature for context. revision: partial

Circularity Check

0 steps flagged

No circularity: architecture proposal and DRL training claims rest on empirical validation, not definitional reduction or self-citation chains.

full rationale

The paper presents a hierarchical DRL architecture for high-level tactical decisions (occupancy-grid input to lane-change commands) and asserts improved reliability over end-to-end methods for real-car deployment. No equations, derivations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The central claim is an empirical assertion about system reliability that would require simulation or hardware results; it does not reduce by construction to its own inputs or to self-citations. No load-bearing self-citation, ansatz smuggling, or renaming of known results is present. This is the normal case of a methods/architecture paper whose validity hinges on external validation rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no mathematical model, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5691 in / 972 out tokens · 22593 ms · 2026-05-25T19:57:58.519112+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

K., Yavas, U., and Kurtulus, C

Alizadeh, A., Moghadam, M., Bicer, Y., Ure, N. K., Yavas, U., and Kurtulus, C. Tactical lane changing with deep reinforcement learning in dynamic and uncertain traffic scenarios. In 22nd Intelligent Transportation Systems Conference (ITSC2019-submitted), 2019

work page 2019
[3]

Bicer, Y., Moghadam, M., Sahin, C., Eroglu, B., and \"U re, N. K. Vision-based uav guidance for autonomous landing with deep neural networks. In AIAA Scitech 2019 Forum, pp.\ 0140, 2019

work page 2019
[4]

End to End Learning for Self-Driving Cars

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Broggi, A., Bertozzi, M., Fascioli, A., Bianco, C. G. L., and Piazzi, A. The argo autonomous vehicle’s vision and control systems. International Journal of Intelligent Control and Systems, 3 0 (4): 0 409--441, 1999

work page 1999
[6]

Reinforcement learning and dynamic programming using function approximators

Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D. Reinforcement learning and dynamic programming using function approximators. CRC press, 2017

work page 2017
[7]

E., and Hrovat, D

Falcone, P., Borrelli, F., Asgari, J., Tseng, H. E., and Hrovat, D. Predictive active steering control for autonomous vehicle systems. IEEE Transactions on control systems technology, 15 0 (3): 0 566--580, 2007

work page 2007
[8]

Evolving large-scale neural networks for vision-based torcs

Koutn \' k, J., Cuccu, G., Schmidhuber, J., and Gomez, F. Evolving large-scale neural networks for vision-based torcs. 2013

work page 2013
[9]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012

work page 2012
[10]

Continuous control with deep reinforcement learning

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

A., Veness, J., Bellemare, M

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518 0 (7540): 0 529, 2015

work page 2015
[12]

and Caliskan, F

Moghadam, M. and Caliskan, F. Actuator and sensor fault detection and diagnosis of quadrotor based on two-stage kalman filter. In 2015 5th Australian Control Conference (AUCC), pp.\ 182--187. IEEE, 2015

work page 2015
[13]

K., and Inalhan, G

Moghadam, M., Ure, N. K., and Inalhan, G. Autonomous execution of aircraft supermaneuvers with switching nonlinear backstepping control. In 2018 AIAA Guidance, Navigation, and Control Conference, pp.\ 1594, 2018

work page 2018
[14]

End-to-end driving in a realistic racing game with deep reinforcement learning

Perot, E., Jaritz, M., Toromanoff, M., and De Charette, R. End-to-end driving in a realistic racing game with deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.\ 3--4, 2017

work page 2017
[15]

E., Li, W., et al

Slotine, J.-J. E., Li, W., et al. Applied nonlinear control, volume 199. Prentice hall Englewood Cliffs, NJ, 1991

work page 1991
[16]

Specht, D. F. A general regression neural network. IEEE transactions on neural networks, 2 0 (6): 0 568--576, 1991

work page 1991
[17]

Deep reinforcement learning with double q-learning

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016

work page 2016

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

K., Yavas, U., and Kurtulus, C

Alizadeh, A., Moghadam, M., Bicer, Y., Ure, N. K., Yavas, U., and Kurtulus, C. Tactical lane changing with deep reinforcement learning in dynamic and uncertain traffic scenarios. In 22nd Intelligent Transportation Systems Conference (ITSC2019-submitted), 2019

work page 2019

[3] [3]

Bicer, Y., Moghadam, M., Sahin, C., Eroglu, B., and \"U re, N. K. Vision-based uav guidance for autonomous landing with deep neural networks. In AIAA Scitech 2019 Forum, pp.\ 0140, 2019

work page 2019

[4] [4]

End to End Learning for Self-Driving Cars

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Broggi, A., Bertozzi, M., Fascioli, A., Bianco, C. G. L., and Piazzi, A. The argo autonomous vehicle’s vision and control systems. International Journal of Intelligent Control and Systems, 3 0 (4): 0 409--441, 1999

work page 1999

[6] [6]

Reinforcement learning and dynamic programming using function approximators

Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D. Reinforcement learning and dynamic programming using function approximators. CRC press, 2017

work page 2017

[7] [7]

E., and Hrovat, D

Falcone, P., Borrelli, F., Asgari, J., Tseng, H. E., and Hrovat, D. Predictive active steering control for autonomous vehicle systems. IEEE Transactions on control systems technology, 15 0 (3): 0 566--580, 2007

work page 2007

[8] [8]

Evolving large-scale neural networks for vision-based torcs

Koutn \' k, J., Cuccu, G., Schmidhuber, J., and Gomez, F. Evolving large-scale neural networks for vision-based torcs. 2013

work page 2013

[9] [9]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012

work page 2012

[10] [10]

Continuous control with deep reinforcement learning

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

A., Veness, J., Bellemare, M

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518 0 (7540): 0 529, 2015

work page 2015

[12] [12]

and Caliskan, F

Moghadam, M. and Caliskan, F. Actuator and sensor fault detection and diagnosis of quadrotor based on two-stage kalman filter. In 2015 5th Australian Control Conference (AUCC), pp.\ 182--187. IEEE, 2015

work page 2015

[13] [13]

K., and Inalhan, G

Moghadam, M., Ure, N. K., and Inalhan, G. Autonomous execution of aircraft supermaneuvers with switching nonlinear backstepping control. In 2018 AIAA Guidance, Navigation, and Control Conference, pp.\ 1594, 2018

work page 2018

[14] [14]

End-to-end driving in a realistic racing game with deep reinforcement learning

Perot, E., Jaritz, M., Toromanoff, M., and De Charette, R. End-to-end driving in a realistic racing game with deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.\ 3--4, 2017

work page 2017

[15] [15]

E., Li, W., et al

Slotine, J.-J. E., Li, W., et al. Applied nonlinear control, volume 199. Prentice hall Englewood Cliffs, NJ, 1991

work page 1991

[16] [16]

Specht, D. F. A general regression neural network. IEEE transactions on neural networks, 2 0 (6): 0 568--576, 1991

work page 1991

[17] [17]

Deep reinforcement learning with double q-learning

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016

work page 2016