Cooperation-Aware Reinforcement Learning for Merging in Dense Traffic

Alireza Nakhaei; Kikuo Fujimura; Maxime Bouton; Mykel J. Kochenderfer

arxiv: 1906.11021 · v1 · pith:AFXAX2BNnew · submitted 2019-06-26 · 💻 cs.RO · cs.AI· cs.LG

Cooperation-Aware Reinforcement Learning for Merging in Dense Traffic

Maxime Bouton , Alireza Nakhaei , Kikuo Fujimura , Mykel J. Kochenderfer This is my paper

Pith reviewed 2026-05-25 15:44 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords reinforcement learningautonomous drivingdense traffic mergingcooperation modelingbelief trackingdeadlock avoidanceinteractive decision making

0 comments

The pith

Reinforcement learning that tracks beliefs over other drivers' cooperation levels enables merging in dense traffic with fewer deadlocks than online planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that an autonomous vehicle can learn effective merging behavior in heavy traffic by using reinforcement learning augmented with a belief over how cooperative nearby drivers are. Standard approaches that ignore interaction often result in the vehicle freezing, but reasoning about varying cooperation allows the agent to anticipate changes in other drivers' behavior. Maintaining and updating this belief distribution during the maneuver leads to successful navigation with reduced deadlock rates in simulation compared to planning baselines.

Core claim

The reinforcement learning agent that maintains a belief over the level of cooperation of other drivers successfully learns how to navigate a dense merging scenario with less deadlocks than with online planning methods.

What carries the argument

Belief distribution over discrete cooperation levels of other drivers, maintained and used to condition the reinforcement learning policy for interaction.

Load-bearing premise

Other drivers exhibit discrete, observable levels of cooperation that can be tracked via a belief distribution and that this modeling choice is the main driver of reduced deadlocks in the simulated environment.

What would settle it

A controlled simulation run that removes the belief component or makes cooperation levels continuous and unobservable, then measures whether deadlock rates remain lower than online planning methods.

Figures

Figures reproduced from arXiv: 1906.11021 by Alireza Nakhaei, Kikuo Fujimura, Maxime Bouton, Mykel J. Kochenderfer.

**Figure 2.** Figure 2: Illustration of the vehicles observed by the ego vehicle. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the C-IDM model where a cooperative vehicle (in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Example of a trajectory when executing the reinforcement learning [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of the different policies on a dense traffic scenario. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Decision making in dense traffic can be challenging for autonomous vehicles. An autonomous system only relying on predefined road priorities and considering other drivers as moving objects will cause the vehicle to freeze and fail the maneuver. Human drivers leverage the cooperation of other drivers to avoid such deadlock situations and convince others to change their behavior. Decision making algorithms must reason about the interaction with other drivers and anticipate a broad range of driver behaviors. In this work, we present a reinforcement learning approach to learn how to interact with drivers with different cooperation levels. We enhanced the performance of traditional reinforcement learning algorithms by maintaining a belief over the level of cooperation of other drivers. We show that our agent successfully learns how to navigate a dense merging scenario with less deadlocks than with online planning methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Belief-augmented RL for merging reduces deadlocks in simulation but lacks ablations and metrics to confirm the contribution.

read the letter

The key takeaway is that this paper uses reinforcement learning augmented with a belief over other drivers' cooperation levels to handle dense merging, reporting fewer deadlocks than planning baselines. The integration is presented as the main novelty. It does a good job framing the problem of deadlock from non-interactive assumptions and showing how modeling varying cooperation can help in simulation. The method description focuses on maintaining that belief to inform actions. The soft spots are clear from the abstract: no quantitative metrics, experiment details, or significance tests are provided, so the improvement can't be assessed. The stress-test concern is on point—no ablation removes the belief component to isolate its effect, leaving open whether RL alone suffices. The discrete cooperation model is central but untested against more flexible alternatives. The citation pattern looks standard, drawing from RL and traffic planning literature. This work is for researchers in autonomous driving and multi-agent RL focused on human-like interactions. A reader looking for practical applications in traffic scenarios would get value from the setup, particularly the way they combine belief maintenance with RL. It deserves a serious referee to evaluate the full experiments and controls. The paper shows clear thinking on the problem. I recommend sending it for peer review.

Referee Report

2 major / 0 minor

Summary. The manuscript presents a reinforcement learning approach for autonomous vehicle decision-making during merging in dense traffic. It augments standard RL by maintaining a belief distribution over discrete cooperation levels of other drivers, with the central claim that this enables successful navigation with fewer deadlocks than online planning methods.

Significance. If the experimental results hold with proper validation, the work could contribute to more robust interactive behaviors for AVs in scenarios where treating other vehicles as non-cooperative leads to deadlocks, by explicitly reasoning about driver cooperation.

major comments (2)

[Abstract] Abstract: the performance improvement is asserted without any quantitative metrics, experiment details, baselines, statistical significance, or simulation parameters, so the central claim cannot be evaluated from the provided text.
[Method] The claim that belief tracking over discrete cooperation levels drives the deadlock reduction is load-bearing, yet no ablation is described that removes the belief update (e.g., replacing it with a fixed prior) while holding the RL policy and reward fixed to isolate its contribution versus standard RL.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and strengthen the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the performance improvement is asserted without any quantitative metrics, experiment details, baselines, statistical significance, or simulation parameters, so the central claim cannot be evaluated from the provided text.

Authors: We agree that the abstract should provide quantitative support for the central claim to allow evaluation. In the revised version, we will expand the abstract to include key metrics (e.g., deadlock rates and success percentages), brief experiment details, baseline comparisons, and notes on statistical significance and simulation parameters drawn from the results section. revision: yes
Referee: [Method] The claim that belief tracking over discrete cooperation levels drives the deadlock reduction is load-bearing, yet no ablation is described that removes the belief update (e.g., replacing it with a fixed prior) while holding the RL policy and reward fixed to isolate its contribution versus standard RL.

Authors: We acknowledge the value of an ablation to isolate the belief update's contribution. We will add such an experiment in the revised manuscript, comparing the full belief-maintenance agent against a variant with a fixed prior (holding policy and reward fixed) to quantify the impact on deadlock reduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is modeling choice with external validation

full rationale

The paper describes a standard RL policy augmented by a belief distribution over discrete cooperation levels of other agents. No equations, derivations, or self-citations are shown that reduce the central claim (fewer deadlocks) to a fitted parameter renamed as prediction or to a self-referential definition. The belief model is presented as an explicit design decision whose performance is evaluated in simulation against online planning baselines; the derivation chain does not collapse to its own inputs by construction. This is the common case of an honest empirical RL contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on an unstated model of driver cooperation levels and the fidelity of the simulation environment; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Drivers exhibit discrete cooperation levels that can be represented by a maintainable belief distribution.
Central to the claimed performance gain over standard methods.

pith-pipeline@v0.9.0 · 5666 in / 999 out tokens · 24805 ms · 2026-05-25T15:44:56.767496+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Unfreezing the robot: Navigation in dense, interacting crowds,

P. Trautman and A. Krause, “Unfreezing the robot: Navigation in dense, interacting crowds,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2010

work page 2010
[2]

Prob- abilistic model for interaction aware planning in merge scenarios,

E. Ward, N. Evestedt, D. Axehill, and J. Folkesson, “Prob- abilistic model for interaction aware planning in merge scenarios,” IEEE Transactions on Intelligent V ehicles , vol. 2, no. 2, pp. 133–146, 2017

work page 2017
[3]

A belief state planner for interactive merge maneuvers in congested trafﬁc,

C. Hubmann, J. Schulz, G. Xu, D. Althoff, and C. Stiller, “A belief state planner for interactive merge maneuvers in congested trafﬁc,” in IEEE International Conference on Intelligent Transportation Systems (ITSC) , 2018

work page 2018
[4]

Multimodal probabilistic model-based planning for human- robot interaction,

E. Schmerling, K. Leung, W. V ollprecht, and M. Pavone, “Multimodal probabilistic model-based planning for human- robot interaction,” in IEEE International Conference on Robotics and Automation (ICRA) , 2018

work page 2018
[5]

Hierarchical game-theoretic planning for autonomous vehicles,

J. F. Fisac, E. Bronstein, E. Stefansson, D. Sadigh, S. S. Sastry, and A. D. Dragan, “Hierarchical game-theoretic planning for autonomous vehicles,” in IEEE International Conference on Robotics and Automation (ICRA) , 2019

work page 2019
[6]

Collab- orative planning for mixed-autonomy lane merging,

S. Bansal, A. Cosgun, A. Nakhaei, and K. Fujimura, “Collab- orative planning for mixed-autonomy lane merging,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018

work page 2018
[7]

The value of inferring the internal state of trafﬁc participants for autonomous freeway driving,

Z. N. Sunberg, C. J. Ho, and M. J. Kochenderfer, “The value of inferring the internal state of trafﬁc participants for autonomous freeway driving,” in American Control Conference (ACC), 2017

work page 2017
[8]

Intention estimation for ramp merging control in autonomous driving (in review),

C. Dong, J. M. Dolan, and B. Litkouhi, “Intention estimation for ramp merging control in autonomous driving (in review),” in IEEE Intelligent V ehicles Symposium (IV) , 2017

work page 2017
[9]

Planning for autonomous cars that leverage effects on human actions,

D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning for autonomous cars that leverage effects on human actions,” in Robotics: Science and Systems , 2016

work page 2016
[10]

A reinforcement learning based approach for automated lane change maneu- vers,

P. Wang, C. Chan, and A. de La Fortelle, “A reinforcement learning based approach for automated lane change maneu- vers,” in IEEE Intelligent V ehicles Symposium (IV) , 2018

work page 2018
[11]

Learning negotiating behavior between cars in intersections using deep q-learning,

T. Tram, A. Jansson, R. Gr ¨onberg, M. Ali, and J. Sj ¨oberg, “Learning negotiating behavior between cars in intersections using deep q-learning,” in IEEE International Conference on Intelligent Transportation Systems (ITSC) , 2018

work page 2018
[12]

Safe reinforcement learning with scene decomposition for navigating complex urban environments,

M. Bouton, A. Nakhaei, K. Fujimura, and M. J. Kochenderfer, “Safe reinforcement learning with scene decomposition for navigating complex urban environments,” in IEEE Intelligent V ehicles Symposium (IV), 2019

work page 2019
[13]

M. J. Kochenderfer, Decision making under uncertainty: Theory and application . MIT Press, 2015

work page 2015
[14]

Using eligibility traces to ﬁnd the best memoryless policy in partially observable markov decision processes,

J. Loch and S. P. Singh, “Using eligibility traces to ﬁnd the best memoryless policy in partially observable markov decision processes,” in International Conference on Machine Learning (ICML) , 1998

work page 1998
[15]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” vol. 518, no. 7540, pp. 529–533, 2015

work page 2015
[16]

Prioritized experience replay,

T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in International Conference on Learning Representations, 2016

work page 2016
[17]

Interaction-aware decision making with adaptive strategies under merging scenarios,

Y . Hu, A. Nakhaei, M. Tomizuka, and K. Fujimura, “Interaction-aware decision making with adaptive strategies under merging scenarios,” ArXiv preprint arXiv:1904.06025 , 2019

work page arXiv 1904
[18]

Congested trafﬁc states in empirical observations and microscopic simulations,

M. Treiber, A. Hennecke, and D. Helbing, “Congested trafﬁc states in empirical observations and microscopic simulations,” Physical review E , vol. 62, no. 2, p. 1805, 2000

work page 2000
[19]

Thrun, W

S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics. MIT press, 2005

work page 2005
[20]

Flux: Elegant machine learning with julia,

M. Innes, “Flux: Elegant machine learning with julia,” Journal of Open Source Software , 2018

work page 2018
[21]

Initial scene conﬁgurations for highway trafﬁc propagation,

T. A. Wheeler, M. J. Kochenderfer, and P. Robbel, “Initial scene conﬁgurations for highway trafﬁc propagation,” in IEEE International Conference on Intelligent Transportation Systems (ITSC) , 2015

work page 2015
[22]

Continuous upper conﬁdence trees,

A. Cou ¨etoux, J. Hoock, N. Sokolovska, O. Teytaud, and N. Bonnard, “Continuous upper conﬁdence trees,” in Learning and Intelligent Optimization (LION) , 2011

work page 2011

[1] [1]

Unfreezing the robot: Navigation in dense, interacting crowds,

P. Trautman and A. Krause, “Unfreezing the robot: Navigation in dense, interacting crowds,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2010

work page 2010

[2] [2]

Prob- abilistic model for interaction aware planning in merge scenarios,

E. Ward, N. Evestedt, D. Axehill, and J. Folkesson, “Prob- abilistic model for interaction aware planning in merge scenarios,” IEEE Transactions on Intelligent V ehicles , vol. 2, no. 2, pp. 133–146, 2017

work page 2017

[3] [3]

A belief state planner for interactive merge maneuvers in congested trafﬁc,

C. Hubmann, J. Schulz, G. Xu, D. Althoff, and C. Stiller, “A belief state planner for interactive merge maneuvers in congested trafﬁc,” in IEEE International Conference on Intelligent Transportation Systems (ITSC) , 2018

work page 2018

[4] [4]

Multimodal probabilistic model-based planning for human- robot interaction,

E. Schmerling, K. Leung, W. V ollprecht, and M. Pavone, “Multimodal probabilistic model-based planning for human- robot interaction,” in IEEE International Conference on Robotics and Automation (ICRA) , 2018

work page 2018

[5] [5]

Hierarchical game-theoretic planning for autonomous vehicles,

J. F. Fisac, E. Bronstein, E. Stefansson, D. Sadigh, S. S. Sastry, and A. D. Dragan, “Hierarchical game-theoretic planning for autonomous vehicles,” in IEEE International Conference on Robotics and Automation (ICRA) , 2019

work page 2019

[6] [6]

Collab- orative planning for mixed-autonomy lane merging,

S. Bansal, A. Cosgun, A. Nakhaei, and K. Fujimura, “Collab- orative planning for mixed-autonomy lane merging,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018

work page 2018

[7] [7]

The value of inferring the internal state of trafﬁc participants for autonomous freeway driving,

Z. N. Sunberg, C. J. Ho, and M. J. Kochenderfer, “The value of inferring the internal state of trafﬁc participants for autonomous freeway driving,” in American Control Conference (ACC), 2017

work page 2017

[8] [8]

Intention estimation for ramp merging control in autonomous driving (in review),

C. Dong, J. M. Dolan, and B. Litkouhi, “Intention estimation for ramp merging control in autonomous driving (in review),” in IEEE Intelligent V ehicles Symposium (IV) , 2017

work page 2017

[9] [9]

Planning for autonomous cars that leverage effects on human actions,

D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning for autonomous cars that leverage effects on human actions,” in Robotics: Science and Systems , 2016

work page 2016

[10] [10]

A reinforcement learning based approach for automated lane change maneu- vers,

P. Wang, C. Chan, and A. de La Fortelle, “A reinforcement learning based approach for automated lane change maneu- vers,” in IEEE Intelligent V ehicles Symposium (IV) , 2018

work page 2018

[11] [11]

Learning negotiating behavior between cars in intersections using deep q-learning,

T. Tram, A. Jansson, R. Gr ¨onberg, M. Ali, and J. Sj ¨oberg, “Learning negotiating behavior between cars in intersections using deep q-learning,” in IEEE International Conference on Intelligent Transportation Systems (ITSC) , 2018

work page 2018

[12] [12]

Safe reinforcement learning with scene decomposition for navigating complex urban environments,

M. Bouton, A. Nakhaei, K. Fujimura, and M. J. Kochenderfer, “Safe reinforcement learning with scene decomposition for navigating complex urban environments,” in IEEE Intelligent V ehicles Symposium (IV), 2019

work page 2019

[13] [13]

M. J. Kochenderfer, Decision making under uncertainty: Theory and application . MIT Press, 2015

work page 2015

[14] [14]

Using eligibility traces to ﬁnd the best memoryless policy in partially observable markov decision processes,

J. Loch and S. P. Singh, “Using eligibility traces to ﬁnd the best memoryless policy in partially observable markov decision processes,” in International Conference on Machine Learning (ICML) , 1998

work page 1998

[15] [15]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” vol. 518, no. 7540, pp. 529–533, 2015

work page 2015

[16] [16]

Prioritized experience replay,

T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in International Conference on Learning Representations, 2016

work page 2016

[17] [17]

Interaction-aware decision making with adaptive strategies under merging scenarios,

Y . Hu, A. Nakhaei, M. Tomizuka, and K. Fujimura, “Interaction-aware decision making with adaptive strategies under merging scenarios,” ArXiv preprint arXiv:1904.06025 , 2019

work page arXiv 1904

[18] [18]

Congested trafﬁc states in empirical observations and microscopic simulations,

M. Treiber, A. Hennecke, and D. Helbing, “Congested trafﬁc states in empirical observations and microscopic simulations,” Physical review E , vol. 62, no. 2, p. 1805, 2000

work page 2000

[19] [19]

Thrun, W

S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics. MIT press, 2005

work page 2005

[20] [20]

Flux: Elegant machine learning with julia,

M. Innes, “Flux: Elegant machine learning with julia,” Journal of Open Source Software , 2018

work page 2018

[21] [21]

Initial scene conﬁgurations for highway trafﬁc propagation,

T. A. Wheeler, M. J. Kochenderfer, and P. Robbel, “Initial scene conﬁgurations for highway trafﬁc propagation,” in IEEE International Conference on Intelligent Transportation Systems (ITSC) , 2015

work page 2015

[22] [22]

Continuous upper conﬁdence trees,

A. Cou ¨etoux, J. Hoock, N. Sokolovska, O. Teytaud, and N. Bonnard, “Continuous upper conﬁdence trees,” in Learning and Intelligent Optimization (LION) , 2011

work page 2011