Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

Amnon Shashua; Shai Shalev-Shwartz; Shaked Shammah

arxiv: 1610.03295 · v1 · pith:NCLQY5DOnew · submitted 2016-10-11 · 💻 cs.AI · cs.LG· stat.ML

Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

Shai Shalev-Shwartz , Shaked Shammah , Amnon Shashua This is my paper

classification 💻 cs.AI cs.LGstat.ML

keywords drivingotherautonomouslearningmulti-agentpolicyapplybehavior

0 comments

read the original abstract

Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when overtaking, giving way, merging, taking left and right turns and while pushing ahead in unstructured urban roadways. Since there are many possible scenarios, manually tackling all possible cases will likely yield a too simplistic policy. Moreover, one must balance between unexpected behavior of other drivers/pedestrians and at the same time not to be too defensive so that normal traffic flow is maintained. In this paper we apply deep reinforcement learning to the problem of forming long term driving strategies. We note that there are two major challenges that make autonomous driving different from other robotic tasks. First, is the necessity for ensuring functional safety - something that machine learning has difficulty with given that performance is optimized at the level of an expectation over many instances. Second, the Markov Decision Process model often used in robotics is problematic in our case because of unpredictable behavior of other agents in this multi-agent scenario. We make three contributions in our work. First, we show how policy gradient iterations can be used without Markovian assumptions. Second, we decompose the problem into a composition of a Policy for Desires (which is to be learned) and trajectory planning with hard constraints (which is not learned). The goal of Desires is to enable comfort of driving, while hard constraints guarantees the safety of driving. Third, we introduce a hierarchical temporal abstraction we call an "Option Graph" with a gating mechanism that significantly reduces the effective horizon and thereby reducing the variance of the gradient estimation even further.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise
math.PR 2026-05 unverdicted novelty 7.0

Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properti...
NePPO: Near-Potential Policy Optimization for General-Sum Multi-Agent Reinforcement Learning
cs.LG 2026-03 unverdicted novelty 7.0

NePPO learns a player-independent potential function via a novel objective whose minimization yields an approximate Nash equilibrium for general-sum multi-agent games.
Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates
cs.LG 2025-09 unverdicted novelty 6.0

A novel robust asynchronous Q-learning algorithm achieves finite-time convergence rates that match clean-data bounds up to an additive term proportional to the corruption fraction, with a matching information-theoreti...
SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning
cs.RO 2025-06 unverdicted novelty 6.0

SENIOR improves feedback efficiency and policy learning speed in PbRL by combining motion-distinction query selection via kernel density estimation with preference-guided intrinsic rewards, showing gains on simulated ...
Optimistic {\epsilon}-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning
cs.MA 2025-02 unverdicted novelty 6.0

Optimistic ε-Greedy Exploration adds decoupled optimistic networks that converge in probability to maximum returns and samples from them with probability ε to increase optimal joint-action frequency in CTDE MARL.
Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic
cs.AI 2026-04 unverdicted novelty 5.0

This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.
When Altruism Meets Autonomy: Managing Bottleneck Congestion with Strategic Autonomous Vehicles
eess.SY 2026-04 unverdicted novelty 5.0

Under selfish human driver behavior, the congestion relief from strategic autonomous vehicles at bottlenecks is non-increasing with AV penetration, featuring plateaus and gains only at critical thresholds.
Scalable Quantum Reinforcement Learning on NISQ Devices with Dynamic-Circuit Qubit Reuse and Grover Optimization
quant-ph 2025-09 unverdicted novelty 5.0

A dynamic-circuit framework for multi-step quantum Markov decision processes reduces physical qubit count from O(T) to O(1) while preserving trajectory fidelity and applying Grover amplification for high-return paths.
Focusing Influence Mechanism for Multi-Agent Reinforcement Learning
cs.LG 2025-06 unverdicted novelty 5.0

The Focusing Influence Mechanism (FIM) uses an entropy-based criterion and eligibility traces to help multiple agents in reinforcement learning focus and maintain their influence on under-explored parts of the state s...
Quantum framework for Reinforcement Learning: Integrating Markov decision process, quantum arithmetic, and trajectory search
quant-ph 2024-12 unverdicted novelty 3.0

The paper claims a fully quantum MDP model for RL with quantum state transitions, return calculation, and trajectory search that achieves quantum enhancement.
Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI
cs.LG 2024-03 unverdicted novelty 2.0

A survey reviewing the integration of generative models with connected and automated vehicles to enhance predictive modeling, simulation accuracy, and decision-making.