Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

Kaushik Raghupathruni; Prakash Aryan; Sebastiano Panichella; Timo Kehrer

arxiv: 2605.20255 · v2 · pith:RV3UIBPVnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.HC· cs.RO

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

Prakash Aryan , Kaushik Raghupathruni , Timo Kehrer , Sebastiano Panichella This is my paper

Pith reviewed 2026-05-21 08:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.HCcs.RO

keywords multi-agent reinforcement learningautonomous drivingpedestrian behavior modelingjaywalking simulationsafety testingMAPPOtrajectory metrics

0 comments

The pith

Jointly training self-driving cars and pedestrians with multi-agent reinforcement learning produces more realistic crossing scenarios and measurable behavior gaps than fixed policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that co-training an autonomous vehicle and a group of pedestrians in a shared reinforcement learning environment captures the hidden uncertainties in human crossing decisions better than training the vehicle against static pedestrian rules. Pedestrians follow fixed paths but learn go-or-wait choices influenced by an unobserved personality trait that raises jaywalking odds, while the vehicle must respond without seeing that trait. This setup matters for safety testing because current simulations often underestimate jaywalking risks that cause most collisions in the evaluations. Results show the jointly trained vehicle completes more goals with fewer crashes and exhibits clear speed differences near unpredictable crossings. The approach also demonstrates that these gaps can be read directly from recorded trajectories without additional sensors.

Core claim

An SDC and 12 pedestrians are co-trained with MAPPO in an environment where pedestrian locomotion uses Dijkstra pathfinding and an RL policy governs go/wait decisions modulated by a per-pedestrian hidden personality trait. In 500-episode tests the co-trained SDC reaches 78 percent of goals with a 14 percent collision rate, outperforming the best rule-based baseline of 35 percent goals and 33 percent collisions. A speed differential metric reveals the SDC moves 2.65 m/s faster near jaywalkers than near crosswalk users at 0-3 m range, while jaywalking events comprise only 13 percent of crossings yet account for 62 percent of collisions. Co-training reduces collisions by 30 percent relative to单

What carries the argument

MAPPO-based multi-agent co-training in which pedestrians learn go/wait policies conditioned on a hidden personality trait while the SDC must infer behavior from observed motion alone.

If this is right

Jaywalking accounts for a small fraction of crossings yet drives the majority of collisions, highlighting the need for anticipation models that treat personality-driven deviations separately.
The co-trained SDC shows higher speeds near jaywalkers at close range, indicating that interaction learning still leaves measurable anticipation gaps.
Collision rates drop 30 percent when pedestrians also adapt during training, suggesting that mutual policy learning improves overall safety metrics.
Trajectory-derived speed differentials provide a direct, sensor-free way to quantify the predictability gap between crossing types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hidden-trait mechanism could be extended to model other unobservable factors such as distraction or group influence in future multi-agent driving simulators.
Trajectory-based metrics might transfer to post-hoc analysis of real-world fleet data to flag high-risk interaction patterns without new instrumentation.
Scaling the same co-training loop to include cyclists or other road users could reveal similar behavior gaps in more complex mixed-traffic scenes.

Load-bearing premise

The combination of scripted Dijkstra paths, RL-controlled go/wait decisions, and one hidden personality trait per pedestrian is enough to represent real human crossing uncertainty and heterogeneity.

What would settle it

A direct comparison of speed profiles and collision rates in the simulated environment against video recordings of actual urban crossings under comparable visibility and speed conditions.

Figures

Figures reproduced from arXiv: 2605.20255 by Kaushik Raghupathruni, Prakash Aryan, Sebastiano Panichella, Timo Kehrer.

**Figure 2.** Figure 2: System architecture. (a) CTDE: the centralized critic uses global state during training and is discarded at execution. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Goal and collision rates across methods (500 episodes [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: SDC speed vs. distance to nearest pedestrian, separated [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Simulation-based testing of self-driving cars (SDCs) typically relies on scripted pedestrian models that do not capture the heterogeneity and uncertainty of real crossing behavior, limiting the realism of safety assessments, especially for jaywalking, which is governed by latent personality traits the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) yields more realistic interaction scenarios than training against fixed pedestrian policies, and that the behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. We co-train an SDC and 12 pedestrians using Multi-Agent Proximal Policy Optimization (MAPPO): pedestrian locomotion follows scripted Dijkstra pathfinding while an RL policy controls high-level go/wait decisions, and jaywalking probability depends on a per-pedestrian trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, versus 35%/33% for the best rule-based baseline. A speed differential metric shows the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating jaywalking encounters were not anticipated. Jaywalking was 13% of crossing events but 62% of collisions, and co-training reduced collisions by 30% relative to single-agent RL as pedestrians learned to wait when the SDC approached at speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Co-training the SDC with MARL pedestrians that have hidden jaywalking traits produces clearer simulation gains than fixed policies, but the realism of those pedestrians is not checked against real crossing data.

read the letter

The main thing to know is that this setup shows measurable differences in how an SDC handles predictable versus unpredictable crossings when both sides are trained together. In 500 episodes the co-trained agent hits 78% goals with 14% collisions, against 35% goals and 33% collisions for the strongest rule-based baseline. Jaywalking events are only 13% of crossings but drive 62% of the collisions, and the speed-differential metric flags that the SDC is moving 2.65 m/s faster near those events at close range.

Referee Report

3 major / 2 minor

Summary. The paper introduces a MARL environment using MAPPO to co-train an SDC and 12 pedestrians, where pedestrian locomotion uses scripted Dijkstra paths and RL controls go/wait decisions modulated by a hidden per-pedestrian personality trait that governs jaywalking probability. The central claim is that this co-training yields more realistic interaction scenarios than fixed pedestrian policies, with the behavior gap measurable from trajectories; 500-episode results report the co-trained SDC reaching 78% goals at 14% collision rate versus 35% goals and 33% collisions for the best rule-based baseline, plus a 2.65 m/s speed differential near jaywalkers and jaywalking (13% of events) linked to 62% of collisions.

Significance. If the performance gains hold under equivalent training budgets and the pedestrian model is shown to better approximate real human heterogeneity, the work could advance simulation-based safety validation for autonomous driving by incorporating latent behavioral uncertainty. The numerical differences are clear, but without real-data calibration or statistical rigor the significance for practical deployment remains provisional.

major comments (3)

Abstract: the claim that co-training 'produces more realistic interaction scenarios' rests on the unvalidated assumption that the Dijkstra+RL+hidden-trait pedestrian model captures human crossing heterogeneity; no calibration or statistical comparison to real pedestrian trajectories is provided, so the metrics only demonstrate SDC behavioral differences rather than improved fidelity.
Abstract (evaluation paragraph): the reported 78% vs 35% goal-reaching and 14% vs 33% collision rates do not state whether the rule-based baseline received an equivalent training budget or number of environment steps, leaving the superiority claim only partially supported.
Abstract: no variance, confidence intervals, or statistical significance tests accompany the 500-episode aggregate metrics (e.g., the 2.65 m/s speed differential or 62% collision attribution), so it is unclear whether the observed gaps are robust or sensitive to random seeds.

minor comments (2)

Abstract: the description of the personality trait sampling and its effect on jaywalking probability could be expanded with the exact functional form or probability mapping used.
The manuscript should clarify the observation space available to the SDC (e.g., whether personality traits or go/wait intentions are fully hidden) to support reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their insightful comments on our work. We address each of the major comments in detail below, indicating where we agree and the revisions we will implement to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the claim that co-training 'produces more realistic interaction scenarios' rests on the unvalidated assumption that the Dijkstra+RL+hidden-trait pedestrian model captures human crossing heterogeneity; no calibration or statistical comparison to real pedestrian trajectories is provided, so the metrics only demonstrate SDC behavioral differences rather than improved fidelity.

Authors: We agree that the pedestrian model is not calibrated against real data, and thus the assertion of 'more realistic' is an assumption based on the inclusion of hidden personality traits that modulate jaywalking. The manuscript presents this as a hypothesis, with results showing that co-training leads to different SDC behaviors and reduced collisions. To address the concern, we will revise the abstract to replace 'produces more realistic interaction scenarios' with 'leads to interaction scenarios with greater behavioral diversity' and add a sentence noting the lack of real-world calibration as a limitation. This change will be reflected in the revised manuscript. revision: yes
Referee: Abstract (evaluation paragraph): the reported 78% vs 35% goal-reaching and 14% vs 33% collision rates do not state whether the rule-based baseline received an equivalent training budget or number of environment steps, leaving the superiority claim only partially supported.

Authors: The rule-based baselines are fixed, non-adaptive policies and therefore do not receive training or environment steps in the same manner as the RL agents. The training is performed only for the MAPPO policies in the co-training setup and the single-agent RL comparison. All methods are evaluated over the same 500 episodes. We will revise the abstract and methods to explicitly state that the baselines are rule-based fixed policies without training, ensuring the comparison is clearly between learned policies against different pedestrian behaviors. revision: yes
Referee: Abstract: no variance, confidence intervals, or statistical significance tests accompany the 500-episode aggregate metrics (e.g., the 2.65 m/s speed differential or 62% collision attribution), so it is unclear whether the observed gaps are robust or sensitive to random seeds.

Authors: We concur that variability measures are important for assessing robustness. The reported figures are averages over 500 episodes that incorporate stochasticity from personality trait sampling and environment initialization. In the revision, we will add standard deviations to the key metrics in the abstract and results section. We will also clarify that the training used a single random seed for reproducibility, and note that sensitivity to seeds is a potential limitation that could be explored with additional computational resources. revision: partial

standing simulated objections not resolved

Providing a full statistical comparison and calibration of the simulated pedestrian behaviors to real human trajectory data, as this would necessitate new experiments with external datasets beyond the scope of the current simulation study.

Circularity Check

0 steps flagged

No significant circularity; results are direct simulation rollouts independent of model inputs

full rationale

The paper defines a MARL setup (MAPPO co-training of SDC and 12 pedestrians, Dijkstra locomotion plus RL go/wait decisions, hidden per-pedestrian personality trait controlling jaywalking probability) as an input modeling choice. It then reports independent empirical outcomes from 500-episode rollouts: 78% goal success and 14% collisions for the co-trained agent versus 35% goals and 33% collisions for the best rule-based baseline, plus derived trajectory metrics such as 2.65 m/s speed differential near jaywalkers and 62% of collisions from 13% jaywalking events. These quantities are computed directly from executed trajectories and do not reduce algebraically or definitionally to the personality parameters or training procedure. No self-citations, uniqueness theorems, or fitted-input renamings are invoked to support the central performance claims; the comparison to single-agent RL and rule-based baselines supplies external reference points within the simulation. The derivation chain is therefore self-contained against its own benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen pedestrian model (scripted paths plus RL go/wait plus hidden trait) is a faithful proxy for real behavioral uncertainty; no free parameters are explicitly fitted to external data in the reported results.

free parameters (1)

per-pedestrian personality trait
Sampled once per episode to set jaywalking probability; value distribution not specified in abstract.

axioms (1)

domain assumption Pedestrian locomotion follows scripted Dijkstra pathfinding while RL controls only high-level go/wait decisions
Stated directly in the environment description.

pith-pipeline@v0.9.0 · 5848 in / 1379 out tokens · 33164 ms · 2026-05-21T08:11:42.060568+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) produces more realistic interaction scenarios than training the SDC against fixed pedestrian policies
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Jaywalking probability depends on a per-pedestrian personality trait sampled at episode start and hidden from the SDC

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.