Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models

Changzhong Qian; Congyuan Yu; Kalpana Panda; Nishad Sahu; Ragunathan Rajkumar; Shounak Sural

arxiv: 2606.00191 · v1 · pith:E3YU3HLGnew · submitted 2026-05-29 · 💻 cs.RO · cs.CV

Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models

Nishad Sahu , Kalpana Panda , Congyuan Yu , Changzhong Qian , Shounak Sural , Ragunathan Rajkumar This is my paper

Pith reviewed 2026-06-28 22:13 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords autonomous drivingsafety evaluationend-to-end policiessimulation benchmarksCARLAdriving metricswork zonespedestrian scenarios

0 comments

The pith

Two leading end-to-end driving models lose most of their benchmark performance when tested on work zones, jaywalking pedestrians, and occluded road users.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adds one hundred new scenarios to an existing driving benchmark to target three common hazard types that current tests under-sample. It pairs those scenarios with a new safety metric that checks braking timing, object contact, lane position, and smoothness in addition to standard progress and collision counts. When two published policies run on the extended set, their overall scores fall by more than half and the safety metric stays near the bottom of its range, matching observed failures such as ignoring work-zone signs and late braking for pedestrians. The evaluation uses the same simulator towns the models were trained on, so the drop cannot be blamed on distribution shift alone. The results indicate that high scores on existing closed-loop tests do not guarantee safe responses to frequent real-road risks.

Core claim

Evaluating two state-of-the-art policies on Safe2Drive reveals sharp drops relative to their Bench2Drive baselines (LEAD from 94.70 to 39.95 driving score, SimLingo from 85.07 to 41.00) together with low SafeDriving Scores (11.85 and 15.27). These numbers align with concrete behavioral failures including poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. The study concludes that current end-to-end models still lack reliable safe behavioral reasoning even inside the CARLA towns used during training.

What carries the argument

Safe2Drive benchmark of 100 added scenarios covering work zones, pedestrian jaywalking, and occluded vulnerable road users, together with the SafeDriving Score metric that augments prior evaluators with pre-crash braking, work-zone contact, lane-centering, and smoothness checks.

If this is right

Standard benchmark scores alone do not reveal whether an end-to-end policy will respond safely to common road hazards.
End-to-end models can still produce red-light violations and absent braking even inside training-town environments.
Adding targeted hazard scenarios to evaluation suites exposes behavioral weaknesses that are invisible on prior test sets.
Safety-centric metrics that penalize late braking and work-zone contact produce lower scores than progress-only metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines could be extended to include the new scenarios as negative examples to improve hazard responses without changing the base architecture.
The same evaluation approach could be applied to other published end-to-end policies to test whether the observed safety gap is widespread.
If the safety failures persist on physical vehicles, regulators may need scenario-specific certification tests beyond aggregate driving scores.

Load-bearing premise

The one hundred added scenarios form a representative sample of frequent real-world safety-critical events and the chosen simulator plus metric thresholds correctly capture the safety-relevant parts of vehicle behavior.

What would settle it

Re-running the same two policies on a different simulator or on recorded real-world traces of work-zone and pedestrian events and checking whether the performance drop and safety-score pattern remain.

Figures

Figures reproduced from arXiv: 2606.00191 by Changzhong Qian, Congyuan Yu, Kalpana Panda, Nishad Sahu, Ragunathan Rajkumar, Shounak Sural.

**Figure 2.** Figure 2: Collision analysis for a child jaywalking case in stormy night in S2D. The speed and longitudinal-acceleration traces are aligned [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows sharp drops in model performance on new safety scenarios but provides little detail on how those scenarios were built or how the metric thresholds were chosen.

read the letter

The main point is that two state-of-the-art end-to-end driving models drop from driving scores above 85 on Bench2Drive to around 40 on Safe2Drive, with SafeDriving Scores below 16. This suggests the models struggle with work zones, red lights, and pedestrian braking.

What is new is the set of 100 scenarios focused on three hazard families and the SDS metric that adds safety-specific checks. The paper does a decent job of listing specific failure modes like poor work-zone understanding and late braking.

The numbers are presented clearly and the idea of extending an existing benchmark is straightforward. Releasing the scenarios is a plus.

The main weakness is the lack of information on scenario selection and metric definition. There is no description of how the 100 scenarios were generated, whether they match real-world frequencies, or how the SDS thresholds were calibrated. This makes it hard to judge if the performance drop reflects a true safety gap or just the choice of test cases. The abstract also omits any statistical tests on the differences.

This work is aimed at people evaluating autonomous driving systems for safety. Readers who care about benchmark gaps will find the concrete comparisons useful.

The paper shows clear thinking in identifying a potential issue with current evaluations. It deserves peer review so the details can be examined.

I recommend sending it to referees rather than desk rejecting it.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Safe2Drive (S2D), a Bench2Drive-aligned extension adding 100 scenarios focused on three families of road hazards (work zones, pedestrian jaywalking, occluded VRUs). It defines a new SafeDriving Score (SDS) that augments prior metrics with checks for pre-crash braking, work-zone contact, lane centering, and smoothness. Evaluation of LEAD and SimLingo on S2D reports sharp drops in driving score relative to their Bench2Drive baselines (LEAD 94.70→39.95; SimLingo 85.07→41.00) together with low SDS values (11.85 and 15.27), which the authors interpret as evidence of brittle safe-driving behaviors such as poor work-zone understanding and absent pedestrian braking.

Significance. If the 100 scenarios prove representative of frequent real-world hazards and the SDS thresholds are shown to be appropriately calibrated, the work would usefully document concrete safety shortcomings in current E2E driving policies even on training-town CARLA environments. The explicit numeric drops and failure-mode examples provide a clear, falsifiable starting point for follow-up safety research.

major comments (3)

[Abstract] Abstract (paragraph describing S2D construction): the statement that the 100 scenarios target “three frequent families” of road hazards supplies no selection methodology, real-world frequency statistics, or explicit alignment procedure with Bench2Drive; without this, the observed DS drops cannot be interpreted as evidence that the policies fail on common safety-critical events.
[Abstract] Abstract (SDS definition): the added SDS components (pre-crash braking, work-zone contact, lane centering, smoothness) and their numerical thresholds are presented without calibration against accident databases, human-driver baselines, or sensitivity analysis; this directly affects whether the reported low SDS values (11.85/15.27) correctly identify unsafe behavior rather than penalizing acceptable CARLA dynamics.
[Results] Results (comparison with Bench2Drive baselines): the manuscript reports large DS drops but provides no statistical significance tests, confidence intervals, or per-scenario variance; given that post-hoc scenario selection is a recognized risk, the central claim that the drops demonstrate “brittle safe-driving behaviors” rests on unquantified differences.

minor comments (2)

The abstract states that code and videos will be released, yet the manuscript contains no reproducibility checklist, exact CARLA version, or seed information for the 100 scenarios.
Notation for SDS components is introduced only in the abstract; a dedicated methods subsection would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below, indicating where revisions will be made to the paper.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph describing S2D construction): the statement that the 100 scenarios target “three frequent families” of road hazards supplies no selection methodology, real-world frequency statistics, or explicit alignment procedure with Bench2Drive; without this, the observed DS drops cannot be interpreted as evidence that the policies fail on common safety-critical events.

Authors: We agree that the abstract would benefit from more detail on how the scenarios were constructed. The 100 scenarios were created by extending the Bench2Drive scenario suite with additional parameters for the three hazard families, using the same CARLA towns and route structures for alignment. We will revise the abstract to include a brief description of this procedure and clarify that the scenarios represent common hazard types in the simulation environment. We do not have access to new real-world frequency statistics in this study and will qualify the language accordingly. revision: partial
Referee: [Abstract] Abstract (SDS definition): the added SDS components (pre-crash braking, work-zone contact, lane centering, smoothness) and their numerical thresholds are presented without calibration against accident databases, human-driver baselines, or sensitivity analysis; this directly affects whether the reported low SDS values (11.85/15.27) correctly identify unsafe behavior rather than penalizing acceptable CARLA dynamics.

Authors: The SDS components and thresholds were designed to capture obvious safety violations in CARLA, such as colliding with work zone objects or failing to brake before impact. We will expand the methods section to provide the rationale for each addition and the chosen thresholds. We acknowledge the absence of formal calibration against external databases and will add this as a limitation of the current metric. A sensitivity analysis is planned for future extensions of this work. revision: yes
Referee: [Results] Results (comparison with Bench2Drive baselines): the manuscript reports large DS drops but provides no statistical significance tests, confidence intervals, or per-scenario variance; given that post-hoc scenario selection is a recognized risk, the central claim that the drops demonstrate “brittle safe-driving behaviors” rests on unquantified differences.

Authors: We will revise the results section to include per-scenario standard deviations and confidence intervals for the reported driving scores and SDS values. Given the closed-loop nature of the evaluations, we will also discuss the consistency of failures across scenarios to support the interpretation of brittle behaviors. This addresses the concern about unquantified differences. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical evaluation with no derivations or fitted predictions

full rationale

The paper is an empirical benchmark study that constructs S2D scenarios and SDS metric, then runs existing policies (LEAD, SimLingo) in CARLA to report observed score drops. No equations, parameter fitting, or first-principles derivations are present whose outputs reduce to inputs by construction. Any references to Bench2Drive are external baselines for comparison, not self-citations that justify a uniqueness theorem or ansatz. The central claims rest on simulation outcomes rather than tautological re-labeling or self-referential fitting, satisfying the self-contained empirical case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the 100 scenarios for real hazards and on the validity of CARLA as a proxy for physical vehicle dynamics; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption CARLA simulator physics and sensor models are sufficiently accurate for the tested safety-critical scenarios
All quantitative results are obtained inside CARLA; the abstract invokes this without additional validation.

pith-pipeline@v0.9.1-grok · 5834 in / 1405 out tokens · 40925 ms · 2026-06-28T22:13:08.466056+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Dosovitskiy, Alexey and Ros, German and Codevilla, Felipe and Lopez, Antonio and Koltun, Vladlen , booktitle =. CARLA:
[2]

Advances in Neural Information Processing Systems (

Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving , author =. Advances in Neural Information Processing Systems (
[3]

LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

LEAD: Minimizing Learner--Expert Asymmetry in End-to-End Driving , author =. arXiv preprint arXiv:2512.20563 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (
[5]

ArXiv.org , volume =

Hidden Biases of End-to-End Driving Datasets , author =. ArXiv.org , volume =
[6]

Proceedings of the IEEE/CVF International Conference on Computer Vision (

Hidden Biases of End-to-End Driving Models , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (
[7]

2025 , eprint =

PlanT 2.0: Exposing Biases and Structural Flaws in Closed-Loop Driving , author =. 2025 , eprint =

2025
[8]

2025 , eprint =

Safety2Drive: Safety-Critical Scenario Benchmark for the Evaluation of Autonomous Driving , author =. 2025 , eprint =

2025
[9]

2026 , eprint =

Fail2Drive: Benchmarking Closed-Loop Driving Generalization , author =. 2026 , eprint =

2026
[10]

CARLA Autonomous Driving Leaderboard 2.0 , howpublished =
[11]

End to End Learning for Self-Driving Cars

End to End Learning for Self-Driving Cars , author =. arXiv preprint arXiv:1604.07316 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[12]

End-to-end Driving via Conditional Imitation Learning

End-to-end Driving via Conditional Imitation Learning , author =. arXiv preprint arXiv:1710.02410 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[13]

What Uncertainties Do We Need in

Kendall, Alex and Gal, Yarin , booktitle =. What Uncertainties Do We Need in
[14]

Manual on Uniform Traffic Control Devices (
[15]

2024 , eprint =

ROADWork: A Benchmark for Work Zone Understanding in Autonomous Driving , author =. 2024 , eprint =

2024
[16]

Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

JaywalkerVR: A Large-Scale Dataset for Pedestrian Trajectory Prediction in Urban Driving Scenarios , author =. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =
[17]

IEEE International Conference on Intelligent Transportation Systems (ITSC) , year =

Occlusion-Aware Risk Assessment for Autonomous Driving in Urban Environments , author =. IEEE International Conference on Intelligent Transportation Systems (ITSC) , year =
[18]

Transportation Research Part F: Traffic Psychology and Behaviour , year =

A Driving Comfort and Acceptance Evaluation Method for Automated Vehicles , author =. Transportation Research Part F: Traffic Psychology and Behaviour , year =
[19]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

WorkZone3D: A Multimodal Dataset for 3D Work Zone Perception in Autonomous Driving , author =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

[1] [1]

Dosovitskiy, Alexey and Ros, German and Codevilla, Felipe and Lopez, Antonio and Koltun, Vladlen , booktitle =. CARLA:

[2] [2]

Advances in Neural Information Processing Systems (

Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving , author =. Advances in Neural Information Processing Systems (

[3] [3]

LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

LEAD: Minimizing Learner--Expert Asymmetry in End-to-End Driving , author =. arXiv preprint arXiv:2512.20563 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (

[5] [5]

ArXiv.org , volume =

Hidden Biases of End-to-End Driving Datasets , author =. ArXiv.org , volume =

[6] [6]

Proceedings of the IEEE/CVF International Conference on Computer Vision (

Hidden Biases of End-to-End Driving Models , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (

[7] [7]

2025 , eprint =

PlanT 2.0: Exposing Biases and Structural Flaws in Closed-Loop Driving , author =. 2025 , eprint =

2025

[8] [8]

2025 , eprint =

Safety2Drive: Safety-Critical Scenario Benchmark for the Evaluation of Autonomous Driving , author =. 2025 , eprint =

2025

[9] [9]

2026 , eprint =

Fail2Drive: Benchmarking Closed-Loop Driving Generalization , author =. 2026 , eprint =

2026

[10] [10]

CARLA Autonomous Driving Leaderboard 2.0 , howpublished =

[11] [11]

End to End Learning for Self-Driving Cars

End to End Learning for Self-Driving Cars , author =. arXiv preprint arXiv:1604.07316 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

End-to-end Driving via Conditional Imitation Learning

End-to-end Driving via Conditional Imitation Learning , author =. arXiv preprint arXiv:1710.02410 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

What Uncertainties Do We Need in

Kendall, Alex and Gal, Yarin , booktitle =. What Uncertainties Do We Need in

[14] [14]

Manual on Uniform Traffic Control Devices (

[15] [15]

2024 , eprint =

ROADWork: A Benchmark for Work Zone Understanding in Autonomous Driving , author =. 2024 , eprint =

2024

[16] [16]

Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

JaywalkerVR: A Large-Scale Dataset for Pedestrian Trajectory Prediction in Urban Driving Scenarios , author =. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

[17] [17]

IEEE International Conference on Intelligent Transportation Systems (ITSC) , year =

Occlusion-Aware Risk Assessment for Autonomous Driving in Urban Environments , author =. IEEE International Conference on Intelligent Transportation Systems (ITSC) , year =

[18] [18]

Transportation Research Part F: Traffic Psychology and Behaviour , year =

A Driving Comfort and Acceptance Evaluation Method for Automated Vehicles , author =. Transportation Research Part F: Traffic Psychology and Behaviour , year =

[19] [19]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

WorkZone3D: A Multimodal Dataset for 3D Work Zone Perception in Autonomous Driving , author =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =