pith. sign in

arxiv: 2606.00191 · v1 · pith:E3YU3HLGnew · submitted 2026-05-29 · 💻 cs.RO · cs.CV

Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models

Pith reviewed 2026-06-28 22:13 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords autonomous drivingsafety evaluationend-to-end policiessimulation benchmarksCARLAdriving metricswork zonespedestrian scenarios
0
0 comments X

The pith

Two leading end-to-end driving models lose most of their benchmark performance when tested on work zones, jaywalking pedestrians, and occluded road users.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adds one hundred new scenarios to an existing driving benchmark to target three common hazard types that current tests under-sample. It pairs those scenarios with a new safety metric that checks braking timing, object contact, lane position, and smoothness in addition to standard progress and collision counts. When two published policies run on the extended set, their overall scores fall by more than half and the safety metric stays near the bottom of its range, matching observed failures such as ignoring work-zone signs and late braking for pedestrians. The evaluation uses the same simulator towns the models were trained on, so the drop cannot be blamed on distribution shift alone. The results indicate that high scores on existing closed-loop tests do not guarantee safe responses to frequent real-road risks.

Core claim

Evaluating two state-of-the-art policies on Safe2Drive reveals sharp drops relative to their Bench2Drive baselines (LEAD from 94.70 to 39.95 driving score, SimLingo from 85.07 to 41.00) together with low SafeDriving Scores (11.85 and 15.27). These numbers align with concrete behavioral failures including poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. The study concludes that current end-to-end models still lack reliable safe behavioral reasoning even inside the CARLA towns used during training.

What carries the argument

Safe2Drive benchmark of 100 added scenarios covering work zones, pedestrian jaywalking, and occluded vulnerable road users, together with the SafeDriving Score metric that augments prior evaluators with pre-crash braking, work-zone contact, lane-centering, and smoothness checks.

If this is right

  • Standard benchmark scores alone do not reveal whether an end-to-end policy will respond safely to common road hazards.
  • End-to-end models can still produce red-light violations and absent braking even inside training-town environments.
  • Adding targeted hazard scenarios to evaluation suites exposes behavioral weaknesses that are invisible on prior test sets.
  • Safety-centric metrics that penalize late braking and work-zone contact produce lower scores than progress-only metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines could be extended to include the new scenarios as negative examples to improve hazard responses without changing the base architecture.
  • The same evaluation approach could be applied to other published end-to-end policies to test whether the observed safety gap is widespread.
  • If the safety failures persist on physical vehicles, regulators may need scenario-specific certification tests beyond aggregate driving scores.

Load-bearing premise

The one hundred added scenarios form a representative sample of frequent real-world safety-critical events and the chosen simulator plus metric thresholds correctly capture the safety-relevant parts of vehicle behavior.

What would settle it

Re-running the same two policies on a different simulator or on recorded real-world traces of work-zone and pedestrian events and checking whether the performance drop and safety-score pattern remain.

Figures

Figures reproduced from arXiv: 2606.00191 by Changzhong Qian, Congyuan Yu, Kalpana Panda, Nishad Sahu, Ragunathan Rajkumar, Shounak Sural.

Figure 1
Figure 1. Figure 1: LEAD and SimLingo ego-view scenes while testing with [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Collision analysis for a child jaywalking case in stormy night in S2D. The speed and longitudinal-acceleration traces are aligned [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Safe2Drive (S2D), a Bench2Drive-aligned extension adding 100 scenarios focused on three families of road hazards (work zones, pedestrian jaywalking, occluded VRUs). It defines a new SafeDriving Score (SDS) that augments prior metrics with checks for pre-crash braking, work-zone contact, lane centering, and smoothness. Evaluation of LEAD and SimLingo on S2D reports sharp drops in driving score relative to their Bench2Drive baselines (LEAD 94.70→39.95; SimLingo 85.07→41.00) together with low SDS values (11.85 and 15.27), which the authors interpret as evidence of brittle safe-driving behaviors such as poor work-zone understanding and absent pedestrian braking.

Significance. If the 100 scenarios prove representative of frequent real-world hazards and the SDS thresholds are shown to be appropriately calibrated, the work would usefully document concrete safety shortcomings in current E2E driving policies even on training-town CARLA environments. The explicit numeric drops and failure-mode examples provide a clear, falsifiable starting point for follow-up safety research.

major comments (3)
  1. [Abstract] Abstract (paragraph describing S2D construction): the statement that the 100 scenarios target “three frequent families” of road hazards supplies no selection methodology, real-world frequency statistics, or explicit alignment procedure with Bench2Drive; without this, the observed DS drops cannot be interpreted as evidence that the policies fail on common safety-critical events.
  2. [Abstract] Abstract (SDS definition): the added SDS components (pre-crash braking, work-zone contact, lane centering, smoothness) and their numerical thresholds are presented without calibration against accident databases, human-driver baselines, or sensitivity analysis; this directly affects whether the reported low SDS values (11.85/15.27) correctly identify unsafe behavior rather than penalizing acceptable CARLA dynamics.
  3. [Results] Results (comparison with Bench2Drive baselines): the manuscript reports large DS drops but provides no statistical significance tests, confidence intervals, or per-scenario variance; given that post-hoc scenario selection is a recognized risk, the central claim that the drops demonstrate “brittle safe-driving behaviors” rests on unquantified differences.
minor comments (2)
  1. The abstract states that code and videos will be released, yet the manuscript contains no reproducibility checklist, exact CARLA version, or seed information for the 100 scenarios.
  2. Notation for SDS components is introduced only in the abstract; a dedicated methods subsection would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below, indicating where revisions will be made to the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph describing S2D construction): the statement that the 100 scenarios target “three frequent families” of road hazards supplies no selection methodology, real-world frequency statistics, or explicit alignment procedure with Bench2Drive; without this, the observed DS drops cannot be interpreted as evidence that the policies fail on common safety-critical events.

    Authors: We agree that the abstract would benefit from more detail on how the scenarios were constructed. The 100 scenarios were created by extending the Bench2Drive scenario suite with additional parameters for the three hazard families, using the same CARLA towns and route structures for alignment. We will revise the abstract to include a brief description of this procedure and clarify that the scenarios represent common hazard types in the simulation environment. We do not have access to new real-world frequency statistics in this study and will qualify the language accordingly. revision: partial

  2. Referee: [Abstract] Abstract (SDS definition): the added SDS components (pre-crash braking, work-zone contact, lane centering, smoothness) and their numerical thresholds are presented without calibration against accident databases, human-driver baselines, or sensitivity analysis; this directly affects whether the reported low SDS values (11.85/15.27) correctly identify unsafe behavior rather than penalizing acceptable CARLA dynamics.

    Authors: The SDS components and thresholds were designed to capture obvious safety violations in CARLA, such as colliding with work zone objects or failing to brake before impact. We will expand the methods section to provide the rationale for each addition and the chosen thresholds. We acknowledge the absence of formal calibration against external databases and will add this as a limitation of the current metric. A sensitivity analysis is planned for future extensions of this work. revision: yes

  3. Referee: [Results] Results (comparison with Bench2Drive baselines): the manuscript reports large DS drops but provides no statistical significance tests, confidence intervals, or per-scenario variance; given that post-hoc scenario selection is a recognized risk, the central claim that the drops demonstrate “brittle safe-driving behaviors” rests on unquantified differences.

    Authors: We will revise the results section to include per-scenario standard deviations and confidence intervals for the reported driving scores and SDS values. Given the closed-loop nature of the evaluations, we will also discuss the consistency of failures across scenarios to support the interpretation of brittle behaviors. This addresses the concern about unquantified differences. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical evaluation with no derivations or fitted predictions

full rationale

The paper is an empirical benchmark study that constructs S2D scenarios and SDS metric, then runs existing policies (LEAD, SimLingo) in CARLA to report observed score drops. No equations, parameter fitting, or first-principles derivations are present whose outputs reduce to inputs by construction. Any references to Bench2Drive are external baselines for comparison, not self-citations that justify a uniqueness theorem or ansatz. The central claims rest on simulation outcomes rather than tautological re-labeling or self-referential fitting, satisfying the self-contained empirical case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the 100 scenarios for real hazards and on the validity of CARLA as a proxy for physical vehicle dynamics; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption CARLA simulator physics and sensor models are sufficiently accurate for the tested safety-critical scenarios
    All quantitative results are obtained inside CARLA; the abstract invokes this without additional validation.

pith-pipeline@v0.9.1-grok · 5834 in / 1405 out tokens · 40925 ms · 2026-06-28T22:13:08.466056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    Dosovitskiy, Alexey and Ros, German and Codevilla, Felipe and Lopez, Antonio and Koltun, Vladlen , booktitle =. CARLA:

  2. [2]

    Advances in Neural Information Processing Systems (

    Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving , author =. Advances in Neural Information Processing Systems (

  3. [3]

    LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

    LEAD: Minimizing Learner--Expert Asymmetry in End-to-End Driving , author =. arXiv preprint arXiv:2512.20563 , year =

  4. [4]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (

    SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (

  5. [5]

    ArXiv.org , volume =

    Hidden Biases of End-to-End Driving Datasets , author =. ArXiv.org , volume =

  6. [6]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (

    Hidden Biases of End-to-End Driving Models , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (

  7. [7]

    2025 , eprint =

    PlanT 2.0: Exposing Biases and Structural Flaws in Closed-Loop Driving , author =. 2025 , eprint =

  8. [8]

    2025 , eprint =

    Safety2Drive: Safety-Critical Scenario Benchmark for the Evaluation of Autonomous Driving , author =. 2025 , eprint =

  9. [9]

    2026 , eprint =

    Fail2Drive: Benchmarking Closed-Loop Driving Generalization , author =. 2026 , eprint =

  10. [10]

    CARLA Autonomous Driving Leaderboard 2.0 , howpublished =

  11. [11]

    End to End Learning for Self-Driving Cars

    End to End Learning for Self-Driving Cars , author =. arXiv preprint arXiv:1604.07316 , year =

  12. [12]

    End-to-end Driving via Conditional Imitation Learning

    End-to-end Driving via Conditional Imitation Learning , author =. arXiv preprint arXiv:1710.02410 , year =

  13. [13]

    What Uncertainties Do We Need in

    Kendall, Alex and Gal, Yarin , booktitle =. What Uncertainties Do We Need in

  14. [14]

    Manual on Uniform Traffic Control Devices (

  15. [15]

    2024 , eprint =

    ROADWork: A Benchmark for Work Zone Understanding in Autonomous Driving , author =. 2024 , eprint =

  16. [16]

    Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

    JaywalkerVR: A Large-Scale Dataset for Pedestrian Trajectory Prediction in Urban Driving Scenarios , author =. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

  17. [17]

    IEEE International Conference on Intelligent Transportation Systems (ITSC) , year =

    Occlusion-Aware Risk Assessment for Autonomous Driving in Urban Environments , author =. IEEE International Conference on Intelligent Transportation Systems (ITSC) , year =

  18. [18]

    Transportation Research Part F: Traffic Psychology and Behaviour , year =

    A Driving Comfort and Acceptance Evaluation Method for Automated Vehicles , author =. Transportation Research Part F: Traffic Psychology and Behaviour , year =

  19. [19]

    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

    WorkZone3D: A Multimodal Dataset for 3D Work Zone Perception in Autonomous Driving , author =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =