arxiv: 2305.10430 · v2 · submitted 2023-05-17 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

Jiang-Tian Zhai , Ze Feng , Jinhao Du , Yongqiang Mao , Jiang-Jiang Liu , Zichang Tan , Yifu Zhang , Xiaoqing Ye

show 1 more author

Jingdong Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 08:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords end-to-end autonomous drivingnuScenesopen-loop evaluationMLP baselinetrajectory predictionL2 errorcollision rate

0 comments

The pith

A simple MLP using only past trajectories and velocity matches perception-based planners on nuScenes L2 error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether existing open-loop metrics on nuScenes actually reveal the value added by perception and prediction modules in end-to-end driving systems. It builds a minimal MLP that receives only raw inputs such as historical ego trajectory and velocity and directly predicts the future trajectory, omitting all camera, LiDAR, or object-detection processing. This baseline reaches performance close to or better than published perception-heavy methods on average L2 trajectory error while the perception methods retain an advantage on collision rate. The findings indicate that trajectory history may dominate current benchmarks and that the evaluation protocol itself requires re-examination.

Core claim

An MLP that maps raw sensor inputs such as past ego-vehicle trajectory and velocity directly to future trajectory predictions attains end-to-end planning performance on the nuScenes dataset comparable to that of perception-based methods, reducing average L2 error by roughly 20 percent, while perception-based methods hold an edge on collision-rate metrics.

What carries the argument

MLP baseline that ingests only historical trajectory, velocity and similar raw signals to output future ego trajectory without perception or prediction stages.

Load-bearing premise

The MLP baseline is trained and evaluated under exactly the same data splits, preprocessing and conditions as the perception-based competitors, with no extra advantage from trajectory history alone.

What would settle it

Re-train the MLP and all compared perception methods from scratch on identical data splits and input formats, then recompute both L2 error and collision rate to check whether the simple model still matches or exceeds the others.

read the original abstract

Modern autonomous driving systems are typically divided into three main tasks: perception, prediction, and planning. The planning task involves predicting the trajectory of the ego vehicle based on inputs from both internal intention and the external environment, and manipulating the vehicle accordingly. Most existing works evaluate their performance on the nuScenes dataset using the L2 error and collision rate between the predicted trajectories and the ground truth. In this paper, we reevaluate these existing evaluation metrics and explore whether they accurately measure the superiority of different methods. Specifically, we design an MLP-based method that takes raw sensor data (e.g., past trajectory, velocity, etc.) as input and directly outputs the future trajectory of the ego vehicle, without using any perception or prediction information such as camera images or LiDAR. Our simple method achieves similar end-to-end planning performance on the nuScenes dataset with other perception-based methods, reducing the average L2 error by about 20%. Meanwhile, the perception-based methods have an advantage in terms of collision rate. We further conduct in-depth analysis and provide new insights into the factors that are critical for the success of the planning task on nuScenes dataset. Our observation also indicates that we need to rethink the current open-loop evaluation scheme of end-to-end autonomous driving in nuScenes. Codes are available at https://github.com/E2E-AD/AD-MLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that a simple MLP baseline using only past trajectory and velocity (no perception or prediction modules) achieves similar or superior end-to-end planning performance on nuScenes, reducing average L2 error by ~20% relative to published perception-based methods, while perception-based approaches show an advantage on collision rate. The authors conclude that current open-loop metrics may not properly credit perception and that the evaluation scheme on nuScenes should be rethought, supported by additional analysis of critical factors for planning success and public code release.

Significance. If the empirical comparisons hold under matched conditions, the work is significant for exposing that strong trajectory-history baselines can match or exceed perception-based L2 numbers on nuScenes, thereby questioning whether L2 error alone is a sufficient proxy for planning quality. The public implementation and direct reproduction of the MLP strengthen reproducibility and provide a useful reference point for the community.

major comments (1)

[Section 3, Table 1] Section 3 and Table 1: the headline claim that the MLP reduces L2 error by ~20% without perception requires that the cited perception baselines were evaluated with identical trajectory-history length, velocity features, data splits, and loss functions. The manuscript does not tabulate these details for the baselines, leaving open the possibility that the reported L2 gap reflects differences in history modeling or auxiliary objectives rather than the absence of perception.

minor comments (2)

[Abstract] The abstract states a 20% L2 reduction but does not list the exact per-method numbers or the set of compared methods; adding a short table or explicit references in the abstract would improve clarity.
[Figures] Figure captions and axis labels in the analysis sections could more explicitly state the history length used by the MLP to aid readers in reproducing the conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive comment. The point about ensuring matched experimental conditions is well taken, and we address it directly below. We will incorporate the requested clarification in the revised manuscript.

read point-by-point responses

Referee: Section 3, Table 1: the headline claim that the MLP reduces L2 error by ~20% without perception requires that the cited perception baselines were evaluated with identical trajectory-history length, velocity features, data splits, and loss functions. The manuscript does not tabulate these details for the baselines, leaving open the possibility that the reported L2 gap reflects differences in history modeling or auxiliary objectives rather than the absence of perception.

Authors: We agree that explicit documentation of these settings is required to substantiate the comparison. In the experiments, we reproduced each baseline using the official nuScenes train/val/test splits, the same 2-second history length for past trajectory and velocity inputs, and the primary L2 loss as reported in the respective original papers. Velocity features were incorporated consistently with the baselines that utilized them. To remove any ambiguity, we will add a new table (or expanded caption) in Section 3 that tabulates history length, input features, data splits, and loss functions for every method in Table 1. This will make clear that the L2 differences arise under matched conditions rather than from mismatched history modeling or objectives. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical baseline comparison on public benchmark

full rationale

The paper conducts an empirical reevaluation by training a simple MLP on raw trajectory/velocity inputs and directly measuring L2 error and collision rate against published perception-based methods on the nuScenes dataset. No equations, fitted parameters renamed as predictions, or self-citation chains are used to derive the central claim; the ~20% L2 reduction is presented as an observed experimental outcome. The derivation chain is self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus the domain assumption that nuScenes open-loop splits allow fair comparison between perception and non-perception methods.

free parameters (1)

MLP hidden-layer sizes and learning rate
Architecture and optimization choices selected to achieve the reported L2 numbers.

axioms (1)

domain assumption nuScenes dataset splits and evaluation protocol remain unchanged and comparable across methods
Invoked when claiming the MLP matches or beats published perception-based planners.

pith-pipeline@v0.9.0 · 5570 in / 1167 out tokens · 72430 ms · 2026-05-17T08:12:15.902335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our simple method achieves similar end-to-end planning performance on the nuScenes dataset with other perception-based methods, reducing the average L2 error by about 20%.
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we design an MLP-based method that takes raw sensor data (e.g., past trajectory, velocity, etc.) as input and directly outputs the future trajectory of the ego vehicle, without using any perception or prediction information such as camera images or LiDAR.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 7.0

VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving
cs.RO 2026-04 unverdicted novelty 7.0

The SNG framework and SNG-VLA model enable end-to-end driving systems to better incorporate global navigation for state-of-the-art route following without auxiliary perception losses.
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
cs.CV 2025-06 unverdicted novelty 7.0

ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
cs.CV 2026-04 unverdicted novelty 6.0

HERMES++ unifies 3D scene understanding and future geometry prediction in driving scenes via BEV representations, LLM-enhanced queries, a temporal link, and joint geometric optimization.
ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution
cs.RO 2026-04 unverdicted novelty 6.0

ProDrive couples a query-centric planner with a BEV world model for end-to-end ego-environment co-evolution, enabling future-outcome assessment that improves safety and efficiency over reactive baselines on NAVSIM v1.
BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving
cs.RO 2026-04 unverdicted novelty 6.0

The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
cs.CV 2026-04 unverdicted novelty 6.0

Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving
cs.RO 2026-01 unverdicted novelty 6.0

A cascaded end-to-end driving model conditions longitudinal planning on the lateral path via anchor-based regression and path-conditioned 1D displacement prediction, achieving SOTA driving score of 89.07 and 73.18% su...
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
cs.CV 2025-06 unverdicted novelty 6.0

AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
cs.CV 2025-03 unverdicted novelty 6.0

ORION reports 77.74 Driving Score and 54.62% Success Rate on Bench2Drive, outperforming prior end-to-end methods by 14.28 DS and 19.61% SR through unified VQA and planning optimization.
EMMA: End-to-End Multimodal Model for Autonomous Driving
cs.CV 2024-10 unverdicted novelty 6.0

EMMA is an end-to-end multimodal LLM that converts camera data into trajectories, objects, and road graphs via text prompts and reports state-of-the-art motion planning on nuScenes plus competitive detection results on Waymo.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 6.0

DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
Can Attribution Predict Risk? From Multi-View Attribution to Planning Risk Signals in End-to-End Autonomous Driving
cs.LG 2026-05 unverdicted novelty 5.0

Attribution statistics derived from multi-view inputs in end-to-end planners can predict planning risks, with reported Spearman correlation of 0.30 with trajectory error and AUROC of 0.77 for collision detection.
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 5.0

VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.
OVPD: A Virtual-Physical Fusion Testing Dataset of OnSite Auton-omous Driving Challenge
cs.RO 2026-04 unverdicted novelty 5.0

OVPD is a new virtual-physical fusion dataset with 20 testing clips totaling nearly 3 hours of multi-modal autonomous driving data for closed-loop evaluation.
Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic
cs.AI 2026-04 unverdicted novelty 5.0

This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation
cs.RO 2026-05 unverdicted novelty 4.0

A conditional flow matching model generates realistic safety-critical traffic scenarios by turning nominal scenes into dangerous rollouts using combined simulation and real data.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 18 Pith papers

[1]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In CVPR, 2020. 1, 3

work page 2020
[2]

What data do we need for train- ing an A V motion planner? InICRA, 2021

Long Chen, Lukas Platinsky, Stefanie Speichert, Blazej Os- inski, Oliver Scheel, Yawei Ye, Hugo Grimmett, Luca Del Pero, and Peter Ondruska. What data do we need for train- ing an A V motion planner? InICRA, 2021. 1

work page 2021
[3]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing. IEEE TPAMI, 2022. 1

work page 2022
[4]

Densetnt: End-to-end trajectory prediction from dense goal sets

Junru Gu, Chen Sun, and Hang Zhao. Densetnt: End-to-end trajectory prediction from dense goal sets. In ICCV, 2021. 1

work page 2021
[5]

FIERY: Future instance segmentation in bird’s-eye view from surround monocular cameras

Anthony Hu, Zak Murez, Nikhil Mohan, Sof ´ıa Dudas, Jef- frey Hawke, Vijay Badrinarayanan, Roberto Cipolla, and Alex Kendall. FIERY: Future instance segmentation in bird’s-eye view from surround monocular cameras. InICCV,

work page
[6]

Safe local motion planning with self- supervised freespace forecasting

Peiyun Hu, Aaron Huang, John Dolan, David Held, and Deva Ramanan. Safe local motion planning with self- supervised freespace forecasting. In CVPR, 2021. 3

work page 2021
[7]

St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In ECCV, 2022. 1, 2, 3

work page 2022
[8]

Planning-oriented autonomous driv- ing

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driv- ing. In CVPR, 2023. 1, 2, 3

work page 2023
[9]

Vad: Vectorized scene rep- resentation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jia- jie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene rep- resentation for efficient autonomous driving. arXiv preprint arXiv:2303.12077, 2023. 1, 2, 3

work page arXiv 2023
[10]

Differentiable raycasting for self-supervised occupancy forecasting

Tarasha Khurana, Peiyun Hu, Achal Dave, Jason Ziglar, David Held, and Deva Ramanan. Differentiable raycasting for self-supervised occupancy forecasting. In ECCV, 2022. 3

work page 2022
[11]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022. 1

work page 2022
[12]

Pnpnet: End-to-end per- ception and prediction with tracking in the loop

Ming Liang, Bin Yang, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. Pnpnet: End-to-end per- ception and prediction with tracking in the loop. In CVPR,

work page
[13]

Sgdr: Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017. 3

work page 2017
[14]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019. 3

work page 2019
[15]

Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net

Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In CVPR, 2018. 1

work page 2018
[16]

Perceive, predict, and plan: Safe motion planning through interpretable semantic representations

Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. In ECCV, 2020. 1

work page 2020
[17]

Cape: Camera view position embedding for multi-view 3d object detection

Kaixin Xiong, Shi Gong, Xiaoqing Ye, Xiao Tan, Ji Wan, Errui Ding, Jingdong Wang, and Xiang Bai. Cape: Camera view position embedding for multi-view 3d object detection. In CVPR, 2023. 1

work page 2023
[18]

End-to-end inter- pretable neural motion planner

Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun. End-to-end inter- pretable neural motion planner. In CVPR, 2019. 3

work page 2019