Recognition: 2 theorem links
· Lean TheoremRethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes
Pith reviewed 2026-05-17 08:12 UTC · model grok-4.3
The pith
A simple MLP using only past trajectories and velocity matches perception-based planners on nuScenes L2 error.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An MLP that maps raw sensor inputs such as past ego-vehicle trajectory and velocity directly to future trajectory predictions attains end-to-end planning performance on the nuScenes dataset comparable to that of perception-based methods, reducing average L2 error by roughly 20 percent, while perception-based methods hold an edge on collision-rate metrics.
What carries the argument
MLP baseline that ingests only historical trajectory, velocity and similar raw signals to output future ego trajectory without perception or prediction stages.
Load-bearing premise
The MLP baseline is trained and evaluated under exactly the same data splits, preprocessing and conditions as the perception-based competitors, with no extra advantage from trajectory history alone.
What would settle it
Re-train the MLP and all compared perception methods from scratch on identical data splits and input formats, then recompute both L2 error and collision rate to check whether the simple model still matches or exceeds the others.
read the original abstract
Modern autonomous driving systems are typically divided into three main tasks: perception, prediction, and planning. The planning task involves predicting the trajectory of the ego vehicle based on inputs from both internal intention and the external environment, and manipulating the vehicle accordingly. Most existing works evaluate their performance on the nuScenes dataset using the L2 error and collision rate between the predicted trajectories and the ground truth. In this paper, we reevaluate these existing evaluation metrics and explore whether they accurately measure the superiority of different methods. Specifically, we design an MLP-based method that takes raw sensor data (e.g., past trajectory, velocity, etc.) as input and directly outputs the future trajectory of the ego vehicle, without using any perception or prediction information such as camera images or LiDAR. Our simple method achieves similar end-to-end planning performance on the nuScenes dataset with other perception-based methods, reducing the average L2 error by about 20%. Meanwhile, the perception-based methods have an advantage in terms of collision rate. We further conduct in-depth analysis and provide new insights into the factors that are critical for the success of the planning task on nuScenes dataset. Our observation also indicates that we need to rethink the current open-loop evaluation scheme of end-to-end autonomous driving in nuScenes. Codes are available at https://github.com/E2E-AD/AD-MLP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that a simple MLP baseline using only past trajectory and velocity (no perception or prediction modules) achieves similar or superior end-to-end planning performance on nuScenes, reducing average L2 error by ~20% relative to published perception-based methods, while perception-based approaches show an advantage on collision rate. The authors conclude that current open-loop metrics may not properly credit perception and that the evaluation scheme on nuScenes should be rethought, supported by additional analysis of critical factors for planning success and public code release.
Significance. If the empirical comparisons hold under matched conditions, the work is significant for exposing that strong trajectory-history baselines can match or exceed perception-based L2 numbers on nuScenes, thereby questioning whether L2 error alone is a sufficient proxy for planning quality. The public implementation and direct reproduction of the MLP strengthen reproducibility and provide a useful reference point for the community.
major comments (1)
- [Section 3, Table 1] Section 3 and Table 1: the headline claim that the MLP reduces L2 error by ~20% without perception requires that the cited perception baselines were evaluated with identical trajectory-history length, velocity features, data splits, and loss functions. The manuscript does not tabulate these details for the baselines, leaving open the possibility that the reported L2 gap reflects differences in history modeling or auxiliary objectives rather than the absence of perception.
minor comments (2)
- [Abstract] The abstract states a 20% L2 reduction but does not list the exact per-method numbers or the set of compared methods; adding a short table or explicit references in the abstract would improve clarity.
- [Figures] Figure captions and axis labels in the analysis sections could more explicitly state the history length used by the MLP to aid readers in reproducing the conditions.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comment. The point about ensuring matched experimental conditions is well taken, and we address it directly below. We will incorporate the requested clarification in the revised manuscript.
read point-by-point responses
-
Referee: Section 3, Table 1: the headline claim that the MLP reduces L2 error by ~20% without perception requires that the cited perception baselines were evaluated with identical trajectory-history length, velocity features, data splits, and loss functions. The manuscript does not tabulate these details for the baselines, leaving open the possibility that the reported L2 gap reflects differences in history modeling or auxiliary objectives rather than the absence of perception.
Authors: We agree that explicit documentation of these settings is required to substantiate the comparison. In the experiments, we reproduced each baseline using the official nuScenes train/val/test splits, the same 2-second history length for past trajectory and velocity inputs, and the primary L2 loss as reported in the respective original papers. Velocity features were incorporated consistently with the baselines that utilized them. To remove any ambiguity, we will add a new table (or expanded caption) in Section 3 that tabulates history length, input features, data splits, and loss functions for every method in Table 1. This will make clear that the L2 differences arise under matched conditions rather than from mismatched history modeling or objectives. revision: yes
Circularity Check
No circularity: empirical baseline comparison on public benchmark
full rationale
The paper conducts an empirical reevaluation by training a simple MLP on raw trajectory/velocity inputs and directly measuring L2 error and collision rate against published perception-based methods on the nuScenes dataset. No equations, fitted parameters renamed as predictions, or self-citation chains are used to derive the central claim; the ~20% L2 reduction is presented as an observed experimental outcome. The derivation chain is self-contained against external benchmarks and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- MLP hidden-layer sizes and learning rate
axioms (1)
- domain assumption nuScenes dataset splits and evaluation protocol remain unchanged and comparable across methods
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our simple method achieves similar end-to-end planning performance on the nuScenes dataset with other perception-based methods, reducing the average L2 error by about 20%.
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we design an MLP-based method that takes raw sensor data (e.g., past trajectory, velocity, etc.) as input and directly outputs the future trajectory of the ego vehicle, without using any perception or prediction information such as camera images or LiDAR.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
-
Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving
The SNG framework and SNG-VLA model enable end-to-end driving systems to better incorporate global navigation for state-of-the-art route following without auxiliary perception losses.
-
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
-
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
-
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
HERMES++ unifies 3D scene understanding and future geometry prediction in driving scenes via BEV representations, LLM-enhanced queries, a temporal link, and joint geometric optimization.
-
ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution
ProDrive couples a query-centric planner with a BEV world model for end-to-end ego-environment co-evolution, enabling future-outcome assessment that improves safety and efficiency over reactive baselines on NAVSIM v1.
-
BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving
The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.
-
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
-
AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving
A cascaded end-to-end driving model conditions longitudinal planning on the lateral path via anchor-based regression and path-conditioned 1D displacement prediction, achieving SOTA driving score of 89.07 and 73.18% su...
-
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
-
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
ORION reports 77.74 Driving Score and 54.62% Success Rate on Bench2Drive, outperforming prior end-to-end methods by 14.28 DS and 19.61% SR through unified VQA and planning optimization.
-
EMMA: End-to-End Multimodal Model for Autonomous Driving
EMMA is an end-to-end multimodal LLM that converts camera data into trajectories, objects, and road graphs via text prompts and reports state-of-the-art motion planning on nuScenes plus competitive detection results on Waymo.
-
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
-
Can Attribution Predict Risk? From Multi-View Attribution to Planning Risk Signals in End-to-End Autonomous Driving
Attribution statistics derived from multi-view inputs in end-to-end planners can predict planning risks, with reported Spearman correlation of 0.30 with trajectory error and AUROC of 0.77 for collision detection.
-
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.
-
OVPD: A Virtual-Physical Fusion Testing Dataset of OnSite Auton-omous Driving Challenge
OVPD is a new virtual-physical fusion dataset with 20 testing clips totaling nearly 3 hours of multi-modal autonomous driving data for closed-loop evaluation.
-
Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic
This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
-
Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation
A conditional flow matching model generates realistic safety-critical traffic scenarios by turning nominal scenes into dangerous rollouts using combined simulation and real data.
Reference graph
Works this paper leans on
-
[1]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In CVPR, 2020. 1, 3
work page 2020
-
[2]
What data do we need for train- ing an A V motion planner? InICRA, 2021
Long Chen, Lukas Platinsky, Stefanie Speichert, Blazej Os- inski, Oliver Scheel, Yawei Ye, Hugo Grimmett, Luca Del Pero, and Peter Ondruska. What data do we need for train- ing an A V motion planner? InICRA, 2021. 1
work page 2021
-
[3]
Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing
Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing. IEEE TPAMI, 2022. 1
work page 2022
-
[4]
Densetnt: End-to-end trajectory prediction from dense goal sets
Junru Gu, Chen Sun, and Hang Zhao. Densetnt: End-to-end trajectory prediction from dense goal sets. In ICCV, 2021. 1
work page 2021
-
[5]
FIERY: Future instance segmentation in bird’s-eye view from surround monocular cameras
Anthony Hu, Zak Murez, Nikhil Mohan, Sof ´ıa Dudas, Jef- frey Hawke, Vijay Badrinarayanan, Roberto Cipolla, and Alex Kendall. FIERY: Future instance segmentation in bird’s-eye view from surround monocular cameras. InICCV,
-
[6]
Safe local motion planning with self- supervised freespace forecasting
Peiyun Hu, Aaron Huang, John Dolan, David Held, and Deva Ramanan. Safe local motion planning with self- supervised freespace forecasting. In CVPR, 2021. 3
work page 2021
-
[7]
St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning
Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In ECCV, 2022. 1, 2, 3
work page 2022
-
[8]
Planning-oriented autonomous driv- ing
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driv- ing. In CVPR, 2023. 1, 2, 3
work page 2023
-
[9]
Vad: Vectorized scene rep- resentation for efficient autonomous driving
Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jia- jie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene rep- resentation for efficient autonomous driving. arXiv preprint arXiv:2303.12077, 2023. 1, 2, 3
-
[10]
Differentiable raycasting for self-supervised occupancy forecasting
Tarasha Khurana, Peiyun Hu, Achal Dave, Jason Ziglar, David Held, and Deva Ramanan. Differentiable raycasting for self-supervised occupancy forecasting. In ECCV, 2022. 3
work page 2022
-
[11]
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022. 1
work page 2022
-
[12]
Pnpnet: End-to-end per- ception and prediction with tracking in the loop
Ming Liang, Bin Yang, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. Pnpnet: End-to-end per- ception and prediction with tracking in the loop. In CVPR,
-
[13]
Sgdr: Stochastic gradient descent with warm restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017. 3
work page 2017
-
[14]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019. 3
work page 2019
-
[15]
Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In CVPR, 2018. 1
work page 2018
-
[16]
Perceive, predict, and plan: Safe motion planning through interpretable semantic representations
Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. In ECCV, 2020. 1
work page 2020
-
[17]
Cape: Camera view position embedding for multi-view 3d object detection
Kaixin Xiong, Shi Gong, Xiaoqing Ye, Xiao Tan, Ji Wan, Errui Ding, Jingdong Wang, and Xiang Bai. Cape: Camera view position embedding for multi-view 3d object detection. In CVPR, 2023. 1
work page 2023
-
[18]
End-to-end inter- pretable neural motion planner
Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun. End-to-end inter- pretable neural motion planner. In CVPR, 2019. 3
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.