Recognition: 2 theorem links
· Lean TheoremNuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles
Pith reviewed 2026-05-15 08:57 UTC · model grok-4.3
The pith
NuPlan establishes the first closed-loop benchmark for machine learning planners in autonomous driving.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose the world's first closed-loop ML-based planning benchmark for autonomous driving. The benchmark includes a large-scale driving dataset with 1500h of human driving data from 4 cities across the US and Asia, a closed-loop simulation framework with reactive agents, and a large set of both general and scenario-specific planning metrics.
What carries the argument
The closed-loop simulator with reactive agents that interact dynamically with the planner being tested, shifting evaluation from static short-term forecasts to interactive long-term planning outcomes.
If this is right
- Planners will be assessed in interactive settings where other agents respond to their actions rather than following fixed trajectories.
- Evaluation will shift from L2-based short-term prediction scores to metrics tailored for long-term planning success and failure modes.
- The multi-city dataset will allow testing of how well planners generalize across different traffic patterns and regions.
- Organized benchmark challenges can standardize comparisons and accelerate development of better ML planning models.
Where Pith is reading between the lines
- This benchmark could reveal that many current ML planners perform worse under interactive conditions than open-loop tests suggest.
- Researchers might extend the reactive agent behaviors using patterns from the collected driving data to increase simulation realism.
- The framework may support hybrid evaluations that combine simulation results with limited real-vehicle validation to improve correlation.
Load-bearing premise
The chosen metrics and reactive-agent simulator will produce planner rankings that correlate with real-world safety and performance once deployed on physical vehicles.
What would settle it
Deploying several benchmark-ranked planners on physical vehicles in matching scenarios and checking whether their real-world safety records and performance match the simulated rankings.
read the original abstract
In this work, we propose the world's first closed-loop ML-based planning benchmark for autonomous driving. While there is a growing body of ML-based motion planners, the lack of established datasets and metrics has limited the progress in this area. Existing benchmarks for autonomous vehicle motion prediction have focused on short-term motion forecasting, rather than long-term planning. This has led previous works to use open-loop evaluation with L2-based metrics, which are not suitable for fairly evaluating long-term planning. Our benchmark overcomes these limitations by introducing a large-scale driving dataset, lightweight closed-loop simulator, and motion-planning-specific metrics. We provide a high-quality dataset with 1500h of human driving data from 4 cities across the US and Asia with widely varying traffic patterns (Boston, Pittsburgh, Las Vegas and Singapore). We will provide a closed-loop simulation framework with reactive agents and provide a large set of both general and scenario-specific planning metrics. We plan to release the dataset at NeurIPS 2021 and organize benchmark challenges starting in early 2022.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes NuPlan as the world's first closed-loop ML-based planning benchmark for autonomous driving. It presents a 1500-hour multi-city dataset of human driving data, a lightweight closed-loop simulator with reactive agents, and a collection of general and scenario-specific planning metrics intended to address the shortcomings of open-loop L2 evaluation for long-term motion planning.
Significance. If the simulator and metrics can be shown to produce planner rankings that correlate with real-world safety and efficiency, the benchmark would fill an important gap by enabling standardized, realistic evaluation of ML-based planners beyond short-term forecasting tasks. The scale and geographic diversity of the dataset represent a clear strength.
major comments (2)
- [Abstract] Abstract: The manuscript contains no closed-loop experiments, ablations of agent reactivity, or comparisons against open-loop L2 baselines and real-vehicle logs. Without such evidence, the claim that the proposed metrics and reactive simulator will produce rankings predictive of real-world performance remains untested and central to the benchmark's value.
- [Metrics and Simulator] Metrics and Simulator sections: The general and scenario-specific metrics are described at a high level but lack explicit definitions, formulas, or pseudocode. This prevents assessment of whether they avoid the known pitfalls of open-loop evaluation and whether the simulator rules are sufficiently specified for reproducibility.
minor comments (1)
- [Abstract] Abstract: Phrases such as 'we will provide' and 'we plan to release' indicate this is a benchmark proposal paper; the current status of the simulator implementation and metric computation code should be stated explicitly.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the scale and geographic diversity of the dataset. We agree that additional evidence and detail would strengthen the manuscript and will revise accordingly. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] The manuscript contains no closed-loop experiments, ablations of agent reactivity, or comparisons against open-loop L2 baselines and real-vehicle logs. Without such evidence, the claim that the proposed metrics and reactive simulator will produce rankings predictive of real-world performance remains untested and central to the benchmark's value.
Authors: We acknowledge that the current manuscript is primarily a benchmark definition paper and does not contain closed-loop planner evaluations. In the revision we will add a dedicated experiments section that runs several baseline planners (rule-based and learned) in closed-loop simulation. This will include ablations on agent reactivity levels and side-by-side comparison of closed-loop metric rankings versus open-loop L2 error on the same scenarios from the dataset. These additions will provide concrete evidence of how the benchmark behaves differently from open-loop evaluation. A full statistical correlation with real-world safety outcomes is not possible within this work, as it would require proprietary fleet testing data and deployments beyond the benchmark release. revision: yes
-
Referee: [Metrics and Simulator] The general and scenario-specific metrics are described at a high level but lack explicit definitions, formulas, or pseudocode. This prevents assessment of whether they avoid the known pitfalls of open-loop evaluation and whether the simulator rules are sufficiently specified for reproducibility.
Authors: We agree that the current level of detail is insufficient for reproducibility. The revised manuscript will expand both sections with explicit mathematical definitions and formulas for every metric (e.g., collision rate, route progress, comfort, and scenario-specific scores), together with pseudocode for the closed-loop simulation loop, agent state updates, and reactivity model. This will make clear how the metrics penalize unrealistic behaviors that open-loop L2 evaluation overlooks and will allow independent re-implementation of the simulator. revision: yes
Circularity Check
No circularity: benchmark proposal with independent definitions
full rationale
The paper introduces a new dataset (1500h from 4 cities), lightweight closed-loop simulator with reactive agents, and planning-specific metrics without any claimed derivations, equations, parameter fittings, or predictions. No load-bearing step reduces by construction to prior inputs or self-citations; the central claim is the proposal of these independent components. This matches the default expectation for non-circular benchmark papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The collected driving data and reactive simulator produce rankings that generalize to real-world deployment safety.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Existing benchmarks... use open-loop evaluation with L2-based metrics, which are not suitable for fairly evaluating long-term planning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems
MDrive benchmark shows multi-agent cooperative driving systems generally outperform single-agent ones in closed-loop settings but perception sharing does not always improve planning and negotiation can harm performanc...
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
A global dataset of continuous urban dashcam driving
CROWD is a new global dataset of 51,753 continuous urban dashcam segments spanning over 20,000 hours from 238 countries, with manual labels and automated object detections for routine driving analysis.
-
C-TRAIL: A Commonsense World Framework for Trajectory Planning in Autonomous Driving
C-TRAIL combines LLM commonsense with a dual-trust mechanism and Dirichlet-weighted Monte Carlo Tree Search to improve trajectory planning accuracy and safety in autonomous driving.
-
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
-
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction
Smaller end-to-end autonomous driving models achieve optimal 3-second trajectory prediction accuracy at lower or intermediate temporal sampling frequencies, whereas larger VLA-style models perform best at the highest ...
-
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
-
SceneFactory: GPU-Accelerated Multi-Agent Driving Simulation with Physics-Based Vehicle Dynamics
SceneFactory delivers a batched GPU platform for physics-based multi-agent autonomous driving simulation that achieves 127x higher throughput than non-vectorized PhysX while supporting articulated dynamics and road-co...
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution
ProDrive couples a query-centric planner with a BEV world model for end-to-end ego-environment co-evolution, enabling future-outcome assessment that improves safety and efficiency over reactive baselines on NAVSIM v1.
-
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
-
Mosaic: An Extensible Framework for Composing Rule-Based and Learned Motion Planners
Mosaic integrates rule-based and learned planners via arbitration graphs to set new state-of-the-art scores on nuPlan and interPlan benchmarks while cutting at-fault collisions by 30%.
-
BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving
The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.
-
Evaluation as Evolution: Transforming Adversarial Diffusion into Closed-Loop Curricula for Autonomous Vehicles
E² uses transport-regularized sparse control on learned reverse-time SDEs with topology-driven selection and Topological Anchoring to generate realistic adversarial scenarios, improving collision discovery by 9.01% on...
-
Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
Hydra-MDP uses multi-teacher distillation and a multi-head decoder to learn diverse, metric-specific trajectories in an end-to-end autonomous-driving planner, winning the Navsim challenge.
-
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling
CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.
-
Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic
This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.
Reference graph
Works this paper leans on
-
[1]
CommonRoad: Composable benchmarks for motion plan- ning on roads
Matthias Althoff, Markus Koschi, and Stefanie Manzinger. CommonRoad: Composable benchmarks for motion plan- ning on roads. In Proc. of the IEEE Intelligent Vehicles Sym- posium, 2017. 2
work page 2017
-
[2]
Chauf- feurnet: Learning to drive by imitating the best and synthe- sizing the worst
Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauf- feurnet: Learning to drive by imitating the best and synthe- sizing the worst. In RSS, 2019. 2
work page 2019
-
[3]
Learning to drive from simulation without real world labels
Alex Bewley, Jessica Rigley, Yuxuan Liu, Jeffrey Hawke, Richard Shen, Vinh-Dieu Lam, and Alex Kendall. Learning to drive from simulation without real world labels. In ICRA,
-
[4]
Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In CVPR, 2020. 1, 2
work page 2020
-
[5]
MP3: A unified model to map, perceive, predict and plan
Sergio Casas, Abbas Sadat, and Raquel Urtasun. MP3: A unified model to map, perceive, predict and plan. In CVPR,
-
[6]
Argo- verse: 3d tracking and forecasting with rich maps
Ming-Fang Chang, John W Lambert, Patsorn Sangkloy, Jag- jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argo- verse: 3d tracking and forecasting with rich maps. In CVPR,
-
[7]
CARLA: An open urban driving simulator
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. CoRR, 2017. 2
work page 2017
-
[8]
Large scale interactive motion forecasting for autonomous driving: The Waymo Open Motion Dataset
Scott Ettinger, Shuyang Cheng, and Benjamin Caine et al. Large scale interactive motion forecasting for autonomous driving: The Waymo Open Motion Dataset. arXiv preprint arXiv:2104.10133, 2021. 1, 2
-
[9]
Vision meets robotics: The KITTI dataset
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The KITTI dataset. IJRR, 32(11):1231–1237, 2013. 1, 2
work page 2013
-
[10]
The efficacy of neural planning metrics: A 4 meta-analysis of PKL on nuscenes
Yiluan Guo, Holger Caesar, Oscar Beijbom, Jonah Philion, and Sanja Fidler. The efficacy of neural planning metrics: A 4 meta-analysis of PKL on nuscenes. In IROS Workshop on Benchmarking Progress in Autonomous Driving, 2020. 3
work page 2020
-
[11]
One thousand and one hours: Self-driving motion prediction dataset
John Houston, Guido Zuidhof, and Luca Bergamini et al. One thousand and one hours: Self-driving motion prediction dataset. arXiv preprint arXiv:2006.14480, 2020. 1, 2, 4
-
[12]
Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom
Alex H. Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019. 3
work page 2019
-
[13]
Eraqi Hesham M., Mohamed N. Moustafa, and Jens Honer. Conditional imitation learning driving considering camera and lidar fusion. In NeurIPS, 2020. 3
work page 2020
-
[14]
Simulation-based reinforcement learning for real-world autonomous driving
Blazej Osinski, Adam Jakubowski, Pawel Ziecina, Piotr Milos, Christopher Galias, Silviu Homoceanu, and Henryk Michalewski. Simulation-based reinforcement learning for real-world autonomous driving. In ICRA, 2020. 2
work page 2020
-
[15]
Learning to evaluate perception models using planner-centric metrics
Jonah Philion, Amlan Kar, and Sanja Fidler. Learning to evaluate perception models using planner-centric metrics. In CVPR, 2020. 3
work page 2020
-
[16]
Multi- modal fusion transformer for end-to-end autonomous driv- ing
Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi- modal fusion transformer for end-to-end autonomous driv- ing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3
work page 2021
-
[17]
Qi, Yin Zhou, Mahyar Najibi, Pei Sun, Khoa V o, Boyang Deng, and Dragomir Anguelov
Charles R. Qi, Yin Zhou, Mahyar Najibi, Pei Sun, Khoa V o, Boyang Deng, and Dragomir Anguelov. Offboard 3d ob- ject detection from point cloud sequences. arXiv preprint arXiv:2103.05073, 2021. 2, 3
-
[18]
Jointly learnable be- havior and trajectory planning for self-driving vehicles
Abbas Sadat, Mengye Ren, Andrei Pokrovsky, Yen-Chen Lin, Ersin Yumer, and Raquel Urtasun. Jointly learnable be- havior and trajectory planning for self-driving vehicles. In IROS, 2019. 2
work page 2019
-
[19]
AirSim: High-fidelity visual and physical simula- tion for autonomous vehicles
Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-fidelity visual and physical simula- tion for autonomous vehicles. In Field and Service Robotics,
-
[20]
End-to-end multi-modal sen- sors fusion system for urban automated driving
Ibrahim Sobh, Loay Amin, Sherif Abdelkarim, Khaled Elmadawy, Mahmoud Saeed, Omar Abdeltawab, Mostafa Gamal, and Ahmad El Sallab. End-to-end multi-modal sen- sors fusion system for urban automated driving. In NeurIPS,
-
[21]
Learning to track with object permanence
Pavel Tokmakov, Jie Li, Wolfram Burgard, and Adrien Gaidon. Learning to track with object permanence. arXiv preprint arXiv:2103.14258, 2021. 2
- [22]
-
[23]
Center- based 3d object detection and tracking
Tianwei Yin, Xingyi Zhou, and Philipp Kr ¨ahenb¨uhl. Center- based 3d object detection and tracking. arXiv preprint arXiv:2006.11275, 2020. 3
-
[24]
End-to-end inter- pretable neural motion planner
Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun. End-to-end inter- pretable neural motion planner. In CVPR, 2021. 2 5
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.