arxiv: 2106.11810 · v4 · submitted 2021-06-22 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Holger Caesar , Juraj Kabzan , Kok Seang Tan , Whye Kit Fong , Eric Wolff , Alex Lang , Luke Fletcher , Oscar Beijbom

show 1 more author

Sammy Omari

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords autonomous drivingmotion planningclosed-loop evaluationbenchmarkdriving datasetreactive agentsmachine learning

0 comments

The pith

NuPlan establishes the first closed-loop benchmark for machine learning planners in autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing open-loop evaluation methods using short-term L2 metrics cannot properly assess long-term planning performance in autonomous vehicles. It introduces NuPlan to address this gap through a large dataset of 1500 hours of real human driving from four cities, a lightweight closed-loop simulator with reactive agents, and planning-specific metrics. A sympathetic reader would care because this setup enables fairer testing of how planners handle dynamic interactions over time, which is essential for advancing safer autonomous systems.

Core claim

We propose the world's first closed-loop ML-based planning benchmark for autonomous driving. The benchmark includes a large-scale driving dataset with 1500h of human driving data from 4 cities across the US and Asia, a closed-loop simulation framework with reactive agents, and a large set of both general and scenario-specific planning metrics.

What carries the argument

The closed-loop simulator with reactive agents that interact dynamically with the planner being tested, shifting evaluation from static short-term forecasts to interactive long-term planning outcomes.

If this is right

Planners will be assessed in interactive settings where other agents respond to their actions rather than following fixed trajectories.
Evaluation will shift from L2-based short-term prediction scores to metrics tailored for long-term planning success and failure modes.
The multi-city dataset will allow testing of how well planners generalize across different traffic patterns and regions.
Organized benchmark challenges can standardize comparisons and accelerate development of better ML planning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This benchmark could reveal that many current ML planners perform worse under interactive conditions than open-loop tests suggest.
Researchers might extend the reactive agent behaviors using patterns from the collected driving data to increase simulation realism.
The framework may support hybrid evaluations that combine simulation results with limited real-vehicle validation to improve correlation.

Load-bearing premise

The chosen metrics and reactive-agent simulator will produce planner rankings that correlate with real-world safety and performance once deployed on physical vehicles.

What would settle it

Deploying several benchmark-ranked planners on physical vehicles in matching scenarios and checking whether their real-world safety records and performance match the simulated rankings.

read the original abstract

In this work, we propose the world's first closed-loop ML-based planning benchmark for autonomous driving. While there is a growing body of ML-based motion planners, the lack of established datasets and metrics has limited the progress in this area. Existing benchmarks for autonomous vehicle motion prediction have focused on short-term motion forecasting, rather than long-term planning. This has led previous works to use open-loop evaluation with L2-based metrics, which are not suitable for fairly evaluating long-term planning. Our benchmark overcomes these limitations by introducing a large-scale driving dataset, lightweight closed-loop simulator, and motion-planning-specific metrics. We provide a high-quality dataset with 1500h of human driving data from 4 cities across the US and Asia with widely varying traffic patterns (Boston, Pittsburgh, Las Vegas and Singapore). We will provide a closed-loop simulation framework with reactive agents and provide a large set of both general and scenario-specific planning metrics. We plan to release the dataset at NeurIPS 2021 and organize benchmark challenges starting in early 2022.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes NuPlan as the world's first closed-loop ML-based planning benchmark for autonomous driving. It presents a 1500-hour multi-city dataset of human driving data, a lightweight closed-loop simulator with reactive agents, and a collection of general and scenario-specific planning metrics intended to address the shortcomings of open-loop L2 evaluation for long-term motion planning.

Significance. If the simulator and metrics can be shown to produce planner rankings that correlate with real-world safety and efficiency, the benchmark would fill an important gap by enabling standardized, realistic evaluation of ML-based planners beyond short-term forecasting tasks. The scale and geographic diversity of the dataset represent a clear strength.

major comments (2)

[Abstract] Abstract: The manuscript contains no closed-loop experiments, ablations of agent reactivity, or comparisons against open-loop L2 baselines and real-vehicle logs. Without such evidence, the claim that the proposed metrics and reactive simulator will produce rankings predictive of real-world performance remains untested and central to the benchmark's value.
[Metrics and Simulator] Metrics and Simulator sections: The general and scenario-specific metrics are described at a high level but lack explicit definitions, formulas, or pseudocode. This prevents assessment of whether they avoid the known pitfalls of open-loop evaluation and whether the simulator rules are sufficiently specified for reproducibility.

minor comments (1)

[Abstract] Abstract: Phrases such as 'we will provide' and 'we plan to release' indicate this is a benchmark proposal paper; the current status of the simulator implementation and metric computation code should be stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the scale and geographic diversity of the dataset. We agree that additional evidence and detail would strengthen the manuscript and will revise accordingly. We address each major comment below.

read point-by-point responses

Referee: [Abstract] The manuscript contains no closed-loop experiments, ablations of agent reactivity, or comparisons against open-loop L2 baselines and real-vehicle logs. Without such evidence, the claim that the proposed metrics and reactive simulator will produce rankings predictive of real-world performance remains untested and central to the benchmark's value.

Authors: We acknowledge that the current manuscript is primarily a benchmark definition paper and does not contain closed-loop planner evaluations. In the revision we will add a dedicated experiments section that runs several baseline planners (rule-based and learned) in closed-loop simulation. This will include ablations on agent reactivity levels and side-by-side comparison of closed-loop metric rankings versus open-loop L2 error on the same scenarios from the dataset. These additions will provide concrete evidence of how the benchmark behaves differently from open-loop evaluation. A full statistical correlation with real-world safety outcomes is not possible within this work, as it would require proprietary fleet testing data and deployments beyond the benchmark release. revision: yes
Referee: [Metrics and Simulator] The general and scenario-specific metrics are described at a high level but lack explicit definitions, formulas, or pseudocode. This prevents assessment of whether they avoid the known pitfalls of open-loop evaluation and whether the simulator rules are sufficiently specified for reproducibility.

Authors: We agree that the current level of detail is insufficient for reproducibility. The revised manuscript will expand both sections with explicit mathematical definitions and formulas for every metric (e.g., collision rate, route progress, comfort, and scenario-specific scores), together with pseudocode for the closed-loop simulation loop, agent state updates, and reactivity model. This will make clear how the metrics penalize unrealistic behaviors that open-loop L2 evaluation overlooks and will allow independent re-implementation of the simulator. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal with independent definitions

full rationale

The paper introduces a new dataset (1500h from 4 cities), lightweight closed-loop simulator with reactive agents, and planning-specific metrics without any claimed derivations, equations, parameter fittings, or predictions. No load-bearing step reduces by construction to prior inputs or self-citations; the central claim is the proposal of these independent components. This matches the default expectation for non-circular benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that 1500 hours across four cities sufficiently covers the distribution of real traffic interactions and that closed-loop simulation with reactive agents approximates physical vehicle dynamics well enough for ranking planners.

axioms (1)

domain assumption The collected driving data and reactive simulator produce rankings that generalize to real-world deployment safety.
Invoked when claiming the benchmark overcomes limitations of open-loop evaluation.

pith-pipeline@v0.9.0 · 5503 in / 1168 out tokens · 40905 ms · 2026-05-15T08:57:39.837798+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Existing benchmarks... use open-loop evaluation with L2-based metrics, which are not suitable for fairly evaluating long-term planning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems
cs.RO 2026-05 unverdicted novelty 7.0

MDrive benchmark shows multi-agent cooperative driving systems generally outperform single-agent ones in closed-loop settings but perception sharing does not always improve planning and negotiation can harm performanc...
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
A global dataset of continuous urban dashcam driving
cs.CV 2026-04 accept novelty 7.0

CROWD is a new global dataset of 51,753 continuous urban dashcam segments spanning over 20,000 hours from 238 countries, with manual labels and automated object detections for routine driving analysis.
C-TRAIL: A Commonsense World Framework for Trajectory Planning in Autonomous Driving
cs.AI 2026-03 unverdicted novelty 7.0

C-TRAIL combines LLM commonsense with a dual-trust mechanism and Dirichlet-weighted Monte Carlo Tree Search to improve trajectory planning accuracy and safety in autonomous driving.
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
cs.CV 2026-03 unverdicted novelty 7.0

KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
cs.CV 2025-06 unverdicted novelty 7.0

ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction
cs.CV 2026-05 unverdicted novelty 6.0

Smaller end-to-end autonomous driving models achieve optimal 3-second trajectory prediction accuracy at lower or intermediate temporal sampling frequencies, whereas larger VLA-style models perform best at the highest ...
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
SceneFactory: GPU-Accelerated Multi-Agent Driving Simulation with Physics-Based Vehicle Dynamics
cs.MA 2026-05 accept novelty 6.0

SceneFactory delivers a batched GPU platform for physics-based multi-agent autonomous driving simulation that achieves 127x higher throughput than non-vectorized PhysX while supporting articulated dynamics and road-co...
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution
cs.RO 2026-04 unverdicted novelty 6.0

ProDrive couples a query-centric planner with a BEV world model for end-to-end ego-environment co-evolution, enabling future-outcome assessment that improves safety and efficiency over reactive baselines on NAVSIM v1.
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
cs.CV 2026-04 unverdicted novelty 6.0

OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
Mosaic: An Extensible Framework for Composing Rule-Based and Learned Motion Planners
cs.RO 2026-04 unverdicted novelty 6.0

Mosaic integrates rule-based and learned planners via arbitration graphs to set new state-of-the-art scores on nuPlan and interPlan benchmarks while cutting at-fault collisions by 30%.
BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving
cs.RO 2026-04 unverdicted novelty 6.0

The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.
Evaluation as Evolution: Transforming Adversarial Diffusion into Closed-Loop Curricula for Autonomous Vehicles
cs.RO 2026-04 unverdicted novelty 6.0

E² uses transport-regularized sparse control on learned reverse-time SDEs with topology-driven selection and Topological Anchoring to generate realistic adversarial scenarios, improving collision discovery by 9.01% on...
Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
cs.CV 2024-06 unverdicted novelty 6.0

Hydra-MDP uses multi-teacher distillation and a multi-head decoder to learn diverse, metric-specific trajectories in an end-to-end autonomous-driving planner, winning the Navsim challenge.
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling
cs.RO 2026-05 unverdicted novelty 5.0

CaAD adds ego-centric joint-causal modeling and causality-aware policy alignment to end-to-end driving, reporting Driving Score 87.53 and Success Rate 71.81 on Bench2Drive plus PDMS 91.1 on NAVSIM.
Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic
cs.AI 2026-04 unverdicted novelty 5.0

This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 19 Pith papers

[1]

CommonRoad: Composable benchmarks for motion plan- ning on roads

Matthias Althoff, Markus Koschi, and Stefanie Manzinger. CommonRoad: Composable benchmarks for motion plan- ning on roads. In Proc. of the IEEE Intelligent Vehicles Sym- posium, 2017. 2

work page 2017
[2]

Chauf- feurnet: Learning to drive by imitating the best and synthe- sizing the worst

Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauf- feurnet: Learning to drive by imitating the best and synthe- sizing the worst. In RSS, 2019. 2

work page 2019
[3]

Learning to drive from simulation without real world labels

Alex Bewley, Jessica Rigley, Yuxuan Liu, Jeffrey Hawke, Richard Shen, Vinh-Dieu Lam, and Alex Kendall. Learning to drive from simulation without real world labels. In ICRA,

work page
[4]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In CVPR, 2020. 1, 2

work page 2020
[5]

MP3: A uniﬁed model to map, perceive, predict and plan

Sergio Casas, Abbas Sadat, and Raquel Urtasun. MP3: A uniﬁed model to map, perceive, predict and plan. In CVPR,

work page
[6]

Argo- verse: 3d tracking and forecasting with rich maps

Ming-Fang Chang, John W Lambert, Patsorn Sangkloy, Jag- jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argo- verse: 3d tracking and forecasting with rich maps. In CVPR,

work page
[7]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. CoRR, 2017. 2

work page 2017
[8]

Large scale interactive motion forecasting for autonomous driving: The Waymo Open Motion Dataset

Scott Ettinger, Shuyang Cheng, and Benjamin Caine et al. Large scale interactive motion forecasting for autonomous driving: The Waymo Open Motion Dataset. arXiv preprint arXiv:2104.10133, 2021. 1, 2

work page arXiv 2021
[9]

Vision meets robotics: The KITTI dataset

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The KITTI dataset. IJRR, 32(11):1231–1237, 2013. 1, 2

work page 2013
[10]

The efﬁcacy of neural planning metrics: A 4 meta-analysis of PKL on nuscenes

Yiluan Guo, Holger Caesar, Oscar Beijbom, Jonah Philion, and Sanja Fidler. The efﬁcacy of neural planning metrics: A 4 meta-analysis of PKL on nuscenes. In IROS Workshop on Benchmarking Progress in Autonomous Driving, 2020. 3

work page 2020
[11]

One thousand and one hours: Self-driving motion prediction dataset

John Houston, Guido Zuidhof, and Luca Bergamini et al. One thousand and one hours: Self-driving motion prediction dataset. arXiv preprint arXiv:2006.14480, 2020. 1, 2, 4

work page arXiv 2006
[12]

Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom

Alex H. Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019. 3

work page 2019
[13]

Moustafa, and Jens Honer

Eraqi Hesham M., Mohamed N. Moustafa, and Jens Honer. Conditional imitation learning driving considering camera and lidar fusion. In NeurIPS, 2020. 3

work page 2020
[14]

Simulation-based reinforcement learning for real-world autonomous driving

Blazej Osinski, Adam Jakubowski, Pawel Ziecina, Piotr Milos, Christopher Galias, Silviu Homoceanu, and Henryk Michalewski. Simulation-based reinforcement learning for real-world autonomous driving. In ICRA, 2020. 2

work page 2020
[15]

Learning to evaluate perception models using planner-centric metrics

Jonah Philion, Amlan Kar, and Sanja Fidler. Learning to evaluate perception models using planner-centric metrics. In CVPR, 2020. 3

work page 2020
[16]

Multi- modal fusion transformer for end-to-end autonomous driv- ing

Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi- modal fusion transformer for end-to-end autonomous driv- ing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3

work page 2021
[17]

Qi, Yin Zhou, Mahyar Najibi, Pei Sun, Khoa V o, Boyang Deng, and Dragomir Anguelov

Charles R. Qi, Yin Zhou, Mahyar Najibi, Pei Sun, Khoa V o, Boyang Deng, and Dragomir Anguelov. Offboard 3d ob- ject detection from point cloud sequences. arXiv preprint arXiv:2103.05073, 2021. 2, 3

work page arXiv 2021
[18]

Jointly learnable be- havior and trajectory planning for self-driving vehicles

Abbas Sadat, Mengye Ren, Andrei Pokrovsky, Yen-Chen Lin, Ersin Yumer, and Raquel Urtasun. Jointly learnable be- havior and trajectory planning for self-driving vehicles. In IROS, 2019. 2

work page 2019
[19]

AirSim: High-ﬁdelity visual and physical simula- tion for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-ﬁdelity visual and physical simula- tion for autonomous vehicles. In Field and Service Robotics,

work page
[20]

End-to-end multi-modal sen- sors fusion system for urban automated driving

Ibrahim Sobh, Loay Amin, Sherif Abdelkarim, Khaled Elmadawy, Mahmoud Saeed, Omar Abdeltawab, Mostafa Gamal, and Ahmad El Sallab. End-to-end multi-modal sen- sors fusion system for urban automated driving. In NeurIPS,

work page
[21]

Learning to track with object permanence

Pavel Tokmakov, Jie Li, Wolfram Burgard, and Adrien Gaidon. Learning to track with object permanence. arXiv preprint arXiv:2103.14258, 2021. 2

work page arXiv 2021
[22]

Yi Xiao, Felipe Codevilla, Akhil Gurram, Onay Urfalioglu, and Antonio M. L´opez. Multimodal end-to-end autonomous driving. arXiv preprint arXiv:1906.03199, 2019. 3

work page arXiv 1906
[23]

Center- based 3d object detection and tracking

Tianwei Yin, Xingyi Zhou, and Philipp Kr ¨ahenb¨uhl. Center- based 3d object detection and tracking. arXiv preprint arXiv:2006.11275, 2020. 3

work page arXiv 2006
[24]

End-to-end inter- pretable neural motion planner

Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun. End-to-end inter- pretable neural motion planner. In CVPR, 2021. 2 5

work page 2021