Recognition: unknown
Do Open-Loop Metrics Predict Closed-Loop Driving? A Cross-Benchmark Correlation Study of NAVSIM and Bench2Drive
Pith reviewed 2026-05-09 20:52 UTC · model grok-4.3
The pith
NAVSIM open-loop PDM Score correlates at ρ=0.90 with closed-loop Bench2Drive Driving Score but shows ranking inversions and can be matched by a simpler three-metric version.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Compiling paired NAVSIM sub-metrics and Bench2Drive scores for eight methods shows the aggregate PDM Score correlates positively with closed-loop Driving Score at Spearman ρ=0.90 but non-monotonically, with ranking inversions. Ego Progress is the strongest single sub-metric predictor, exceeding the safety-critical No Collision metric. A simpler three-metric formula matches the predictive power of the full five-metric PDM Score on the same n=8 sample. The safety-progress trade-off appears differently across the two regimes, with the snowball effect of accumulating open-loop deviations offered as a candidate mechanism for the residual gap.
What carries the argument
Paired dataset of eight methods' NAVSIM open-loop sub-metrics (including Ego Progress and PDM Score) and Bench2Drive closed-loop Driving Scores, analyzed via Spearman rank correlations and ranking comparisons.
If this is right
- A three-metric shortcut can be used in place of the full PDM Score for closed-loop ranking with no loss in accuracy on current methods.
- Planners that maximize safety by minimizing progress in open-loop evaluation tend to underperform in closed-loop due to timeout and slow-driving penalties.
- Ego Progress should receive higher weight in future open-loop metric design to improve alignment with closed-loop outcomes.
- Within present state-of-the-art, TTC and Comfort metrics add little marginal information for predicting closed-loop success.
Where Pith is reading between the lines
- If the pattern holds on larger samples, benchmark designers could simplify NAVSIM-style scoring to fewer metrics focused on progress.
- The snowball effect suggests that open-loop metrics might be augmented with explicit models of deviation accumulation to better anticipate closed-loop failures.
- Extending the pairing exercise to additional closed-loop benchmarks would test whether the observed correlation generalizes beyond Bench2Drive.
Load-bearing premise
The eight methods with complete paired data represent current state-of-the-art planners and that the published open-loop and closed-loop results are directly comparable without hidden differences in simulation setups or evaluation protocols.
What would settle it
A study adding more methods or different benchmark pairs that finds Spearman correlation between PDM Score and Driving Score below 0.7, or where the three-metric formula loses its matching accuracy, would falsify the claimed predictive equivalence.
Figures
read the original abstract
Open-loop evaluation offers fast, reproducible assessment of autonomous driving planners, but its ability to predict real closed-loop driving performance remains questionable. Prior work has shown that traditional open-loop metrics such as Average Displacement Error (ADE) and Final Displacement Error (FDE) exhibit no reliable correlation with closed-loop Driving Score. In this paper, we ask whether the more recent, safety-aware open-loop metrics introduced by NAVSIM~v2 can bridge this gap. By systematically cross-referencing published results from 15 state-of-the-art methods across NAVSIM (open-loop) and Bench2Drive (closed-loop), we compile a paired dataset of open-loop sub-metrics and closed-loop performance, yielding 8 methods with complete paired data. Our analysis reveals three key findings: (1) the aggregate NAVSIM PDM Score shows a strong positive but non-monotonic correlation with Bench2Drive Driving Score, with clear ranking inversions; (2) among individual NAVSIM sub-metrics, Ego Progress (EP) is the strongest single predictor of closed-loop success, substantially exceeding the safety-critical collision metric NC; (3) the safety-progress trade-off manifests differently in open-loop and closed-loop: methods that maximize safety at the expense of progress rank highly in NAVSIM but underperform in closed-loop due to timeout and slow-driving penalties. We further demonstrate that a much simpler 3-metric formula matches the predictive power of the full 5-metric PDMS at the same Spearman $\rho{=}0.90$ on our paired sample of $n{=}8$ methods, suggesting that within current state-of-the-art methods -- where TTC and Comfort approach saturation -- these two sub-metrics add little marginal information for closed-loop ranking. Additionally, we identify the snowball effect -- where small open-loop deviations compound into closed-loop failures -- as a candidate mechanism for the residual gap.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts a cross-benchmark correlation study between open-loop NAVSIM metrics (including the aggregate PDM Score and its sub-metrics) and closed-loop Bench2Drive Driving Scores. Using published results from 15 SOTA methods, it compiles paired data for 8 methods and reports a strong positive but non-monotonic Spearman correlation (ρ=0.90) between PDM Score and Driving Score with ranking inversions, identifies Ego Progress (EP) as the strongest single open-loop predictor, and shows that a post-hoc 3-metric formula matches the predictive power of the full 5-metric PDMS on the same n=8 sample. It attributes residual gaps to a snowball effect and notes that TTC and Comfort appear saturated within current SOTA.
Significance. If the correlations and simplification hold on larger, independently validated samples, the work would be significant for autonomous driving evaluation by providing concrete evidence that full open-loop suites like PDMS may be overparameterized for closed-loop prediction and by highlighting non-monotonicity and the open-to-closed-loop gap. The transparency in reporting specific Spearman values and explicit ranking inversions is a strength that enables falsifiable follow-up.
major comments (3)
- [Paired dataset construction and results] Paired data compilation and correlation analysis: All headline results (PDM Score vs. Driving Score ρ=0.90, EP as top predictor, non-monotonicity with inversions, and the 3-metric formula) are computed exclusively on the n=8 methods with complete paired data. No bootstrap intervals, statistical significance tests, or sensitivity analysis to outliers are reported, making the claims vulnerable to small-sample artifacts.
- [Simplified metric formula and ablation] 3-metric formula: The selection of which two sub-metrics (TTC, Comfort) to drop and the demonstration that the resulting formula matches full PDMS at ρ=0.90 both occur on the identical n=8 sample, introducing circularity. The claim that these metrics 'approach saturation' is therefore an in-sample observation without hold-out validation or external confirmation.
- [Cross-benchmark methodology] Comparability assumption: The analysis treats published open-loop and closed-loop results as directly comparable, but does not address potential hidden differences in simulation setups, evaluation protocols, or method-specific reporting choices that could affect the observed correlations.
minor comments (1)
- [Abstract] Clarify early (e.g., in the abstract or introduction) that while 15 methods are referenced, all quantitative claims rest on the subset of 8 with complete data.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which highlight important limitations in statistical rigor and methodological assumptions. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Paired dataset construction and results] Paired data compilation and correlation analysis: All headline results (PDM Score vs. Driving Score ρ=0.90, EP as top predictor, non-monotonicity with inversions, and the 3-metric formula) are computed exclusively on the n=8 methods with complete paired data. No bootstrap intervals, statistical significance tests, or sensitivity analysis to outliers are reported, making the claims vulnerable to small-sample artifacts.
Authors: We acknowledge the small n=8 sample as a core limitation due to the scarcity of published paired results across benchmarks. In the revision, we will add bootstrap resampling (1000 iterations) to report 95% confidence intervals for the Spearman ρ values, including for PDM Score vs. Driving Score and the sub-metrics. We will also include leave-one-out sensitivity analysis to evaluate the stability of EP as the top predictor and the observed ranking inversions. These additions will be presented alongside an explicit discussion of small-sample caveats, framing the results as preliminary evidence rather than definitive claims. revision: yes
-
Referee: [Simplified metric formula and ablation] 3-metric formula: The selection of which two sub-metrics (TTC, Comfort) to drop and the demonstration that the resulting formula matches full PDMS at ρ=0.90 both occur on the identical n=8 sample, introducing circularity. The claim that these metrics 'approach saturation' is therefore an in-sample observation without hold-out validation or external confirmation.
Authors: The referee rightly points out the circularity in both selecting and validating the 3-metric formula on the same sample. We will revise the text to present this formula strictly as a post-hoc exploratory observation derived from saturation patterns visible in the current SOTA data distributions. We will report the full set of individual sub-metric correlations with Driving Score for transparency, allowing readers to evaluate contributions independently. The saturation statement will be qualified as an in-sample observation specific to existing methods, with a call for future hold-out validation on expanded datasets. No claim of generalizability beyond n=8 will be retained. revision: partial
-
Referee: [Cross-benchmark methodology] Comparability assumption: The analysis treats published open-loop and closed-loop results as directly comparable, but does not address potential hidden differences in simulation setups, evaluation protocols, or method-specific reporting choices that could affect the observed correlations.
Authors: We agree that unaddressed differences in simulation setups, scenario coverage, or reporting choices could confound the correlations. In revision, we will add a dedicated limitations subsection discussing these factors, noting that all methods were selected based on their use of official NAVSIM and Bench2Drive evaluation protocols as published. We will emphasize that our conclusions are conditional on these standard benchmarks and highlight the snowball effect as one mechanism for residual discrepancies. Despite potential confounders, the strength of the observed correlation (ρ=0.90) still provides useful evidence, but we will frame it more cautiously. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper performs an empirical cross-benchmark analysis by compiling published results from external methods on NAVSIM and Bench2Drive, computing Spearman correlations and identifying predictors such as Ego Progress directly from those data. The observation that a 3-metric subset achieves the same ρ=0.90 on the n=8 paired sample is an in-sample comparison rather than a derivation that reduces to its own inputs by construction. No equations are shown to be self-referential, no parameters are fitted and then relabeled as predictions, and no load-bearing claims rest on self-citations or imported uniqueness theorems. The central findings rely on external published benchmarks and are therefore self-contained against those independent sources.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Published open-loop and closed-loop scores from different papers can be directly paired and compared without protocol mismatches
- standard math Spearman rank correlation is a valid measure of predictive power for closed-loop success
Reference graph
Works this paper leans on
-
[1]
CARLA: An open urban driving simulator
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InConference on Robot Learning (CoRL), 2017
2017
-
[2]
Parting with misconcep- tions about learning-based vehicle motion planning
Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with misconceptions about learning- based vehicle motion planning. InConference on Robot Learn- ing (CoRL), 2023. arXiv:2306.07962
-
[3]
Is ego status all you need for open-loop end-to-end autonomous driving? In IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2024
Zhiqi Li, Zhiding Yu, Shiyi Lan, et al. Is ego status all you need for open-loop end-to-end autonomous driving? In IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2024
2024
-
[4]
NA VSIM: Data-driven non- reactive autonomous vehicle simulation and benchmarking
Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. NA VSIM: Data-driven non- reactive autonomous vehicle simulation and benchmarking. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024
2024
-
[5]
Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving
Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37,
-
[6]
nuScenes: A multimodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, et al. nuScenes: A multimodal dataset for autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
2020
-
[7]
Argoverse 2: Next generation datasets for self-driving perception and fore- casting
Benjamin Wilson, William Qi, Tanmay Aber, et al. Argoverse 2: Next generation datasets for self-driving perception and fore- casting. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[8]
Towards learning-based planning: The nuPlan benchmark for real-world autonomous driving
Napat Karnchanachari et al. Towards learning-based planning: The nuPlan benchmark for real-world autonomous driving. In IEEE International Conference on Robotics and Automation (ICRA), 2024
2024
-
[9]
Xiaosong Jia et al. Bench2Drive-VL: Closed-loop VLM evalua- tion for autonomous driving.arXiv preprint arXiv:2604.01259, 2026
-
[10]
Scaling Laws of Mo- tion Forecasting and Planning – Technical Report, 2025
Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, Carlos Fuertes, Ari Seff, Tim Shen, Cole Gulino, Chenjie Yang, Ghassen Jerfel, Sergio Casas, et al. Scaling laws of motion forecasting and plan- ning – technical report.arXiv preprint arXiv:2506.08228, 2025
-
[11]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, et al. Planning-oriented autonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2023
2023
-
[12]
Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, et al. Hydra- MDP: End-to-end multimodal planning with multi-target hydra- distillation.arXiv preprint arXiv:2406.06978, 2024
work page internal anchor Pith review arXiv 2024
-
[13]
DiffusionDrive: Truncated diffusion model for end-to-end au- tonomous driving
Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, et al. DiffusionDrive: Truncated diffusion model for end-to-end au- tonomous driving. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2025
2025
-
[14]
Hydra-next: Robust closed-loop driving with open-loop training.arXiv preprint arXiv:2503.12030, 2025
Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, and Jose M Alvarez. Hydra-NeXt: Robust closed-loop driv- ing with open-loop training.arXiv preprint arXiv:2503.12030, 2025
-
[15]
GoalFlow: Goal-driven flow matching for multimodal tra- jectories generation in end-to-end autonomous driving
Zebin Xing, Xinhao Zhang, Yanjun Hu, Bo Jiang, et al. GoalFlow: Goal-driven flow matching for multimodal tra- jectories generation in end-to-end autonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2025
2025
-
[16]
Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Xiang Li, Yining Shi, and Sifa Zheng. SparseDriveV2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026
-
[17]
Jungho Kim, Jiyong Oh, Seunghoon Yu, Hongjae Shin, Donghyuk Kwak, and Jun Won Choi. SafeDrive: Fine-grained safety reasoning for end-to-end driving in a sparse world.arXiv preprint arXiv:2602.18887, 2026. CVPR 2026
-
[18]
Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Driv- eTransformer: Unified transformer for scalable end-to-end au- tonomous driving. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2503.07656
-
[19]
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
Bo Jiang, Shaoyu Chen, Hao Gao, Bencheng Liao, Qian Zhang, Wenyu Liu, and Xinggang Wang. V ADv2: End-to-end vector- ized autonomous driving via probabilistic planning. InInterna- tional Conference on Learning Representations (ICLR), 2026. arXiv:2402.13243
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online tra- jectory evaluation via BEV world model.arXiv preprint arXiv:2504.01941, 2025
-
[21]
arXiv preprint arXiv:2506.06659 (2025)
Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M Alvarez, and Zuxuan Wu. DriveSuprim: Towards pre- cise trajectory selection for end-to-end planning.arXiv preprint arXiv:2506.06659, 2025. AAAI 2026
-
[22]
Trajectory-guided control prediction for end- to-end autonomous driving: A simple yet strong baseline
Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end- to-end autonomous driving: A simple yet strong baseline. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 2022
2022
-
[23]
V AD: Vectorized scene representation for efficient autonomous driving
Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xing- gang Wang. V AD: Vectorized scene representation for efficient autonomous driving. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023
2023
-
[24]
Think Twice before Driving: To- wards scalable decoders for end-to-end autonomous driving
Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think Twice before Driving: To- wards scalable decoders for end-to-end autonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2023
2023
-
[25]
DriveAdapter: Breaking the coupling bar- rier of perception and planning in end-to-end autonomous driv- ing
Xiaosong Jia, Penghao Wu, Li Chen, Yu Liu, Hongyang Li, and Junchi Yan. DriveAdapter: Breaking the coupling bar- rier of perception and planning in end-to-end autonomous driv- ing. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023
2023
-
[26]
Bridging past and future: End-to-end autonomous driving with historical pre- diction and planning
Bozhou Zhang, Nan Song, Xin Jin, and Li Zhang. Bridging past and future: End-to-end autonomous driving with historical pre- diction and planning. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2025. arXiv:2503.14182
-
[27]
Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and Zhaoxiang Zhang. DriveDPO: Policy learning via safety DPO for end-to-end autonomous driving. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2025. arXiv:2509.17940
-
[28]
Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. SimLingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2503.09594
-
[29]
Yingqi Tang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. HiP-AD: Hierarchical and multi-granularity planning with de- formable attention for autonomous driving in a single decoder. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2503.08612. 9
-
[30]
Liuhan Yin, Runkun Ju, Guodong Guo, and Erkang Cheng. DiffRefiner: Coarse to fine trajectory planning via diffusion re- finement with semantic interaction for end-to-end autonomous driving.arXiv preprint arXiv:2511.17150, 2025. AAAI 2026
-
[31]
RAP: 3D rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025
Lan Feng, Yang Gao, Eloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord, and Alexandre Alahi. RAP: 3D rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025
-
[32]
LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl, Andreas Geiger, and Kashyap Chitta. LEAD: Minimizing learner-expert asymmetry in end-to-end driving. arXiv preprint arXiv:2512.20563, 2025. CVPR 2026. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.