arxiv: 2604.10856 · v1 · submitted 2026-04-12 · 💻 cs.RO · cs.AI

Recognition: unknown

BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving

Seth Z. Zhao , Luobin Wang , Hongwei Ruan , Yuxin Bao , Yilan Chen , Ziyang Leng , Abhijit Ravichandran , Honglin He

show 8 more authors

Zewei Zhou Xu Han Abhishek Peri Zhiyu Huang Pranav Desai Henrik Christensen Jiaqi Ma Bolei Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:58 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords autonomous drivingopen-loop policiesclosed-loop deploymenttest-time adaptationOL-CL gapQ-value estimatorreactive behaviorstemporal consistency

0 comments

The pith

Open-loop driving policies fail in closed loop because they learn biased Q-value estimators that ignore reactive behaviors and temporal consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that the open-loop to closed-loop gap in end-to-end autonomous driving arises primarily from objective mismatch rather than mere observational differences. Open-loop training produces policies with a structural inability to model the complex reactive behaviors required during closed-loop simulation. A wide range of such policies develop biased Q-value estimators that overlook both the reactive nature of closed-loop environments and the temporal awareness needed to limit compounding errors. The authors introduce a test-time adaptation framework that addresses these issues by calibrating shifts, reducing biases, and enforcing consistency, leading to better performance and scaling than baselines.

Core claim

OL policies suffer from Observational Domain Shift and Objective Mismatch. While the former is largely recoverable with adaptation, the latter creates a structural inability to model complex reactive behaviors, which forms the primary OL-CL gap. A wide range of OL policies learn a biased Q-value estimator that neglects both the reactive nature of CL simulations and the temporal awareness needed to reduce compounding errors.

What carries the argument

The objective mismatch during open-loop training, which produces biased Q-value estimators unable to capture reactive behaviors or enforce temporal consistency in closed-loop deployment.

If this is right

The test-time adaptation framework calibrates observational shift, reduces state-action biases, and enforces temporal consistency during deployment.
Adapted policies achieve superior scaling dynamics compared to their unadapted open-loop counterparts.
Standard open-loop evaluation protocols contain blind spots that do not reflect the demands of closed-loop deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar objective mismatches may limit policy transfer in other sequential decision domains that involve reactive agents.
Test-time calibration techniques could extend to closing simulation-to-real gaps in robotics beyond driving.
Relying solely on open-loop metrics may systematically overstate a policy's viability for interactive real-world tasks.

Load-bearing premise

That the identified objective mismatch is the dominant structural cause of the OL-CL gap and that test-time adaptation can reliably correct it without creating new failure modes in varied driving scenarios.

What would settle it

A closed-loop simulation experiment in which policies adapted with the proposed framework continue to exhibit the same reactive behavior failures and compounding errors as unadapted open-loop baselines.

Figures

Figures reproduced from arXiv: 2604.10856 by Abhijit Ravichandran, Abhishek Peri, Bolei Zhou, Henrik Christensen, Honglin He, Hongwei Ruan, Jiaqi Ma, Luobin Wang, Pranav Desai, Seth Z. Zhao, Xu Han, Yilan Chen, Yuxin Bao, Zewei Zhou, Zhiyu Huang, Ziyang Leng.

**Figure 1.** Figure 1: Motivation. (a): Open-loop (OL) train-test paradigms do not necessarily align with the objectives and evaluation metrics in closed-loop (CL) deployment; (b): This mismatch results in significant performance discrepancies, characterized by decaying correlation and diverging scaling dynamics; (c): We attribute this OL-CL gap to observational domain shift and objective mismatch. In this paper, our goal is not… view at source ↗

**Figure 2.** Figure 2: Empirical Study on Observational Domain Shift. (a): Flow-matching procedure for observational calibration; (b): Calibration effects on OL and CL metrics. See Appendix D.2 for experimental details. Empirical studies. To test this hypothesis, we isolate and quantify ∆obs to examine policy behavior under observational domain shift, as depicted in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical Study on Objective Mismatch. (a): Compounding execution error of J(π) w.r.t. execution horizon k; (b): J(π) under CL and OL objectives, where the divergence between these curves constitutes the objective mismatch gap (∆obj); (c): OL-CL Pearson correlation under different traffic modes w.r.t. simulation horizon T. See Appendix D.3 for experimental details. Biased Q-value estimation is a byproduct … view at source ↗

**Figure 4.** Figure 4: Test-time Policy Adaptation. (a): Truncated Q-value estimator selects safe planned trajectories during closed-loop execution; (b): Adaptive replan preserves previously verified planned trajectories if the newly proposed trajectory doesn’t obtain higher Q-values; (c): Test-time scaling dynamics between biased (baseline) and truncated Q-value estimators (TTA). 5.2 Test-time Policy Adaptation 5.2.1 Truncated … view at source ↗

**Figure 5.** Figure 5: Improvements of proposed TTA framework. (a): Performance gain under different simulation horizons on BridgeSim-NavHard scenarios; (b): Performance gain under different logreplay domains; (c): Component analysis of TTA framework. Reactive evaluation results. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of BridgeSim Platform. BridgeSim supports unified closed-loop simulation [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of observation pairs in source and target domains. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative analysis of simulation artifacts in DriveArena: multi-view incoherence (left) [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative analysis of reconstruction artifacts and ghosting effects in HUGSIM. Red [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

read the original abstract

Open-loop (OL) to closed-loop (CL) gap (OL-CL gap) exists when OL-pretrained policies scoring high in OL evaluations fail to transfer effectively in closed-loop (CL) deployment. In this paper, we unveil the root causes of this systemic failure and propose a practical remedy. Specifically, we demonstrate that OL policies suffer from Observational Domain Shift and Objective Mismatch. We show that while the former is largely recoverable with adaptation techniques, the latter creates a structural inability to model complex reactive behaviors, which forms the primary OL-CL gap. We find that a wide range of OL policies learn a biased Q-value estimator that neglects both the reactive nature of CL simulations and the temporal awareness needed to reduce compounding errors. To this end, we propose a Test-Time Adaptation (TTA) framework that calibrates observational shift, reduces state-action biases, and enforces temporal consistency. Extensive experiments show that TTA effectively mitigates planning biases and yields superior scaling dynamics than its baseline counterparts. Furthermore, our analysis highlights the existence of blind spots in standard OL evaluation protocols that fail to capture the realities of closed-loop deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins the main OL-CL gap on objective mismatch from biased Q-value estimates rather than just domain shift, and shows their TTA fix improves scaling in driving policies.

read the letter

The central point is that open-loop policies for end-to-end driving fail in closed loop mainly because they learn biased Q-values that ignore reactivity and temporal structure, not merely from observational differences. The authors separate these cleanly and argue the mismatch is structural and load-bearing. They then test a Test-Time Adaptation method that adjusts for shift, cuts state-action bias, and adds consistency checks, with results showing better planning and scaling than baselines. This framing and the practical remedy are the real additions beyond generic sim-to-real talk. The experiments appear to cover a range of policies and highlight blind spots in standard open-loop metrics, which is useful for anyone training these models. The distinction between recoverable shift and deeper mismatch holds up logically and matches what the abstract and analysis describe. One soft spot is that the bias measurements and TTA gains rest on their simulation setups; without broader tests on varied traffic or sensor conditions, it is not yet clear how dominant this factor stays outside their environments. The evaluation critique is fair but could use more direct comparisons to show which common metrics most mislead. This is for researchers working on imitation or RL policies in autonomous driving simulators who already care about transfer. It is coherent enough on its own terms to merit a serious referee, even if revisions would tighten the generalization claims.

Referee Report

2 major / 3 minor

Summary. The manuscript analyzes the open-loop (OL) to closed-loop (CL) gap in end-to-end autonomous driving. It identifies observational domain shift (largely recoverable via adaptation) and objective mismatch as root causes, with the latter producing biased Q-value estimators that neglect reactivity and temporal consistency in CL simulations. The authors propose a Test-Time Adaptation (TTA) framework to calibrate shifts, reduce state-action biases, and enforce temporal consistency, claiming superior mitigation of planning biases and better scaling than baselines, while highlighting blind spots in standard OL evaluation protocols.

Significance. If the empirical findings hold under rigorous verification, the work offers a structured diagnosis of why OL-pretrained policies fail to transfer and a practical TTA remedy that targets the structural component of the gap. The distinction between recoverable domain shift and non-recoverable objective mismatch, together with the identification of biased Q-value estimation, could inform more robust training pipelines for reactive driving policies. The analysis of evaluation protocol blind spots is a useful contribution to the field.

major comments (2)

[§4] §4 (Method) and Eq. (X) [TTA objective]: the TTA loss formulation for enforcing temporal consistency is described at a high level but lacks an explicit equation showing how the temporal term is combined with the bias-reduction term; without this, it is unclear whether the framework can avoid introducing new compounding errors in long-horizon reactive scenarios.
[Table 3] Table 3 (CL evaluation results): the reported gains in success rate and collision reduction for TTA versus baselines are presented without error bars or statistical tests across random seeds; given that the central claim is that TTA yields superior scaling dynamics, the absence of variance quantification weakens the strength of the empirical support.

minor comments (3)

[§2.1] §2.1 (Related Work): the discussion of prior OL-CL gap papers would benefit from a concise table comparing their identified causes and proposed fixes to the current work.
[Figure 4] Figure 4 (Qualitative rollouts): the caption should explicitly state the number of episodes visualized and whether they are cherry-picked or representative.
[§5.3] §5.3 (Ablation study): the ablation removing the temporal consistency term reports only mean performance; adding per-scenario breakdowns would clarify whether the term is critical for complex reactive behaviors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of our work on the OL-CL gap. We have addressed each major comment below with targeted revisions to improve clarity and empirical rigor.

read point-by-point responses

Referee: [§4] §4 (Method) and Eq. (X) [TTA objective]: the TTA loss formulation for enforcing temporal consistency is described at a high level but lacks an explicit equation showing how the temporal term is combined with the bias-reduction term; without this, it is unclear whether the framework can avoid introducing new compounding errors in long-horizon reactive scenarios.

Authors: We agree that an explicit combined objective would strengthen the presentation. In the revised manuscript, we have added Equation (X') defining the full TTA loss as L_TTA = L_bias + λ L_temporal, where L_bias is the state-action bias reduction term (KL divergence between adapted and original distributions) and L_temporal is the mean-squared error on predicted versus observed next states over a short adaptation window. We have also expanded the surrounding text to explain that the temporal term is applied only during lightweight test-time updates and does not propagate errors into the base policy, thereby avoiding new compounding issues in long-horizon reactive driving. revision: yes
Referee: [Table 3] Table 3 (CL evaluation results): the reported gains in success rate and collision reduction for TTA versus baselines are presented without error bars or statistical tests across random seeds; given that the central claim is that TTA yields superior scaling dynamics, the absence of variance quantification weakens the strength of the empirical support.

Authors: We acknowledge that variance reporting would better support the scaling claim. The original experiments used a fixed seed for direct comparability across methods and environments. In the revision, we have rerun the primary CL evaluations in Table 3 across five random seeds, added mean ± standard deviation error bars, and included a footnote with paired t-test results confirming statistical significance (p < 0.05) of the reported gains. These additions preserve the original trends while quantifying variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents no mathematical derivations, equations, or self-referential definitions that reduce claims to fitted inputs or prior self-citations. All load-bearing assertions (observational domain shift, objective mismatch via biased Q-value estimation, and the TTA framework's effectiveness) are framed as outcomes of empirical experiments and simulation-based evaluations. The abstract and summary contain no derivation chain; claims rest on direct demonstration from policy rollouts and adaptation tests rather than any self-definitional or fitted-input structure. This is the standard case of an empirical robotics paper whose central results are externally falsifiable via replication on the described benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard domain assumptions in reinforcement learning for driving (e.g., closed-loop simulation approximates real deployment dynamics) and the validity of Q-value analysis for detecting bias. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Closed-loop simulations faithfully represent the reactive and compounding-error dynamics of real autonomous driving deployment.
Invoked when claiming that OL policies' biased Q-values create structural inability in CL settings.

pith-pipeline@v0.9.0 · 5556 in / 1230 out tokens · 61156 ms · 2026-05-10T14:58:24.012329+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems
cs.RO 2026-05 unverdicted novelty 7.0

MDrive benchmark shows multi-agent cooperative driving systems generally outperform single-agent ones in closed-loop settings but perception sharing does not always improve planning and negotiation can harm performanc...
ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes
cs.CV 2026-05 unverdicted novelty 7.0

ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

Reference graph

Works this paper leans on

85 extracted references · 26 canonical work pages · cited by 3 Pith papers · 4 internal anchors

[1]

Caesar, V

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

2020
[2]

W. Cao, M. Hallgarten, T. Li, D. Dauner, X. Gu, C. Wang, Y . Miron, M. Aiello, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. Pseudo-simulation for autonomous driving. InConference on Robot Learning (CoRL), 2025

2025
[3]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. Navsim: Data-driven non-reactive au- tonomous vehicle simulation and benchmarking. InAdvances in Neural Information Process- ing Systems (NeurIPS), 2024

2024
[4]

L. Feng, Y . Gao, E. Zablocki, Q. Li, W. Li, S. Liu, M. Cord, and A. Alahi. Rap: 3d rasterization augmented end-to-end planning, 2025. URLhttps://arxiv.org/abs/2510.04333

work page arXiv 2025
[5]

Kirby, A

E. Kirby, A. Boulch, Y . Xu, Y . Yin, G. Puy, E. Zablocki, A. Bursuc, S. Gidaris, R. Marlet, F. Bartoccioni, A.-Q. Cao, N. Samet, T.-H. Vu, and M. Cord. Driving on registers.preprint, 2026

2026
[6]

X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS 2024 Datasets and Benchmarks Track, 2024

2024
[7]

Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

2024
[8]

Karkus, M

P. Karkus, M. Igl, Y . Chen, K. Chitta, J. Packer, B. Douillard, R. Tian, A. Naumann, G. Garcia- Cobo, S. Tan, et al. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques.Authorea Preprints
[9]

S. Z. Zhao, H. Zhang, Z. Li, J. Peng, A. Chui, Z. Zhou, Z. Meng, H. Xiang, Z. Huang, F. Wang, et al. Quantv2x: A fully quantized multi-agent system for cooperative perception.arXiv preprint arXiv:2509.03704, 2025

work page arXiv 2025
[10]

Zheng, P

Y . Zheng, P. Yang, Z. Xia, Q. Zhang, Y . Zheng, S. Gu, B. Jin, T. Zhang, B. Lu, C. Han, et al. Data scaling laws for imitation learning-based end-to-end autonomous driving.arXiv preprint arXiv:2412.02689, 2024

work page arXiv 2024
[11]

Naumann, X

A. Naumann, X. Gu, T. Dimlioglu, M. Bojarski, A. Degirmenci, A. Popov, D. Bisla, M. Pavone, U. Muller, and B. Ivanovic. Data scaling laws for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2571–2582, 2025

2025
[12]

arXiv preprint arXiv:2106.11810 (2021) 3, 7

H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021

work page arXiv 2021
[13]

Y . Li, S. Z. Zhao, C. Xu, C. Tang, C. Li, M. Ding, M. Tomizuka, and W. Zhan. Pre-training on synthetic driving data for trajectory prediction. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5910–5917, 2024. doi:10.1109/IROS58592. 2024.10802492

work page doi:10.1109/iros58592 2024
[14]

Garcia-Cobo, M

G. Garcia-Cobo, M. Igl, P. Karkus, Z. Zhang, M. Watson, Y . Chen, B. Ivanovic, and M. Pavone. Road: Rollouts as demonstrations for closed-loop supervised fine-tuning of autonomous driv- ing policies.arXiv preprint arXiv:2512.01993, 2025. 10

work page arXiv 2025
[15]

Alghodhaifi and S

H. Alghodhaifi and S. Lakshmanan. Autonomous vehicle evaluation: A comprehensive survey on modeling and simulation approaches.Ieee Access, 9:151531–151566, 2021

2021
[16]

Rosique, P

F. Rosique, P. J. Navarro, C. Fern ´andez, and A. Padilla. A systematic review of perception system and simulators for autonomous vehicles research.Sensors, 19(3):648, 2019

2019
[17]

Y . Ye, X. Zhang, and J. Sun. Automated vehicle’s behavior decision making using deep re- inforcement learning and high-fidelity simulation environment.Transportation Research Part C: Emerging Technologies, 107:155–170, 2019

2019
[18]

Osi ´nski, A

B. Osi ´nski, A. Jakubowski, P. Ziecina, P. Miło´s, C. Galias, S. Homoceanu, and H. Michalewski. Simulation-based reinforcement learning for real-world autonomous driving. In2020 IEEE international conference on robotics and automation (ICRA), pages 6411–6418. IEEE, 2020

2020
[19]

J. Wu, C. Huang, H. Huang, C. Lv, Y . Wang, and F.-Y . Wang. Recent advances in reinforcement learning-based autonomous driving behavior planning: A survey.Transportation Research Part C: Emerging Technologies, 164:104654, 2024

2024
[20]

Martinez, C

M. Martinez, C. Sitawarin, K. Finch, L. Meincke, A. Yablonski, and A. Kornhauser. Beyond grand theft auto v for training, testing and enhancing deep learning in self driving cars, 2017. URLhttps://arxiv.org/abs/1712.01397

work page arXiv 2017
[21]

M ¨uller, V

M. M ¨uller, V . Casser, J. Lahoud, N. Smith, and B. Ghanem. Sim4cv: A photo-realistic simu- lator for computer vision applications.IJCV, 2018

2018
[22]

S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: High-fidelity visual and physical simula- tion for autonomous vehicles. InFSR, 2018

2018
[23]

Dosovitskiy, G

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

2017
[24]

D. Team. Deepdrive: a simulator that allows anyone with a pc to push the state-of-the-art in self-driving.https://github.com/deepdrive/deepdrive
[25]

Kothari, C

P. Kothari, C. Perone, L. Bergamini, A. Alahi, and P. Ondruska. Drivergym: Democratising reinforcement learning for autonomous driving.arXiv preprint arXiv:2111.06889, 2021

work page arXiv 2021
[26]

Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning.TPAMI, 2022

2022
[27]

X. Yang, L. Wen, Y . Ma, J. Mei, X. Li, T. Wei, W. Lei, D. Fu, P. Cai, M. Dou, B. Shi, L. He, Y . Liu, and Y . Qiao. Drivearena: A closed-loop generative simulation platform for autonomous driving.arXiv preprint arXiv:2408.00415, 2024

work page arXiv 2024
[28]

H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.arXiv preprint arXiv:2412.01718, 2024

work page arXiv 2024
[29]

H. Gao, S. Chen, B. Jiang, B. Liao, Y . Shi, X. Guo, Y . Pu, H. Yin, X. Li, X. Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning. arXiv preprint arXiv:2502.13144, 2025

work page arXiv 2025
[30]

C. Ni, G. Zhao, X. Wang, Z. Zhu, W. Qin, G. Huang, C. Liu, Y . Chen, Y . Wang, X. Zhang, et al. Recondreamer: Crafting world models for driving scene reconstruction via online restoration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1559–1569, 2025

2025
[31]

C. Ni, G. Zhao, X. Wang, Z. Zhu, W. Qin, X. Chen, G. Jia, G. Huang, and W. Mei. Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruction. arXiv preprint arXiv:2508.08170, 2025. 11

work page arXiv 2025
[32]

W. Luo, B. Yang, and R. Urtasun. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3569–3577, 2018

2018
[33]

Liang, B

M. Liang, B. Yang, W. Zeng, Y . Chen, R. Hu, S. Casas, and R. Urtasun. Pnpnet: End-to- end perception and prediction with tracking in the loop. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11553–11562, 2020

2020
[34]

Sadat, S

A. Sadat, S. Casas, M. Ren, X. Wu, P. Dhawan, and R. Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. InEuropean Conference on Computer Vision, pages 414–430. Springer, 2020

2020
[35]

Y . Li, C. Fan, C. Ge, Z. Zhao, C. Li, C. Xu, H. Yao, M. Tomizuka, B. Zhou, C. Tang, et al. Womd-reasoning: A large-scale dataset for interaction reasoning in driving.arXiv preprint arXiv:2407.04281, 2024

work page arXiv 2024
[36]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[37]

Jiang, S

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving.ICCV, 2023

2023
[38]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.Pattern Analysis and Machine Intel- ligence (PAMI), 2023

2023
[39]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, and X. Wang. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. pages 12037–12047, 2025. doi:10.1109/CVPR52734.2025.01124. URLhttps: //openaccess.thecvf.com/content/CVPR2025/html/Liao_DiffusionDrive_ Truncated_Diffusion_Model_for_End-to-End...

work page doi:10.1109/cvpr52734.2025.01124 2025
[40]

J. Zou, S. Chen, B. Liao, Z. Zheng, Y . Song, L. Zhang, Q. Zhang, W. Liu, and X. Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end- to-end autonomous driving.arXiv preprint arXiv:2512.07745, 2025

work page arXiv 2025
[41]

J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang. Rethink- ing the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430, 2023

work page arXiv 2023
[42]

Dauner, M

D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023

2023
[43]

LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

L. Nguyen, M. Fauth, B. Jaeger, D. Dauner, M. Igl, A. Geiger, and K. Chitta. Lead: Minimizing learner-expert asymmetry in end-to-end driving.arXiv preprint arXiv:2512.20563, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Karkus, M

P. Karkus, M. Igl, Y . Chen, K. Chitta, J. Packer, B. Douillard, R. Tian, A. Naumann, G. Garcia- Cobo, S. Tan, A. De ˘girmenci, A. Popov, N. Smolyanskiy, U. Muller, B. Ivanovic, and M. Pavone. Beyond behavior cloning in autonomous driving: a survey of closed-loop training techniques. URLhttps://api.semanticscholar.org/CorpusID:283728300
[45]

J. Mei, A. Z. Zhu, X. Yan, H. Yan, S. Qiao, L.-C. Chen, and H. Kretzschmar. Waymo open dataset: Panoramic video panoptic segmentation. InEuropean Conference on Computer Vi- sion, pages 53–72. Springer, 2022. 12

2022
[46]

Treiber, A

M. Treiber, A. Hennecke, and D. Helbing. Congested traffic states in empirical observations and microscopic simulations.Physical review E, 62(2):1805, 2000

2000
[47]

Y . Liu, Z. Peng, X. Cui, and B. Zhou. Adv-bmt: Bidirectional motion transformer for safety- critical traffic scenario generation, 2025. URLhttps://arxiv.org/abs/2506.09485

work page arXiv 2025
[48]

H. Lin, Y . Zhang, W. Ding, J. Wu, and D. Zhao. Model-based policy adaptation for closed-loop end-to-end autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=4OLbpaTKJe

2025
[49]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Q. Liu, X. Yin, A. Yuille, A. Brown, and M. Singh. Flowing from words to pixels: A noise-free framework for cross-modality evolution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2755–2765, 2025

2025
[51]

Z. Li, W. Yao, Z. Wang, X. Sun, J. Chen, N. Chang, M. Shen, J. Song, Z. Wu, S. Lan, et al. Ztrs: Zero-imitation end-to-end autonomous driving with trajectory scoring.arXiv preprint arXiv:2510.24108, 2025

work page arXiv 2025
[52]

W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026

2026
[53]

S. Ross, G. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured pre- diction to no-regret online learning. InAISTATS, 2011

2011
[54]

Reinforcement and Imitation Learning via Interactive No-Regret Learning

S. Ross and J. A. Bagnell. Reinforcement and imitation learning via interactive no-regret learning.arXiv preprint arXiv:1406.5979, 2014

work page Pith review arXiv 2014
[55]

Codevilla et al

F. Codevilla et al. Exploring the limitations of behavior cloning for autonomous driving. In ICCV, 2019

2019
[56]

Q. Li, Z. Peng, L. Feng, Z. Liu, C. Duan, W. Mo, and B. Zhou. Scenarionet: Open-source plat- form for large-scale traffic scenario simulation and modeling.Advances in Neural Information Processing Systems, 2023

2023
[57]

Christensen, D

H. Christensen, D. Paz, H. Zhang, D. Meyer, H. Xiang, Y . Han, Y . Liu, A. Liang, Z. Zhong, and S. Tang. Autonomous vehicles for micro-mobility.Autonomous Intelligent Systems, 1(1): 11, 2021

2021
[58]

R. C. Coulter. Implementation of the pure pursuit path tracking algorithm. Technical report, 1992

1992
[59]

K. Li, Z. Li, S. Lan, Y . Xie, Z. Zhang, J. Liu, Z. Wu, Z. Yu, and J. M. Alvarez. Hydra- mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

work page arXiv 2025
[60]

Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Diamond, Y . Ding, W. Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

work page arXiv 2025
[61]

openpilot.https://github.com/commaai/openpilot, 2026

comma.ai. openpilot.https://github.com/commaai/openpilot, 2026

2026
[62]

Scalable Diffusion Models with Transformers

W. Peebles and S. Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748, 2022

work page internal anchor Pith review arXiv 2022
[63]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 13

work page internal anchor Pith review Pith/arXiv arXiv 2014
[64]

A. P. Tarko. Surrogate measures of safety. 2018

2018
[65]

H. Guo, K. Xie, and M. Keyvan-Ekbatani. Modeling driver’s evasive behavior during safety– critical lane changes: Two-dimensional time-to-collision and deep reinforcement learning.Ac- cident Analysis & Prevention, 186:107063, 2023

2023
[66]

Laureshyn, ˚A

A. Laureshyn, ˚A. Svensson, and C. Hyd ´en. Evaluation of traffic safety, based on micro-level behavioural data: Theoretical framework and first implementation.Accident Analysis & Pre- vention, 42(6):1637–1646, 2010

2010
[67]

Ozbay, H

K. Ozbay, H. Yang, B. Bartin, and S. Mudigonda. Derivation and validation of new simulation- based surrogate safety measure.Transportation research record, 2083(1):105–113, 2008

2083
[68]

Ettinger, S

S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y . Chai, B. Sapp, C. R. Qi, Y . Zhou, Z. Zoeggeler, A. Chouard, P. Sun, J. Ngiam, V . Vasudevan, A. McCauley, J. Pang, and D. Anguelov. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF International Conference on Co...

2021
[69]

Z. Zhou, T. Cai, Y . Zhao, Seth Z.and Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.Advances in Neural Information Processing Systems (NeurIPS), 2025. 14 Appendix A More Related Works E2E Driving Policy Adaptation.Adaptation of end-to-end...

2025
[70]

Observational Alignment:We standardize sensor configurations, including camera extrinsics and intrinsics, across all evaluated benchmarks. This standardization ensures that performance failures are strictly attributable to the policy’s representational robustness rather than superficial geomet- 15 Waymo Log-replay Argoverse2 CARLA Procedural Generated Rea...
[71]

Kinematic and Control Alignment:We enforce consistency across vehicle dynamics and control execution layers to isolate the sources of performance degradation. By standardizing the transition dynamicsPduring the execution of predicted waypoints, we ensure that the measured OL-CL gap reflects fundamental planning-level failures rather than downstream actuat...
[72]

This geometry is discretized and stored as dense point sequences, each assigned specific type attributes

Map features.BridgeSim ingests diverse real-world datasets, including Waymo [45], nuPlan [12], nuScenes [1], via a standardized protocol to extract map geometry. This geometry is discretized and stored as dense point sequences, each assigned specific type attributes. We subsequently vectorize HD map elements, such as lanes, boundaries, crosswalks, and sto...
[73]

To address the varying sampling rates across different datasets, we implement an interpolation pipeline that up-samples tra- jectories

Object tracks.Dynamic agents are stored as temporal tracks containing state sequences, includ- ingposition,heading,velocity, anddimensionsalongside validity masks. To address the varying sampling rates across different datasets, we implement an interpolation pipeline that up-samples tra- jectories. This process utilizes linear interpolation for position a...
[74]

We map individual traffic signals to specific lane IDs and store their state sequences (e.g.,STOP,WAIT,GO)

Dynamic map states.Traffic light states are decoupled from the static map geometry. We map individual traffic signals to specific lane IDs and store their state sequences (e.g.,STOP,WAIT,GO). This mapping enables the simulator to dynamically update the lane signaling during replay, ensuring that the behavioral constraints of the agents remain in sync with...
[75]

The simulator essentially replays the preset vehicle tracks

Log-replay mode.In this mode, background agents strictly adhere to their recorded trajectories from the source dataset. The simulator essentially replays the preset vehicle tracks. While this pre- serves the exact real-world context, these agents remain non-reactive to the ego-vehicle’s deviations
[76]

To bridge the gap between static datasets and closed-loop interaction, we re- place log-replay policies with an Intelligent Driver Model (IDM)

IDM mode[46]. To bridge the gap between static datasets and closed-loop interaction, we re- place log-replay policies with an Intelligent Driver Model (IDM). While background vehicles are initialized at their ground-truth poses, their subsequent behaviors are governed by heuristic policies. These agents adhere to lane geometry, maintain safety gaps, and a...
[77]

This enables the generation of safety-critical edge cases, challenging ego-policies to react instantaneously to avoid imminent collisions

Adversarial mode.We provide an interface for injecting adversarial agents [47] into log-replay scenarios. This enables the generation of safety-critical edge cases, challenging ego-policies to react instantaneously to avoid imminent collisions. Coordinate Systems.A critical challenge in evaluating policies across different simulators is the misalignment o...
[78]

Orientation angles (yaw, roll) are inverted ac- cordingly to maintain kinematic consistency

Chirality normalization.For datasets using left-handed systems (e.g., CARLA, Bench2Drive), we apply a basis transformation during conversion. Orientation angles (yaw, roll) are inverted ac- cordingly to maintain kinematic consistency
[79]

All map features and trajectories are translated so that the map origin aligns with this anchor, ensuring numerical stability for the planner and simulator physics engine

Spatial anchoring.To mitigate floating-point errors inherent in large-scale global coordinates (common in nuScenes and nuPlan), we normalize all spatial data relative to a scenario anchor, de- fined as the ego-vehicle’s position att= 0. All map features and trajectories are translated so that the map origin aligns with this anchor, ensuring numerical stab...
[80]

The environment dynamics are strictly governed by the ground-truth trajectories from the source dataset

Non-reactive open-loop simulation.In open-loop mode, the simulation decouples the ego- agent’s trajectory planning from the environment’s state transitions. The environment dynamics are strictly governed by the ground-truth trajectories from the source dataset. In this mode, the ego- vehicle’s state transition is forced to match the logged trajectoryT log...

Showing first 80 references.