Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

Jinkyu Kim; Joon Seo; Jungbeom Lee; Minseung Lee; Seokha Moon

arxiv: 2605.13646 · v2 · pith:CLO7UEVMnew · submitted 2026-05-13 · 💻 cs.RO · cs.AI

Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

Seokha Moon , Minseung Lee , Joon Seo , Jinkyu Kim , Jungbeom Lee This is my paper

Pith reviewed 2026-05-20 20:51 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords autonomous drivingend-to-end planningcausal modelingjoint scene representationclosed-loop feedbackmulti-agent interactiontrajectory prediction

0 comments

The pith

Causal modeling of ego-agent interactions improves closed-loop driving performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CaAD to fix a gap in end-to-end autonomous driving: most methods treat the ego vehicle and surrounding agents as independent, which breaks down when decisions must respond to each other. It builds an ego-centric module that learns these reciprocal causal links inside one shared latent scene model. A second stage then aligns the ego policy to closed-loop feedback drawn from traffic and maps. The result is stronger trajectory consistency in dense interaction scenarios. Tests on two standard benchmarks show clear gains in driving score, success rate, and planning metric.

Core claim

CaAD captures causal inter-dependencies between the ego vehicle and surrounding agents within a shared latent scene representation. An ego-centric joint-causal modeling module built on the marginal prediction branch learns dependencies with interaction-relevant agents. A causality-aware policy alignment stage then uses joint-mode embeddings to match the stochastic ego policy against planning-oriented closed-loop feedback computed from surrounding traffic and map context.

What carries the argument

Ego-centric joint-causal modeling module that learns reciprocal causal relations between ego decisions and neighboring agent behaviors inside a shared latent scene representation.

If this is right

Trajectory predictions become more consistent when ego and agent actions are reasoned about jointly rather than separately.
Closed-loop planning benefits from explicit alignment between stochastic policy outputs and traffic-map feedback.
The same latent scene representation supports both marginal prediction and causal dependency capture.
Benchmark scores rise when the model accounts for how ego maneuvers alter surrounding agent behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint modeling idea could extend to other multi-agent control problems such as robot swarms or traffic signal coordination.
Removing the need for explicit causal graphs may let the method scale to larger numbers of agents without extra supervision cost.
Real-world deployment would still require checking whether the learned causal relations hold under sensor noise and unseen map layouts.

Load-bearing premise

The joint-causal modeling module will learn and use the back-and-forth causal links between ego choices and other agents without any separate causal discovery step or extra labels.

What would settle it

A controlled ablation that removes only the joint-causal modeling module and measures the resulting drop in closed-loop Driving Score on interaction-heavy test routes.

Figures

Figures reproduced from arXiv: 2605.13646 by Jinkyu Kim, Joon Seo, Jungbeom Lee, Minseung Lee, Seokha Moon.

**Figure 2.** Figure 2: Overview of the proposed CaAD. (a) CaAD builds on a query-based architecture and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of joint mode selection strategies. (a) Conventional methods select the joint [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples of interaction-aware planning. We visualize consecutive frames in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

End-to-end autonomous driving, which bypasses traditional modular pipelines by directly predicting future trajectories from sensor inputs, has recently achieved substantial progress. However, existing methods often overlook the causal inter-dependencies in ego-vehicle planning, ignoring the reciprocal relations between the ego vehicle and surrounding agents. This causal oversight leads to inconsistent and unreliable trajectory predictions, especially in interaction-critical scenarios where ego decisions and neighboring agent behaviors must be reasoned about jointly. To address this limitation, we propose CaAD, a Causality-aware end-to-end Autonomous Driving framework that captures these dependencies within a shared latent scene representation. First, we propose an ego-centric joint-causal modeling module that builds on the marginal prediction branch, and learns causal dependencies between the ego vehicle and interaction-relevant agents. Second, we employ a causality-aware policy alignment stage implemented with joint-mode embeddings to align the stochastic ego policy with planning-oriented closed-loop feedback computed from surrounding traffic and map context. On the Bench2Drive and NAVSIM benchmarks, CaAD demonstrates strong closed-loop planning performance, achieving a Driving Score of 87.53 and Success Rate of 71.81 on Bench2Drive, and a PDMS of 91.1 on NAVSIM. The project page is available at https://moonseokha.github.io/CaAD/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CaAD adds an ego-centric joint-causal module on the marginal branch plus policy alignment and posts competitive numbers on Bench2Drive and NAVSIM, but the causal claims rest on design description rather than direct tests.

read the letter

The one thing to know is that this paper adds an ego-centric joint-causal modeling module to end-to-end driving and claims better closed-loop performance by capturing reciprocal influences between ego and agents. The numbers look good: 87.53 driving score and 71.81 success on Bench2Drive, 91.1 on NAVSIM. The second thing is that the causal justification is not strongly supported yet. What is new is the concrete architecture that starts from the marginal prediction branch, learns causal dependencies in a shared latent space, and then aligns the stochastic policy using joint-mode embeddings for closed-loop feedback. This pairing hasn't been laid out exactly this way before. The paper does well at keeping the overall pipeline simple and reporting results on two relevant benchmarks without inventing new metrics. The soft spots are in the validation. The abstract gives benchmark numbers but no ablations, no statistical tests, and no details on how the causal dependencies are actually learned or verified. The concern that it might be capturing correlations rather than directed causality is reasonable because there's no mention of do-calculus, interventions, or explicit causal discovery. If the full methods section has those, it would address the issue directly; from what's visible, the central claim rests on the module's design alone. This work is for people already in the end-to-end autonomous driving area who want ideas for handling interactions better. It assumes familiarity with the benchmarks and prior modular vs end-to-end debates. I would bring this to a reading group as a maybe, because the idea is timely but the evidence needs checking. I might cite it if the full paper shows the ablations work out. It deserves peer review because it has real results and a clear motivation, even if the causality angle needs more scrutiny.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CaAD, a causality-aware end-to-end autonomous driving framework. It proposes an ego-centric joint-causal modeling module built on the marginal prediction branch to capture causal dependencies between the ego vehicle and interaction-relevant agents in a shared latent scene representation, followed by a causality-aware policy alignment stage using joint-mode embeddings to align the stochastic ego policy with planning-oriented closed-loop feedback. The approach is evaluated on Bench2Drive and NAVSIM, reporting a Driving Score of 87.53, Success Rate of 71.81, and PDMS of 91.1.

Significance. If the causal modeling component demonstrably learns and exploits reciprocal relations rather than correlations, the work could meaningfully advance closed-loop planning in interaction-critical scenarios by reducing inconsistent trajectory predictions. The benchmark results suggest practical gains over prior end-to-end methods, and the explicit focus on causality addresses a recognized limitation in the field.

major comments (2)

[ego-centric joint-causal modeling module] Ego-centric joint-causal modeling module (described in the methods): No explicit mechanism is provided for enforcing or validating reciprocal causality, such as do-calculus interventions, counterfactual supervision, or ablation isolating this module from standard joint prediction. This is load-bearing for the central claim, as the subsequent policy alignment stage depends on the latent representation encoding directed causal influence rather than statistical co-occurrence.
[experimental results] Experimental results (Bench2Drive and NAVSIM evaluations): The reported metrics lack ablation studies, error bars, or statistical significance tests that would isolate the contribution of the causality-aware components. Without these, it is unclear whether the gains (e.g., Driving Score 87.53) stem from the proposed causal modeling or from other architectural choices.

minor comments (1)

[abstract and methods] The abstract and methods would benefit from clearer notation distinguishing the marginal prediction branch from the joint-causal extension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [ego-centric joint-causal modeling module] Ego-centric joint-causal modeling module (described in the methods): No explicit mechanism is provided for enforcing or validating reciprocal causality, such as do-calculus interventions, counterfactual supervision, or ablation isolating this module from standard joint prediction. This is load-bearing for the central claim, as the subsequent policy alignment stage depends on the latent representation encoding directed causal influence rather than statistical co-occurrence.

Authors: The ego-centric joint-causal modeling module is designed to capture reciprocal causal dependencies by extending the marginal prediction branch into a shared latent scene representation that conditions predictions on directed influences between the ego vehicle and interaction-relevant agents. While the current implementation does not incorporate explicit do-calculus interventions or counterfactual supervision, it goes beyond statistical co-occurrence through its ego-centric conditioning and joint-mode structure. To strengthen validation of this distinction, we will add an ablation study isolating the module from a standard joint prediction baseline and expand the methods discussion on how the architecture encodes directed causal relations rather than mere correlations. revision: partial
Referee: [experimental results] Experimental results (Bench2Drive and NAVSIM evaluations): The reported metrics lack ablation studies, error bars, or statistical significance tests that would isolate the contribution of the causality-aware components. Without these, it is unclear whether the gains (e.g., Driving Score 87.53) stem from the proposed causal modeling or from other architectural choices.

Authors: We agree that the current experimental section would benefit from additional controls to isolate the causality-aware components. We will incorporate ablation studies that remove the ego-centric joint-causal modeling module and the causality-aware policy alignment stage individually. We will also report error bars from multiple independent runs and include statistical significance tests for the key metrics such as Driving Score on Bench2Drive and PDMS on NAVSIM. These changes will be added to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain.

full rationale

The paper presents a standard end-to-end modeling pipeline: an ego-centric joint-causal modeling module built on a marginal prediction branch, followed by a causality-aware policy alignment stage using joint-mode embeddings. These components are trained on driving datasets and evaluated on independent benchmarks (Bench2Drive, NAVSIM) with reported metrics (Driving Score 87.53, PDMS 91.1). No equations, self-definitions, or fitted-input-as-prediction patterns appear in the abstract or described structure that would make the final claims equivalent to the inputs by construction. The central claims rest on architectural choices and empirical results rather than reducing to self-citation chains or renamed inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; therefore no specific free parameters, axioms, or invented entities can be extracted. The framework description implies standard neural-network parameters and the usual assumptions of supervised imitation or reinforcement learning on driving datasets.

pith-pipeline@v0.9.0 · 5772 in / 1267 out tokens · 69144 ms · 2026-05-20T20:51:23.358353+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

causality-aware policy alignment stage implemented with joint-mode embeddings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 13 internal anchors

[1]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

work page 2020
[2]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning bench- mark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Gri: General reinforced imitation and its application to vision-based autonomous driving.Robotics, 12(5):127, 2023

Raphael Chekroun, Marin Toromanoff, Sascha Hornauer, and Fabien Moutarde. Gri: General reinforced imitation and its application to vision-based autonomous driving.Robotics, 12(5):127, 2023

work page 2023
[4]

End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

work page 2024
[5]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Neat: Neural attention fields for end- to-end autonomous driving

Kashyap Chitta, Aditya Prakash, and Andreas Geiger. Neat: Neural attention fields for end- to-end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15793–15803, 2021

work page 2021
[7]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

work page 2022
[8]

Parting with mis- conceptions about learning-based vehicle motion planning

Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with mis- conceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023

work page 2023
[9]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

work page 2024
[10]

D’Souza, Samira Ebrahimi Kahou, Felix Heide, and Christopher Pal

Roger Girgis, Florian Golemo, Felipe Codevilla, Martin Weiss, Justin A. D’Souza, Samira Ebrahimi Kahou, Felix Heide, and Christopher Pal. Latent variable sequential set transformers for joint multi-agent motion prediction. InInternational Conference on Learning Representations, 2022

work page 2022
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[13]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

work page 2023
[14]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024. 10

work page 2024
[16]

Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656, 2025

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656, 2025

work page arXiv 2025
[17]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

work page 2023
[18]

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Safedrive: Fine-grained safety reasoning for end-to-end driving in a sparse world.arXiv preprint arXiv:2602.18887, 2026

Jungho Kim, Jiyong Oh, Seunghoon Yu, Hongjae Shin, Donghyuk Kwak, and Jun Won Choi. Safedrive: Fine-grained safety reasoning for end-to-end driving in a sparse world.arXiv preprint arXiv:2602.18887, 2026

work page arXiv 2026
[20]

Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

Derun Li, Changye Li, Yue Wang, Jianwei Ren, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Peng Jia, Xianpeng Lang, et al. Learning personalized driving styles via reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

work page arXiv 2025
[21]

Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

Kailin Li, Zhenxin Li, Shiyi Lan, Yuan Xie, Zhizhong Zhang, Jiayi Liu, Zuxuan Wu, Zhiding Yu, and Jose M Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

work page arXiv 2025
[22]

Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2)

Qifeng Li, Xiaosong Jia, Shaobo Wang, and Junchi Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2). InEuropean Conference on Computer Vision, pages 142–158. Springer, 2024

work page 2024
[23]

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

turn left

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025

work page arXiv 2025
[25]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Hydra-next: Robust closed-loop driving with open-loop training.arXiv preprint arXiv:2503.12030, 2025

Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, and Jose M Alvarez. Hydra-next: Robust closed-loop driving with open-loop training.arXiv preprint arXiv:2503.12030, 2025

work page arXiv 2025
[27]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[28]

Cirl: Controllable imitative rein- forcement learning for vision-based self-driving

Xiaodan Liang, Tairui Wang, Luona Yang, and Eric Xing. Cirl: Controllable imitative rein- forcement learning for vision-based self-driving. InProceedings of the European conference on computer vision (ECCV), pages 584–599, 2018

work page 2018
[29]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

work page 2025
[30]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017
[31]

Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation

Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In2023 IEEE international conference on robotics and automation (ICRA), pages 2774–2781. IEEE, 2023. 11

work page 2023
[32]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios

Yiren Lu, Justin Fu, George Tucker, Xinlei Pan, Eli Bronstein, Rebecca Roelofs, Benjamin Sapp, Brandyn White, Aleksandra Faust, Shimon Whiteson, et al. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7553–...

work page 2023
[34]

JFP: Joint future prediction with interactive multi-agent modeling for autonomous driving

Wenjie Luo, Cheolho Park, Andre Cornman, Benjamin Sapp, and Dragomir Anguelov. JFP: Joint future prediction with interactive multi-agent modeling for autonomous driving. In Proceedings of The 7th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Research, pages 1457–1467. PMLR, 2023

work page 2023
[35]

Visiontrap: Vision-augmented trajectory prediction guided by textual descriptions

Seokha Moon, Hyun Woo, Hongbeen Park, Haeji Jung, Reza Mahjourian, Hyung-gun Chi, Hyerin Lim, Sangpil Kim, and Jinkyu Kim. Visiontrap: Vision-augmented trajectory prediction guided by textual descriptions. InEuropean Conference on Computer Vision, pages 361–379. Springer, 2024

work page 2024
[36]

Wayformer: Motion forecasting via simple & efﬁcient attention networks,

Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S Refaat, and Benjamin Sapp. Wayformer: Motion forecasting via simple & efficient attention networks.arXiv preprint arXiv:2207.05844, 2022

work page arXiv 2022
[37]

Scene transformer: A unified architecture for predicting mul- tiple agent trajectories

Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zhengdong Zhang, Hao-Tien Lewis Chiang, Jeffrey Ling, Rebecca Roelofs, Alex Bewley, Chenxi Liu, Ashish Venugopal, et al. Scene transformer: A unified architecture for predicting multiple agent trajectories.arXiv preprint arXiv:2106.08417, 2021

work page arXiv 2021
[38]

Refaat, Rami Al-Rfou, and Benjamin Sapp

Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S. Refaat, Rami Al-Rfou, and Benjamin Sapp. MotionLM: Multi-agent motion forecasting as language modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[39]

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

work page arXiv 2025
[40]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Motion transformer with global intention localization and local movement refinement

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and local movement refinement. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[42]

Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22432–22441, 2025

work page 2025
[43]

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving

Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, and Yadan Luo. Diver: Reinforced diffusion breaks imitation bottlenecks in end-to-end autonomous driving.arXiv preprint arXiv:2507.04049, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

M2i: From factored marginal to joint trajectory prediction for highly interactive environments

Qiao Sun, Xin Huang, Brian C Williams, Hang Zhao, et al. M2i: From factored marginal to joint trajectory prediction for highly interactive environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16116–16126, 2022

work page 2022
[45]

Sparsedrivev2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026

Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Xiang Li, Yining Shi, and Sifa Zheng. Sparsedrivev2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026. 12

work page arXiv 2026
[46]

Sparsedrive: End-to-end autonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

work page 2025
[47]

Hip-ad: Hierarchical and multi-granularity planning with de- formable attention for autonomous driving in a single decoder

Yingqi Tang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi- granularity planning with deformable attention for autonomous driving in a single decoder. arXiv preprint arXiv:2503.08612, 2025

work page arXiv 2025
[48]

Attention is all you need.Advances in Neural Information Processing Systems, 2017

A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017

work page 2017
[49]

RetroMotion: Retrocausal Motion Forecasting Models are Instructable

Royden Wagner, Omer Sahin Tas, Felix Hauser, Marlon Steiner, Dominik Strutz, Abhishek Vivekanandan, Carlos Fernandez, and Christoph Stiller. Retromotion: Retrocausal motion forecasting models are instructable.arXiv preprint arXiv:2505.20414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Jointmotion: joint self-supervision for joint motion prediction.arXiv preprint arXiv:2403.05489, 2024

Royden Wagner, Omer Sahin Tas, Marvin Klemp, and Carlos Fernandez. Jointmotion: joint self-supervision for joint motion prediction.arXiv preprint arXiv:2403.05489, 2024

work page arXiv 2024
[51]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Para-drive: Par- allelized architecture for real-time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Par- allelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024

work page 2024
[53]

Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

work page 2025
[54]

Second: Sparsely embedded convolutional detection.Sensors, 18(10), 2018

Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection.Sensors, 18(10), 2018

work page 2018
[55]

Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025

Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, and Junchi Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025

work page arXiv 2025
[56]

Center-based 3d object detection and track- ing

Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and track- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021

work page 2021
[57]

AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

Zhenlong Yuan, Chengxuan Qian, Jing Tang, Rui Chen, Zijian Song, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, and Shuo Li. Autodrive-r2: Incentivizing reasoning and self- reflection capacity for vla model in autonomous driving.arXiv preprint arXiv:2509.01944, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

End-to-end interpretable neural motion planner

Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun. End-to-end interpretable neural motion planner. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8660–8669, 2019

work page 2019
[59]

Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

Bozhou Zhang, Nan Song, Jingyu Li, Xiatian Zhu, Jiankang Deng, and Li Zhang. Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

work page arXiv 2025
[60]

Resworld: Temporal residual world model for end-to-end autonomous driving.arXiv preprint arXiv:2602.10884, 2026

Jinqing Zhang, Zehua Fu, Zelin Xu, Wenying Dai, Qingjie Liu, and Yunhong Wang. Resworld: Temporal residual world model for end-to-end autonomous driving.arXiv preprint arXiv:2602.10884, 2026

work page arXiv 2026
[61]

Genad: Gen- erative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Gen- erative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024. 13

work page 2024
[62]

World4drive: End-to-end autonomous driving via intention-aware physical latent world model

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28632–28642, 2025

work page 2025
[63]

V oxelnet: End-to-end learning for point cloud based 3d object detection

Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4490–4499, 2018

work page 2018
[64]

Query-centric trajectory prediction

Zikang Zhou, Jianping Wang, Yung-Hui Li, and Yu-Kai Huang. Query-centric trajectory prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17863–17873, 2023

work page 2023
[65]

Hidden biases of end-to-end driving datasets

Julian Zimmerlin, Jens Beißwenger, Bernhard Jaeger, Andreas Geiger, and Kashyap Chitta. Hidden biases of end-to-end driving datasets.arXiv preprint arXiv:2412.09602, 2024

work page arXiv 2024
[66]

Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving, 2025

Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, and Xinggang Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving, 2025. 14 Appendix A Additional Related Work A.1 End-to-End Planning for Autonomous Driving End-to-end autonomous driving ...

work page 2025

[1] [1]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

work page 2020

[2] [2]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning bench- mark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Gri: General reinforced imitation and its application to vision-based autonomous driving.Robotics, 12(5):127, 2023

Raphael Chekroun, Marin Toromanoff, Sascha Hornauer, and Fabien Moutarde. Gri: General reinforced imitation and its application to vision-based autonomous driving.Robotics, 12(5):127, 2023

work page 2023

[4] [4]

End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

work page 2024

[5] [5]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Neat: Neural attention fields for end- to-end autonomous driving

Kashyap Chitta, Aditya Prakash, and Andreas Geiger. Neat: Neural attention fields for end- to-end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15793–15803, 2021

work page 2021

[7] [7]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

work page 2022

[8] [8]

Parting with mis- conceptions about learning-based vehicle motion planning

Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with mis- conceptions about learning-based vehicle motion planning. InConference on Robot Learning, pages 1268–1281. PMLR, 2023

work page 2023

[9] [9]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

work page 2024

[10] [10]

D’Souza, Samira Ebrahimi Kahou, Felix Heide, and Christopher Pal

Roger Girgis, Florian Golemo, Felipe Codevilla, Martin Weiss, Justin A. D’Souza, Samira Ebrahimi Kahou, Felix Heide, and Christopher Pal. Latent variable sequential set transformers for joint multi-agent motion prediction. InInternational Conference on Learning Representations, 2022

work page 2022

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[13] [13]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

work page 2023

[14] [14]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024. 10

work page 2024

[16] [16]

Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656, 2025

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656, 2025

work page arXiv 2025

[17] [17]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

work page 2023

[18] [18]

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Safedrive: Fine-grained safety reasoning for end-to-end driving in a sparse world.arXiv preprint arXiv:2602.18887, 2026

Jungho Kim, Jiyong Oh, Seunghoon Yu, Hongjae Shin, Donghyuk Kwak, and Jun Won Choi. Safedrive: Fine-grained safety reasoning for end-to-end driving in a sparse world.arXiv preprint arXiv:2602.18887, 2026

work page arXiv 2026

[20] [20]

Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

Derun Li, Changye Li, Yue Wang, Jianwei Ren, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Peng Jia, Xianpeng Lang, et al. Learning personalized driving styles via reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

work page arXiv 2025

[21] [21]

Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

Kailin Li, Zhenxin Li, Shiyi Lan, Yuan Xie, Zhizhong Zhang, Jiayi Liu, Zuxuan Wu, Zhiding Yu, and Jose M Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

work page arXiv 2025

[22] [22]

Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2)

Qifeng Li, Xiaosong Jia, Shaobo Wang, and Junchi Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2). InEuropean Conference on Computer Vision, pages 142–158. Springer, 2024

work page 2024

[23] [23]

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

turn left

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025

work page arXiv 2025

[25] [25]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Hydra-next: Robust closed-loop driving with open-loop training.arXiv preprint arXiv:2503.12030, 2025

Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, and Jose M Alvarez. Hydra-next: Robust closed-loop driving with open-loop training.arXiv preprint arXiv:2503.12030, 2025

work page arXiv 2025

[27] [27]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[28] [28]

Cirl: Controllable imitative rein- forcement learning for vision-based self-driving

Xiaodan Liang, Tairui Wang, Luona Yang, and Eric Xing. Cirl: Controllable imitative rein- forcement learning for vision-based self-driving. InProceedings of the European conference on computer vision (ECCV), pages 584–599, 2018

work page 2018

[29] [29]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

work page 2025

[30] [30]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017

[31] [31]

Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation

Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In2023 IEEE international conference on robotics and automation (ICRA), pages 2774–2781. IEEE, 2023. 11

work page 2023

[32] [32]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios

Yiren Lu, Justin Fu, George Tucker, Xinlei Pan, Eli Bronstein, Rebecca Roelofs, Benjamin Sapp, Brandyn White, Aleksandra Faust, Shimon Whiteson, et al. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7553–...

work page 2023

[34] [34]

JFP: Joint future prediction with interactive multi-agent modeling for autonomous driving

Wenjie Luo, Cheolho Park, Andre Cornman, Benjamin Sapp, and Dragomir Anguelov. JFP: Joint future prediction with interactive multi-agent modeling for autonomous driving. In Proceedings of The 7th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Research, pages 1457–1467. PMLR, 2023

work page 2023

[35] [35]

Visiontrap: Vision-augmented trajectory prediction guided by textual descriptions

Seokha Moon, Hyun Woo, Hongbeen Park, Haeji Jung, Reza Mahjourian, Hyung-gun Chi, Hyerin Lim, Sangpil Kim, and Jinkyu Kim. Visiontrap: Vision-augmented trajectory prediction guided by textual descriptions. InEuropean Conference on Computer Vision, pages 361–379. Springer, 2024

work page 2024

[36] [36]

Wayformer: Motion forecasting via simple & efﬁcient attention networks,

Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S Refaat, and Benjamin Sapp. Wayformer: Motion forecasting via simple & efficient attention networks.arXiv preprint arXiv:2207.05844, 2022

work page arXiv 2022

[37] [37]

Scene transformer: A unified architecture for predicting mul- tiple agent trajectories

Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zhengdong Zhang, Hao-Tien Lewis Chiang, Jeffrey Ling, Rebecca Roelofs, Alex Bewley, Chenxi Liu, Ashish Venugopal, et al. Scene transformer: A unified architecture for predicting multiple agent trajectories.arXiv preprint arXiv:2106.08417, 2021

work page arXiv 2021

[38] [38]

Refaat, Rami Al-Rfou, and Benjamin Sapp

Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S. Refaat, Rami Al-Rfou, and Benjamin Sapp. MotionLM: Multi-agent motion forecasting as language modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023

[39] [39]

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

work page arXiv 2025

[40] [40]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Motion transformer with global intention localization and local movement refinement

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and local movement refinement. InAdvances in Neural Information Processing Systems, 2022

work page 2022

[42] [42]

Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22432–22441, 2025

work page 2025

[43] [43]

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving

Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, and Yadan Luo. Diver: Reinforced diffusion breaks imitation bottlenecks in end-to-end autonomous driving.arXiv preprint arXiv:2507.04049, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

M2i: From factored marginal to joint trajectory prediction for highly interactive environments

Qiao Sun, Xin Huang, Brian C Williams, Hang Zhao, et al. M2i: From factored marginal to joint trajectory prediction for highly interactive environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16116–16126, 2022

work page 2022

[45] [45]

Sparsedrivev2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026

Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Xiang Li, Yining Shi, and Sifa Zheng. Sparsedrivev2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026. 12

work page arXiv 2026

[46] [46]

Sparsedrive: End-to-end autonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

work page 2025

[47] [47]

Hip-ad: Hierarchical and multi-granularity planning with de- formable attention for autonomous driving in a single decoder

Yingqi Tang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi- granularity planning with deformable attention for autonomous driving in a single decoder. arXiv preprint arXiv:2503.08612, 2025

work page arXiv 2025

[48] [48]

Attention is all you need.Advances in Neural Information Processing Systems, 2017

A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017

work page 2017

[49] [49]

RetroMotion: Retrocausal Motion Forecasting Models are Instructable

Royden Wagner, Omer Sahin Tas, Felix Hauser, Marlon Steiner, Dominik Strutz, Abhishek Vivekanandan, Carlos Fernandez, and Christoph Stiller. Retromotion: Retrocausal motion forecasting models are instructable.arXiv preprint arXiv:2505.20414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Jointmotion: joint self-supervision for joint motion prediction.arXiv preprint arXiv:2403.05489, 2024

Royden Wagner, Omer Sahin Tas, Marvin Klemp, and Carlos Fernandez. Jointmotion: joint self-supervision for joint motion prediction.arXiv preprint arXiv:2403.05489, 2024

work page arXiv 2024

[51] [51]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Para-drive: Par- allelized architecture for real-time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Par- allelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024

work page 2024

[53] [53]

Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

work page 2025

[54] [54]

Second: Sparsely embedded convolutional detection.Sensors, 18(10), 2018

Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection.Sensors, 18(10), 2018

work page 2018

[55] [55]

Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025

Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, and Junchi Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025

work page arXiv 2025

[56] [56]

Center-based 3d object detection and track- ing

Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and track- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021

work page 2021

[57] [57]

AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

Zhenlong Yuan, Chengxuan Qian, Jing Tang, Rui Chen, Zijian Song, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, and Shuo Li. Autodrive-r2: Incentivizing reasoning and self- reflection capacity for vla model in autonomous driving.arXiv preprint arXiv:2509.01944, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

End-to-end interpretable neural motion planner

Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun. End-to-end interpretable neural motion planner. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8660–8669, 2019

work page 2019

[59] [59]

Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

Bozhou Zhang, Nan Song, Jingyu Li, Xiatian Zhu, Jiankang Deng, and Li Zhang. Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

work page arXiv 2025

[60] [60]

Resworld: Temporal residual world model for end-to-end autonomous driving.arXiv preprint arXiv:2602.10884, 2026

Jinqing Zhang, Zehua Fu, Zelin Xu, Wenying Dai, Qingjie Liu, and Yunhong Wang. Resworld: Temporal residual world model for end-to-end autonomous driving.arXiv preprint arXiv:2602.10884, 2026

work page arXiv 2026

[61] [61]

Genad: Gen- erative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Gen- erative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024. 13

work page 2024

[62] [62]

World4drive: End-to-end autonomous driving via intention-aware physical latent world model

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28632–28642, 2025

work page 2025

[63] [63]

V oxelnet: End-to-end learning for point cloud based 3d object detection

Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4490–4499, 2018

work page 2018

[64] [64]

Query-centric trajectory prediction

Zikang Zhou, Jianping Wang, Yung-Hui Li, and Yu-Kai Huang. Query-centric trajectory prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17863–17873, 2023

work page 2023

[65] [65]

Hidden biases of end-to-end driving datasets

Julian Zimmerlin, Jens Beißwenger, Bernhard Jaeger, Andreas Geiger, and Kashyap Chitta. Hidden biases of end-to-end driving datasets.arXiv preprint arXiv:2412.09602, 2024

work page arXiv 2024

[66] [66]

Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving, 2025

Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, and Xinggang Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving, 2025. 14 Appendix A Additional Related Work A.1 End-to-End Planning for Autonomous Driving End-to-end autonomous driving ...

work page 2025