PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Lianhang Liu; Maciej K. Wozniak; Patric Jensfelt; Yixi Cai

arxiv: 2507.17596 · v3 · submitted 2025-07-23 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Maciej K. Wozniak , Lianhang Liu , Yixi Cai , Patric Jensfelt This is my paper

Pith reviewed 2026-05-19 02:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords autonomous drivingend-to-end planningcamera-based navigationtrajectory predictionraw pixel inputrecalibration transformerNavSim benchmarknuScenes benchmark

0 comments

The pith

A camera-only architecture plans safe trajectories directly from raw pixels without LiDAR or bird's-eye views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PRIX as an end-to-end autonomous driving system that takes raw camera images as input and outputs collision-free trajectories. It avoids the usual requirements for expensive LiDAR sensors and explicit bird's-eye-view feature maps by pairing a visual feature extractor with a generative planning head. A new Context-aware Recalibration Transformer module refines multi-level image features to support this direct pixel-to-plan mapping. Sympathetic readers would care because the approach promises smaller models and faster inference, which could make high-performance driving feasible on ordinary camera-equipped vehicles.

Core claim

PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks by predicting safe trajectories directly from raw pixel inputs using only camera data, without explicit BEV representation and forgoing the need for LiDAR. The architecture leverages a visual feature extractor coupled with a generative planning head, with the Context-aware Recalibration Transformer enhancing multi-level visual features for more robust planning. This matches the capabilities of larger multimodal diffusion planners while remaining significantly more efficient in inference speed and model size.

What carries the argument

The Context-aware Recalibration Transformer (CaRT) module, which refines multi-level visual features extracted from raw camera pixels to support direct generative trajectory planning.

If this is right

Mass-market vehicles equipped only with cameras can achieve competitive planning performance without LiDAR hardware.
Inference speed and model size become practical for onboard real-time deployment.
End-to-end driving systems no longer require separate 3D reconstruction stages to reach benchmark-leading results.
Training and evaluation can focus on pixel-to-trajectory mappings rather than intermediate bird's-eye representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The recalibration approach may generalize to other vision-only planning domains such as robotics navigation or drone path finding.
If visual features prove sufficient here, future work could test whether similar lightweight recalibration replaces multi-sensor fusion in additional perception tasks.
Performance parity with larger models suggests that efficiency gains could be traded for further accuracy improvements by scaling the same architecture.

Load-bearing premise

Raw camera pixels, after processing by the visual feature extractor and CaRT module, contain enough information to generate safe, collision-free trajectories without explicit 3D reconstruction or additional sensor modalities.

What would settle it

A controlled experiment on nuScenes or NavSim showing substantially higher collision or off-road rates for PRIX than for LiDAR-based baselines in low-visibility or dynamic intersection scenarios would falsify the claim.

Figures

Figures reproduced from arXiv: 2507.17596 by Lianhang Liu, Maciej K. Wozniak, Patric Jensfelt, Yixi Cai.

**Figure 2.** Figure 2: PRIX Overview: Visual features from multi-camera images are extracted by ResNet layers (fi) and together with self-attention and skip connections (CaRT, described in Sec. 3.1). Next, visual features are used for auxiliary perception tasks (see Sec. 3.4) and trajectory planning (see Sec. 3.2). A conditional diffusion planner then uses visual features, along with the current ego state and a set of noisy anch… view at source ↗

**Figure 3.** Figure 3: Architecture of our visual feature extractor with [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Diffusion steps vs performance on Navsim-v1. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative trajectory predictions from our method. In some cases, like [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes PRIX, an end-to-end autonomous driving architecture that predicts safe trajectories directly from raw camera pixels without LiDAR or explicit BEV representations. It combines a visual feature extractor with a generative planning head and introduces the Context-aware Recalibration Transformer (CaRT) module to enhance multi-level visual features. The central claims are state-of-the-art performance on the NavSim and nuScenes benchmarks that matches larger multimodal diffusion planners while offering substantially lower inference latency and model size; the work is presented as open-source.

Significance. If the performance and efficiency claims are substantiated with rigorous controls, the result would be significant for practical camera-only autonomous driving, as it targets reduced sensor cost and computational overhead for mass-market vehicles. The planned code release is a clear strength that supports reproducibility.

major comments (2)

Experimental evaluation: the abstract and results sections assert SOTA performance and efficiency gains, yet no details are provided on training/validation data splits, ablation studies isolating the CaRT module, error bars, or failure-case analysis. These omissions prevent full evaluation of the central performance claim against the cited larger diffusion planners.
Method and evaluation: the claim that raw monocular pixels plus the visual extractor and CaRT suffice for safe, collision-free planning without explicit 3D reconstruction or LiDAR rests on closed-loop benchmark metrics that can be achieved by short-horizon imitation. No targeted stress tests under distribution shift (novel lighting, thin distant obstacles, ego-motion parallax) or direct comparison to depth-augmented baselines are reported, leaving the weakest assumption unexamined.

minor comments (1)

Abstract: the code link https://maxiuw.github.io/prix should be replaced with a permanent repository URL (e.g., GitHub) in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: Experimental evaluation: the abstract and results sections assert SOTA performance and efficiency gains, yet no details are provided on training/validation data splits, ablation studies isolating the CaRT module, error bars, or failure-case analysis. These omissions prevent full evaluation of the central performance claim against the cited larger diffusion planners.

Authors: We thank the referee for this observation. The manuscript details the training and validation splits in Section 4.1, following the official NavSim and nuScenes protocols. Ablation studies isolating the contribution of the CaRT module appear in Section 4.3 and Table 3. Error bars computed over multiple runs are reported in Tables 1 and 2. We agree that failure-case analysis is not sufficiently developed. We will add a dedicated subsection with qualitative examples and common failure modes in the revised manuscript. revision: yes
Referee: Method and evaluation: the claim that raw monocular pixels plus the visual extractor and CaRT suffice for safe, collision-free planning without explicit 3D reconstruction or LiDAR rests on closed-loop benchmark metrics that can be achieved by short-horizon imitation. No targeted stress tests under distribution shift (novel lighting, thin distant obstacles, ego-motion parallax) or direct comparison to depth-augmented baselines are reported, leaving the weakest assumption unexamined.

Authors: We respectfully note that the generative planning head is trained on multi-step trajectories and the closed-loop evaluation on NavSim and nuScenes measures full-horizon safety rather than pure short-horizon imitation. The CaRT module is specifically designed to recalibrate multi-level features for implicit spatial reasoning. Nevertheless, we agree that targeted robustness checks would strengthen the claims. We will incorporate comparisons against depth-augmented baselines and add discussion of performance under lighting and obstacle variations in the revised experiments section. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of internal definitions

full rationale

The paper's core claims rest on training a visual feature extractor plus CaRT module on standard driving datasets and reporting closed-loop metrics on NavSim and nuScenes. These are external, falsifiable evaluations against held-out trajectories and do not reduce any 'prediction' to a quantity defined by the authors' own fitted parameters or prior self-citations. No equations equate outputs to inputs by construction, no uniqueness theorems are imported from the same authors, and the architecture choices are presented as design decisions rather than derived necessities. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The performance claims rest on the domain assumption that camera images suffice for planning and on the effectiveness of the newly introduced CaRT module; no explicit free parameters or invented physical entities are described.

axioms (1)

domain assumption Raw camera images contain sufficient information for safe trajectory planning in the evaluated scenarios
Core premise enabling the camera-only design stated in the abstract.

invented entities (1)

Context-aware Recalibration Transformer (CaRT) no independent evidence
purpose: Enhance multi-level visual features for more robust planning
New module presented as a core component of the architecture.

pith-pipeline@v0.9.0 · 5751 in / 1240 out tokens · 52679 ms · 2026-05-19T02:49:53.320155+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CLOVER: Closed-Loop Value Estimation and Ranking for End-to-End Autonomous Driving Planning
cs.RO 2026-05 conditional novelty 6.0

CLOVER is a closed-loop generator-scorer framework that expands proposal coverage with pseudo-expert trajectories and performs conservative self-distillation to achieve state-of-the-art planning scores on NAVSIM and nuScenes.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving

Mina Alibeigi, William Ljungbergh, Adam Tonderski, Georg Hess, Adam Lilja, Carl Lindstr ¨om, Daria Motorniuk, Jun- sheng Fu, Jenny Widahl, and Christoffer Petersson. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20178– 20188, 2023. 2

work page 2023
[2]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 15791–15801, 2025. 2

work page 2025
[3]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. 2, 6

work page 2020
[4]

Pseudo-simulation for autonomous driving.arXiv preprint arXiv:2506.04218,

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo- simulation for autonomous driving. arXiv, 2506.04218,

work page arXiv
[5]

End-to-end autonomous driving: Challenges and frontiers

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, An- dreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 1

work page 2024
[6]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Solve: Synergy of language-vision and end-to-end networks for autonomous driving

Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, and Hongsheng Li. Solve: Synergy of language-vision and end-to-end networks for autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12068–12077, 2025. 3

work page 2025
[8]

Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving

Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving. In European Conference on Computer Vision , pages 239–256. Springer,

work page
[9]

Dualat: Dual attention transformer for end-to-end au- tonomous driving

Zesong Chen, Ze Yu, Jun Li, Linlin You, and Xiaojun Tan. Dualat: Dual attention transformer for end-to-end au- tonomous driving. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 16353–16359. IEEE, 2024. 2

work page 2024
[10]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing. IEEE transactions on pattern analysis and machine in- telligence, 45(11):12878–12895, 2022. 1, 2, 5, 7

work page 2022
[11]

Formula 1 icons

Darius Dan. Formula 1 icons. In https://www.flaticon.com/free-icons/formula-1. Flaticon. 8

work page
[12]

Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmark- ing

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmark- ing. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 2, 6

work page 2024
[13]

Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580, 2025

Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, and Yan- jun Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. arXiv preprint arXiv:2504.19580, 2025. 2

work page arXiv 2025
[14]

Polarpoint-bev: Bird-eye- view perception in polar points for explainable end-to-end autonomous driving

Yuchao Feng and Yuxiang Sun. Polarpoint-bev: Bird-eye- view perception in polar points for explainable end-to-end autonomous driving. IEEE Transactions on Intelligent Vehi- cles, 2024. 3

work page 2024
[15]

Man truckscenes: A multimodal dataset for autonomous trucking in diverse con- ditions

Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, Stefan Juergens, Lorenz Lechermann, Christian Nissler, An- drea Perl, Ulrich V oll, Min Yan, et al. Man truckscenes: A multimodal dataset for autonomous trucking in diverse con- ditions. Advances in Neural Information Processing Systems, 37:62062–62082, 2024. 2

work page 2024
[16]

Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,

Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driv- ing policy via large-scale 3dgs-based reinforcement learning. arXiv preprint arXiv:2502.13144, 2025. 2

work page arXiv 2025
[17]

Looking for trouble: In- formative planning for safe trajectories with occlusions

Barry Gilhuly, Armin Sadeghi, Peyman Yedmellat, Kasra Rezaee, and Stephen L Smith. Looking for trouble: In- formative planning for safe trajectories with occlusions. In 2022 International Conference on Robotics and Automation (ICRA), pages 8985–8991. IEEE, 2022. 8

work page 2022
[18]

Cleverdistiller: Simple and spatially consistent cross- modal distillation

Hariprasath Govindarajan, Maciej K Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, and Senthil Yoga- mani. Cleverdistiller: Simple and spatially consistent cross- modal distillation. arXiv preprint arXiv:2503.09878, 2025. 8

work page arXiv 2025
[19]

ipad: Iterative proposal-centric end-to-end autonomous driving.arXiv preprint arXiv:2505.15111, 2025

Ke Guo, Haochen Liu, Xiaojun Wu, Jia Pan, and Chen Lv. ipad: Iterative proposal-centric end-to-end autonomous driv- ing. arXiv preprint arXiv:2505.15111, 2025. 3

work page arXiv 2025
[20]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR,

work page
[21]

Dis- tilling multi-modal large language models for autonomous driving

Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Dis- tilling multi-modal large language models for autonomous driving. Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025. 3 9

work page 2025
[22]

Splatad: Real-time li- dar and camera rendering with 3d gaussian splatting for au- tonomous driving

Georg Hess, Carl Lindstr ¨om, Maryam Fatemi, Christoffer Petersson, and Lennart Svensson. Splatad: Real-time li- dar and camera rendering with 3d gaussian splatting for au- tonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 11982–11992,

work page
[23]

St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision (ECCV), 2022. 8

work page 2022
[24]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023. 2, 3, 7, 8

work page 2023
[25]

Efficient multi-camera tokenization with triplanes for end-to-end driving

Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, and Marco Pavone. Efficient multi-camera tokenization with triplanes for end-to-end driving. arXiv preprint arXiv:2506.12251, 2025. 3

work page arXiv 2025
[26]

Diffvla: Vision-language guided diffusion planning for autonomous driving

Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Zongzheng Zhang, et al. Diffvla: Vision-language guided diffusion planning for autonomous driving. arXiv preprint arXiv:2505.19381, 2025. 3

work page arXiv 2025
[27]

Vad: Vectorized scene representa- tion for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representa- tion for efficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 8340–8350, 2023. 2, 3, 5, 8

work page 2023
[28]

Transdiffuser: End-to-end trajectory gen- eration with decorrelated multi-modal representation for au- tonomous driving

Xuefeng Jiang, Yuan Ma, Pengxiang Li, Leimeng Xu, Xin Wen, Kun Zhan, Zhongpu Xia, Peng Jia, XianPeng Lang, and Sheng Sun. Transdiffuser: End-to-end trajectory gen- eration with decorrelated multi-modal representation for au- tonomous driving. arXiv preprint arXiv:2505.09315, 2025. 1, 2, 3

work page arXiv 2025
[29]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

work page
[30]

Hydra- mdp++: Advancing end-to-end driving via hydra-distillation with expert-guided decision analysis

Kailin Li, Zhenxin Li, Shiyi Lan, Jiayi Liu, Yuan Xie, Zuxuan Wu, Zhiding Yu, Jose M Alvarez, et al. Hydra- mdp++: Advancing end-to-end driving via hydra-distillation with expert-guided decision analysis. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (Workshops), 2025. 3, 7

work page 2025
[31]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi- target hydra-distillation. arXiv preprint arXiv:2406.06978 ,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Generalized trajectory scoring for end-to-end multimodal planning.arXiv preprint arXiv:2506.06664, 2025

Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, and Jose M Alvarez. Generalized trajectory scor- ing for end-to-end multimodal planning. arXiv preprint arXiv:2506.06664, 2025. 3

work page arXiv 2025
[33]

Pnpnet: End-to-end per- ception and prediction with tracking in the loop

Ming Liang, Bin Yang, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. Pnpnet: End-to-end per- ception and prediction with tracking in the loop. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11553–11562, 2020. 1

work page 2020
[34]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025. 1, 3, 4, 5, 6, 7, 8, 2

work page 2025
[35]

Fully sparse 3d occupancy prediction, 2024

Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, and Limin Wang. Fully sparse 3d occupancy prediction, 2024. 8

work page 2024
[36]

Segment any point cloud sequences by distilling vision foundation models

Youquan Liu, Lingdong Kong, Jun Cen, Runnan Chen, Wen- wei Zhang, Liang Pan, Kai Chen, and Ziwei Liu. Segment any point cloud sequences by distilling vision foundation models. Advances in Neural Information Processing Sys- tems, 36, 2024. 8

work page 2024
[37]

Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net

Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. InProceedings of the IEEE conference on Computer Vision and Pattern Recog- nition, pages 3569–3577, 2018. 1

work page 2018
[38]

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025. 2

work page internal anchor Pith review arXiv 2025
[39]

Nerf: Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 2

work page 2021
[40]

Risk-aware motion planning for autonomous vehicles with safety specifications

Truls Nyberg, Christian Pek, Laura Dal Col, Christoffer Nor´en, and Jana Tumova. Risk-aware motion planning for autonomous vehicles with safety specifications. In 2021 ieee intelligent vehicles symposium (iv), pages 1016–1023. IEEE,

work page 2021
[41]

Lego-drive: Language- enhanced goal-oriented closed-loop end-to-end autonomous driving

Pranjal Paul, Anant Garg, Tushar Choudhary, Arun Kumar Singh, and K Madhava Krishna. Lego-drive: Language- enhanced goal-oriented closed-loop end-to-end autonomous driving. In 2024 IEEE/RSJ International Conference on In- telligent Robots and Systems (IROS) , pages 10020–10026. IEEE, 2024. 3

work page 2024
[42]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XIV 16, pages 194–210. Springer, 2020. 1

work page 2020
[43]

Lightemma: Lightweight end-to-end multimodal model for autonomous driving

Zhijie Qiao, Haowei Li, Zhong Cao, and Henry X Liu. Lightemma: Lightweight end-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2505.00284 ,

work page arXiv
[44]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 4 10

work page 2022
[45]

Perceive, predict, and plan: Safe motion planning through interpretable seman- tic representations

Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable seman- tic representations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pages 414–430. Springer, 2020. 1

work page 2020
[46]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2446–2454, 2020. 2

work page 2020
[47]

Sparsedrive: End-to-end au- tonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Hao- ran Wu, and Sifa Zheng. Sparsedrive: End-to-end au- tonomous driving via sparse scene representation. Proceed- ings of the IEEE International Conference on Robotics and Automation, 2025. 3, 8, 1

work page 2025
[48]

Neurad: Neural rendering for autonomous driving

Adam Tonderski, Carl Lindstr ¨om, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. Neurad: Neural rendering for autonomous driving. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14895–14904, 2024. 2

work page 2024
[49]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2

work page 2025
[50]

Para-drive: Parallelized architecture for real- time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real- time autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15449–15458, 2024. 7

work page 2024
[51]

S3pt: Scene semantics and structure guided clus- tering to boost self-supervised pre-training for autonomous driving

Maciej K Wozniak, Hariprasath Govindarajan, Marvin Klingner, Camille Maurice, B Ravi Kiran, and Senthil Yo- gamani. S3pt: Scene semantics and structure guided clus- tering to boost self-supervised pre-training for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applica- tions of Computer Vision (WACV), pages 1660–1670. IEEE,

work page 2025
[52]

Toward a robust sensor fusion step for 3d object detection on corrupted data

Maciej K Wozniak, Viktor K ˚arefj¨ard, Marko Thiel, and Patric Jensfelt. Toward a robust sensor fusion step for 3d object detection on corrupted data. IEEE Robotics and au- tomation letters, 8(11):7018–7025, 2023. 2

work page 2023
[53]

Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025. 1, 3, 5, 7

work page 2025
[54]

4d contrastive superflows are dense 3d representation learners

Xiang Xu, Lingdong Kong, Hui Shuai, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, and Qingshan Liu. 4d contrastive superflows are dense 3d representation learners. arXiv preprint arXiv:2407.06190, 2024. 8

work page arXiv 2024
[55]

Yihong Xu, Lo ¨ıck Chambon, ´Eloi Zablocki, Micka ¨el Chen, Alexandre Alahi, Matthieu Cord, and Patrick P´erez. Towards motion forecasting with real-world perception inputs: Are end-to-end approaches competitive? In 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 18428–18435. IEEE, 2024. 2

work page 2024
[56]

Street gaussians: Modeling dynamic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. In European Conference on Computer Vision, pages 156–173. Springer, 2024. 2

work page 2024
[57]

Drivesuprim: Towards precise trajectory selection for end-to-end planning.arXiv preprint arXiv:2506.06659, 2025

Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M Alvarez, and Zuxuan Wu. Drivesuprim: To- wards precise trajectory selection for end-to-end planning. arXiv preprint arXiv:2506.06659, 2025. 3, 1

work page arXiv 2025
[58]

Roca: Ro- bust cross-domain end-to-end autonomous driving

Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Litian Liu, Shweta Mahajan, Apratim Bhattacharyya, Yunxiao Shi, Risheek Garrepalli, Hong Cai, and Fatih Porikli. Roca: Ro- bust cross-domain end-to-end autonomous driving. arXiv preprint arXiv:2506.10145, 2025. 3

work page arXiv 2025
[59]

Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601, 2024

Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Ze- fan Huang, Christina Dao Wen Lee, Dongen Li, Yuhang Han, Anthony Wong, Keng Peng Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba. arXiv preprint arXiv:2408.03601, 2024. 7

work page arXiv 2024
[60]

Diffe2e: Rethinking end-to-end driving with a hy- brid action diffusion and supervised policy

Rui Zhao, Yuze Fan, Ziguo Chen, Fei Gao, and Zhenhai Gao. Diffe2e: Rethinking end-to-end driving with a hy- brid action diffusion and supervised policy. arXiv preprint arXiv:2505.19516, 2025. 2

work page arXiv 2025
[61]

Gaussianad: Gaussian-centric end-to- end autonomous driving

Wenzhao Zheng, Junjie Wu, Yao Zheng, Sicheng Zuo, Zixun Xie, Longchao Yang, Yong Pan, Zhihui Hao, Peng Jia, Xi- anpeng Lang, et al. Gaussianad: Gaussian-centric end-to- end autonomous driving. arXiv preprint arXiv:2412.10371,

work page arXiv
[62]

Diffusion-based planning for autonomous driving with flexi- ble guidance

Yinan Zheng, Ruiming Liang, Kexin ZHENG, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Diffusion-based planning for autonomous driving with flexi- ble guidance. In Proceedings of the International Conference on Learning Representations, 2025. 3 11 Supplementary Materials A. Parameters setup ...

work page 2025
[63]

com / hustvl / DiffusionDrive / issues/57 as well as issues/45 layer that computes all three projections in one oper- ation

Fused QKV Projection: In the self-attention mecha- nism, the separate linear layers for query (Q), key (K), and value (V ) were replaced with a single, fused linear 3https : / / github . com / hustvl / DiffusionDrive / issues/57 as well as issues/45 layer that computes all three projections in one oper- ation. This reduces three separate matrix multiplica...

work page
[64]

We also substituted the ReLU activa- tion with GELU, a smoother activation function that is common in modern high-performance transformers and can lead to better convergence

Optimized MLP Block: The standard MLP block, which can be inefficient, was replaced by a dedicated MLP module. We also substituted the ReLU activa- tion with GELU, a smoother activation function that is common in modern high-performance transformers and can lead to better convergence

work page
[65]

Efficient Tensor Reshaping: Throughout the model, especially in the attention mechanism and the CaRT module’s forward pass, tensor reshaping op- erations like .reshape() are now preceded by .contiguous(). This ensures the tensor is stored in a contiguous block of memory before the view op- eration, preventing potential performance penalties as- sociated w...

work page
[66]

Gradient Checkpointing: We introduced op- tional gradient checkpointing within the transformer blocks. During training, this technique trades a small amount of re-computation in the backward pass for a significant reduction in memory usage, allowing for larger batch sizes which can further improve training throughput

work page
[67]

In-place and Fused Operations: Smaller optimiza- tions were made throughout the backbone, such as us- ing inplace=True for ReLU activations in the FPN and removing biases from convolution and linear lay- ers where they are followed by a normalization layer, which makes them redundant. Together, these structural and operational improvements result in a mor...

work page

[1] [1]

Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving

Mina Alibeigi, William Ljungbergh, Adam Tonderski, Georg Hess, Adam Lilja, Carl Lindstr ¨om, Daria Motorniuk, Jun- sheng Fu, Jenny Widahl, and Christoffer Petersson. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20178– 20188, 2023. 2

work page 2023

[2] [2]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 15791–15801, 2025. 2

work page 2025

[3] [3]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. 2, 6

work page 2020

[4] [4]

Pseudo-simulation for autonomous driving.arXiv preprint arXiv:2506.04218,

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo- simulation for autonomous driving. arXiv, 2506.04218,

work page arXiv

[5] [5]

End-to-end autonomous driving: Challenges and frontiers

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, An- dreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 1

work page 2024

[6] [6]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Solve: Synergy of language-vision and end-to-end networks for autonomous driving

Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, and Hongsheng Li. Solve: Synergy of language-vision and end-to-end networks for autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12068–12077, 2025. 3

work page 2025

[8] [8]

Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving

Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving. In European Conference on Computer Vision , pages 239–256. Springer,

work page

[9] [9]

Dualat: Dual attention transformer for end-to-end au- tonomous driving

Zesong Chen, Ze Yu, Jun Li, Linlin You, and Xiaojun Tan. Dualat: Dual attention transformer for end-to-end au- tonomous driving. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 16353–16359. IEEE, 2024. 2

work page 2024

[10] [10]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing. IEEE transactions on pattern analysis and machine in- telligence, 45(11):12878–12895, 2022. 1, 2, 5, 7

work page 2022

[11] [11]

Formula 1 icons

Darius Dan. Formula 1 icons. In https://www.flaticon.com/free-icons/formula-1. Flaticon. 8

work page

[12] [12]

Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmark- ing

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmark- ing. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 2, 6

work page 2024

[13] [13]

Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580, 2025

Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, and Yan- jun Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. arXiv preprint arXiv:2504.19580, 2025. 2

work page arXiv 2025

[14] [14]

Polarpoint-bev: Bird-eye- view perception in polar points for explainable end-to-end autonomous driving

Yuchao Feng and Yuxiang Sun. Polarpoint-bev: Bird-eye- view perception in polar points for explainable end-to-end autonomous driving. IEEE Transactions on Intelligent Vehi- cles, 2024. 3

work page 2024

[15] [15]

Man truckscenes: A multimodal dataset for autonomous trucking in diverse con- ditions

Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, Stefan Juergens, Lorenz Lechermann, Christian Nissler, An- drea Perl, Ulrich V oll, Min Yan, et al. Man truckscenes: A multimodal dataset for autonomous trucking in diverse con- ditions. Advances in Neural Information Processing Systems, 37:62062–62082, 2024. 2

work page 2024

[16] [16]

Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,

Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driv- ing policy via large-scale 3dgs-based reinforcement learning. arXiv preprint arXiv:2502.13144, 2025. 2

work page arXiv 2025

[17] [17]

Looking for trouble: In- formative planning for safe trajectories with occlusions

Barry Gilhuly, Armin Sadeghi, Peyman Yedmellat, Kasra Rezaee, and Stephen L Smith. Looking for trouble: In- formative planning for safe trajectories with occlusions. In 2022 International Conference on Robotics and Automation (ICRA), pages 8985–8991. IEEE, 2022. 8

work page 2022

[18] [18]

Cleverdistiller: Simple and spatially consistent cross- modal distillation

Hariprasath Govindarajan, Maciej K Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, and Senthil Yoga- mani. Cleverdistiller: Simple and spatially consistent cross- modal distillation. arXiv preprint arXiv:2503.09878, 2025. 8

work page arXiv 2025

[19] [19]

ipad: Iterative proposal-centric end-to-end autonomous driving.arXiv preprint arXiv:2505.15111, 2025

Ke Guo, Haochen Liu, Xiaojun Wu, Jia Pan, and Chen Lv. ipad: Iterative proposal-centric end-to-end autonomous driv- ing. arXiv preprint arXiv:2505.15111, 2025. 3

work page arXiv 2025

[20] [20]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR,

work page

[21] [21]

Dis- tilling multi-modal large language models for autonomous driving

Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Dis- tilling multi-modal large language models for autonomous driving. Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025. 3 9

work page 2025

[22] [22]

Splatad: Real-time li- dar and camera rendering with 3d gaussian splatting for au- tonomous driving

Georg Hess, Carl Lindstr ¨om, Maryam Fatemi, Christoffer Petersson, and Lennart Svensson. Splatad: Real-time li- dar and camera rendering with 3d gaussian splatting for au- tonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 11982–11992,

work page

[23] [23]

St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision (ECCV), 2022. 8

work page 2022

[24] [24]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023. 2, 3, 7, 8

work page 2023

[25] [25]

Efficient multi-camera tokenization with triplanes for end-to-end driving

Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, and Marco Pavone. Efficient multi-camera tokenization with triplanes for end-to-end driving. arXiv preprint arXiv:2506.12251, 2025. 3

work page arXiv 2025

[26] [26]

Diffvla: Vision-language guided diffusion planning for autonomous driving

Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Zongzheng Zhang, et al. Diffvla: Vision-language guided diffusion planning for autonomous driving. arXiv preprint arXiv:2505.19381, 2025. 3

work page arXiv 2025

[27] [27]

Vad: Vectorized scene representa- tion for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representa- tion for efficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 8340–8350, 2023. 2, 3, 5, 8

work page 2023

[28] [28]

Transdiffuser: End-to-end trajectory gen- eration with decorrelated multi-modal representation for au- tonomous driving

Xuefeng Jiang, Yuan Ma, Pengxiang Li, Leimeng Xu, Xin Wen, Kun Zhan, Zhongpu Xia, Peng Jia, XianPeng Lang, and Sheng Sun. Transdiffuser: End-to-end trajectory gen- eration with decorrelated multi-modal representation for au- tonomous driving. arXiv preprint arXiv:2505.09315, 2025. 1, 2, 3

work page arXiv 2025

[29] [29]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

work page

[30] [30]

Hydra- mdp++: Advancing end-to-end driving via hydra-distillation with expert-guided decision analysis

Kailin Li, Zhenxin Li, Shiyi Lan, Jiayi Liu, Yuan Xie, Zuxuan Wu, Zhiding Yu, Jose M Alvarez, et al. Hydra- mdp++: Advancing end-to-end driving via hydra-distillation with expert-guided decision analysis. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (Workshops), 2025. 3, 7

work page 2025

[31] [31]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi- target hydra-distillation. arXiv preprint arXiv:2406.06978 ,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Generalized trajectory scoring for end-to-end multimodal planning.arXiv preprint arXiv:2506.06664, 2025

Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, and Jose M Alvarez. Generalized trajectory scor- ing for end-to-end multimodal planning. arXiv preprint arXiv:2506.06664, 2025. 3

work page arXiv 2025

[33] [33]

Pnpnet: End-to-end per- ception and prediction with tracking in the loop

Ming Liang, Bin Yang, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. Pnpnet: End-to-end per- ception and prediction with tracking in the loop. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11553–11562, 2020. 1

work page 2020

[34] [34]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025. 1, 3, 4, 5, 6, 7, 8, 2

work page 2025

[35] [35]

Fully sparse 3d occupancy prediction, 2024

Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, and Limin Wang. Fully sparse 3d occupancy prediction, 2024. 8

work page 2024

[36] [36]

Segment any point cloud sequences by distilling vision foundation models

Youquan Liu, Lingdong Kong, Jun Cen, Runnan Chen, Wen- wei Zhang, Liang Pan, Kai Chen, and Ziwei Liu. Segment any point cloud sequences by distilling vision foundation models. Advances in Neural Information Processing Sys- tems, 36, 2024. 8

work page 2024

[37] [37]

Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net

Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. InProceedings of the IEEE conference on Computer Vision and Pattern Recog- nition, pages 3569–3577, 2018. 1

work page 2018

[38] [38]

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025. 2

work page internal anchor Pith review arXiv 2025

[39] [39]

Nerf: Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 2

work page 2021

[40] [40]

Risk-aware motion planning for autonomous vehicles with safety specifications

Truls Nyberg, Christian Pek, Laura Dal Col, Christoffer Nor´en, and Jana Tumova. Risk-aware motion planning for autonomous vehicles with safety specifications. In 2021 ieee intelligent vehicles symposium (iv), pages 1016–1023. IEEE,

work page 2021

[41] [41]

Lego-drive: Language- enhanced goal-oriented closed-loop end-to-end autonomous driving

Pranjal Paul, Anant Garg, Tushar Choudhary, Arun Kumar Singh, and K Madhava Krishna. Lego-drive: Language- enhanced goal-oriented closed-loop end-to-end autonomous driving. In 2024 IEEE/RSJ International Conference on In- telligent Robots and Systems (IROS) , pages 10020–10026. IEEE, 2024. 3

work page 2024

[42] [42]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XIV 16, pages 194–210. Springer, 2020. 1

work page 2020

[43] [43]

Lightemma: Lightweight end-to-end multimodal model for autonomous driving

Zhijie Qiao, Haowei Li, Zhong Cao, and Henry X Liu. Lightemma: Lightweight end-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2505.00284 ,

work page arXiv

[44] [44]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 4 10

work page 2022

[45] [45]

Perceive, predict, and plan: Safe motion planning through interpretable seman- tic representations

Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable seman- tic representations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pages 414–430. Springer, 2020. 1

work page 2020

[46] [46]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2446–2454, 2020. 2

work page 2020

[47] [47]

Sparsedrive: End-to-end au- tonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Hao- ran Wu, and Sifa Zheng. Sparsedrive: End-to-end au- tonomous driving via sparse scene representation. Proceed- ings of the IEEE International Conference on Robotics and Automation, 2025. 3, 8, 1

work page 2025

[48] [48]

Neurad: Neural rendering for autonomous driving

Adam Tonderski, Carl Lindstr ¨om, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. Neurad: Neural rendering for autonomous driving. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14895–14904, 2024. 2

work page 2024

[49] [49]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2

work page 2025

[50] [50]

Para-drive: Parallelized architecture for real- time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real- time autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15449–15458, 2024. 7

work page 2024

[51] [51]

S3pt: Scene semantics and structure guided clus- tering to boost self-supervised pre-training for autonomous driving

Maciej K Wozniak, Hariprasath Govindarajan, Marvin Klingner, Camille Maurice, B Ravi Kiran, and Senthil Yo- gamani. S3pt: Scene semantics and structure guided clus- tering to boost self-supervised pre-training for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applica- tions of Computer Vision (WACV), pages 1660–1670. IEEE,

work page 2025

[52] [52]

Toward a robust sensor fusion step for 3d object detection on corrupted data

Maciej K Wozniak, Viktor K ˚arefj¨ard, Marko Thiel, and Patric Jensfelt. Toward a robust sensor fusion step for 3d object detection on corrupted data. IEEE Robotics and au- tomation letters, 8(11):7018–7025, 2023. 2

work page 2023

[53] [53]

Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025. 1, 3, 5, 7

work page 2025

[54] [54]

4d contrastive superflows are dense 3d representation learners

Xiang Xu, Lingdong Kong, Hui Shuai, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, and Qingshan Liu. 4d contrastive superflows are dense 3d representation learners. arXiv preprint arXiv:2407.06190, 2024. 8

work page arXiv 2024

[55] [55]

Yihong Xu, Lo ¨ıck Chambon, ´Eloi Zablocki, Micka ¨el Chen, Alexandre Alahi, Matthieu Cord, and Patrick P´erez. Towards motion forecasting with real-world perception inputs: Are end-to-end approaches competitive? In 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 18428–18435. IEEE, 2024. 2

work page 2024

[56] [56]

Street gaussians: Modeling dynamic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. In European Conference on Computer Vision, pages 156–173. Springer, 2024. 2

work page 2024

[57] [57]

Drivesuprim: Towards precise trajectory selection for end-to-end planning.arXiv preprint arXiv:2506.06659, 2025

Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M Alvarez, and Zuxuan Wu. Drivesuprim: To- wards precise trajectory selection for end-to-end planning. arXiv preprint arXiv:2506.06659, 2025. 3, 1

work page arXiv 2025

[58] [58]

Roca: Ro- bust cross-domain end-to-end autonomous driving

Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Litian Liu, Shweta Mahajan, Apratim Bhattacharyya, Yunxiao Shi, Risheek Garrepalli, Hong Cai, and Fatih Porikli. Roca: Ro- bust cross-domain end-to-end autonomous driving. arXiv preprint arXiv:2506.10145, 2025. 3

work page arXiv 2025

[59] [59]

Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601, 2024

Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Ze- fan Huang, Christina Dao Wen Lee, Dongen Li, Yuhang Han, Anthony Wong, Keng Peng Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba. arXiv preprint arXiv:2408.03601, 2024. 7

work page arXiv 2024

[60] [60]

Diffe2e: Rethinking end-to-end driving with a hy- brid action diffusion and supervised policy

Rui Zhao, Yuze Fan, Ziguo Chen, Fei Gao, and Zhenhai Gao. Diffe2e: Rethinking end-to-end driving with a hy- brid action diffusion and supervised policy. arXiv preprint arXiv:2505.19516, 2025. 2

work page arXiv 2025

[61] [61]

Gaussianad: Gaussian-centric end-to- end autonomous driving

Wenzhao Zheng, Junjie Wu, Yao Zheng, Sicheng Zuo, Zixun Xie, Longchao Yang, Yong Pan, Zhihui Hao, Peng Jia, Xi- anpeng Lang, et al. Gaussianad: Gaussian-centric end-to- end autonomous driving. arXiv preprint arXiv:2412.10371,

work page arXiv

[62] [62]

Diffusion-based planning for autonomous driving with flexi- ble guidance

Yinan Zheng, Ruiming Liang, Kexin ZHENG, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Diffusion-based planning for autonomous driving with flexi- ble guidance. In Proceedings of the International Conference on Learning Representations, 2025. 3 11 Supplementary Materials A. Parameters setup ...

work page 2025

[63] [63]

com / hustvl / DiffusionDrive / issues/57 as well as issues/45 layer that computes all three projections in one oper- ation

Fused QKV Projection: In the self-attention mecha- nism, the separate linear layers for query (Q), key (K), and value (V ) were replaced with a single, fused linear 3https : / / github . com / hustvl / DiffusionDrive / issues/57 as well as issues/45 layer that computes all three projections in one oper- ation. This reduces three separate matrix multiplica...

work page

[64] [64]

We also substituted the ReLU activa- tion with GELU, a smoother activation function that is common in modern high-performance transformers and can lead to better convergence

Optimized MLP Block: The standard MLP block, which can be inefficient, was replaced by a dedicated MLP module. We also substituted the ReLU activa- tion with GELU, a smoother activation function that is common in modern high-performance transformers and can lead to better convergence

work page

[65] [65]

Efficient Tensor Reshaping: Throughout the model, especially in the attention mechanism and the CaRT module’s forward pass, tensor reshaping op- erations like .reshape() are now preceded by .contiguous(). This ensures the tensor is stored in a contiguous block of memory before the view op- eration, preventing potential performance penalties as- sociated w...

work page

[66] [66]

Gradient Checkpointing: We introduced op- tional gradient checkpointing within the transformer blocks. During training, this technique trades a small amount of re-computation in the backward pass for a significant reduction in memory usage, allowing for larger batch sizes which can further improve training throughput

work page

[67] [67]

In-place and Fused Operations: Smaller optimiza- tions were made throughout the backbone, such as us- ing inplace=True for ReLU activations in the FPN and removing biases from convolution and linear lay- ers where they are followed by a normalization layer, which makes them redundant. Together, these structural and operational improvements result in a mor...

work page