PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
Pith reviewed 2026-05-19 02:49 UTC · model grok-4.3
The pith
A camera-only architecture plans safe trajectories directly from raw pixels without LiDAR or bird's-eye views.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks by predicting safe trajectories directly from raw pixel inputs using only camera data, without explicit BEV representation and forgoing the need for LiDAR. The architecture leverages a visual feature extractor coupled with a generative planning head, with the Context-aware Recalibration Transformer enhancing multi-level visual features for more robust planning. This matches the capabilities of larger multimodal diffusion planners while remaining significantly more efficient in inference speed and model size.
What carries the argument
The Context-aware Recalibration Transformer (CaRT) module, which refines multi-level visual features extracted from raw camera pixels to support direct generative trajectory planning.
If this is right
- Mass-market vehicles equipped only with cameras can achieve competitive planning performance without LiDAR hardware.
- Inference speed and model size become practical for onboard real-time deployment.
- End-to-end driving systems no longer require separate 3D reconstruction stages to reach benchmark-leading results.
- Training and evaluation can focus on pixel-to-trajectory mappings rather than intermediate bird's-eye representations.
Where Pith is reading between the lines
- The recalibration approach may generalize to other vision-only planning domains such as robotics navigation or drone path finding.
- If visual features prove sufficient here, future work could test whether similar lightweight recalibration replaces multi-sensor fusion in additional perception tasks.
- Performance parity with larger models suggests that efficiency gains could be traded for further accuracy improvements by scaling the same architecture.
Load-bearing premise
Raw camera pixels, after processing by the visual feature extractor and CaRT module, contain enough information to generate safe, collision-free trajectories without explicit 3D reconstruction or additional sensor modalities.
What would settle it
A controlled experiment on nuScenes or NavSim showing substantially higher collision or off-road rates for PRIX than for LiDAR-based baselines in low-visibility or dynamic intersection scenarios would falsify the claim.
Figures
read the original abstract
While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PRIX, an end-to-end autonomous driving architecture that predicts safe trajectories directly from raw camera pixels without LiDAR or explicit BEV representations. It combines a visual feature extractor with a generative planning head and introduces the Context-aware Recalibration Transformer (CaRT) module to enhance multi-level visual features. The central claims are state-of-the-art performance on the NavSim and nuScenes benchmarks that matches larger multimodal diffusion planners while offering substantially lower inference latency and model size; the work is presented as open-source.
Significance. If the performance and efficiency claims are substantiated with rigorous controls, the result would be significant for practical camera-only autonomous driving, as it targets reduced sensor cost and computational overhead for mass-market vehicles. The planned code release is a clear strength that supports reproducibility.
major comments (2)
- Experimental evaluation: the abstract and results sections assert SOTA performance and efficiency gains, yet no details are provided on training/validation data splits, ablation studies isolating the CaRT module, error bars, or failure-case analysis. These omissions prevent full evaluation of the central performance claim against the cited larger diffusion planners.
- Method and evaluation: the claim that raw monocular pixels plus the visual extractor and CaRT suffice for safe, collision-free planning without explicit 3D reconstruction or LiDAR rests on closed-loop benchmark metrics that can be achieved by short-horizon imitation. No targeted stress tests under distribution shift (novel lighting, thin distant obstacles, ego-motion parallax) or direct comparison to depth-augmented baselines are reported, leaving the weakest assumption unexamined.
minor comments (1)
- Abstract: the code link https://maxiuw.github.io/prix should be replaced with a permanent repository URL (e.g., GitHub) in the camera-ready version.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made to strengthen the presentation.
read point-by-point responses
-
Referee: Experimental evaluation: the abstract and results sections assert SOTA performance and efficiency gains, yet no details are provided on training/validation data splits, ablation studies isolating the CaRT module, error bars, or failure-case analysis. These omissions prevent full evaluation of the central performance claim against the cited larger diffusion planners.
Authors: We thank the referee for this observation. The manuscript details the training and validation splits in Section 4.1, following the official NavSim and nuScenes protocols. Ablation studies isolating the contribution of the CaRT module appear in Section 4.3 and Table 3. Error bars computed over multiple runs are reported in Tables 1 and 2. We agree that failure-case analysis is not sufficiently developed. We will add a dedicated subsection with qualitative examples and common failure modes in the revised manuscript. revision: yes
-
Referee: Method and evaluation: the claim that raw monocular pixels plus the visual extractor and CaRT suffice for safe, collision-free planning without explicit 3D reconstruction or LiDAR rests on closed-loop benchmark metrics that can be achieved by short-horizon imitation. No targeted stress tests under distribution shift (novel lighting, thin distant obstacles, ego-motion parallax) or direct comparison to depth-augmented baselines are reported, leaving the weakest assumption unexamined.
Authors: We respectfully note that the generative planning head is trained on multi-step trajectories and the closed-loop evaluation on NavSim and nuScenes measures full-horizon safety rather than pure short-horizon imitation. The CaRT module is specifically designed to recalibrate multi-level features for implicit spatial reasoning. Nevertheless, we agree that targeted robustness checks would strengthen the claims. We will incorporate comparisons against depth-augmented baselines and add discussion of performance under lighting and obstacle variations in the revised experiments section. revision: partial
Circularity Check
No circularity: empirical benchmark results independent of internal definitions
full rationale
The paper's core claims rest on training a visual feature extractor plus CaRT module on standard driving datasets and reporting closed-loop metrics on NavSim and nuScenes. These are external, falsifiable evaluations against held-out trajectories and do not reduce any 'prediction' to a quantity defined by the authors' own fitted parameters or prior self-citations. No equations equate outputs to inputs by construction, no uniqueness theorems are imported from the same authors, and the architecture choices are presented as design decisions rather than derived necessities. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Raw camera images contain sufficient information for safe trajectory planning in the evaluated scenarios
invented entities (1)
-
Context-aware Recalibration Transformer (CaRT)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
CLOVER: Closed-Loop Value Estimation and Ranking for End-to-End Autonomous Driving Planning
CLOVER is a closed-loop generator-scorer framework that expands proposal coverage with pseudo-expert trajectories and performs conservative self-distillation to achieve state-of-the-art planning scores on NAVSIM and nuScenes.
Reference graph
Works this paper leans on
-
[1]
Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving
Mina Alibeigi, William Ljungbergh, Adam Tonderski, Georg Hess, Adam Lilja, Carl Lindstr ¨om, Daria Motorniuk, Jun- sheng Fu, Jenny Widahl, and Christoffer Petersson. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20178– 20188, 2023. 2
work page 2023
-
[2]
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 15791–15801, 2025. 2
work page 2025
-
[3]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. 2, 6
work page 2020
-
[4]
Pseudo-simulation for autonomous driving.arXiv preprint arXiv:2506.04218,
Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo- simulation for autonomous driving. arXiv, 2506.04218,
-
[5]
End-to-end autonomous driving: Challenges and frontiers
Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, An- dreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 1
work page 2024
-
[6]
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Solve: Synergy of language-vision and end-to-end networks for autonomous driving
Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, and Hongsheng Li. Solve: Synergy of language-vision and end-to-end networks for autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12068–12077, 2025. 3
work page 2025
-
[8]
Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving
Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving. In European Conference on Computer Vision , pages 239–256. Springer,
-
[9]
Dualat: Dual attention transformer for end-to-end au- tonomous driving
Zesong Chen, Ze Yu, Jun Li, Linlin You, and Xiaojun Tan. Dualat: Dual attention transformer for end-to-end au- tonomous driving. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 16353–16359. IEEE, 2024. 2
work page 2024
-
[10]
Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing
Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing. IEEE transactions on pattern analysis and machine in- telligence, 45(11):12878–12895, 2022. 1, 2, 5, 7
work page 2022
-
[11]
Darius Dan. Formula 1 icons. In https://www.flaticon.com/free-icons/formula-1. Flaticon. 8
-
[12]
Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmark- ing
Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmark- ing. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 2, 6
work page 2024
-
[13]
Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, and Yan- jun Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. arXiv preprint arXiv:2504.19580, 2025. 2
-
[14]
Yuchao Feng and Yuxiang Sun. Polarpoint-bev: Bird-eye- view perception in polar points for explainable end-to-end autonomous driving. IEEE Transactions on Intelligent Vehi- cles, 2024. 3
work page 2024
-
[15]
Man truckscenes: A multimodal dataset for autonomous trucking in diverse con- ditions
Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, Stefan Juergens, Lorenz Lechermann, Christian Nissler, An- drea Perl, Ulrich V oll, Min Yan, et al. Man truckscenes: A multimodal dataset for autonomous trucking in diverse con- ditions. Advances in Neural Information Processing Systems, 37:62062–62082, 2024. 2
work page 2024
-
[16]
Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,
Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driv- ing policy via large-scale 3dgs-based reinforcement learning. arXiv preprint arXiv:2502.13144, 2025. 2
-
[17]
Looking for trouble: In- formative planning for safe trajectories with occlusions
Barry Gilhuly, Armin Sadeghi, Peyman Yedmellat, Kasra Rezaee, and Stephen L Smith. Looking for trouble: In- formative planning for safe trajectories with occlusions. In 2022 International Conference on Robotics and Automation (ICRA), pages 8985–8991. IEEE, 2022. 8
work page 2022
-
[18]
Cleverdistiller: Simple and spatially consistent cross- modal distillation
Hariprasath Govindarajan, Maciej K Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, and Senthil Yoga- mani. Cleverdistiller: Simple and spatially consistent cross- modal distillation. arXiv preprint arXiv:2503.09878, 2025. 8
-
[19]
ipad: Iterative proposal-centric end-to-end autonomous driving.arXiv preprint arXiv:2505.15111, 2025
Ke Guo, Haochen Liu, Xiaojun Wu, Jia Pan, and Chen Lv. ipad: Iterative proposal-centric end-to-end autonomous driv- ing. arXiv preprint arXiv:2505.15111, 2025. 3
-
[20]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR,
-
[21]
Dis- tilling multi-modal large language models for autonomous driving
Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Dis- tilling multi-modal large language models for autonomous driving. Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025. 3 9
work page 2025
-
[22]
Splatad: Real-time li- dar and camera rendering with 3d gaussian splatting for au- tonomous driving
Georg Hess, Carl Lindstr ¨om, Maryam Fatemi, Christoffer Petersson, and Lennart Svensson. Splatad: Real-time li- dar and camera rendering with 3d gaussian splatting for au- tonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 11982–11992,
-
[23]
St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning
Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision (ECCV), 2022. 8
work page 2022
-
[24]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023. 2, 3, 7, 8
work page 2023
-
[25]
Efficient multi-camera tokenization with triplanes for end-to-end driving
Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, and Marco Pavone. Efficient multi-camera tokenization with triplanes for end-to-end driving. arXiv preprint arXiv:2506.12251, 2025. 3
-
[26]
Diffvla: Vision-language guided diffusion planning for autonomous driving
Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Zongzheng Zhang, et al. Diffvla: Vision-language guided diffusion planning for autonomous driving. arXiv preprint arXiv:2505.19381, 2025. 3
-
[27]
Vad: Vectorized scene representa- tion for efficient autonomous driving
Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representa- tion for efficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 8340–8350, 2023. 2, 3, 5, 8
work page 2023
-
[28]
Xuefeng Jiang, Yuan Ma, Pengxiang Li, Leimeng Xu, Xin Wen, Kun Zhan, Zhongpu Xia, Peng Jia, XianPeng Lang, and Sheng Sun. Transdiffuser: End-to-end trajectory gen- eration with decorrelated multi-modal representation for au- tonomous driving. arXiv preprint arXiv:2505.09315, 2025. 1, 2, 3
-
[29]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,
-
[30]
Kailin Li, Zhenxin Li, Shiyi Lan, Jiayi Liu, Yuan Xie, Zuxuan Wu, Zhiding Yu, Jose M Alvarez, et al. Hydra- mdp++: Advancing end-to-end driving via hydra-distillation with expert-guided decision analysis. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (Workshops), 2025. 3, 7
work page 2025
-
[31]
Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi- target hydra-distillation. arXiv preprint arXiv:2406.06978 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, and Jose M Alvarez. Generalized trajectory scor- ing for end-to-end multimodal planning. arXiv preprint arXiv:2506.06664, 2025. 3
-
[33]
Pnpnet: End-to-end per- ception and prediction with tracking in the loop
Ming Liang, Bin Yang, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. Pnpnet: End-to-end per- ception and prediction with tracking in the loop. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11553–11562, 2020. 1
work page 2020
-
[34]
Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving
Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025. 1, 3, 4, 5, 6, 7, 8, 2
work page 2025
-
[35]
Fully sparse 3d occupancy prediction, 2024
Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, and Limin Wang. Fully sparse 3d occupancy prediction, 2024. 8
work page 2024
-
[36]
Segment any point cloud sequences by distilling vision foundation models
Youquan Liu, Lingdong Kong, Jun Cen, Runnan Chen, Wen- wei Zhang, Liang Pan, Kai Chen, and Ziwei Liu. Segment any point cloud sequences by distilling vision foundation models. Advances in Neural Information Processing Sys- tems, 36, 2024. 8
work page 2024
-
[37]
Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. InProceedings of the IEEE conference on Computer Vision and Pattern Recog- nition, pages 3569–3577, 2018. 1
work page 2018
-
[38]
VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[39]
Nerf: Representing scenes as neural radiance fields for view syn- thesis
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 2
work page 2021
-
[40]
Risk-aware motion planning for autonomous vehicles with safety specifications
Truls Nyberg, Christian Pek, Laura Dal Col, Christoffer Nor´en, and Jana Tumova. Risk-aware motion planning for autonomous vehicles with safety specifications. In 2021 ieee intelligent vehicles symposium (iv), pages 1016–1023. IEEE,
work page 2021
-
[41]
Lego-drive: Language- enhanced goal-oriented closed-loop end-to-end autonomous driving
Pranjal Paul, Anant Garg, Tushar Choudhary, Arun Kumar Singh, and K Madhava Krishna. Lego-drive: Language- enhanced goal-oriented closed-loop end-to-end autonomous driving. In 2024 IEEE/RSJ International Conference on In- telligent Robots and Systems (IROS) , pages 10020–10026. IEEE, 2024. 3
work page 2024
-
[42]
Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d
Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XIV 16, pages 194–210. Springer, 2020. 1
work page 2020
-
[43]
Lightemma: Lightweight end-to-end multimodal model for autonomous driving
Zhijie Qiao, Haowei Li, Zhong Cao, and Henry X Liu. Lightemma: Lightweight end-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2505.00284 ,
-
[44]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 4 10
work page 2022
-
[45]
Perceive, predict, and plan: Safe motion planning through interpretable seman- tic representations
Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable seman- tic representations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pages 414–430. Springer, 2020. 1
work page 2020
-
[46]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2446–2454, 2020. 2
work page 2020
-
[47]
Sparsedrive: End-to-end au- tonomous driving via sparse scene representation
Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Hao- ran Wu, and Sifa Zheng. Sparsedrive: End-to-end au- tonomous driving via sparse scene representation. Proceed- ings of the IEEE International Conference on Robotics and Automation, 2025. 3, 8, 1
work page 2025
-
[48]
Neurad: Neural rendering for autonomous driving
Adam Tonderski, Carl Lindstr ¨om, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. Neurad: Neural rendering for autonomous driving. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14895–14904, 2024. 2
work page 2024
-
[49]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2
work page 2025
-
[50]
Para-drive: Parallelized architecture for real- time autonomous driving
Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real- time autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15449–15458, 2024. 7
work page 2024
-
[51]
Maciej K Wozniak, Hariprasath Govindarajan, Marvin Klingner, Camille Maurice, B Ravi Kiran, and Senthil Yo- gamani. S3pt: Scene semantics and structure guided clus- tering to boost self-supervised pre-training for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applica- tions of Computer Vision (WACV), pages 1660–1670. IEEE,
work page 2025
-
[52]
Toward a robust sensor fusion step for 3d object detection on corrupted data
Maciej K Wozniak, Viktor K ˚arefj¨ard, Marko Thiel, and Patric Jensfelt. Toward a robust sensor fusion step for 3d object detection on corrupted data. IEEE Robotics and au- tomation letters, 8(11):7018–7025, 2023. 2
work page 2023
-
[53]
Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025. 1, 3, 5, 7
work page 2025
-
[54]
4d contrastive superflows are dense 3d representation learners
Xiang Xu, Lingdong Kong, Hui Shuai, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, and Qingshan Liu. 4d contrastive superflows are dense 3d representation learners. arXiv preprint arXiv:2407.06190, 2024. 8
-
[55]
Yihong Xu, Lo ¨ıck Chambon, ´Eloi Zablocki, Micka ¨el Chen, Alexandre Alahi, Matthieu Cord, and Patrick P´erez. Towards motion forecasting with real-world perception inputs: Are end-to-end approaches competitive? In 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 18428–18435. IEEE, 2024. 2
work page 2024
-
[56]
Street gaussians: Modeling dynamic urban scenes with gaussian splatting
Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. In European Conference on Computer Vision, pages 156–173. Springer, 2024. 2
work page 2024
-
[57]
Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M Alvarez, and Zuxuan Wu. Drivesuprim: To- wards precise trajectory selection for end-to-end planning. arXiv preprint arXiv:2506.06659, 2025. 3, 1
-
[58]
Roca: Ro- bust cross-domain end-to-end autonomous driving
Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Litian Liu, Shweta Mahajan, Apratim Bhattacharyya, Yunxiao Shi, Risheek Garrepalli, Hong Cai, and Fatih Porikli. Roca: Ro- bust cross-domain end-to-end autonomous driving. arXiv preprint arXiv:2506.10145, 2025. 3
-
[59]
Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Ze- fan Huang, Christina Dao Wen Lee, Dongen Li, Yuhang Han, Anthony Wong, Keng Peng Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba. arXiv preprint arXiv:2408.03601, 2024. 7
-
[60]
Diffe2e: Rethinking end-to-end driving with a hy- brid action diffusion and supervised policy
Rui Zhao, Yuze Fan, Ziguo Chen, Fei Gao, and Zhenhai Gao. Diffe2e: Rethinking end-to-end driving with a hy- brid action diffusion and supervised policy. arXiv preprint arXiv:2505.19516, 2025. 2
-
[61]
Gaussianad: Gaussian-centric end-to- end autonomous driving
Wenzhao Zheng, Junjie Wu, Yao Zheng, Sicheng Zuo, Zixun Xie, Longchao Yang, Yong Pan, Zhihui Hao, Peng Jia, Xi- anpeng Lang, et al. Gaussianad: Gaussian-centric end-to- end autonomous driving. arXiv preprint arXiv:2412.10371,
-
[62]
Diffusion-based planning for autonomous driving with flexi- ble guidance
Yinan Zheng, Ruiming Liang, Kexin ZHENG, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Diffusion-based planning for autonomous driving with flexi- ble guidance. In Proceedings of the International Conference on Learning Representations, 2025. 3 11 Supplementary Materials A. Parameters setup ...
work page 2025
-
[63]
Fused QKV Projection: In the self-attention mecha- nism, the separate linear layers for query (Q), key (K), and value (V ) were replaced with a single, fused linear 3https : / / github . com / hustvl / DiffusionDrive / issues/57 as well as issues/45 layer that computes all three projections in one oper- ation. This reduces three separate matrix multiplica...
-
[64]
Optimized MLP Block: The standard MLP block, which can be inefficient, was replaced by a dedicated MLP module. We also substituted the ReLU activa- tion with GELU, a smoother activation function that is common in modern high-performance transformers and can lead to better convergence
-
[65]
Efficient Tensor Reshaping: Throughout the model, especially in the attention mechanism and the CaRT module’s forward pass, tensor reshaping op- erations like .reshape() are now preceded by .contiguous(). This ensures the tensor is stored in a contiguous block of memory before the view op- eration, preventing potential performance penalties as- sociated w...
-
[66]
Gradient Checkpointing: We introduced op- tional gradient checkpointing within the transformer blocks. During training, this technique trades a small amount of re-computation in the backward pass for a significant reduction in memory usage, allowing for larger batch sizes which can further improve training throughput
-
[67]
In-place and Fused Operations: Smaller optimiza- tions were made throughout the backbone, such as us- ing inplace=True for ReLU activations in the FPN and removing biases from convolution and linear lay- ers where they are followed by a normalization layer, which makes them redundant. Together, these structural and operational improvements result in a mor...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.