pith. sign in

arxiv: 2507.17596 · v3 · submitted 2025-07-23 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Pith reviewed 2026-05-19 02:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords autonomous drivingend-to-end planningcamera-based navigationtrajectory predictionraw pixel inputrecalibration transformerNavSim benchmarknuScenes benchmark
0
0 comments X

The pith

A camera-only architecture plans safe trajectories directly from raw pixels without LiDAR or bird's-eye views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PRIX as an end-to-end autonomous driving system that takes raw camera images as input and outputs collision-free trajectories. It avoids the usual requirements for expensive LiDAR sensors and explicit bird's-eye-view feature maps by pairing a visual feature extractor with a generative planning head. A new Context-aware Recalibration Transformer module refines multi-level image features to support this direct pixel-to-plan mapping. Sympathetic readers would care because the approach promises smaller models and faster inference, which could make high-performance driving feasible on ordinary camera-equipped vehicles.

Core claim

PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks by predicting safe trajectories directly from raw pixel inputs using only camera data, without explicit BEV representation and forgoing the need for LiDAR. The architecture leverages a visual feature extractor coupled with a generative planning head, with the Context-aware Recalibration Transformer enhancing multi-level visual features for more robust planning. This matches the capabilities of larger multimodal diffusion planners while remaining significantly more efficient in inference speed and model size.

What carries the argument

The Context-aware Recalibration Transformer (CaRT) module, which refines multi-level visual features extracted from raw camera pixels to support direct generative trajectory planning.

If this is right

  • Mass-market vehicles equipped only with cameras can achieve competitive planning performance without LiDAR hardware.
  • Inference speed and model size become practical for onboard real-time deployment.
  • End-to-end driving systems no longer require separate 3D reconstruction stages to reach benchmark-leading results.
  • Training and evaluation can focus on pixel-to-trajectory mappings rather than intermediate bird's-eye representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The recalibration approach may generalize to other vision-only planning domains such as robotics navigation or drone path finding.
  • If visual features prove sufficient here, future work could test whether similar lightweight recalibration replaces multi-sensor fusion in additional perception tasks.
  • Performance parity with larger models suggests that efficiency gains could be traded for further accuracy improvements by scaling the same architecture.

Load-bearing premise

Raw camera pixels, after processing by the visual feature extractor and CaRT module, contain enough information to generate safe, collision-free trajectories without explicit 3D reconstruction or additional sensor modalities.

What would settle it

A controlled experiment on nuScenes or NavSim showing substantially higher collision or off-road rates for PRIX than for LiDAR-based baselines in low-visibility or dynamic intersection scenarios would falsify the claim.

Figures

Figures reproduced from arXiv: 2507.17596 by Lianhang Liu, Maciej K. Wozniak, Patric Jensfelt, Yixi Cai.

Figure 1
Figure 1. Figure 1: Performance vs. inference speed comparing our camera [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PRIX Overview: Visual features from multi-camera images are extracted by ResNet layers (fi) and together with self-attention and skip connections (CaRT, described in Sec. 3.1). Next, visual features are used for auxiliary perception tasks (see Sec. 3.4) and trajectory planning (see Sec. 3.2). A conditional diffusion planner then uses visual features, along with the current ego state and a set of noisy anch… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of our visual feature extractor with [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Diffusion steps vs performance on Navsim-v1. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative trajectory predictions from our method. In some cases, like [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes PRIX, an end-to-end autonomous driving architecture that predicts safe trajectories directly from raw camera pixels without LiDAR or explicit BEV representations. It combines a visual feature extractor with a generative planning head and introduces the Context-aware Recalibration Transformer (CaRT) module to enhance multi-level visual features. The central claims are state-of-the-art performance on the NavSim and nuScenes benchmarks that matches larger multimodal diffusion planners while offering substantially lower inference latency and model size; the work is presented as open-source.

Significance. If the performance and efficiency claims are substantiated with rigorous controls, the result would be significant for practical camera-only autonomous driving, as it targets reduced sensor cost and computational overhead for mass-market vehicles. The planned code release is a clear strength that supports reproducibility.

major comments (2)
  1. Experimental evaluation: the abstract and results sections assert SOTA performance and efficiency gains, yet no details are provided on training/validation data splits, ablation studies isolating the CaRT module, error bars, or failure-case analysis. These omissions prevent full evaluation of the central performance claim against the cited larger diffusion planners.
  2. Method and evaluation: the claim that raw monocular pixels plus the visual extractor and CaRT suffice for safe, collision-free planning without explicit 3D reconstruction or LiDAR rests on closed-loop benchmark metrics that can be achieved by short-horizon imitation. No targeted stress tests under distribution shift (novel lighting, thin distant obstacles, ego-motion parallax) or direct comparison to depth-augmented baselines are reported, leaving the weakest assumption unexamined.
minor comments (1)
  1. Abstract: the code link https://maxiuw.github.io/prix should be replaced with a permanent repository URL (e.g., GitHub) in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: Experimental evaluation: the abstract and results sections assert SOTA performance and efficiency gains, yet no details are provided on training/validation data splits, ablation studies isolating the CaRT module, error bars, or failure-case analysis. These omissions prevent full evaluation of the central performance claim against the cited larger diffusion planners.

    Authors: We thank the referee for this observation. The manuscript details the training and validation splits in Section 4.1, following the official NavSim and nuScenes protocols. Ablation studies isolating the contribution of the CaRT module appear in Section 4.3 and Table 3. Error bars computed over multiple runs are reported in Tables 1 and 2. We agree that failure-case analysis is not sufficiently developed. We will add a dedicated subsection with qualitative examples and common failure modes in the revised manuscript. revision: yes

  2. Referee: Method and evaluation: the claim that raw monocular pixels plus the visual extractor and CaRT suffice for safe, collision-free planning without explicit 3D reconstruction or LiDAR rests on closed-loop benchmark metrics that can be achieved by short-horizon imitation. No targeted stress tests under distribution shift (novel lighting, thin distant obstacles, ego-motion parallax) or direct comparison to depth-augmented baselines are reported, leaving the weakest assumption unexamined.

    Authors: We respectfully note that the generative planning head is trained on multi-step trajectories and the closed-loop evaluation on NavSim and nuScenes measures full-horizon safety rather than pure short-horizon imitation. The CaRT module is specifically designed to recalibrate multi-level features for implicit spatial reasoning. Nevertheless, we agree that targeted robustness checks would strengthen the claims. We will incorporate comparisons against depth-augmented baselines and add discussion of performance under lighting and obstacle variations in the revised experiments section. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of internal definitions

full rationale

The paper's core claims rest on training a visual feature extractor plus CaRT module on standard driving datasets and reporting closed-loop metrics on NavSim and nuScenes. These are external, falsifiable evaluations against held-out trajectories and do not reduce any 'prediction' to a quantity defined by the authors' own fitted parameters or prior self-citations. No equations equate outputs to inputs by construction, no uniqueness theorems are imported from the same authors, and the architecture choices are presented as design decisions rather than derived necessities. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The performance claims rest on the domain assumption that camera images suffice for planning and on the effectiveness of the newly introduced CaRT module; no explicit free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Raw camera images contain sufficient information for safe trajectory planning in the evaluated scenarios
    Core premise enabling the camera-only design stated in the abstract.
invented entities (1)
  • Context-aware Recalibration Transformer (CaRT) no independent evidence
    purpose: Enhance multi-level visual features for more robust planning
    New module presented as a core component of the architecture.

pith-pipeline@v0.9.0 · 5751 in / 1240 out tokens · 52679 ms · 2026-05-19T02:49:53.320155+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CLOVER: Closed-Loop Value Estimation and Ranking for End-to-End Autonomous Driving Planning

    cs.RO 2026-05 conditional novelty 6.0

    CLOVER is a closed-loop generator-scorer framework that expands proposal coverage with pseudo-expert trajectories and performs conservative self-distillation to achieve state-of-the-art planning scores on NAVSIM and nuScenes.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving

    Mina Alibeigi, William Ljungbergh, Adam Tonderski, Georg Hess, Adam Lilja, Carl Lindstr ¨om, Daria Motorniuk, Jun- sheng Fu, Jenny Widahl, and Christoffer Petersson. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20178– 20188, 2023. 2

  2. [2]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 15791–15801, 2025. 2

  3. [3]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020. 2, 6

  4. [4]

    Pseudo-simulation for autonomous driving.arXiv preprint arXiv:2506.04218,

    Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo- simulation for autonomous driving. arXiv, 2506.04218,

  5. [5]

    End-to-end autonomous driving: Challenges and frontiers

    Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, An- dreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 1

  6. [6]

    VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243,

  7. [7]

    Solve: Synergy of language-vision and end-to-end networks for autonomous driving

    Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, and Hongsheng Li. Solve: Synergy of language-vision and end-to-end networks for autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12068–12077, 2025. 3

  8. [8]

    Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving

    Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving. In European Conference on Computer Vision , pages 239–256. Springer,

  9. [9]

    Dualat: Dual attention transformer for end-to-end au- tonomous driving

    Zesong Chen, Ze Yu, Jun Li, Linlin You, and Xiaojun Tan. Dualat: Dual attention transformer for end-to-end au- tonomous driving. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 16353–16359. IEEE, 2024. 2

  10. [10]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing. IEEE transactions on pattern analysis and machine in- telligence, 45(11):12878–12895, 2022. 1, 2, 5, 7

  11. [11]

    Formula 1 icons

    Darius Dan. Formula 1 icons. In https://www.flaticon.com/free-icons/formula-1. Flaticon. 8

  12. [12]

    Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmark- ing

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmark- ing. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 2, 6

  13. [13]

    Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580, 2025

    Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, and Yan- jun Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. arXiv preprint arXiv:2504.19580, 2025. 2

  14. [14]

    Polarpoint-bev: Bird-eye- view perception in polar points for explainable end-to-end autonomous driving

    Yuchao Feng and Yuxiang Sun. Polarpoint-bev: Bird-eye- view perception in polar points for explainable end-to-end autonomous driving. IEEE Transactions on Intelligent Vehi- cles, 2024. 3

  15. [15]

    Man truckscenes: A multimodal dataset for autonomous trucking in diverse con- ditions

    Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, Stefan Juergens, Lorenz Lechermann, Christian Nissler, An- drea Perl, Ulrich V oll, Min Yan, et al. Man truckscenes: A multimodal dataset for autonomous trucking in diverse con- ditions. Advances in Neural Information Processing Systems, 37:62062–62082, 2024. 2

  16. [16]

    Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,

    Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driv- ing policy via large-scale 3dgs-based reinforcement learning. arXiv preprint arXiv:2502.13144, 2025. 2

  17. [17]

    Looking for trouble: In- formative planning for safe trajectories with occlusions

    Barry Gilhuly, Armin Sadeghi, Peyman Yedmellat, Kasra Rezaee, and Stephen L Smith. Looking for trouble: In- formative planning for safe trajectories with occlusions. In 2022 International Conference on Robotics and Automation (ICRA), pages 8985–8991. IEEE, 2022. 8

  18. [18]

    Cleverdistiller: Simple and spatially consistent cross- modal distillation

    Hariprasath Govindarajan, Maciej K Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, and Senthil Yoga- mani. Cleverdistiller: Simple and spatially consistent cross- modal distillation. arXiv preprint arXiv:2503.09878, 2025. 8

  19. [19]

    ipad: Iterative proposal-centric end-to-end autonomous driving.arXiv preprint arXiv:2505.15111, 2025

    Ke Guo, Haochen Liu, Xiaojun Wu, Jia Pan, and Chen Lv. ipad: Iterative proposal-centric end-to-end autonomous driv- ing. arXiv preprint arXiv:2505.15111, 2025. 3

  20. [20]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR,

  21. [21]

    Dis- tilling multi-modal large language models for autonomous driving

    Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Dis- tilling multi-modal large language models for autonomous driving. Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025. 3 9

  22. [22]

    Splatad: Real-time li- dar and camera rendering with 3d gaussian splatting for au- tonomous driving

    Georg Hess, Carl Lindstr ¨om, Maryam Fatemi, Christoffer Petersson, and Lennart Svensson. Splatad: Real-time li- dar and camera rendering with 3d gaussian splatting for au- tonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference , pages 11982–11992,

  23. [23]

    St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision (ECCV), 2022. 8

  24. [24]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023. 2, 3, 7, 8

  25. [25]

    Efficient multi-camera tokenization with triplanes for end-to-end driving

    Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, and Marco Pavone. Efficient multi-camera tokenization with triplanes for end-to-end driving. arXiv preprint arXiv:2506.12251, 2025. 3

  26. [26]

    Diffvla: Vision-language guided diffusion planning for autonomous driving

    Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Zongzheng Zhang, et al. Diffvla: Vision-language guided diffusion planning for autonomous driving. arXiv preprint arXiv:2505.19381, 2025. 3

  27. [27]

    Vad: Vectorized scene representa- tion for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representa- tion for efficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 8340–8350, 2023. 2, 3, 5, 8

  28. [28]

    Transdiffuser: End-to-end trajectory gen- eration with decorrelated multi-modal representation for au- tonomous driving

    Xuefeng Jiang, Yuan Ma, Pengxiang Li, Leimeng Xu, Xin Wen, Kun Zhan, Zhongpu Xia, Peng Jia, XianPeng Lang, and Sheng Sun. Transdiffuser: End-to-end trajectory gen- eration with decorrelated multi-modal representation for au- tonomous driving. arXiv preprint arXiv:2505.09315, 2025. 1, 2, 3

  29. [29]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

  30. [30]

    Hydra- mdp++: Advancing end-to-end driving via hydra-distillation with expert-guided decision analysis

    Kailin Li, Zhenxin Li, Shiyi Lan, Jiayi Liu, Yuan Xie, Zuxuan Wu, Zhiding Yu, Jose M Alvarez, et al. Hydra- mdp++: Advancing end-to-end driving via hydra-distillation with expert-guided decision analysis. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (Workshops), 2025. 3, 7

  31. [31]

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi- target hydra-distillation. arXiv preprint arXiv:2406.06978 ,

  32. [32]

    Generalized trajectory scoring for end-to-end multimodal planning.arXiv preprint arXiv:2506.06664, 2025

    Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, and Jose M Alvarez. Generalized trajectory scor- ing for end-to-end multimodal planning. arXiv preprint arXiv:2506.06664, 2025. 3

  33. [33]

    Pnpnet: End-to-end per- ception and prediction with tracking in the loop

    Ming Liang, Bin Yang, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. Pnpnet: End-to-end per- ception and prediction with tracking in the loop. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11553–11562, 2020. 1

  34. [34]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025. 1, 3, 4, 5, 6, 7, 8, 2

  35. [35]

    Fully sparse 3d occupancy prediction, 2024

    Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, and Limin Wang. Fully sparse 3d occupancy prediction, 2024. 8

  36. [36]

    Segment any point cloud sequences by distilling vision foundation models

    Youquan Liu, Lingdong Kong, Jun Cen, Runnan Chen, Wen- wei Zhang, Liang Pan, Kai Chen, and Ziwei Liu. Segment any point cloud sequences by distilling vision foundation models. Advances in Neural Information Processing Sys- tems, 36, 2024. 8

  37. [37]

    Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net

    Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. InProceedings of the IEEE conference on Computer Vision and Pattern Recog- nition, pages 3569–3577, 2018. 1

  38. [38]

    VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025. 2

  39. [39]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 2

  40. [40]

    Risk-aware motion planning for autonomous vehicles with safety specifications

    Truls Nyberg, Christian Pek, Laura Dal Col, Christoffer Nor´en, and Jana Tumova. Risk-aware motion planning for autonomous vehicles with safety specifications. In 2021 ieee intelligent vehicles symposium (iv), pages 1016–1023. IEEE,

  41. [41]

    Lego-drive: Language- enhanced goal-oriented closed-loop end-to-end autonomous driving

    Pranjal Paul, Anant Garg, Tushar Choudhary, Arun Kumar Singh, and K Madhava Krishna. Lego-drive: Language- enhanced goal-oriented closed-loop end-to-end autonomous driving. In 2024 IEEE/RSJ International Conference on In- telligent Robots and Systems (IROS) , pages 10020–10026. IEEE, 2024. 3

  42. [42]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d

    Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XIV 16, pages 194–210. Springer, 2020. 1

  43. [43]

    Lightemma: Lightweight end-to-end multimodal model for autonomous driving

    Zhijie Qiao, Haowei Li, Zhong Cao, and Henry X Liu. Lightemma: Lightweight end-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2505.00284 ,

  44. [44]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 4 10

  45. [45]

    Perceive, predict, and plan: Safe motion planning through interpretable seman- tic representations

    Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable seman- tic representations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pages 414–430. Springer, 2020. 1

  46. [46]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2446–2454, 2020. 2

  47. [47]

    Sparsedrive: End-to-end au- tonomous driving via sparse scene representation

    Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Hao- ran Wu, and Sifa Zheng. Sparsedrive: End-to-end au- tonomous driving via sparse scene representation. Proceed- ings of the IEEE International Conference on Robotics and Automation, 2025. 3, 8, 1

  48. [48]

    Neurad: Neural rendering for autonomous driving

    Adam Tonderski, Carl Lindstr ¨om, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. Neurad: Neural rendering for autonomous driving. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14895–14904, 2024. 2

  49. [49]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2

  50. [50]

    Para-drive: Parallelized architecture for real- time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real- time autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15449–15458, 2024. 7

  51. [51]

    S3pt: Scene semantics and structure guided clus- tering to boost self-supervised pre-training for autonomous driving

    Maciej K Wozniak, Hariprasath Govindarajan, Marvin Klingner, Camille Maurice, B Ravi Kiran, and Senthil Yo- gamani. S3pt: Scene semantics and structure guided clus- tering to boost self-supervised pre-training for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applica- tions of Computer Vision (WACV), pages 1660–1670. IEEE,

  52. [52]

    Toward a robust sensor fusion step for 3d object detection on corrupted data

    Maciej K Wozniak, Viktor K ˚arefj¨ard, Marko Thiel, and Patric Jensfelt. Toward a robust sensor fusion step for 3d object detection on corrupted data. IEEE Robotics and au- tomation letters, 8(11):7018–7025, 2023. 2

  53. [53]

    Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

    Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025. 1, 3, 5, 7

  54. [54]

    4d contrastive superflows are dense 3d representation learners

    Xiang Xu, Lingdong Kong, Hui Shuai, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, and Qingshan Liu. 4d contrastive superflows are dense 3d representation learners. arXiv preprint arXiv:2407.06190, 2024. 8

  55. [55]

    Yihong Xu, Lo ¨ıck Chambon, ´Eloi Zablocki, Micka ¨el Chen, Alexandre Alahi, Matthieu Cord, and Patrick P´erez. Towards motion forecasting with real-world perception inputs: Are end-to-end approaches competitive? In 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 18428–18435. IEEE, 2024. 2

  56. [56]

    Street gaussians: Modeling dynamic urban scenes with gaussian splatting

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. In European Conference on Computer Vision, pages 156–173. Springer, 2024. 2

  57. [57]

    Drivesuprim: Towards precise trajectory selection for end-to-end planning.arXiv preprint arXiv:2506.06659, 2025

    Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M Alvarez, and Zuxuan Wu. Drivesuprim: To- wards precise trajectory selection for end-to-end planning. arXiv preprint arXiv:2506.06659, 2025. 3, 1

  58. [58]

    Roca: Ro- bust cross-domain end-to-end autonomous driving

    Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Litian Liu, Shweta Mahajan, Apratim Bhattacharyya, Yunxiao Shi, Risheek Garrepalli, Hong Cai, and Fatih Porikli. Roca: Ro- bust cross-domain end-to-end autonomous driving. arXiv preprint arXiv:2506.10145, 2025. 3

  59. [59]

    Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601, 2024

    Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Ze- fan Huang, Christina Dao Wen Lee, Dongen Li, Yuhang Han, Anthony Wong, Keng Peng Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba. arXiv preprint arXiv:2408.03601, 2024. 7

  60. [60]

    Diffe2e: Rethinking end-to-end driving with a hy- brid action diffusion and supervised policy

    Rui Zhao, Yuze Fan, Ziguo Chen, Fei Gao, and Zhenhai Gao. Diffe2e: Rethinking end-to-end driving with a hy- brid action diffusion and supervised policy. arXiv preprint arXiv:2505.19516, 2025. 2

  61. [61]

    Gaussianad: Gaussian-centric end-to- end autonomous driving

    Wenzhao Zheng, Junjie Wu, Yao Zheng, Sicheng Zuo, Zixun Xie, Longchao Yang, Yong Pan, Zhihui Hao, Peng Jia, Xi- anpeng Lang, et al. Gaussianad: Gaussian-centric end-to- end autonomous driving. arXiv preprint arXiv:2412.10371,

  62. [62]

    Diffusion-based planning for autonomous driving with flexi- ble guidance

    Yinan Zheng, Ruiming Liang, Kexin ZHENG, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Diffusion-based planning for autonomous driving with flexi- ble guidance. In Proceedings of the International Conference on Learning Representations, 2025. 3 11 Supplementary Materials A. Parameters setup ...

  63. [63]

    com / hustvl / DiffusionDrive / issues/57 as well as issues/45 layer that computes all three projections in one oper- ation

    Fused QKV Projection: In the self-attention mecha- nism, the separate linear layers for query (Q), key (K), and value (V ) were replaced with a single, fused linear 3https : / / github . com / hustvl / DiffusionDrive / issues/57 as well as issues/45 layer that computes all three projections in one oper- ation. This reduces three separate matrix multiplica...

  64. [64]

    We also substituted the ReLU activa- tion with GELU, a smoother activation function that is common in modern high-performance transformers and can lead to better convergence

    Optimized MLP Block: The standard MLP block, which can be inefficient, was replaced by a dedicated MLP module. We also substituted the ReLU activa- tion with GELU, a smoother activation function that is common in modern high-performance transformers and can lead to better convergence

  65. [65]

    Efficient Tensor Reshaping: Throughout the model, especially in the attention mechanism and the CaRT module’s forward pass, tensor reshaping op- erations like .reshape() are now preceded by .contiguous(). This ensures the tensor is stored in a contiguous block of memory before the view op- eration, preventing potential performance penalties as- sociated w...

  66. [66]

    Gradient Checkpointing: We introduced op- tional gradient checkpointing within the transformer blocks. During training, this technique trades a small amount of re-computation in the backward pass for a significant reduction in memory usage, allowing for larger batch sizes which can further improve training throughput

  67. [67]

    In-place and Fused Operations: Smaller optimiza- tions were made throughout the backbone, such as us- ing inplace=True for ReLU activations in the FPN and removing biases from convolution and linear lay- ers where they are followed by a normalization layer, which makes them redundant. Together, these structural and operational improvements result in a mor...