SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Andreas Geiger; Andreas Zell; David Holtz; Hang Yu; Peizheng Li; Rui Song; Yutong Yang; Yuzhi Lai; Zhenghao Zhang

arxiv: 2512.10719 · v2 · pith:3IZVEE2Snew · submitted 2025-12-11 · 💻 cs.CV

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Peizheng Li , Zhenghao Zhang , David Holtz , Hang Yu , Yutong Yang , Yuzhi Lai , Rui Song , Andreas Geiger

show 1 more author

Andreas Zell

This is my paper

Pith reviewed 2026-05-22 12:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords autonomous drivingvision language modelspositional encodingsspatial reasoningtrajectory planningend-to-end drivingmulti-view depthnuScenes

0 comments

The pith

Treating 3D coordinates as positional encodings instead of text digits lets VLMs jointly reason over semantics and space for driving plans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision language models for autonomous driving have trouble grasping fine-grained 3D spatial relationships even though they excel at general visual understanding. SpaceDrive fixes this by converting 3D coordinates from depth estimation, ego history, and prompts into positional encodings that augment visual tokens and replace numerical text tokens for both inputs and outputs. The model can then index specific scene elements by their spatial position and regress full trajectories at once rather than token by token. A reader would care because autonomous systems must interact accurately with the physical world, and better spatial handling directly supports safer trajectory planning.

Core claim

SpaceDrive is a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings rather than textual digit tokens. A universal positional encoder processes all 3D coordinates obtained from multi-view depth estimation, historical ego-states, and text prompts. These encodings are superimposed on the corresponding 2D visual tokens and simultaneously serve as a task-agnostic coordinate representation that the VLM uses for both input and output, enabling direct regression of trajectory coordinates and improved joint semantic-spatial reasoning.

What carries the argument

The universal positional encoder that converts 3D coordinates into positional encodings, superimposes them on visual tokens, and replaces digit-wise numerical tokens for VLM input and output.

If this is right

The VLM can index specific visual semantics by their spatial location during reasoning.
Trajectory coordinates are regressed directly instead of being assembled digit by digit.
Planning accuracy improves because the model avoids errors from numerical text parsing.
The same coordinate representation works across different driving tasks without task-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same superposition technique could be tested on non-driving VLM tasks that require metric spatial output, such as visual question answering about object distances.
Replacing digit tokens with positional encodings may reduce the model's sensitivity to prompt phrasing that describes numbers.
If depth estimation quality is the main bottleneck, combining the encoder with stronger 3D perception backbones should produce measurable gains in closed-loop metrics.

Load-bearing premise

The multi-view depth estimation must produce 3D coordinates accurate enough that the derived positional encodings support reliable semantic-spatial reasoning without introducing new localization errors that degrade planning.

What would settle it

Measure whether planning errors increase in direct proportion to added noise in the depth estimates while holding the VLM and encoder fixed; a clear mismatch would indicate the encodings are not carrying the expected spatial signal.

Figures

Figures reproduced from arXiv: 2512.10719 by Andreas Geiger, Andreas Zell, David Holtz, Hang Yu, Peizheng Li, Rui Song, Yutong Yang, Yuzhi Lai, Zhenghao Zhang.

**Figure 1.** Figure 1: Spatial awareness in VLM-based end-to-end autonomous driving. (a) Constrained by insufficient 3D pre-training and discrete token-wise encoding, existing end-to-end planners based on the VLM struggle to precisely ground, associate, and predict 3D spatial positions, limiting their planning capabilities. (b) Our proposed SpaceDrive planner introduces a unified 3D coordinate encoding to replace the original VL… view at source ↗

**Figure 2.** Figure 2: SpaceDrive framework. Beyond the base VLM, a frozen depth estimator predicts dense metric depths from surround-view images, which are projected into 3D coordinates and encoded by a universal PE encoder to augment visual tokens with spatial cues. BEV coordinates in text prompts are encoded by the same PE encoder, replacing the original coordinate tokens and preceded by the PE indicator ⟨IND⟩. At the output … view at source ↗

**Figure 3.** Figure 3: Qualitative results of closed-loop evaluation on Bench2Drive [25]. Green and pink dots represent path and speed waypoints, respectively. Red circles indicate cyclists ahead that the vehicle needs to avoid. Parameters such as speed and steering wheel angle can be found in the figures [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods. Code is available at: https://github.com/zhenghao2519/SpaceDrive.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpaceDrive adds 3D positional encodings to VLMs for driving, gets solid benchmark lifts on nuScenes and Bench2Drive, but the depth step needs closer checks.

read the letter

SpaceDrive's main move is to run 3D coordinates from multi-view depth, ego history, and prompts through a universal positional encoder, then overlay those encodings on visual tokens while swapping out digit tokens for both input and output. This lets the VLM tie semantics to space more directly and regress trajectories without spelling numbers digit by digit. The abstract shows state-of-the-art open-loop numbers on nuScenes and a 78.02 driving score on closed-loop Bench2Drive, second among VLM methods. That is measurable progress on the tasks they picked. The design stays modular on top of existing VLM backbones, which makes the change easy to adopt, and the code is released so others can reproduce it. The central claim holds up on the reported metrics without obvious circularity in the abstract. The soft spot is the depth estimation step. Driving scenes routinely produce scale ambiguity, occlusions, and errors on moving objects, and any bias in those 3D coordinates would shift the positional encodings and could weaken the planning instead of strengthen it. The abstract gives no sensitivity tests, depth-error ablations, or error bars, so it is unclear how much the gains depend on clean coordinates versus the encoding trick. This paper is aimed at groups building VLM-based planners who want to add geometry without rewriting the whole token pipeline. Readers focused on practical ways to inject spatial structure into pretrained models will get concrete value from the experiments. The idea is clear and the results are positive enough that it deserves a serious referee. I would send it out for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SpaceDrive, a VLM-based end-to-end autonomous driving framework that addresses limitations in fine-grained 3D spatial reasoning by deriving 3D coordinates via multi-view depth estimation, encoding them with a universal positional encoder, superimposing the resulting positional encodings on visual tokens, and replacing digit-wise numerical tokens with these encodings for both VLM inputs and outputs. This enables joint semantic-spatial reasoning and direct trajectory regression. The paper reports state-of-the-art open-loop performance on nuScenes and a second-best closed-loop Driving Score of 78.02 on Bench2Drive among VLM-based methods, with code released.

Significance. If the central mechanism proves robust, the work offers a concrete architectural route to improve spatial awareness in pretrained VLMs for driving without relying on textual digit representations. The public code release is a positive contribution to reproducibility in the field.

major comments (2)

[Methods (depth estimation and universal positional encoder)] Methods section on multi-view depth estimation and positional encoding: the central claim that superimposed 3D PEs enable reliable joint semantic-spatial reasoning without introducing new localization errors rests on the untested assumption that depth estimates remain sufficiently accurate under occlusions, scale ambiguity, and dynamic objects. No sensitivity analysis, depth-error propagation study, or ablation isolating the effect of depth inaccuracies on planning metrics is reported, which directly bears on whether the reported gains can be attributed to the proposed mechanism rather than other factors.
[Experiments (closed-loop benchmark)] Experiments section (Bench2Drive results): the Driving Score of 78.02 is presented as second-best without error bars, variance across runs, or controls for post-hoc coordinate-handling choices. This weakens the ability to assess whether the improvement is robust or sensitive to implementation details in the positional encoding pipeline.

minor comments (2)

[Abstract and Methods] Notation for the universal positional encoder is introduced in the abstract but would benefit from an explicit equation or diagram in the main text showing how 3D coordinates from depth, ego-states, and prompts are unified.
[Figures and Tables] Figure captions and tables should explicitly state whether reported metrics include standard deviations or are single-run results to aid interpretation of benchmark comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which help us improve the clarity and rigor of the manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Methods (depth estimation and universal positional encoder)] Methods section on multi-view depth estimation and positional encoding: the central claim that superimposed 3D PEs enable reliable joint semantic-spatial reasoning without introducing new localization errors rests on the untested assumption that depth estimates remain sufficiently accurate under occlusions, scale ambiguity, and dynamic objects. No sensitivity analysis, depth-error propagation study, or ablation isolating the effect of depth inaccuracies on planning metrics is reported, which directly bears on whether the reported gains can be attributed to the proposed mechanism rather than other factors.

Authors: We agree that a direct analysis of depth estimation errors is valuable for substantiating the robustness of the proposed mechanism. The multi-view depth estimation in SpaceDrive is combined with a universal positional encoder that maps coordinates into a shared embedding space, which is designed to reduce sensitivity to per-view scale and occlusion issues. Existing ablations in the manuscript already isolate the contribution of the 3D positional encodings by comparing against variants without them, showing consistent gains in both semantic and planning metrics. Nevertheless, we acknowledge the absence of an explicit sensitivity study. In the revised manuscript we will add a new experiment that injects controlled noise into the depth estimates at different levels and measures the resulting degradation in open-loop trajectory regression and closed-loop Driving Score. revision: partial
Referee: [Experiments (closed-loop benchmark)] Experiments section (Bench2Drive results): the Driving Score of 78.02 is presented as second-best without error bars, variance across runs, or controls for post-hoc coordinate-handling choices. This weakens the ability to assess whether the improvement is robust or sensitive to implementation details in the positional encoding pipeline.

Authors: We appreciate the referee's point on statistical robustness. The reported Driving Score follows the single-run protocol used by prior VLM-based methods on Bench2Drive. To strengthen the claim, we will rerun the closed-loop evaluation with multiple random seeds, report mean and standard deviation for the Driving Score and auxiliary metrics, and add a short paragraph clarifying that the coordinate-handling pipeline uses only the deterministic universal positional encoder with no post-hoc adjustments. revision: yes

Circularity Check

0 steps flagged

No significant circularity: architectural proposal validated on external benchmarks

full rationale

The paper introduces SpaceDrive as a new VLM architecture that derives 3D coordinates via multi-view depth estimation, encodes them as universal positional encodings, superimposes them on visual tokens, and uses them to replace digit tokens for input/output. This is presented as an explicit design choice to enable joint semantic-spatial reasoning. The central claims of SOTA open-loop performance on nuScenes and second-best closed-loop Driving Score on Bench2Drive are supported by empirical results on standard external datasets and benchmarks. No load-bearing step reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain; the derivation remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach depends on the accuracy of upstream multi-view depth estimation and the assumption that positional encoding superposition preserves semantic information while adding spatial structure.

axioms (1)

domain assumption Multi-view depth estimation yields sufficiently accurate 3D coordinates for the positional encodings to be useful.
Invoked when deriving 3D PEs from multi-view depth estimation in the framework description.

invented entities (1)

universal positional encoder no independent evidence
purpose: To convert 3D coordinates into task-agnostic positional encodings usable across visual tokens and VLM inputs/outputs.
New component introduced to handle spatial information explicitly.

pith-pipeline@v0.9.0 · 5812 in / 1227 out tokens · 45976 ms · 2026-05-22T12:13:26.434636+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we opt for a 3D sine-cosine positional encoding extending the standard 1D formulation dimension-wise: ϕ(cp) = [ϕx(x3D_p), ϕy(y3D_p), ϕz(z3D_p)] … dx=dy=⌈dim/3⌉, dz=dim−dx−dy
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

These 3D PEs are first superimposed to augment the corresponding 2D visual tokens … replacing the digit-wise numerical tokens as both inputs and outputs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoPhy distills VLM knowledge into a BEV encoder and uses an action-conditioned auto-regressive BEV world model inside GRPO with dual physical-cognitive rewards to reach SOTA on NAVSIM v1/v2 while adding language-based...
ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

ST-Prune is a training-free spatio-temporal token pruning framework for VLMs in autonomous driving that achieves near-lossless results at 90% token reduction by exploiting motion volatility, temporal recency, and mult...
EponaV2: Driving World Model with Comprehensive Future Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 4 Pith papers · 11 internal anchors

[1]

Covla: Comprehensive vision-language-action dataset for au- tonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watan- abe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for au- tonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025. 3

work page 2025
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 5, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Large language model-assisted autonomous vehicle recovery from immobilization.arXiv preprint arXiv:2510.26023, 2025

Zhipeng Bao and Qianwen Li. Large language model-assisted autonomous vehicle recovery from immobilization.arXiv preprint arXiv:2510.26023, 2025. 6, 4

work page arXiv 2025
[4]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020. 2, 5, 6, 3

work page 2020
[5]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 3

work page 2024
[6]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Solve: Synergy of language-vision and end-to-end networks for autonomous driving

Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, and Hongsheng Li. Solve: Synergy of language-vision and end-to-end networks for autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 6, 3

work page 2025
[8]

3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiao- long Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, et al. 3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,

work page arXiv
[9]

Tqd-track: Tem- poral query denoising for 3d multi-object tracking.arXiv preprint arXiv:2504.03258, 2025

Shuxiao Ding, Yutong Yang, Julian Wiederer, Markus Braun, Peizheng Li, Juergen Gall, and Bin Yang. Tqd-track: Tem- poral query denoising for 3d multi-object tracking.arXiv preprint arXiv:2504.03258, 2025. 2

work page arXiv 2025
[10]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, 2017. 5

work page 2017
[11]

Advanc- ing sequential numerical prediction in autoregressive models

Xiang Fei, Jinghui Lu, Qi Sun, Hao Feng, Yanjie Wang, Wei Shi, An-Lan Wang, Jingqun Tang, and Can Huang. Advanc- ing sequential numerical prediction in autoregressive models. arXiv preprint arXiv:2505.13077, 2025. 2, 4

work page arXiv 2025
[12]

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755, 2025. 1, 2, 3, 4, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 3

work page arXiv 2024
[14]

End-to-end autonomous driving with- out costly modularization and 3d manual annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence,

Mingzhe Guo, Zhipeng Zhang, Yuan He, Ke Wang, Liping Jing, and Haibin Ling. End-to-end autonomous driving with- out costly modularization and 3d manual annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page
[15]

Vdrive: Leveraging reinforced vla and diffusion policy for end-to-end autonomous driving

Ziang Guo and Zufeng Zhang. Vdrive: Leveraging reinforced vla and diffusion policy for end-to-end autonomous driving. arXiv preprint arXiv:2510.15446, 2025. 6, 4

work page arXiv 2025
[16]

Eta: Efficiency through thinking ahead, a dual approach to self-driving with large models

Shadi Hamdan, Chonghao Sima, Zetong Yang, Hongyang Li, and Fatma Guney. Eta: Efficiency through thinking ahead, a dual approach to self-driving with large models. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, 2025. 6, 4

work page 2025
[17]

Lora: Low-rank adaptation of large language models.ICLR, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 2022. 5, 2

work page 2022
[18]

St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision, 2022. 2, 6, 3

work page 2022
[19]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023. 2, 6, 3, 4 9

work page 2023
[20]

Prioritizing perception-guided self-supervision: A new paradigm for causal modeling in end-to-end autonomous driv- ing

Yi Huang, Lihui Jiang, Bingbing Liu, Hongbo Zhang, et al. Prioritizing perception-guided self-supervision: A new paradigm for causal modeling in end-to-end autonomous driv- ing. InThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems, 2025. 2, 4

work page 2025
[21]

Mak- ing large language models better planners with reasoning- decision alignment

Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, and Xiaodan Liang. Mak- ing large language models better planners with reasoning- decision alignment. InEuropean Conference on Computer Vision, 2024. 6, 3

work page 2024
[22]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page
[24]

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2, 6, 4

work page 2023
[25]

Bench2drive: Towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 2024. 2, 5, 6, 8, 4

work page 2024
[26]

Drivetransformer: Unified transformer for scalable end-to- end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to- end autonomous driving. InInternational Conference on Learning Representations (ICLR), 2025. 6, 4

work page 2025
[27]

Vad: Vectorized scene representation for ef- ficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for ef- ficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023. 2, 6, 3, 4

work page 2023
[28]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end- to-end autonomous driving.arXiv preprint arXiv:2410.22313,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Rethinking positional en- coding in language pre-training

Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional en- coding in language pre-training. InInternational Conference on Learning Representations, 2021. 2

work page 2021
[30]

Vlr-driver: Large vision-language-reasoning models for embodied autonomous driving

Fanjie Kong, Yitong Li, Weihuang Chen, Chen Min, Yizhe Li, Zhiqiang Gao, Haoyang Li, Zhongyu Guo, and Hongbin Sun. Vlr-driver: Large vision-language-reasoning models for embodied autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page
[31]

Seer-var: Semantic egocentric environment reasoner for vehicle augmented reality.arXiv preprint arXiv:2508.17255,

Yuzhi Lai, Shenghai Yuan, Peizheng Li, Jun Lou, and Andreas Zell. Seer-var: Semantic egocentric environment reasoner for vehicle augmented reality.arXiv preprint arXiv:2508.17255,

work page arXiv
[32]

FAM-HRI: Foundation-Model Assisted Multi-Modal Human-Robot Interaction Combining Gaze and Speech

Yuzhi Lai, Shenghai Yuan, Boya Zhang, Benjamin Kiefer, Peizheng Li, Tianchen Deng, and Andreas Zell. Fam- hri: Foundation-model assisted multi-modal human-robot interaction combining gaze and speech.arXiv preprint arXiv:2503.16492, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Powerbev: A pow- erful yet lightweight framework for instance prediction in bird’s-eye view

Peizheng Li, Shuxiao Ding, Xieyuanli Chen, Niklas Hansel- mann, Marius Cordts, and Juergen Gall. Powerbev: A pow- erful yet lightweight framework for instance prediction in bird’s-eye view. InProceedings of the Thirty-Second Interna- tional Joint Conference on Artificial Intelligence, IJCAI-23,

work page
[34]

Ago: Adaptive grounding for open world 3d occupancy prediction

Peizheng Li, Shuxiao Ding, You Zhou, Qingwen Zhang, Onat Inak, Larissa Triess, Niklas Hanselmann, Marius Cordts, and Andreas Zell. Ago: Adaptive grounding for open world 3d occupancy prediction. InProceedings of the IEEE/CVF international conference on computer vision, 2025. 2

work page 2025
[35]

turn left

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online tra- jectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025. 2, 6, 4

work page arXiv 2025
[36]

Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025. 3

work page arXiv 2025
[37]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2

work page 2024
[38]

Is ego status all you need for open- loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open- loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 5, 6, 1

work page 2024
[39]

Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, and Jose M. Alvarez. Hydra-next: Robust closed-loop driving with open-loop training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2, 4

work page 2025
[40]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 3, 4

work page 2025
[41]

Visual instruction tuning.Advances in neural information processing systems, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023. 4, 1

work page 2023
[42]

Reinforced refinement with self-aware ex- pansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800, 2025

Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware ex- pansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800, 2025. 4

work page arXiv 2025
[43]

Gaussianfusion: Gaussian-based multi-sensor fu- sion for end-to-end autonomous driving.arXiv preprint arXiv:2506.00034, 2025

Shuai Liu, Quanmin Liang, Zefeng Li, Boyang Li, and Kai Huang. Gaussianfusion: Gaussian-based multi-sensor fu- sion for end-to-end autonomous driving.arXiv preprint arXiv:2506.00034, 2025. 2, 4 10

work page arXiv 2025
[44]

X-driver: Explainable autonomous driving with vision-language models.arXiv preprint arXiv:2505.05098, 2025

Wei Liu, Jiyuan Zhang, Binxiong Zheng, Yufeng Hu, Yingzhan Lin, and Zengfeng Zeng. X-driver: Explainable autonomous driving with vision-language models.arXiv preprint arXiv:2505.05098, 2025. 6, 4

work page arXiv 2025
[45]

Petr: Position embedding transformation for multi-view 3d object detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. InEuropean conference on computer vision. Springer, 2022. 2

work page 2022
[46]

Real- ad: Towards human-like reasoning in end-to-end autonomous driving

Yuhang Lu, Jiadong Tu, Yuexin Ma, and Xinge Zhu. Real- ad: Towards human-like reasoning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2025. 6, 3, 4

work page 2025
[47]

Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jian- hua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing. InEuropean Conference on Computer Vision. Springer,

work page
[48]

Vlp: Vision language planning for autonomous driving

Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, and Liu Ren. Vlp: Vision language planning for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1, 3, 6

work page 2024
[49]

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025. 5, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 4, 6, 7, 1

work page 2025
[51]

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025. 2, 4

work page arXiv 2025
[52]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3

work page 2024
[53]

Divide and merge: Motion and semantic learning in end-to-end autonomous driving.arXiv preprint arXiv:2502.07631, 2025

Yinzhe Shen, Omer Sahin Tas, Kaiwen Wang, Royden Wag- ner, and Christoph Stiller. Divide and merge: Motion and semantic learning in end-to-end autonomous driving.arXiv preprint arXiv:2502.07631, 2025. 2

work page arXiv 2025
[54]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, 2024. 1, 2, 3

work page 2024
[55]

Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025. 6, 3, 4

work page 2025
[56]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024. 4

work page 2024
[57]

Sparsedrive: End-to-end au- tonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Hao- ran Wu, and Sifa Zheng. Sparsedrive: End-to-end au- tonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), 2025. 6, 3, 4

work page 2025
[58]

Hip-ad: Hierarchical and multi-granularity planning with de- formable attention for autonomous driving in a single decoder

Yingqi Tang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi-granularity planning with de- formable attention for autonomous driving in a single decoder. arXiv preprint arXiv:2503.08612, 2025. 6, 4

work page arXiv 2025
[59]

Drivevlm: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, XianPeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. InConference on Robot Learning, 2025. 1, 3, 6

work page 2025
[60]

Geminus: Dual-aware global and scene-adaptive mixture-of-experts for end-to-end autonomous driving.arXiv preprint arXiv:2507.14456, 2025

Chi Wan, Yixin Cui, Jiatong Du, Shuo Yang, Yulong Bai, Peng Yi, Nan Li, and Yanjun Huang. Geminus: Dual-aware global and scene-adaptive mixture-of-experts for end-to-end autonomous driving.arXiv preprint arXiv:2507.14456, 2025. 6, 4

work page arXiv 2025
[61]

Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 1, 2, 3, 4, 5, 6

work page 2025
[62]

Detr3d: 3d object detection from multi-view images via 3d-to-2d queries

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InConference on robot learning, 2022. 2

work page 2022
[63]

Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 3

work page 2024
[64]

Para-drive: Parallelized architecture for real- time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real- time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

work page
[65]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters,

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters,

work page
[67]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 3

work page 2025
[68]

Depth anything v2.Advances in Neural Information Processing Systems,

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems,

work page
[69]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving.arXiv preprint arXiv:2505.16278,

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025

Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, and Junchi Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025. 4

work page arXiv 2025
[71]

Hype: Hybrid planning with ego proposal-conditioned predictions.arXiv preprint arXiv:2510.12733, 2025

Hang Yu, Julian Jordan, Julian Schmidt, Silvan Lindner, Alessandro Canevaro, and Wilhelm Stork. Hype: Hybrid planning with ego proposal-conditioned predictions.arXiv preprint arXiv:2510.12733, 2025. 2

work page arXiv 2025
[72]

Drivee2e: Closed- loop benchmark for end-to-end autonomous driving through real-to-simulation.arXiv preprint arXiv:2509.23922, 2025

Haibao Yu, Wenxian Yang, Ruiyang Hao, Chuanye Wang, Jiaru Zhong, Ping Luo, and Zaiqing Nie. Drivee2e: Closed- loop benchmark for end-to-end autonomous driving through real-to-simulation.arXiv preprint arXiv:2509.23922, 2025. 2

work page arXiv 2025
[73]

Rag-driver: Generalisable driving explanations with retrieval-augmented in-context multi-modal large language model learning

Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, and Matthew Gadd. Rag-driver: Generalisable driving explanations with retrieval-augmented in-context multi-modal large language model learning. In Robotics: Science and Systems, 2024. 3

work page 2024
[74]

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang- Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jing- dong Wang. Rethinking the open-loop evaluation of end- to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430, 2023. 2, 5, 6, 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution

Bozhou Zhang, Nan Song, Xiatian Zhu, Jiankang Deng, Li Zhang, et al. Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 2, 4

work page 2025
[76]

Seflow: A self-supervised scene flow method in autonomous driving

Qingwen Zhang, Yi Yang, Peizheng Li, Olov Andersson, and Patric Jensfelt. Seflow: A self-supervised scene flow method in autonomous driving. InEuropean Conference on Computer Vision. Springer, 2024. 2

work page 2024
[77]

Dual-aeb: Synergizing rule-based and multimodal large language models for effective emergency braking

Wei Zhang, Pengfei Li, Junli Wang, Bingchuan Sun, Qihao Jin, Guangjun Bao, Shibo Rui, Yang Yu, Wenchao Ding, Peng Li, et al. Dual-aeb: Synergizing rule-based and multimodal large language models for effective emergency braking. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), 2025. 6, 4

work page 2025
[78]

Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving

Zhiyuan Zhang, Xiaofan Li, Zhihao Xu, Wenjie Peng, Zi- jian Zhou, Miaojing Shi, and Shuangping Huang. Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving. InProceedings of the Com- puter Vision and Pattern Recognition Conference, 2025. 2

work page 2025
[79]

Video-3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 3

work page 2025
[80]

Genad: Generative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, 2024. 2, 6, 3, 4

work page 2024

Showing first 80 references.

[1] [1]

Covla: Comprehensive vision-language-action dataset for au- tonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watan- abe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for au- tonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025. 3

work page 2025

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 5, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Large language model-assisted autonomous vehicle recovery from immobilization.arXiv preprint arXiv:2510.26023, 2025

Zhipeng Bao and Qianwen Li. Large language model-assisted autonomous vehicle recovery from immobilization.arXiv preprint arXiv:2510.26023, 2025. 6, 4

work page arXiv 2025

[4] [4]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020. 2, 5, 6, 3

work page 2020

[5] [5]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 3

work page 2024

[6] [6]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Solve: Synergy of language-vision and end-to-end networks for autonomous driving

Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, and Hongsheng Li. Solve: Synergy of language-vision and end-to-end networks for autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 6, 3

work page 2025

[8] [8]

3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiao- long Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, et al. 3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,

work page arXiv

[9] [9]

Tqd-track: Tem- poral query denoising for 3d multi-object tracking.arXiv preprint arXiv:2504.03258, 2025

Shuxiao Ding, Yutong Yang, Julian Wiederer, Markus Braun, Peizheng Li, Juergen Gall, and Bin Yang. Tqd-track: Tem- poral query denoising for 3d multi-object tracking.arXiv preprint arXiv:2504.03258, 2025. 2

work page arXiv 2025

[10] [10]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, 2017. 5

work page 2017

[11] [11]

Advanc- ing sequential numerical prediction in autoregressive models

Xiang Fei, Jinghui Lu, Qi Sun, Hao Feng, Yanjie Wang, Wei Shi, An-Lan Wang, Jingqun Tang, and Can Huang. Advanc- ing sequential numerical prediction in autoregressive models. arXiv preprint arXiv:2505.13077, 2025. 2, 4

work page arXiv 2025

[12] [12]

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755, 2025. 1, 2, 3, 4, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 3

work page arXiv 2024

[14] [14]

End-to-end autonomous driving with- out costly modularization and 3d manual annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence,

Mingzhe Guo, Zhipeng Zhang, Yuan He, Ke Wang, Liping Jing, and Haibin Ling. End-to-end autonomous driving with- out costly modularization and 3d manual annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page

[15] [15]

Vdrive: Leveraging reinforced vla and diffusion policy for end-to-end autonomous driving

Ziang Guo and Zufeng Zhang. Vdrive: Leveraging reinforced vla and diffusion policy for end-to-end autonomous driving. arXiv preprint arXiv:2510.15446, 2025. 6, 4

work page arXiv 2025

[16] [16]

Eta: Efficiency through thinking ahead, a dual approach to self-driving with large models

Shadi Hamdan, Chonghao Sima, Zetong Yang, Hongyang Li, and Fatma Guney. Eta: Efficiency through thinking ahead, a dual approach to self-driving with large models. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, 2025. 6, 4

work page 2025

[17] [17]

Lora: Low-rank adaptation of large language models.ICLR, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 2022. 5, 2

work page 2022

[18] [18]

St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision, 2022. 2, 6, 3

work page 2022

[19] [19]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023. 2, 6, 3, 4 9

work page 2023

[20] [20]

Prioritizing perception-guided self-supervision: A new paradigm for causal modeling in end-to-end autonomous driv- ing

Yi Huang, Lihui Jiang, Bingbing Liu, Hongbo Zhang, et al. Prioritizing perception-guided self-supervision: A new paradigm for causal modeling in end-to-end autonomous driv- ing. InThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems, 2025. 2, 4

work page 2025

[21] [21]

Mak- ing large language models better planners with reasoning- decision alignment

Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, and Xiaodan Liang. Mak- ing large language models better planners with reasoning- decision alignment. InEuropean Conference on Computer Vision, 2024. 6, 3

work page 2024

[22] [22]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page

[24] [24]

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2, 6, 4

work page 2023

[25] [25]

Bench2drive: Towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 2024. 2, 5, 6, 8, 4

work page 2024

[26] [26]

Drivetransformer: Unified transformer for scalable end-to- end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to- end autonomous driving. InInternational Conference on Learning Representations (ICLR), 2025. 6, 4

work page 2025

[27] [27]

Vad: Vectorized scene representation for ef- ficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for ef- ficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023. 2, 6, 3, 4

work page 2023

[28] [28]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end- to-end autonomous driving.arXiv preprint arXiv:2410.22313,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Rethinking positional en- coding in language pre-training

Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional en- coding in language pre-training. InInternational Conference on Learning Representations, 2021. 2

work page 2021

[30] [30]

Vlr-driver: Large vision-language-reasoning models for embodied autonomous driving

Fanjie Kong, Yitong Li, Weihuang Chen, Chen Min, Yizhe Li, Zhiqiang Gao, Haoyang Li, Zhongyu Guo, and Hongbin Sun. Vlr-driver: Large vision-language-reasoning models for embodied autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page

[31] [31]

Seer-var: Semantic egocentric environment reasoner for vehicle augmented reality.arXiv preprint arXiv:2508.17255,

Yuzhi Lai, Shenghai Yuan, Peizheng Li, Jun Lou, and Andreas Zell. Seer-var: Semantic egocentric environment reasoner for vehicle augmented reality.arXiv preprint arXiv:2508.17255,

work page arXiv

[32] [32]

FAM-HRI: Foundation-Model Assisted Multi-Modal Human-Robot Interaction Combining Gaze and Speech

Yuzhi Lai, Shenghai Yuan, Boya Zhang, Benjamin Kiefer, Peizheng Li, Tianchen Deng, and Andreas Zell. Fam- hri: Foundation-model assisted multi-modal human-robot interaction combining gaze and speech.arXiv preprint arXiv:2503.16492, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Powerbev: A pow- erful yet lightweight framework for instance prediction in bird’s-eye view

Peizheng Li, Shuxiao Ding, Xieyuanli Chen, Niklas Hansel- mann, Marius Cordts, and Juergen Gall. Powerbev: A pow- erful yet lightweight framework for instance prediction in bird’s-eye view. InProceedings of the Thirty-Second Interna- tional Joint Conference on Artificial Intelligence, IJCAI-23,

work page

[34] [34]

Ago: Adaptive grounding for open world 3d occupancy prediction

Peizheng Li, Shuxiao Ding, You Zhou, Qingwen Zhang, Onat Inak, Larissa Triess, Niklas Hanselmann, Marius Cordts, and Andreas Zell. Ago: Adaptive grounding for open world 3d occupancy prediction. InProceedings of the IEEE/CVF international conference on computer vision, 2025. 2

work page 2025

[35] [35]

turn left

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online tra- jectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025. 2, 6, 4

work page arXiv 2025

[36] [36]

Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025. 3

work page arXiv 2025

[37] [37]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2

work page 2024

[38] [38]

Is ego status all you need for open- loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open- loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 5, 6, 1

work page 2024

[39] [39]

Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, and Jose M. Alvarez. Hydra-next: Robust closed-loop driving with open-loop training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2, 4

work page 2025

[40] [40]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 3, 4

work page 2025

[41] [41]

Visual instruction tuning.Advances in neural information processing systems, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023. 4, 1

work page 2023

[42] [42]

Reinforced refinement with self-aware ex- pansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800, 2025

Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware ex- pansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800, 2025. 4

work page arXiv 2025

[43] [43]

Gaussianfusion: Gaussian-based multi-sensor fu- sion for end-to-end autonomous driving.arXiv preprint arXiv:2506.00034, 2025

Shuai Liu, Quanmin Liang, Zefeng Li, Boyang Li, and Kai Huang. Gaussianfusion: Gaussian-based multi-sensor fu- sion for end-to-end autonomous driving.arXiv preprint arXiv:2506.00034, 2025. 2, 4 10

work page arXiv 2025

[44] [44]

X-driver: Explainable autonomous driving with vision-language models.arXiv preprint arXiv:2505.05098, 2025

Wei Liu, Jiyuan Zhang, Binxiong Zheng, Yufeng Hu, Yingzhan Lin, and Zengfeng Zeng. X-driver: Explainable autonomous driving with vision-language models.arXiv preprint arXiv:2505.05098, 2025. 6, 4

work page arXiv 2025

[45] [45]

Petr: Position embedding transformation for multi-view 3d object detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. InEuropean conference on computer vision. Springer, 2022. 2

work page 2022

[46] [46]

Real- ad: Towards human-like reasoning in end-to-end autonomous driving

Yuhang Lu, Jiadong Tu, Yuexin Ma, and Xinge Zhu. Real- ad: Towards human-like reasoning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2025. 6, 3, 4

work page 2025

[47] [47]

Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jian- hua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing. InEuropean Conference on Computer Vision. Springer,

work page

[48] [48]

Vlp: Vision language planning for autonomous driving

Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, and Liu Ren. Vlp: Vision language planning for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1, 3, 6

work page 2024

[49] [49]

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025. 5, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 4, 6, 7, 1

work page 2025

[51] [51]

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025. 2, 4

work page arXiv 2025

[52] [52]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3

work page 2024

[53] [53]

Divide and merge: Motion and semantic learning in end-to-end autonomous driving.arXiv preprint arXiv:2502.07631, 2025

Yinzhe Shen, Omer Sahin Tas, Kaiwen Wang, Royden Wag- ner, and Christoph Stiller. Divide and merge: Motion and semantic learning in end-to-end autonomous driving.arXiv preprint arXiv:2502.07631, 2025. 2

work page arXiv 2025

[54] [54]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, 2024. 1, 2, 3

work page 2024

[55] [55]

Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025. 6, 3, 4

work page 2025

[56] [56]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024. 4

work page 2024

[57] [57]

Sparsedrive: End-to-end au- tonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Hao- ran Wu, and Sifa Zheng. Sparsedrive: End-to-end au- tonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), 2025. 6, 3, 4

work page 2025

[58] [58]

Hip-ad: Hierarchical and multi-granularity planning with de- formable attention for autonomous driving in a single decoder

Yingqi Tang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi-granularity planning with de- formable attention for autonomous driving in a single decoder. arXiv preprint arXiv:2503.08612, 2025. 6, 4

work page arXiv 2025

[59] [59]

Drivevlm: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, XianPeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. InConference on Robot Learning, 2025. 1, 3, 6

work page 2025

[60] [60]

Geminus: Dual-aware global and scene-adaptive mixture-of-experts for end-to-end autonomous driving.arXiv preprint arXiv:2507.14456, 2025

Chi Wan, Yixin Cui, Jiatong Du, Shuo Yang, Yulong Bai, Peng Yi, Nan Li, and Yanjun Huang. Geminus: Dual-aware global and scene-adaptive mixture-of-experts for end-to-end autonomous driving.arXiv preprint arXiv:2507.14456, 2025. 6, 4

work page arXiv 2025

[61] [61]

Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 1, 2, 3, 4, 5, 6

work page 2025

[62] [62]

Detr3d: 3d object detection from multi-view images via 3d-to-2d queries

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InConference on robot learning, 2022. 2

work page 2022

[63] [63]

Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 3

work page 2024

[64] [64]

Para-drive: Parallelized architecture for real- time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real- time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

work page

[65] [65]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters,

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters,

work page

[67] [67]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 3

work page 2025

[68] [68]

Depth anything v2.Advances in Neural Information Processing Systems,

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems,

work page

[69] [69]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving.arXiv preprint arXiv:2505.16278,

work page internal anchor Pith review Pith/arXiv arXiv

[70] [70]

Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025

Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, and Junchi Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025. 4

work page arXiv 2025

[71] [71]

Hype: Hybrid planning with ego proposal-conditioned predictions.arXiv preprint arXiv:2510.12733, 2025

Hang Yu, Julian Jordan, Julian Schmidt, Silvan Lindner, Alessandro Canevaro, and Wilhelm Stork. Hype: Hybrid planning with ego proposal-conditioned predictions.arXiv preprint arXiv:2510.12733, 2025. 2

work page arXiv 2025

[72] [72]

Drivee2e: Closed- loop benchmark for end-to-end autonomous driving through real-to-simulation.arXiv preprint arXiv:2509.23922, 2025

Haibao Yu, Wenxian Yang, Ruiyang Hao, Chuanye Wang, Jiaru Zhong, Ping Luo, and Zaiqing Nie. Drivee2e: Closed- loop benchmark for end-to-end autonomous driving through real-to-simulation.arXiv preprint arXiv:2509.23922, 2025. 2

work page arXiv 2025

[73] [73]

Rag-driver: Generalisable driving explanations with retrieval-augmented in-context multi-modal large language model learning

Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, and Matthew Gadd. Rag-driver: Generalisable driving explanations with retrieval-augmented in-context multi-modal large language model learning. In Robotics: Science and Systems, 2024. 3

work page 2024

[74] [74]

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang- Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jing- dong Wang. Rethinking the open-loop evaluation of end- to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430, 2023. 2, 5, 6, 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[75] [75]

Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution

Bozhou Zhang, Nan Song, Xiatian Zhu, Jiankang Deng, Li Zhang, et al. Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 2, 4

work page 2025

[76] [76]

Seflow: A self-supervised scene flow method in autonomous driving

Qingwen Zhang, Yi Yang, Peizheng Li, Olov Andersson, and Patric Jensfelt. Seflow: A self-supervised scene flow method in autonomous driving. InEuropean Conference on Computer Vision. Springer, 2024. 2

work page 2024

[77] [77]

Dual-aeb: Synergizing rule-based and multimodal large language models for effective emergency braking

Wei Zhang, Pengfei Li, Junli Wang, Bingchuan Sun, Qihao Jin, Guangjun Bao, Shibo Rui, Yang Yu, Wenchao Ding, Peng Li, et al. Dual-aeb: Synergizing rule-based and multimodal large language models for effective emergency braking. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), 2025. 6, 4

work page 2025

[78] [78]

Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving

Zhiyuan Zhang, Xiaofan Li, Zhihao Xu, Wenjie Peng, Zi- jian Zhou, Miaojing Shi, and Shuangping Huang. Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving. InProceedings of the Com- puter Vision and Pattern Recognition Conference, 2025. 2

work page 2025

[79] [79]

Video-3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 3

work page 2025

[80] [80]

Genad: Generative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, 2024. 2, 6, 3, 4

work page 2024