SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
Pith reviewed 2026-05-22 12:13 UTC · model grok-4.3
The pith
Treating 3D coordinates as positional encodings instead of text digits lets VLMs jointly reason over semantics and space for driving plans.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SpaceDrive is a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings rather than textual digit tokens. A universal positional encoder processes all 3D coordinates obtained from multi-view depth estimation, historical ego-states, and text prompts. These encodings are superimposed on the corresponding 2D visual tokens and simultaneously serve as a task-agnostic coordinate representation that the VLM uses for both input and output, enabling direct regression of trajectory coordinates and improved joint semantic-spatial reasoning.
What carries the argument
The universal positional encoder that converts 3D coordinates into positional encodings, superimposes them on visual tokens, and replaces digit-wise numerical tokens for VLM input and output.
If this is right
- The VLM can index specific visual semantics by their spatial location during reasoning.
- Trajectory coordinates are regressed directly instead of being assembled digit by digit.
- Planning accuracy improves because the model avoids errors from numerical text parsing.
- The same coordinate representation works across different driving tasks without task-specific tuning.
Where Pith is reading between the lines
- The same superposition technique could be tested on non-driving VLM tasks that require metric spatial output, such as visual question answering about object distances.
- Replacing digit tokens with positional encodings may reduce the model's sensitivity to prompt phrasing that describes numbers.
- If depth estimation quality is the main bottleneck, combining the encoder with stronger 3D perception backbones should produce measurable gains in closed-loop metrics.
Load-bearing premise
The multi-view depth estimation must produce 3D coordinates accurate enough that the derived positional encodings support reliable semantic-spatial reasoning without introducing new localization errors that degrade planning.
What would settle it
Measure whether planning errors increase in direct proportion to added noise in the depth estimates while holding the VLM and encoder fixed; a clear mismatch would indicate the encodings are not carrying the expected spatial signal.
Figures
read the original abstract
End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods. Code is available at: https://github.com/zhenghao2519/SpaceDrive.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SpaceDrive, a VLM-based end-to-end autonomous driving framework that addresses limitations in fine-grained 3D spatial reasoning by deriving 3D coordinates via multi-view depth estimation, encoding them with a universal positional encoder, superimposing the resulting positional encodings on visual tokens, and replacing digit-wise numerical tokens with these encodings for both VLM inputs and outputs. This enables joint semantic-spatial reasoning and direct trajectory regression. The paper reports state-of-the-art open-loop performance on nuScenes and a second-best closed-loop Driving Score of 78.02 on Bench2Drive among VLM-based methods, with code released.
Significance. If the central mechanism proves robust, the work offers a concrete architectural route to improve spatial awareness in pretrained VLMs for driving without relying on textual digit representations. The public code release is a positive contribution to reproducibility in the field.
major comments (2)
- [Methods (depth estimation and universal positional encoder)] Methods section on multi-view depth estimation and positional encoding: the central claim that superimposed 3D PEs enable reliable joint semantic-spatial reasoning without introducing new localization errors rests on the untested assumption that depth estimates remain sufficiently accurate under occlusions, scale ambiguity, and dynamic objects. No sensitivity analysis, depth-error propagation study, or ablation isolating the effect of depth inaccuracies on planning metrics is reported, which directly bears on whether the reported gains can be attributed to the proposed mechanism rather than other factors.
- [Experiments (closed-loop benchmark)] Experiments section (Bench2Drive results): the Driving Score of 78.02 is presented as second-best without error bars, variance across runs, or controls for post-hoc coordinate-handling choices. This weakens the ability to assess whether the improvement is robust or sensitive to implementation details in the positional encoding pipeline.
minor comments (2)
- [Abstract and Methods] Notation for the universal positional encoder is introduced in the abstract but would benefit from an explicit equation or diagram in the main text showing how 3D coordinates from depth, ego-states, and prompts are unified.
- [Figures and Tables] Figure captions and tables should explicitly state whether reported metrics include standard deviations or are single-run results to aid interpretation of benchmark comparisons.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments, which help us improve the clarity and rigor of the manuscript. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Methods (depth estimation and universal positional encoder)] Methods section on multi-view depth estimation and positional encoding: the central claim that superimposed 3D PEs enable reliable joint semantic-spatial reasoning without introducing new localization errors rests on the untested assumption that depth estimates remain sufficiently accurate under occlusions, scale ambiguity, and dynamic objects. No sensitivity analysis, depth-error propagation study, or ablation isolating the effect of depth inaccuracies on planning metrics is reported, which directly bears on whether the reported gains can be attributed to the proposed mechanism rather than other factors.
Authors: We agree that a direct analysis of depth estimation errors is valuable for substantiating the robustness of the proposed mechanism. The multi-view depth estimation in SpaceDrive is combined with a universal positional encoder that maps coordinates into a shared embedding space, which is designed to reduce sensitivity to per-view scale and occlusion issues. Existing ablations in the manuscript already isolate the contribution of the 3D positional encodings by comparing against variants without them, showing consistent gains in both semantic and planning metrics. Nevertheless, we acknowledge the absence of an explicit sensitivity study. In the revised manuscript we will add a new experiment that injects controlled noise into the depth estimates at different levels and measures the resulting degradation in open-loop trajectory regression and closed-loop Driving Score. revision: partial
-
Referee: [Experiments (closed-loop benchmark)] Experiments section (Bench2Drive results): the Driving Score of 78.02 is presented as second-best without error bars, variance across runs, or controls for post-hoc coordinate-handling choices. This weakens the ability to assess whether the improvement is robust or sensitive to implementation details in the positional encoding pipeline.
Authors: We appreciate the referee's point on statistical robustness. The reported Driving Score follows the single-run protocol used by prior VLM-based methods on Bench2Drive. To strengthen the claim, we will rerun the closed-loop evaluation with multiple random seeds, report mean and standard deviation for the Driving Score and auxiliary metrics, and add a short paragraph clarifying that the coordinate-handling pipeline uses only the deterministic universal positional encoder with no post-hoc adjustments. revision: yes
Circularity Check
No significant circularity: architectural proposal validated on external benchmarks
full rationale
The paper introduces SpaceDrive as a new VLM architecture that derives 3D coordinates via multi-view depth estimation, encodes them as universal positional encodings, superimposes them on visual tokens, and uses them to replace digit tokens for input/output. This is presented as an explicit design choice to enable joint semantic-spatial reasoning. The central claims of SOTA open-loop performance on nuScenes and second-best closed-loop Driving Score on Bench2Drive are supported by empirical results on standard external datasets and benchmarks. No load-bearing step reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain; the derivation remains self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-view depth estimation yields sufficiently accurate 3D coordinates for the positional encodings to be useful.
invented entities (1)
-
universal positional encoder
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking (D=3 forcing) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we opt for a 3D sine-cosine positional encoding extending the standard 1D formulation dimension-wise: ϕ(cp) = [ϕx(x3D_p), ϕy(y3D_p), ϕz(z3D_p)] … dx=dy=⌈dim/3⌉, dz=dim−dx−dy
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J-cost uniqueness) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
These 3D PEs are first superimposed to augment the corresponding 2D visual tokens … replacing the digit-wise numerical tokens as both inputs and outputs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving
CoPhy distills VLM knowledge into a BEV encoder and uses an action-conditioned auto-regressive BEV world model inside GRPO with dual physical-cognitive rewards to reach SOTA on NAVSIM v1/v2 while adding language-based...
-
ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving
ST-Prune is a training-free spatio-temporal token pruning framework for VLMs in autonomous driving that achieves near-lossless results at 90% token reduction by exploiting motion volatility, temporal recency, and mult...
-
EponaV2: Driving World Model with Comprehensive Future Reasoning
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
Reference graph
Works this paper leans on
-
[1]
Covla: Comprehensive vision-language-action dataset for au- tonomous driving
Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watan- abe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for au- tonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025. 3
work page 2025
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 5, 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Zhipeng Bao and Qianwen Li. Large language model-assisted autonomous vehicle recovery from immobilization.arXiv preprint arXiv:2510.26023, 2025. 6, 4
-
[4]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020. 2, 5, 6, 3
work page 2020
-
[5]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 3
work page 2024
-
[6]
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Solve: Synergy of language-vision and end-to-end networks for autonomous driving
Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, and Hongsheng Li. Solve: Synergy of language-vision and end-to-end networks for autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 6, 3
work page 2025
-
[8]
3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,
An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiao- long Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, et al. 3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,
-
[9]
Shuxiao Ding, Yutong Yang, Julian Wiederer, Markus Braun, Peizheng Li, Juergen Gall, and Bin Yang. Tqd-track: Tem- poral query denoising for 3d multi-object tracking.arXiv preprint arXiv:2504.03258, 2025. 2
-
[10]
Carla: An open urban driving simulator
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, 2017. 5
work page 2017
-
[11]
Advanc- ing sequential numerical prediction in autoregressive models
Xiang Fei, Jinghui Lu, Qi Sun, Hao Feng, Yanjie Wang, Wei Shi, An-Lan Wang, Jingqun Tang, and Can Huang. Advanc- ing sequential numerical prediction in autoregressive models. arXiv preprint arXiv:2505.13077, 2025. 2, 4
-
[12]
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755, 2025. 1, 2, 3, 4, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 3
-
[14]
Mingzhe Guo, Zhipeng Zhang, Yuan He, Ke Wang, Liping Jing, and Haibin Ling. End-to-end autonomous driving with- out costly modularization and 3d manual annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence,
-
[15]
Vdrive: Leveraging reinforced vla and diffusion policy for end-to-end autonomous driving
Ziang Guo and Zufeng Zhang. Vdrive: Leveraging reinforced vla and diffusion policy for end-to-end autonomous driving. arXiv preprint arXiv:2510.15446, 2025. 6, 4
-
[16]
Eta: Efficiency through thinking ahead, a dual approach to self-driving with large models
Shadi Hamdan, Chonghao Sima, Zetong Yang, Hongyang Li, and Fatma Guney. Eta: Efficiency through thinking ahead, a dual approach to self-driving with large models. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, 2025. 6, 4
work page 2025
-
[17]
Lora: Low-rank adaptation of large language models.ICLR, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 2022. 5, 2
work page 2022
-
[18]
St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning
Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision, 2022. 2, 6, 3
work page 2022
-
[19]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023. 2, 6, 3, 4 9
work page 2023
-
[20]
Yi Huang, Lihui Jiang, Bingbing Liu, Hongbo Zhang, et al. Prioritizing perception-guided self-supervision: A new paradigm for causal modeling in end-to-end autonomous driv- ing. InThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems, 2025. 2, 4
work page 2025
-
[21]
Mak- ing large language models better planners with reasoning- decision alignment
Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, and Xiaodan Liang. Mak- ing large language models better planners with reasoning- decision alignment. InEuropean Conference on Computer Vision, 2024. 6, 3
work page 2024
-
[22]
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision,
-
[24]
Think twice before driving: Towards scalable decoders for end-to-end autonomous driving
Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2, 6, 4
work page 2023
-
[25]
Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 2024. 2, 5, 6, 8, 4
work page 2024
-
[26]
Drivetransformer: Unified transformer for scalable end-to- end autonomous driving
Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to- end autonomous driving. InInternational Conference on Learning Representations (ICLR), 2025. 6, 4
work page 2025
-
[27]
Vad: Vectorized scene representation for ef- ficient autonomous driving
Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for ef- ficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023. 2, 6, 3, 4
work page 2023
-
[28]
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end- to-end autonomous driving.arXiv preprint arXiv:2410.22313,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Rethinking positional en- coding in language pre-training
Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional en- coding in language pre-training. InInternational Conference on Learning Representations, 2021. 2
work page 2021
-
[30]
Vlr-driver: Large vision-language-reasoning models for embodied autonomous driving
Fanjie Kong, Yitong Li, Weihuang Chen, Chen Min, Yizhe Li, Zhiqiang Gao, Haoyang Li, Zhongyu Guo, and Hongbin Sun. Vlr-driver: Large vision-language-reasoning models for embodied autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision,
-
[31]
Yuzhi Lai, Shenghai Yuan, Peizheng Li, Jun Lou, and Andreas Zell. Seer-var: Semantic egocentric environment reasoner for vehicle augmented reality.arXiv preprint arXiv:2508.17255,
-
[32]
FAM-HRI: Foundation-Model Assisted Multi-Modal Human-Robot Interaction Combining Gaze and Speech
Yuzhi Lai, Shenghai Yuan, Boya Zhang, Benjamin Kiefer, Peizheng Li, Tianchen Deng, and Andreas Zell. Fam- hri: Foundation-model assisted multi-modal human-robot interaction combining gaze and speech.arXiv preprint arXiv:2503.16492, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Powerbev: A pow- erful yet lightweight framework for instance prediction in bird’s-eye view
Peizheng Li, Shuxiao Ding, Xieyuanli Chen, Niklas Hansel- mann, Marius Cordts, and Juergen Gall. Powerbev: A pow- erful yet lightweight framework for instance prediction in bird’s-eye view. InProceedings of the Thirty-Second Interna- tional Joint Conference on Artificial Intelligence, IJCAI-23,
-
[34]
Ago: Adaptive grounding for open world 3d occupancy prediction
Peizheng Li, Shuxiao Ding, You Zhou, Qingwen Zhang, Onat Inak, Larissa Triess, Niklas Hanselmann, Marius Cordts, and Andreas Zell. Ago: Adaptive grounding for open world 3d occupancy prediction. InProceedings of the IEEE/CVF international conference on computer vision, 2025. 2
work page 2025
- [35]
-
[36]
Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025. 3
-
[37]
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2
work page 2024
-
[38]
Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open- loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 5, 6, 1
work page 2024
-
[39]
Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, and Jose M. Alvarez. Hydra-next: Robust closed-loop driving with open-loop training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2, 4
work page 2025
-
[40]
Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving
Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 3, 4
work page 2025
-
[41]
Visual instruction tuning.Advances in neural information processing systems, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023. 4, 1
work page 2023
-
[42]
Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware ex- pansion for end-to-end autonomous driving.arXiv preprint arXiv:2506.09800, 2025. 4
-
[43]
Shuai Liu, Quanmin Liang, Zefeng Li, Boyang Li, and Kai Huang. Gaussianfusion: Gaussian-based multi-sensor fu- sion for end-to-end autonomous driving.arXiv preprint arXiv:2506.00034, 2025. 2, 4 10
-
[44]
Wei Liu, Jiyuan Zhang, Binxiong Zheng, Yufeng Hu, Yingzhan Lin, and Zengfeng Zeng. X-driver: Explainable autonomous driving with vision-language models.arXiv preprint arXiv:2505.05098, 2025. 6, 4
-
[45]
Petr: Position embedding transformation for multi-view 3d object detection
Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. InEuropean conference on computer vision. Springer, 2022. 2
work page 2022
-
[46]
Real- ad: Towards human-like reasoning in end-to-end autonomous driving
Yuhang Lu, Jiadong Tu, Yuexin Ma, and Xinge Zhu. Real- ad: Towards human-like reasoning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2025. 6, 3, 4
work page 2025
-
[47]
Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing
Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jian- hua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing. InEuropean Conference on Computer Vision. Springer,
-
[48]
Vlp: Vision language planning for autonomous driving
Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, and Liu Ren. Vlp: Vision language planning for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1, 3, 6
work page 2024
-
[49]
UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025. 5, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Simlingo: Vision-only closed-loop autonomous driving with language-action alignment
Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 4, 6, 7, 1
work page 2025
-
[51]
Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025. 2, 4
-
[52]
Lmdrive: Closed-loop end-to-end driving with large language models
Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3
work page 2024
-
[53]
Yinzhe Shen, Omer Sahin Tas, Kaiwen Wang, Royden Wag- ner, and Christoph Stiller. Divide and merge: Motion and semantic learning in end-to-end autonomous driving.arXiv preprint arXiv:2502.07631, 2025. 2
-
[54]
Drivelm: Driving with graph visual question answering
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, 2024. 1, 2, 3
work page 2024
-
[55]
Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving
Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025. 6, 3, 4
work page 2025
-
[56]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024. 4
work page 2024
-
[57]
Sparsedrive: End-to-end au- tonomous driving via sparse scene representation
Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Hao- ran Wu, and Sifa Zheng. Sparsedrive: End-to-end au- tonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), 2025. 6, 3, 4
work page 2025
-
[58]
Yingqi Tang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi-granularity planning with de- formable attention for autonomous driving in a single decoder. arXiv preprint arXiv:2503.08612, 2025. 6, 4
-
[59]
Drivevlm: The convergence of autonomous driving and large vision-language models
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, XianPeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. InConference on Robot Learning, 2025. 1, 3, 6
work page 2025
-
[60]
Chi Wan, Yixin Cui, Jiatong Du, Shuo Yang, Yulong Bai, Peng Yi, Nan Li, and Yanjun Huang. Geminus: Dual-aware global and scene-adaptive mixture-of-experts for end-to-end autonomous driving.arXiv preprint arXiv:2507.14456, 2025. 6, 4
-
[61]
Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 1, 2, 3, 4, 5, 6
work page 2025
-
[62]
Detr3d: 3d object detection from multi-view images via 3d-to-2d queries
Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InConference on robot learning, 2022. 2
work page 2022
-
[63]
Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 3
work page 2024
-
[64]
Para-drive: Parallelized architecture for real- time autonomous driving
Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real- time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
-
[65]
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters,
-
[67]
Thinking in space: How multimodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 3
work page 2025
-
[68]
Depth anything v2.Advances in Neural Information Processing Systems,
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems,
-
[69]
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving.arXiv preprint arXiv:2505.16278,
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, and Junchi Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2).arXiv preprint arXiv:2505.16394, 2025. 4
-
[71]
Hang Yu, Julian Jordan, Julian Schmidt, Silvan Lindner, Alessandro Canevaro, and Wilhelm Stork. Hype: Hybrid planning with ego proposal-conditioned predictions.arXiv preprint arXiv:2510.12733, 2025. 2
-
[72]
Haibao Yu, Wenxian Yang, Ruiyang Hao, Chuanye Wang, Jiaru Zhong, Ping Luo, and Zaiqing Nie. Drivee2e: Closed- loop benchmark for end-to-end autonomous driving through real-to-simulation.arXiv preprint arXiv:2509.23922, 2025. 2
-
[73]
Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, and Matthew Gadd. Rag-driver: Generalisable driving explanations with retrieval-augmented in-context multi-modal large language model learning. In Robotics: Science and Systems, 2024. 3
work page 2024
-
[74]
Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes
Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang- Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jing- dong Wang. Rethinking the open-loop evaluation of end- to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430, 2023. 2, 5, 6, 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution
Bozhou Zhang, Nan Song, Xiatian Zhu, Jiankang Deng, Li Zhang, et al. Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 2, 4
work page 2025
-
[76]
Seflow: A self-supervised scene flow method in autonomous driving
Qingwen Zhang, Yi Yang, Peizheng Li, Olov Andersson, and Patric Jensfelt. Seflow: A self-supervised scene flow method in autonomous driving. InEuropean Conference on Computer Vision. Springer, 2024. 2
work page 2024
-
[77]
Wei Zhang, Pengfei Li, Junli Wang, Bingchuan Sun, Qihao Jin, Guangjun Bao, Shibo Rui, Yang Yu, Wenchao Ding, Peng Li, et al. Dual-aeb: Synergizing rule-based and multimodal large language models for effective emergency braking. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), 2025. 6, 4
work page 2025
-
[78]
Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving
Zhiyuan Zhang, Xiaofan Li, Zhihao Xu, Wenjie Peng, Zi- jian Zhou, Miaojing Shi, and Shuangping Huang. Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving. InProceedings of the Com- puter Vision and Pattern Recognition Conference, 2025. 2
work page 2025
-
[79]
Video-3d llm: Learning position-aware video representation for 3d scene understanding
Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 3
work page 2025
-
[80]
Genad: Generative end-to-end autonomous driving
Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, 2024. 2, 6, 3, 4
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.