TPS-Drive: Task-Guided Representation Purification for VLM-based Autonomous Driving

Jiaxiang Li; Ke Ma; Yumao Liu

arxiv: 2605.27038 · v1 · pith:O3Q3HRJQnew · submitted 2026-05-26 · 💻 cs.RO

TPS-Drive: Task-Guided Representation Purification for VLM-based Autonomous Driving

Jiaxiang Li , Yumao Liu , Ke Ma This is my paper

Pith reviewed 2026-06-29 17:40 UTC · model grok-4.3

classification 💻 cs.RO

keywords Vision-Language ModelsAutonomous DrivingVector QuantizationRepresentation PurificationSpatial ForecastingTask-Guided LearningnuScenesNAVSIM

0 comments

The pith

An agent-centric tokenizer uses frozen 3D detection to reallocate codebook capacity from backgrounds to dynamic agents in VLM-based driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that VLMs for autonomous driving suffer either from flattened spatial states that cause hallucinations or from dense visuals that overwhelm tokenizers with background noise. TPS-Drive counters this with a purification step that applies task-guided vector quantization, supervised by a frozen 3D detector, to concentrate the limited codebook on moving agents. A decoupled pipeline then handles scene understanding, forecasting, and planning in sequence, trained progressively with reward refinement. This yields measurable gains in spatial accuracy and safety on standard benchmarks. The approach treats representation purification as the direct fix for bridging semantic reasoning with precise 3D forecasting.

Core claim

TPS-Drive shows that an Agent-Centric Tokenizer, built on task-guided vector quantization supervised by a frozen 3D detection head, reallocates codebook capacity from pervasive static backgrounds to critical dynamic agents, isolating spatial redundancy and supplying a purified spatial vocabulary that supports a decoupled reasoning pipeline of scene understanding, future forecasting, and action generation, with optimization via three-stage training that ends in reward-driven refinement.

What carries the argument

Agent-Centric Tokenizer: a task-guided vector quantization mechanism supervised by a frozen 3D detection head that reallocates limited codebook capacity to dynamic agents.

If this is right

Accurate agent spatial state forecasting is achieved in open-loop nuScenes evaluations.
Collision rates are reduced in those open-loop evaluations.
New safety records are established on the closed-loop NAVSIMv1 and NAVSIMv2 benchmarks.
The three-stage training with reward-driven refinement surpasses pure imitation learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same purification step could be tested on other VLM tasks that require isolating sparse but critical spatial signals amid heavy background texture.
If the frozen detector works as supervisor, replacing it with a weaker or noisier head would directly test whether the capacity reallocation depends on high-quality 3D labels.
The decoupled pipeline separates understanding from forecasting; removing the decoupling and forcing joint prediction would show whether the separation itself contributes to the observed gains.

Load-bearing premise

The frozen 3D detection head can reliably supervise vector quantization to reallocate codebook capacity to dynamic agents without losing critical spatial details or introducing supervision biases that affect downstream planning.

What would settle it

An experiment that applies the same VLM backbone and training stages but replaces the agent-centric tokenizer with standard dense visual tokenization and shows no improvement or a decline in closed-loop safety metrics on NAVSIMv2 would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.27038 by Jiaxiang Li, Ke Ma, Yumao Liu.

**Figure 1.** Figure 1: Comparison of representation strategies. (A) Text and numerical formats induce spatial hallucinations, as symbolic compression destroys geometric proximity. (B) Dense grid representations cause representation interference, where redundant backgrounds overwhelm semantic reasoning. (C) Our TPS-Drive achieves representation purification via task-guided VQ, enabling the strict encoding of task-critical dynamic… view at source ↗

**Figure 2.** Figure 2: Overview of TPS-Drive. The decoupled VLM explicitly separates scene understanding, future forecasting, and action generation. The bottom panel illustrates the progressive three-stage training paradigm: tokenizer pretraining, supervised fine-tuning, and reward-driven refinement. where D, H, and W denote the feature dimension and spatial resolutions. Alongside this visual context, the system receives a high-… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison in safety-critical scenarios. Rows denote visual Observation, internal Thinking, and trajectory Planning. In Planning, green and red indicate predicted and groundtruth trajectories, respectively. Left: Facing a pedestrian, the baseline wrongly plans "Forward", while TPS-Drive explicitly predicts 3D bounding boxes to safely "Stop". Right: In a cluttered corridor, the baseline makes a… view at source ↗

read the original abstract

Vision-Language Models (VLMs) provide a promising foundation for autonomous driving planning, yet bridging semantic reasoning and precise 3D spatial forecasting remains a critical challenge. Existing representation strategies generally follow two paths: text-aligned methods flatten continuous spatial states into symbols, which compromises geometric structure and induces "spatial hallucinations"; dense visual methods preserve spatial topology but overwhelm standard tokenizers with redundant background textures, leading to "representation interference". To address these limitations, we introduce TPS-Drive, a novel framework centered on Task-Guided Representation Purification that empowers VLMs to Think in Purified Space. At its core, an Agent-Centric Tokenizer utilizes a task-guided vector quantization mechanism supervised by a frozen 3D detection head, which explicitly reallocates limited codebook capacity from pervasive static backgrounds to critical dynamic agents and effectively isolates spatial redundancy. Leveraging this purified spatial vocabulary, TPS-Drive employs a decoupled reasoning pipeline that sequentially performs scene understanding, future forecasting, and action generation. The framework is optimized via a progressive three-stage training paradigm, culminating in reward-driven refinement that surpasses pure imitation learning. Extensive experiments validate our approach: TPS-Drive achieves accurate agent spatial state forecasting and reduces collision rates in open-loop nuScenes evaluations, while establishing new safety records on the rigorous closed-loop NAVSIMv1 and NAVSIMv2 benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TPS-Drive introduces an agent-centric tokenizer that uses a frozen 3D detector to guide vector quantization toward dynamic agents, but the abstract supplies no results or ablations to check whether this actually works.

read the letter

The core idea is a task-guided vector quantization step inside an agent-centric tokenizer, supervised by a frozen 3D detection head, meant to shift codebook capacity away from static background and toward moving agents. The rest of the pipeline is a standard decoupled sequence of understanding, forecasting, and planning, trained in three progressive stages with a final reward step.

What stands out is the explicit diagnosis of the two prior failure modes—text-aligned flattening that loses geometry and dense visual tokens that drown the model in background—and the concrete mechanism offered to fix the second one. That framing is clear and directly tied to the driving task.

The obvious weakness is that none of the performance claims (lower collisions on nuScenes, new safety records on NAVSIM) are accompanied by numbers, error bars, or ablations in the abstract. The central assumption—that the frozen detector supplies an unbiased, planning-relevant signal for codebook allocation—receives no supporting evidence here, so the stress-test concern about supervision bias or error propagation cannot be dismissed. Without the methods section or tables it is impossible to tell whether the reported gains are real or artifacts of the supervision choice.

This is for researchers already working on VLM planners for robotics who need a specific proposal to test against their own baselines. A reader looking for a worked-out alternative to dense or text-only tokenization would find the direction worth following once the full paper and code appear. It is worth sending to referees because the problem is well-posed and the proposed fix is specific enough to evaluate, even though the current version is too thin on evidence to accept as is.

Referee Report

2 major / 1 minor

Summary. The paper proposes TPS-Drive, a VLM-based autonomous driving framework centered on Task-Guided Representation Purification. Its core is an Agent-Centric Tokenizer that applies task-guided vector quantization supervised by a frozen 3D detection head to reallocate codebook capacity toward dynamic agents, creating a 'purified spatial vocabulary.' This feeds a decoupled reasoning pipeline (scene understanding, future forecasting, action generation) trained via a three-stage progressive paradigm ending in reward-driven refinement. The abstract claims this yields accurate agent spatial state forecasting, reduced collision rates in open-loop nuScenes evaluations, and new safety records on closed-loop NAVSIMv1/v2 benchmarks.

Significance. If the central claims hold, the approach could meaningfully advance VLM integration into autonomous driving by mitigating spatial hallucinations and representation interference through explicit purification of the spatial token space. The three-stage training with reward refinement is a constructive element that moves beyond pure imitation learning.

major comments (2)

[Abstract] Abstract: The performance claims (accurate forecasting and new safety records on nuScenes/NAVSIM) rest on the assumption that supervision from the frozen 3D detection head produces an unbiased, planning-aligned signal for VQ codebook reallocation. No analysis is provided showing that detector errors (missed agents, inaccurate boxes) do not propagate into the discrete tokens or that the detection objective aligns with downstream planning topology preservation; this is load-bearing for the 'purified spatial vocabulary' claim.
[Abstract] Abstract: The abstract states that the tokenizer 'explicitly reallocates limited codebook capacity from pervasive static backgrounds to critical dynamic agents' but supplies no quantitative evidence (e.g., codebook utilization statistics, ablation on supervision source, or topology preservation metrics) that the mechanism avoids distribution shift between detection and planning objectives.

minor comments (1)

The abstract refers to 'extensive experiments' and 'new safety records' without reporting any numerical results, error bars, baseline comparisons, or ablation studies, which hinders assessment of effect sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review of our manuscript on TPS-Drive. The comments correctly identify areas where the abstract claims would benefit from stronger supporting analysis. We address each major comment below and commit to revisions that directly incorporate the requested evidence and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: The performance claims (accurate forecasting and new safety records on nuScenes/NAVSIM) rest on the assumption that supervision from the frozen 3D detection head produces an unbiased, planning-aligned signal for VQ codebook reallocation. No analysis is provided showing that detector errors (missed agents, inaccurate boxes) do not propagate into the discrete tokens or that the detection objective aligns with downstream planning topology preservation; this is load-bearing for the 'purified spatial vocabulary' claim.

Authors: We agree that the abstract does not contain explicit analysis of detector error propagation or alignment between the detection objective and planning topology. The manuscript reports overall performance gains from the task-guided tokenizer but does not isolate the effect of detection inaccuracies on token assignments. In the revised version we will add a new analysis subsection that quantifies token sensitivity to common detector failure modes and introduces topology preservation metrics (e.g., agent-centric IoU between original and tokenized representations) to substantiate the purification claim. revision: yes
Referee: [Abstract] Abstract: The abstract states that the tokenizer 'explicitly reallocates limited codebook capacity from pervasive static backgrounds to critical dynamic agents' but supplies no quantitative evidence (e.g., codebook utilization statistics, ablation on supervision source, or topology preservation metrics) that the mechanism avoids distribution shift between detection and planning objectives.

Authors: The experimental section already contains codebook utilization statistics and an ablation on the supervision source. However, these results are not referenced in the abstract, and explicit topology preservation metrics addressing potential distribution shift are absent. We will revise the abstract to cite the existing quantitative results and add topology preservation metrics in the revision to more rigorously demonstrate that the reallocation remains beneficial for downstream planning. revision: partial

Circularity Check

0 steps flagged

No circularity; abstract provides descriptive framework without equations or self-referential reductions

full rationale

The provided abstract and description contain no equations, derivations, fitted parameters presented as predictions, or self-citations. The framework is introduced at a high level with claims of empirical validation on nuScenes and NAVSIM benchmarks. No load-bearing step reduces by construction to its own inputs, and the derivation chain is not observable in the given text, making the result self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level description of the tokenizer and purified vocabulary.

invented entities (1)

Purified spatial vocabulary no independent evidence
purpose: Isolates spatial redundancy by reallocating codebook capacity to dynamic agents
Core output of the agent-centric tokenizer described in the abstract; no independent evidence provided.

pith-pipeline@v0.9.1-grok · 5766 in / 1209 out tokens · 35485 ms · 2026-06-29T17:40:18.476224+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

94 extracted references · 36 canonical work pages · 17 internal anchors

[1]

A survey on vision-language-action models for autonomous driving

Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, et al. A survey on vision-language-action models for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4524–4536, 2025

2025
[2]

Vision language models in autonomous driving: A survey and outlook.IEEE Transactions on Intelligent Vehicles, 2024

Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C Knoll. Vision language models in autonomous driving: A survey and outlook.IEEE Transactions on Intelligent Vehicles, 2024

2024
[3]

Large (vision) language models for autonomous vehicles: Current trends and future directions.IEEE Transactions on Intelligent Transportation Systems, 27(1):187–210, 2025

Hanlin Tian, Kethan Reddy, Yuxiang Feng, Mohammed Quddus, Yiannis Demiris, and Pana- giotis Angeloudis. Large (vision) language models for autonomous vehicles: Current trends and future directions.IEEE Transactions on Intelligent Transportation Systems, 27(1):187–210, 2025

2025
[4]

Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, and Long Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

work page arXiv 2026
[5]

Vlm4d: Towards spatiotem- poral awareness in vision language models

Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: Towards spatiotem- poral awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025

2025
[6]

Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models.arXiv preprint arXiv:2411.13112, 2024

Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models.arXiv preprint arXiv:2411.13112, 2024

work page arXiv 2024
[7]

Language prompt for autonomous driving

Dongming Wu, Wencheng Han, Yingfei Liu, Tiancai Wang, Cheng-zhong Xu, Xiangyu Zhang, and Jianbing Shen. Language prompt for autonomous driving. InProceedings of the AAAI conference on artificial intelligence, volume 39, pages 8359–8367, 2025

2025
[8]

Integrating object detection modality into visual language model for enhanced autonomous driving agent.arXiv preprint arXiv:2411.05898, 2024

Linfeng He, Yiming Sun, Sihao Wu, Jiaxu Liu, and Xiaowei Huang. Integrating object detection modality into visual language model for enhanced autonomous driving agent.arXiv preprint arXiv:2411.05898, 2024

work page arXiv 2024
[9]

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Last-vla: Thinking in latent spatio- temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, et al. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

work page arXiv 2026
[11]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Occworld: Learning a 3d occupancy world model for autonomous driving

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024

2024
[13]

Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction-based token pruning

Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Liyun Li, Xianming Liu, et al. Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction-based token pruning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2571–2579, 2026

2026
[14]

Self-supervised multi- modal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5299–5318, 2024

Yongshuo Zong, Oisin Mac Aodha, and Timothy M Hospedales. Self-supervised multi- modal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5299–5318, 2024. 10

2024
[15]

Scaling laws for native multimodal models

Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, and Alaaeldin El-Nouby. Scaling laws for native multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–23, 2025

2025
[16]

Llm4drive: A survey of large language models for autonomous driving.arXiv preprint arXiv:2311.01043, 2023

Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. Llm4drive: A survey of large language models for autonomous driving.arXiv preprint arXiv:2311.01043, 2023

work page arXiv 2023
[17]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

2024
[18]

arXiv preprint arXiv:2506.06218 (2025) 3, 9

Christian Fruhwirth-Reisinger, Dušan Mali´c, Wei Lin, David Schinagl, Samuel Schulter, and Horst Possegger. Stsbench: A spatio-temporal scenario benchmark for multi-modal large language models in autonomous driving.arXiv preprint arXiv:2506.06218, 2025

work page arXiv 2025
[19]

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

Roberto Brusnicki, Mattia Piccinini, and Johannes Betz. How well do vision-language models understand sequential driving scenes? a sensitivity study.arXiv preprint arXiv:2604.06750, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding

Tong Zeng, Longfeng Wu, Liang Shi, Dawei Zhou, and Feng Guo. Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5972–5983, 2025

2025
[21]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9(10):8186–8193, 2024

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9(10):8186–8193, 2024

2024
[22]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hong- sheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15120–15130, 2024

2024
[23]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

2024
[24]

Bench2drive-vl: Benchmarks for closed-loop autonomous driving with vision-language models

Xiaosong Jia, Yuqian Shao, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive-vl: Benchmarks for closed-loop autonomous driving with vision-language models. arXiv preprint arXiv:2604.01259, 2026

work page arXiv 2026
[25]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

2017
[26]

Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3(1):22, 2025

Erfei Cui, Wenhai Wang, Zhiqi Li, Jiangwei Xie, Haoming Zou, Hanming Deng, Gen Luo, Lewei Lu, Xizhou Zhu, and Jifeng Dai. Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3(1):22, 2025

2025
[27]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

2025
[28]

Bevdriver: Leveraging bev maps in llms for robust closed-loop driving

Katharina Winter, Mark Azer, and Fabian B Flohr. Bevdriver: Leveraging bev maps in llms for robust closed-loop driving. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 20379–20385. IEEE, 2025

2025
[29]

Occvla: Vision-language-action model with implicit 3d occupancy supervision.arXiv preprint arXiv:2509.05578, 2025

Ruixun Liu, Lingyu Kong, Derun Li, and Hang Zhao. Occvla: Vision-language-action model with implicit 3d occupancy supervision.arXiv preprint arXiv:2509.05578, 2025

work page arXiv 2025
[30]

Perceptio: Perception enhanced vision language models via spatial token generation.arXiv preprint arXiv:2603.18795, 2026

Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, and Garin Kessler. Perceptio: Perception enhanced vision language models via spatial token generation.arXiv preprint arXiv:2603.18795, 2026. 11

work page arXiv 2026
[31]

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

Makanjuola Ogunleye, Eman Abdelrahman, and Ismini Lourentzou. 3d-vcd: Hallucination mitigation in 3d-llm embodied agents through visual contrastive decoding.arXiv preprint arXiv:2604.08645, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Hallucination elimi- nation and semantic enhancement framework for vision-language models in traffic scenarios

Jiaqi Fan, Jianhua Wu, Hongqing Chu, Quanbo Ge, and Bingzhao Gao. Hallucination elimi- nation and semantic enhancement framework for vision-language models in traffic scenarios. arXiv preprint arXiv:2412.07518, 2024

work page arXiv 2024
[33]

Task interference in vlms for autonomous driving: When better perception hurts planning

Sai Bhargav Rongali, Aadya Pipersenia, and Kenji Okuma. Task interference in vlms for autonomous driving: When better perception hurts planning. InAAAI 2026 Workshop on Assessing and Improving Reliability of Foundation Models in the Real World

2026
[34]

DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving

Xiaolu Liu, Yicong Li, Song Wang, Junbo Chen, Angela Yao, and Jianke Zhu. Dynflowdrive: Flow-based dynamic world modeling for autonomous driving.arXiv preprint arXiv:2603.19675, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

The role of world models in shaping autonomous driving: A comprehensive survey.arXiv preprint arXiv:2502.10498, 2025

Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, and Xiang Bai. The role of world models in shaping autonomous driving: A comprehensive survey.arXiv preprint arXiv:2502.10498, 2025

work page arXiv 2025
[36]

World4drive: End-to-end autonomous driving via intention-aware physical latent world model

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28632–28642, 2025

2025
[37]

A survey of world models for autonomous driving

Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving. arXiv preprint arXiv:2501.11260, 2025

work page arXiv 2025
[38]

Dio: Decomposable implicit 4d occupancy-flow world model

Christopher Diehl, Quinlan Sykora, Ben Agro, Thomas Gilles, Sergio Casas, and Raquel Urtasun. Dio: Decomposable implicit 4d occupancy-flow world model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27456–27466, 2025

2025
[39]

ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution

Chuyao Fu, Shengzhe Gan, Zhuoli Ouyang, Yuhan Rui, Xiaowei Chi, Sirui Han, Jiankun Wang, and Hong Zhang. Prodrive: Proactive planning for autonomous driving via ego-environment co-evolution.arXiv preprint arXiv:2604.25329, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Visual point cloud forecasting enables scalable autonomous driving

Zetong Yang, Li Chen, Yanan Sun, and Hongyang Li. Visual point cloud forecasting enables scalable autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14673–14684, 2024

2024
[41]

Uno: Unsu- pervised occupancy fields for perception and forecasting

Ben Agro, Quinlan Sykora, Sergio Casas, Thomas Gilles, and Raquel Urtasun. Uno: Unsu- pervised occupancy fields for perception and forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14487–14496, 2024

2024
[42]

Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model.arXiv preprint arXiv:2410.10429, 2024

Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, and Xiaoxiao Long. Dome: Taming diffusion model into high-fidelity controllable occupancy world model.arXiv preprint arXiv:2410.10429, 2024

work page arXiv 2024
[43]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024

2024
[44]

Smart: Scalable multi-agent real-time motion generation via next-token prediction.Advances in Neural Information Processing Systems, 37:114048–114071, 2024

Wei Wu, Xiaoxin Feng, Ziyan Gao, and Yuheng Kan. Smart: Scalable multi-agent real-time motion generation via next-token prediction.Advances in Neural Information Processing Systems, 37:114048–114071, 2024

2024
[45]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[46]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Model-based imitation learning for urban driving.Advances in Neural Information Processing Systems, 35:20703–20716, 2022

Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learning for urban driving.Advances in Neural Information Processing Systems, 35:20703–20716, 2022

2022
[48]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024
[49]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

2021
[51]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025

work page arXiv 2025
[52]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

2024
[53]

Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, and Miao Liu. Unleashing spatial reasoning in multimodal large language models via textual representation guided reasoning. arXiv preprint arXiv:2603.23404, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

Jiangye Yuan, Gowri Kumar, and Baoyuan Wang. Boosting mllm spatial reasoning with geometrically referenced 3d scene representations.arXiv preprint arXiv:2603.08592, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

Gaussianvlm: Scene-centric 3d vision-language models using language-aligned gaussian splats for embodied reasoning and beyond.IEEE Robotics and Automation Letters, 2025

Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, and Luc Van Gool. Gaussianvlm: Scene-centric 3d vision-language models using language-aligned gaussian splats for embodied reasoning and beyond.IEEE Robotics and Automation Letters, 2025

2025
[56]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems, 37:135062–135093, 2024

2024
[57]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017
[58]

Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024

Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. Occllama: An occupancy-language-action generative world model for autonomous driving. arXiv preprint arXiv:2409.03272, 2024

work page arXiv 2024
[59]

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2020– 2036, 2024

2020
[60]

Center-based 3d object detection and track- ing

Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and track- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021

2021
[61]

V oxelnet: End-to-end learning for point cloud based 3d object detection

Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4490–4499, 2018

2018
[62]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[63]

Traffic scene perception via multimodal large language model with data augmentation and efficient training strategy.Applied Soft Computing, 177:113210, 2025

Shuo Liu, Lei Shi, Yucheng Shi, Yufei Gao, and Xiaole Sun. Traffic scene perception via multimodal large language model with data augmentation and efficient training strategy.Applied Soft Computing, 177:113210, 2025. 13

2025
[64]

Implementation of image generative models using imagegpt

Payel Dutta and Kaustuv Kunal. Implementation of image generative models using imagegpt. In 2023 International Conference on New Frontiers in Communication, Automation, Management and Security (ICCAMS), volume 1, pages 1–6. IEEE, 2023

2023
[65]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023
[66]

Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

2023
[67]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[68]

St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

2022
[69]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

2023
[70]

Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

2024
[71]

Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22432–22441, 2025

2025
[72]

He-drive: Human-like end-to-end driving with vision language models

Junming Wang, Xingyu Zhang, Zebin Xing, Xiaoyang Guo, Yang Hu, Ziying Song, Qian Zhang, Xiaoxiao Long, Wei Yin, et al. He-drive: Human-like end-to-end driving with vision language models. 2024

2024
[73]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

2025
[74]

Prix: learning to plan from raw pixels for end-to-end autonomous driving.IEEE Robotics and Automation Letters, 11(5):6400– 6407, 2026

Maciej Wozniak, Lianhang Liu, Yixi Cai, and Patric Jensfelt. Prix: learning to plan from raw pixels for end-to-end autonomous driving.IEEE Robotics and Automation Letters, 11(5):6400– 6407, 2026

2026
[75]

Making large language models better planners with reasoning-decision alignment

Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, and Xiaodan Liang. Making large language models better planners with reasoning-decision alignment. InEuropean Conference on Computer Vision, pages 73–90. Springer, 2024

2024
[76]

Doe-1: Closed-loop autonomous driving with large world model.arXiv preprint arXiv:2412.09627, 2024

Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, and Jiwen Lu. Doe-1: Closed-loop autonomous driving with large world model.arXiv preprint arXiv:2412.09627, 2024

work page arXiv 2024
[77]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers

Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26890–26900, 2025

2025
[80]

End-to-end driving with online trajectory evaluation via bev world model

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025

2025

Showing first 80 references.

[1] [1]

A survey on vision-language-action models for autonomous driving

Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, et al. A survey on vision-language-action models for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4524–4536, 2025

2025

[2] [2]

Vision language models in autonomous driving: A survey and outlook.IEEE Transactions on Intelligent Vehicles, 2024

Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C Knoll. Vision language models in autonomous driving: A survey and outlook.IEEE Transactions on Intelligent Vehicles, 2024

2024

[3] [3]

Large (vision) language models for autonomous vehicles: Current trends and future directions.IEEE Transactions on Intelligent Transportation Systems, 27(1):187–210, 2025

Hanlin Tian, Kethan Reddy, Yuxiang Feng, Mohammed Quddus, Yiannis Demiris, and Pana- giotis Angeloudis. Large (vision) language models for autonomous vehicles: Current trends and future directions.IEEE Transactions on Intelligent Transportation Systems, 27(1):187–210, 2025

2025

[4] [4]

Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, and Long Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

work page arXiv 2026

[5] [5]

Vlm4d: Towards spatiotem- poral awareness in vision language models

Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: Towards spatiotem- poral awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025

2025

[6] [6]

Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models.arXiv preprint arXiv:2411.13112, 2024

Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models.arXiv preprint arXiv:2411.13112, 2024

work page arXiv 2024

[7] [7]

Language prompt for autonomous driving

Dongming Wu, Wencheng Han, Yingfei Liu, Tiancai Wang, Cheng-zhong Xu, Xiangyu Zhang, and Jianbing Shen. Language prompt for autonomous driving. InProceedings of the AAAI conference on artificial intelligence, volume 39, pages 8359–8367, 2025

2025

[8] [8]

Integrating object detection modality into visual language model for enhanced autonomous driving agent.arXiv preprint arXiv:2411.05898, 2024

Linfeng He, Yiming Sun, Sihao Wu, Jiaxu Liu, and Xiaowei Huang. Integrating object detection modality into visual language model for enhanced autonomous driving agent.arXiv preprint arXiv:2411.05898, 2024

work page arXiv 2024

[9] [9]

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Last-vla: Thinking in latent spatio- temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, et al. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

work page arXiv 2026

[11] [11]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Occworld: Learning a 3d occupancy world model for autonomous driving

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024

2024

[13] [13]

Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction-based token pruning

Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Liyun Li, Xianming Liu, et al. Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction-based token pruning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2571–2579, 2026

2026

[14] [14]

Self-supervised multi- modal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5299–5318, 2024

Yongshuo Zong, Oisin Mac Aodha, and Timothy M Hospedales. Self-supervised multi- modal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5299–5318, 2024. 10

2024

[15] [15]

Scaling laws for native multimodal models

Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, and Alaaeldin El-Nouby. Scaling laws for native multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–23, 2025

2025

[16] [16]

Llm4drive: A survey of large language models for autonomous driving.arXiv preprint arXiv:2311.01043, 2023

Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. Llm4drive: A survey of large language models for autonomous driving.arXiv preprint arXiv:2311.01043, 2023

work page arXiv 2023

[17] [17]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

2024

[18] [18]

arXiv preprint arXiv:2506.06218 (2025) 3, 9

Christian Fruhwirth-Reisinger, Dušan Mali´c, Wei Lin, David Schinagl, Samuel Schulter, and Horst Possegger. Stsbench: A spatio-temporal scenario benchmark for multi-modal large language models in autonomous driving.arXiv preprint arXiv:2506.06218, 2025

work page arXiv 2025

[19] [19]

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

Roberto Brusnicki, Mattia Piccinini, and Johannes Betz. How well do vision-language models understand sequential driving scenes? a sensitivity study.arXiv preprint arXiv:2604.06750, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding

Tong Zeng, Longfeng Wu, Liang Shi, Dawei Zhou, and Feng Guo. Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5972–5983, 2025

2025

[21] [21]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9(10):8186–8193, 2024

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9(10):8186–8193, 2024

2024

[22] [22]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hong- sheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15120–15130, 2024

2024

[23] [23]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

2024

[24] [24]

Bench2drive-vl: Benchmarks for closed-loop autonomous driving with vision-language models

Xiaosong Jia, Yuqian Shao, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive-vl: Benchmarks for closed-loop autonomous driving with vision-language models. arXiv preprint arXiv:2604.01259, 2026

work page arXiv 2026

[25] [25]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

2017

[26] [26]

Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3(1):22, 2025

Erfei Cui, Wenhai Wang, Zhiqi Li, Jiangwei Xie, Haoming Zou, Hanming Deng, Gen Luo, Lewei Lu, Xizhou Zhu, and Jifeng Dai. Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3(1):22, 2025

2025

[27] [27]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

2025

[28] [28]

Bevdriver: Leveraging bev maps in llms for robust closed-loop driving

Katharina Winter, Mark Azer, and Fabian B Flohr. Bevdriver: Leveraging bev maps in llms for robust closed-loop driving. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 20379–20385. IEEE, 2025

2025

[29] [29]

Occvla: Vision-language-action model with implicit 3d occupancy supervision.arXiv preprint arXiv:2509.05578, 2025

Ruixun Liu, Lingyu Kong, Derun Li, and Hang Zhao. Occvla: Vision-language-action model with implicit 3d occupancy supervision.arXiv preprint arXiv:2509.05578, 2025

work page arXiv 2025

[30] [30]

Perceptio: Perception enhanced vision language models via spatial token generation.arXiv preprint arXiv:2603.18795, 2026

Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, and Garin Kessler. Perceptio: Perception enhanced vision language models via spatial token generation.arXiv preprint arXiv:2603.18795, 2026. 11

work page arXiv 2026

[31] [31]

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

Makanjuola Ogunleye, Eman Abdelrahman, and Ismini Lourentzou. 3d-vcd: Hallucination mitigation in 3d-llm embodied agents through visual contrastive decoding.arXiv preprint arXiv:2604.08645, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Hallucination elimi- nation and semantic enhancement framework for vision-language models in traffic scenarios

Jiaqi Fan, Jianhua Wu, Hongqing Chu, Quanbo Ge, and Bingzhao Gao. Hallucination elimi- nation and semantic enhancement framework for vision-language models in traffic scenarios. arXiv preprint arXiv:2412.07518, 2024

work page arXiv 2024

[33] [33]

Task interference in vlms for autonomous driving: When better perception hurts planning

Sai Bhargav Rongali, Aadya Pipersenia, and Kenji Okuma. Task interference in vlms for autonomous driving: When better perception hurts planning. InAAAI 2026 Workshop on Assessing and Improving Reliability of Foundation Models in the Real World

2026

[34] [34]

DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving

Xiaolu Liu, Yicong Li, Song Wang, Junbo Chen, Angela Yao, and Jianke Zhu. Dynflowdrive: Flow-based dynamic world modeling for autonomous driving.arXiv preprint arXiv:2603.19675, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

The role of world models in shaping autonomous driving: A comprehensive survey.arXiv preprint arXiv:2502.10498, 2025

Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, and Xiang Bai. The role of world models in shaping autonomous driving: A comprehensive survey.arXiv preprint arXiv:2502.10498, 2025

work page arXiv 2025

[36] [36]

World4drive: End-to-end autonomous driving via intention-aware physical latent world model

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28632–28642, 2025

2025

[37] [37]

A survey of world models for autonomous driving

Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving. arXiv preprint arXiv:2501.11260, 2025

work page arXiv 2025

[38] [38]

Dio: Decomposable implicit 4d occupancy-flow world model

Christopher Diehl, Quinlan Sykora, Ben Agro, Thomas Gilles, Sergio Casas, and Raquel Urtasun. Dio: Decomposable implicit 4d occupancy-flow world model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27456–27466, 2025

2025

[39] [39]

ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution

Chuyao Fu, Shengzhe Gan, Zhuoli Ouyang, Yuhan Rui, Xiaowei Chi, Sirui Han, Jiankun Wang, and Hong Zhang. Prodrive: Proactive planning for autonomous driving via ego-environment co-evolution.arXiv preprint arXiv:2604.25329, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Visual point cloud forecasting enables scalable autonomous driving

Zetong Yang, Li Chen, Yanan Sun, and Hongyang Li. Visual point cloud forecasting enables scalable autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14673–14684, 2024

2024

[41] [41]

Uno: Unsu- pervised occupancy fields for perception and forecasting

Ben Agro, Quinlan Sykora, Sergio Casas, Thomas Gilles, and Raquel Urtasun. Uno: Unsu- pervised occupancy fields for perception and forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14487–14496, 2024

2024

[42] [42]

Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model.arXiv preprint arXiv:2410.10429, 2024

Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, and Xiaoxiao Long. Dome: Taming diffusion model into high-fidelity controllable occupancy world model.arXiv preprint arXiv:2410.10429, 2024

work page arXiv 2024

[43] [43]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024

2024

[44] [44]

Smart: Scalable multi-agent real-time motion generation via next-token prediction.Advances in Neural Information Processing Systems, 37:114048–114071, 2024

Wei Wu, Xiaoxin Feng, Ziyan Gao, and Yuheng Kan. Smart: Scalable multi-agent real-time motion generation via next-token prediction.Advances in Neural Information Processing Systems, 37:114048–114071, 2024

2024

[45] [45]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901

[46] [46]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Model-based imitation learning for urban driving.Advances in Neural Information Processing Systems, 35:20703–20716, 2022

Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learning for urban driving.Advances in Neural Information Processing Systems, 35:20703–20716, 2022

2022

[48] [48]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024

[49] [49]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

2021

[51] [51]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025

work page arXiv 2025

[52] [52]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

2024

[53] [53]

Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, and Miao Liu. Unleashing spatial reasoning in multimodal large language models via textual representation guided reasoning. arXiv preprint arXiv:2603.23404, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[54] [54]

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

Jiangye Yuan, Gowri Kumar, and Baoyuan Wang. Boosting mllm spatial reasoning with geometrically referenced 3d scene representations.arXiv preprint arXiv:2603.08592, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[55] [55]

Gaussianvlm: Scene-centric 3d vision-language models using language-aligned gaussian splats for embodied reasoning and beyond.IEEE Robotics and Automation Letters, 2025

Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, and Luc Van Gool. Gaussianvlm: Scene-centric 3d vision-language models using language-aligned gaussian splats for embodied reasoning and beyond.IEEE Robotics and Automation Letters, 2025

2025

[56] [56]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems, 37:135062–135093, 2024

2024

[57] [57]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017

[58] [58]

Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024

Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. Occllama: An occupancy-language-action generative world model for autonomous driving. arXiv preprint arXiv:2409.03272, 2024

work page arXiv 2024

[59] [59]

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2020– 2036, 2024

2020

[60] [60]

Center-based 3d object detection and track- ing

Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and track- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021

2021

[61] [61]

V oxelnet: End-to-end learning for point cloud based 3d object detection

Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4490–4499, 2018

2018

[62] [62]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[63] [63]

Traffic scene perception via multimodal large language model with data augmentation and efficient training strategy.Applied Soft Computing, 177:113210, 2025

Shuo Liu, Lei Shi, Yucheng Shi, Yufei Gao, and Xiaole Sun. Traffic scene perception via multimodal large language model with data augmentation and efficient training strategy.Applied Soft Computing, 177:113210, 2025. 13

2025

[64] [64]

Implementation of image generative models using imagegpt

Payel Dutta and Kaustuv Kunal. Implementation of image generative models using imagegpt. In 2023 International Conference on New Frontiers in Communication, Automation, Management and Security (ICCAMS), volume 1, pages 1–6. IEEE, 2023

2023

[65] [65]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023

[66] [66]

Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

2023

[67] [67]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[68] [68]

St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

2022

[69] [69]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

2023

[70] [70]

Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

2024

[71] [71]

Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22432–22441, 2025

2025

[72] [72]

He-drive: Human-like end-to-end driving with vision language models

Junming Wang, Xingyu Zhang, Zebin Xing, Xiaoyang Guo, Yang Hu, Ziying Song, Qian Zhang, Xiaoxiao Long, Wei Yin, et al. He-drive: Human-like end-to-end driving with vision language models. 2024

2024

[73] [73]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

2025

[74] [74]

Prix: learning to plan from raw pixels for end-to-end autonomous driving.IEEE Robotics and Automation Letters, 11(5):6400– 6407, 2026

Maciej Wozniak, Lianhang Liu, Yixi Cai, and Patric Jensfelt. Prix: learning to plan from raw pixels for end-to-end autonomous driving.IEEE Robotics and Automation Letters, 11(5):6400– 6407, 2026

2026

[75] [75]

Making large language models better planners with reasoning-decision alignment

Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, and Xiaodan Liang. Making large language models better planners with reasoning-decision alignment. InEuropean Conference on Computer Vision, pages 73–90. Springer, 2024

2024

[76] [76]

Doe-1: Closed-loop autonomous driving with large world model.arXiv preprint arXiv:2412.09627, 2024

Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, and Jiwen Lu. Doe-1: Closed-loop autonomous driving with large world model.arXiv preprint arXiv:2412.09627, 2024

work page arXiv 2024

[77] [77]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[78] [78]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [79]

Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers

Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26890–26900, 2025

2025

[80] [80]

End-to-end driving with online trajectory evaluation via bev world model

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025

2025