arxiv: 2604.19145 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

Lin Sha , Haiyun Guo , Tao Wang , Cong Zhang , Min Huang , Jinqiao Wang , Qinghai Miao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords token pruningvision-language modelsautonomous drivingspatio-temporal pruningmulti-view videotraining-free methodsmotion-aware selection

0 comments

The pith

ST-Prune reduces visual tokens by 90 percent in driving vision-language models while maintaining near-full performance through motion and ring-view pruning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models for autonomous driving face severe slowdowns from processing multiple camera views across video frames. The paper establishes that a training-free method can safely drop most tokens by first favoring those with high motion or recent timing, then removing duplicates that appear across the circular camera layout. If this holds, the models could run efficiently on vehicle hardware while retaining the accuracy needed for perception, prediction, and planning. The approach works by treating the full spatio-temporal input together rather than pruning views or frames in isolation.

Core claim

The paper claims that Motion-aware Temporal Pruning encodes motion volatility and temporal recency as soft constraints within a diversity selection objective to prioritize dynamic trajectories and current frames, while Ring-view Spatial Pruning uses camera geometry to penalize bilateral cross-view similarities and suppress duplicate projections. Together these modules form a complete pruning process that preserves key scene information under high compression, achieving new state-of-the-art results among training-free methods and near-lossless performance even at 90 percent token reduction across perception, prediction, and planning benchmarks.

What carries the argument

Motion-aware Temporal Pruning combined with Ring-view Spatial Pruning, which together apply motion and geometry penalties to token diversity selection to remove spatio-temporal redundancies.

If this is right

Vision-language models can process longer sequences or additional views at the same computational budget.
Real-time operation on vehicle hardware becomes feasible without task-specific retraining.
Certain accuracy metrics can exceed the unpruned baseline by focusing attention on relevant dynamic content.
Inference speeds remain comparable to prior pruning techniques while retaining higher task performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The geometry-based pruning step could adapt to other multi-camera video settings such as robotic navigation with known sensor layouts.
Pairing this pruning with quantization or distillation might compound efficiency gains for onboard deployment.
Scenes with subtle or uniform motion patterns could expose cases where the volatility metric removes necessary context.
Dynamic adjustment of pruning ratios based on detected scene complexity could further balance speed and safety.

Load-bearing premise

Motion volatility, temporal recency, and ring-view cross-similarity penalties can reliably identify redundant tokens without discarding critical scene information needed for perception, prediction, and planning.

What would settle it

A test sequence containing a critical low-motion object in a region the method would prune due to low volatility or high cross-view similarity, resulting in measurable drops in planning or detection accuracy.

Figures

Figures reproduced from arXiv: 2604.19145 by Cong Zhang, Haiyun Guo, Jinqiao Wang, Lin Sha, Min Huang, Qinghai Miao, Tao Wang.

**Figure 1.** Figure 1: Illustration of the failure modes of traditional pruning methods on spatio-temporal driving inputs and the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The ST-Prune pipeline for spatio-temporal token reduction. The framework processes multi-view, multi-frame [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity analysis of the balancing coefficients [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative visualization of ST-Prune on a challenging grounding scenario from NuInstruct. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ST-Prune pairs motion volatility with ring-view similarity for training-free spatio-temporal pruning in driving VLMs and claims near-lossless results at 90% compression, but those results need checking against controls.

read the letter

The main thing here is that the paper gives a training-free way to prune tokens in vision-language models for autonomous driving by combining motion-aware temporal pruning with ring-view spatial pruning. MTP scores tokens on volatility and recency inside a diversity objective, while RSP uses camera geometry to penalize duplicate projections across views. This targets the multi-frame, multi-view redundancies that single-image pruning methods ignore, which is a reasonable adaptation to driving data. The plug-and-play design and the fact that inference speed stays comparable to prior pruning work are practical pluses. They test across four benchmarks that cover perception, prediction, and planning, which gives decent task coverage for the claim of SOTA training-free performance. The abstract reports that 90% pruning keeps results near the full model and that some metrics even improve, which would be useful if it holds. The soft spots are the performance claims themselves. Surpassing the full baseline on certain metrics at heavy pruning is unusual and could reflect differences in evaluation setup or lucky hyperparameter choices rather than a general win. The heuristics could also drop useful tokens in low-motion scenes or dense intersections where volatility is low and cross-view overlap is high; average benchmark numbers would not necessarily catch that silent loss. Without ablations on those regimes or explicit failure cases, the robustness is hard to judge. This is for researchers and engineers who need to run VLMs on limited hardware in driving stacks. Anyone already working on token reduction or real-time multi-camera systems would pick up usable ideas from the geometry and motion signals. It deserves peer review because the problem is real and the method is a direct attempt to exploit driving-specific structure, even if the experiments will need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ST-Prune, a training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving. It comprises Motion-aware Temporal Pruning (MTP), which encodes motion volatility and temporal recency as soft constraints in a diversity selection objective, and Ring-view Spatial Pruning (RSP), which penalizes bilateral cross-view similarity using ring-view camera geometry. The method is validated on four benchmarks spanning perception, prediction, and planning, claiming new state-of-the-art results for training-free pruning with near-lossless performance at 90% token reduction and some metrics surpassing the full-model baseline while preserving inference speed.

Significance. If the central claims hold under rigorous controls, the work offers a practical advance for deploying VLMs in autonomous driving by directly exploiting spatio-temporal redundancies in multi-view, multi-frame inputs rather than relying on task-specific fine-tuning. The training-free design and explicit use of motion/geometry signals are clear strengths that could generalize beyond the evaluated benchmarks.

major comments (2)

[Abstract] Abstract: the claim of 'certain metrics surpassing the full-model baseline' at 90% pruning is load-bearing for the near-lossless assertion but lacks visible verification of experimental controls, variance, or whether the improvement is statistically significant versus the unpruned model.
[Method] MTP/RSP modules: the soft constraints on motion volatility, temporal recency, and ring-view cross-similarity rest on the assumption that these heuristics reliably discard only redundant tokens; this is undermined in low-motion (e.g., distant slow vehicles) or high-density scenes where volatility scores may under-weight static but safety-critical elements, and the paper provides no targeted ablation or failure-case analysis on such regimes.

minor comments (2)

[Abstract] Abstract: specify the exact four benchmarks, the precise metrics that exceed the full baseline, and the pruning ratios at which this occurs.
[Method] Notation: define 'motion volatility' and 'ring-view cross-similarity' more formally with equations before describing their use in the diversity objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation and analysis.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'certain metrics surpassing the full-model baseline' at 90% pruning is load-bearing for the near-lossless assertion but lacks visible verification of experimental controls, variance, or whether the improvement is statistically significant versus the unpruned model.

Authors: We acknowledge that the claim requires stronger supporting evidence for rigor. While our experiments demonstrated consistent outperformance in select metrics at 90% pruning across the evaluated benchmarks, we agree that variance, controls, and significance testing should be explicitly reported. In the revised manuscript, we will add standard deviations from multiple random seeds, clarify the experimental setup and controls in the results section, and include statistical significance tests (e.g., paired t-tests) against the full-model baseline. The abstract claim will be qualified or supported accordingly based on these results. revision: yes
Referee: [Method] MTP/RSP modules: the soft constraints on motion volatility, temporal recency, and ring-view cross-similarity rest on the assumption that these heuristics reliably discard only redundant tokens; this is undermined in low-motion (e.g., distant slow vehicles) or high-density scenes where volatility scores may under-weight static but safety-critical elements, and the paper provides no targeted ablation or failure-case analysis on such regimes.

Authors: We agree that the effectiveness of the soft constraints in MTP and RSP should be validated in challenging regimes such as low-motion and high-density scenes, where static safety-critical elements could be at risk. Our current benchmarks span diverse driving conditions, but we did not include dedicated targeted ablations or failure-case studies for these specific scenarios. In the revision, we will add a new subsection with quantitative ablations on low-motion (e.g., distant slow vehicles) and high-density urban scenes, along with qualitative analysis of failure cases and discussion of the heuristics' limitations. This will be accompanied by an expanded limitations section. revision: yes

Circularity Check

0 steps flagged

No circularity: pruning heuristics rely on explicit external signals

full rationale

The paper's core derivation consists of two heuristic modules (MTP and RSP) that encode motion volatility, temporal recency, and ring-view geometry as soft constraints inside a diversity selection objective. These are defined directly from camera geometry and frame differences rather than from the downstream task loss or from any fitted parameters that are later renamed as predictions. No equations reduce by construction to their own inputs, no self-citation chain is invoked to justify uniqueness or an ansatz, and the method is explicitly training-free. Validation on four benchmarks therefore constitutes an independent test rather than a tautology. The derivation chain remains self-contained against external signals.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The approach assumes inherent spatio-temporal redundancies exist in driving data.

axioms (1)

domain assumption Driving scenarios contain exploitable spatio-temporal redundancies that can be pruned without loss of key information
Invoked to justify both MTP and RSP modules

pith-pipeline@v0.9.0 · 5559 in / 1210 out tokens · 45953 ms · 2026-05-10T02:56:35.659677+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 12 canonical work pages · 3 internal anchors

[1]

Autonomous driving: cognitive construction and situation understanding.Science China Information Sciences, 62(8):81101, 2019

Shitao Chen, Zhiqiang Jian, Yuhao Huang, Yu Chen, Zhuoli Zhou, and Nanning Zheng. Autonomous driving: cognitive construction and situation understanding.Science China Information Sciences, 62(8):81101, 2019

2019
[2]

Advanced trajectory prediction framework integrating diverse driving styles for autonomous vehicles.Science China Information Sciences, 69(3):1–16, 2026

Juqi Hu, Caini Wang, Subhash Rakheja, Youmin Zhang, Changyin Sun, Hejia Gao, and Darong Huang. Advanced trajectory prediction framework integrating diverse driving styles for autonomous vehicles.Science China Information Sciences, 69(3):1–16, 2026

2026
[3]

Trajectory planning and tracking control for vehicles with tire blowout in complex traffic flows.Science China Information Sciences, 68(3):132202, 2025

Dongsheng Yang, Di Liu, Bing Han, Guoxiang Lu, Lingan Kong, Chaosheng Huang, and Jun Li. Trajectory planning and tracking control for vehicles with tire blowout in complex traffic flows.Science China Information Sciences, 68(3):132202, 2025

2025
[4]

Vision language models in autonomous driving: A survey and outlook.IEEE Transactions on Intelligent Vehicles, 2024

Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C Knoll. Vision language models in autonomous driving: A survey and outlook.IEEE Transactions on Intelligent Vehicles, 2024

2024
[5]

Large (vision) language models for autonomous vehicles: Current trends and future directions.IEEE Transactions on Intelligent Transportation Systems, 27(1):187–210, 2025

Hanlin Tian, Kethan Reddy, Yuxiang Feng, Mohammed Quddus, Yiannis Demiris, and Panagiotis Angeloudis. Large (vision) language models for autonomous vehicles: Current trends and future directions.IEEE Transactions on Intelligent Transportation Systems, 27(1):187–210, 2025. 15 APREPRINT- APRIL22, 2026

2025
[6]

Enhancing scene understanding based on deep learning for end-to-end autonomous driving.Engineering Applications of Artificial Intelligence, 116:105474, 2022

Jie Hu, Huifang Kong, Qian Zhang, and Runwu Liu. Enhancing scene understanding based on deep learning for end-to-end autonomous driving.Engineering Applications of Artificial Intelligence, 116:105474, 2022

2022
[7]

Predicting the road ahead: A knowledge graph based foundation model for scene understanding in autonomous driving

Hongkuan Zhou, Stefan Schimid, Yicong Li, Lavdim Halilaj, Xiangtong Yao, and Wei Cao. Predicting the road ahead: A knowledge graph based foundation model for scene understanding in autonomous driving. In Edward Curry, Maribel Acosta, Maria Poveda-Villalón, Marieke van Erp, Adegboyega Ojo, Katja Hose, Cogan Shimizu, and Pasquale Lisena, editors,The Semanti...

2025
[8]

Khan Muhammad, Tanveer Hussain, Hayat Ullah, Javier Del Ser, Mahdi Rezaei, Neeraj Kumar, Mohammad Hijji, Paolo Bellavista, and Victor Hugo C De Albuquerque. Vision-based semantic segmentation in scene understanding for autonomous driving: Recent achievements, challenges, and outlooks.IEEE Transactions on Intelligent Transportation Systems, 23(12):22694–22...

2022
[9]

Rethinking closed-loop training for autonomous driving

Chris Zhang, Runsheng Guo, Wenyuan Zeng, Yuwen Xiong, Binbin Dai, Rui Hu, Mengye Ren, and Raquel Urtasun. Rethinking closed-loop training for autonomous driving. InEuropean Conference on Computer Vision, pages 264–282. Springer, 2022

2022
[10]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

2024
[11]

Neuroncap: Photorealistic closed-loop safety testing for autonomous driving

William Ljungbergh, Adam Tonderski, Joakim Johnander, Holger Caesar, Kalle Åström, Michael Felsberg, and Christoffer Petersson. Neuroncap: Photorealistic closed-loop safety testing for autonomous driving. InEuropean Conference on Computer Vision, pages 161–177. Springer, 2024

2024
[12]

A peek into Tesla’s autonomous future: Core tech revealed at ICCV 2025 WDFM-AD

Ashok Elluswamy. A peek into Tesla’s autonomous future: Core tech revealed at ICCV 2025 WDFM-AD. Invited Talk at the ICCV Workshop on Foundation Models for Autonomous Driving (WDFM-AD), 2025. https: //www.youtube.com/watch?v=IRu-cPkpiFk

2025
[13]

Marius Zöllner

Daniel Bogdoll, Stefani Guneshka, and J. Marius Zöllner. One ontology to rule them all: Corner case scenarios for autonomous driving. In Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, editors,Computer Vision – ECCV 2022 Workshops, pages 409–425, Cham, 2023. Springer Nature Switzerland

2022
[14]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 19–35, Cham, ...

2024
[15]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InInternational Conference on Learning Representations, 2023

2023
[16]

Sparsevlm: Visual token sparsification for efficient vision-language model inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. InInternational Conference on Machine Learning, 2025

2025
[17]

Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025

Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025

work page arXiv 2025
[18]

arXiv preprint arXiv:2502.11501 , year=

Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Linfeng Zhang. Token pruning in multimodal large language models: Are we solving the right problem?arXiv preprint arXiv:2502.11501, 2025

work page arXiv 2025
[19]

Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction-based token pruning

Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Liyun Li, Xianming Liu, et al. Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction-based token pruning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2571–2579, 2026

2026
[20]

Prune2drive: A plug-and-play framework for accelerating vision-language models in autonomous driving.arXiv preprint arXiv:2508.13305, 2025

Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, et al. Prune2drive: A plug-and-play framework for accelerating vision-language models in autonomous driving.arXiv preprint arXiv:2508.13305, 2025

work page arXiv 2025
[21]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022
[22]

Feedback is all you need: from chatgpt to autonomous driving.Science China Information Sciences, 66(6):1–3, 2023

Hong Chen, Kang Yuan, Yanjun Huang, Lulu Guo, Yulei Wang, and Jie Chen. Feedback is all you need: from chatgpt to autonomous driving.Science China Information Sciences, 66(6):1–3, 2023

2023
[23]

GPT-Driver: Learning to Drive with GPT

Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023. 16 APREPRINT- APRIL22, 2026

work page internal anchor Pith review arXiv 2023
[24]

Driving with llms: Fusing object-level vector modality for explainable autonomous driving

Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14093–14100, 2024

2024
[25]

Waslander, Yu Liu, and Hongsheng Li

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L. Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15120–15130, 2024

2024
[26]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

work page internal anchor Pith review arXiv 2024
[27]

Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M

Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P. Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M. Wolff, and Xin Huang. Vlm-ad: End-to-end autonomous driving through vision-language model supervision. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learni...

2025
[28]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025
[29]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943, 2025

1933
[30]

Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025

work page arXiv 2025
[31]

VGGDrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, and Long Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

work page arXiv 2026
[32]

Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving

Zhiyuan Zhang, Xiaofan Li, Zhihao Xu, Wenjie Peng, Zijian Zhou, Miaojing Shi, and Shuangping Huang. Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12089–12099, 2025

2025
[33]

SpaceDrive: Infusing spatial awareness into VLM-based autonomous driving.arXiv preprint arXiv:2512.10719, 2025

Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

work page arXiv 2025
[34]

Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. InEuropean Conference on Computer Vision, pages 292–308. Springer, 2024

2024
[35]

DynVLA: Learning world dynamics for action reasoning in autonomous driving.arXiv preprint arXiv:2603.11041, 2026

Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, and Tieniu Tan. Dynvla: Learning world dynamics for action reasoning in autonomous driving.arXiv preprint arXiv:2603.11041, 2026

work page arXiv 2026
[36]

Accelerating structured chain-of-thought in autonomous vehicles.arXiv preprint arXiv:2602.02864, 2026

Yi Gu, Yan Wang, Yuxiao Chen, Yurong You, Wenjie Luo, Yue Wang, Wenhao Ding, Boyi Li, Heng Yang, Boris Ivanovic, et al. Accelerating structured chain-of-thought in autonomous vehicles.arXiv preprint arXiv:2602.02864, 2026

work page arXiv 2026
[37]

arXiv preprint arXiv:2412.07689 (2024)

Zhijian Huang, Chengjian Fen, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Drivemm: All-in-one large multimodal model for autonomous driving.arXiv preprint arXiv:2412.07689, 2024

work page arXiv 2024
[38]

Divprune: Diversity-based visual token pruning for large multimodal models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9392–9401, 2025

2025
[39]

Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20857–20867, 2025

2025
[40]

Pact: Pruning and clustering-based token reduction for faster visual language models

Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, and Aymen Shabou. Pact: Pruning and clustering-based token reduction for faster visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14582–14592, June 2025. 17 APREPRINT- APRIL22, 2026

2025
[41]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 256–274, Cham, 2025. Spr...

2024
[42]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Lingoqa: Visual question answering for autonomous driving. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Com...

2024
[43]

Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

2024
[44]

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M. Alvarez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In CVPR, 2025

2025
[45]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023

2023
[46]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

2025