pith. machine review for the scientific record. sign in

arxiv: 2604.19145 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords token pruningvision-language modelsautonomous drivingspatio-temporal pruningmulti-view videotraining-free methodsmotion-aware selection
0
0 comments X

The pith

ST-Prune reduces visual tokens by 90 percent in driving vision-language models while maintaining near-full performance through motion and ring-view pruning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models for autonomous driving face severe slowdowns from processing multiple camera views across video frames. The paper establishes that a training-free method can safely drop most tokens by first favoring those with high motion or recent timing, then removing duplicates that appear across the circular camera layout. If this holds, the models could run efficiently on vehicle hardware while retaining the accuracy needed for perception, prediction, and planning. The approach works by treating the full spatio-temporal input together rather than pruning views or frames in isolation.

Core claim

The paper claims that Motion-aware Temporal Pruning encodes motion volatility and temporal recency as soft constraints within a diversity selection objective to prioritize dynamic trajectories and current frames, while Ring-view Spatial Pruning uses camera geometry to penalize bilateral cross-view similarities and suppress duplicate projections. Together these modules form a complete pruning process that preserves key scene information under high compression, achieving new state-of-the-art results among training-free methods and near-lossless performance even at 90 percent token reduction across perception, prediction, and planning benchmarks.

What carries the argument

Motion-aware Temporal Pruning combined with Ring-view Spatial Pruning, which together apply motion and geometry penalties to token diversity selection to remove spatio-temporal redundancies.

If this is right

  • Vision-language models can process longer sequences or additional views at the same computational budget.
  • Real-time operation on vehicle hardware becomes feasible without task-specific retraining.
  • Certain accuracy metrics can exceed the unpruned baseline by focusing attention on relevant dynamic content.
  • Inference speeds remain comparable to prior pruning techniques while retaining higher task performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The geometry-based pruning step could adapt to other multi-camera video settings such as robotic navigation with known sensor layouts.
  • Pairing this pruning with quantization or distillation might compound efficiency gains for onboard deployment.
  • Scenes with subtle or uniform motion patterns could expose cases where the volatility metric removes necessary context.
  • Dynamic adjustment of pruning ratios based on detected scene complexity could further balance speed and safety.

Load-bearing premise

Motion volatility, temporal recency, and ring-view cross-similarity penalties can reliably identify redundant tokens without discarding critical scene information needed for perception, prediction, and planning.

What would settle it

A test sequence containing a critical low-motion object in a region the method would prune due to low volatility or high cross-view similarity, resulting in measurable drops in planning or detection accuracy.

Figures

Figures reproduced from arXiv: 2604.19145 by Cong Zhang, Haiyun Guo, Jinqiao Wang, Lin Sha, Min Huang, Qinghai Miao, Tao Wang.

Figure 1
Figure 1. Figure 1: Illustration of the failure modes of traditional pruning methods on spatio-temporal driving inputs and the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The ST-Prune pipeline for spatio-temporal token reduction. The framework processes multi-view, multi-frame [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sensitivity analysis of the balancing coefficients [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative visualization of ST-Prune on a challenging grounding scenario from NuInstruct. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ST-Prune, a training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving. It comprises Motion-aware Temporal Pruning (MTP), which encodes motion volatility and temporal recency as soft constraints in a diversity selection objective, and Ring-view Spatial Pruning (RSP), which penalizes bilateral cross-view similarity using ring-view camera geometry. The method is validated on four benchmarks spanning perception, prediction, and planning, claiming new state-of-the-art results for training-free pruning with near-lossless performance at 90% token reduction and some metrics surpassing the full-model baseline while preserving inference speed.

Significance. If the central claims hold under rigorous controls, the work offers a practical advance for deploying VLMs in autonomous driving by directly exploiting spatio-temporal redundancies in multi-view, multi-frame inputs rather than relying on task-specific fine-tuning. The training-free design and explicit use of motion/geometry signals are clear strengths that could generalize beyond the evaluated benchmarks.

major comments (2)
  1. [Abstract] Abstract: the claim of 'certain metrics surpassing the full-model baseline' at 90% pruning is load-bearing for the near-lossless assertion but lacks visible verification of experimental controls, variance, or whether the improvement is statistically significant versus the unpruned model.
  2. [Method] MTP/RSP modules: the soft constraints on motion volatility, temporal recency, and ring-view cross-similarity rest on the assumption that these heuristics reliably discard only redundant tokens; this is undermined in low-motion (e.g., distant slow vehicles) or high-density scenes where volatility scores may under-weight static but safety-critical elements, and the paper provides no targeted ablation or failure-case analysis on such regimes.
minor comments (2)
  1. [Abstract] Abstract: specify the exact four benchmarks, the precise metrics that exceed the full baseline, and the pruning ratios at which this occurs.
  2. [Method] Notation: define 'motion volatility' and 'ring-view cross-similarity' more formally with equations before describing their use in the diversity objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation and analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'certain metrics surpassing the full-model baseline' at 90% pruning is load-bearing for the near-lossless assertion but lacks visible verification of experimental controls, variance, or whether the improvement is statistically significant versus the unpruned model.

    Authors: We acknowledge that the claim requires stronger supporting evidence for rigor. While our experiments demonstrated consistent outperformance in select metrics at 90% pruning across the evaluated benchmarks, we agree that variance, controls, and significance testing should be explicitly reported. In the revised manuscript, we will add standard deviations from multiple random seeds, clarify the experimental setup and controls in the results section, and include statistical significance tests (e.g., paired t-tests) against the full-model baseline. The abstract claim will be qualified or supported accordingly based on these results. revision: yes

  2. Referee: [Method] MTP/RSP modules: the soft constraints on motion volatility, temporal recency, and ring-view cross-similarity rest on the assumption that these heuristics reliably discard only redundant tokens; this is undermined in low-motion (e.g., distant slow vehicles) or high-density scenes where volatility scores may under-weight static but safety-critical elements, and the paper provides no targeted ablation or failure-case analysis on such regimes.

    Authors: We agree that the effectiveness of the soft constraints in MTP and RSP should be validated in challenging regimes such as low-motion and high-density scenes, where static safety-critical elements could be at risk. Our current benchmarks span diverse driving conditions, but we did not include dedicated targeted ablations or failure-case studies for these specific scenarios. In the revision, we will add a new subsection with quantitative ablations on low-motion (e.g., distant slow vehicles) and high-density urban scenes, along with qualitative analysis of failure cases and discussion of the heuristics' limitations. This will be accompanied by an expanded limitations section. revision: yes

Circularity Check

0 steps flagged

No circularity: pruning heuristics rely on explicit external signals

full rationale

The paper's core derivation consists of two heuristic modules (MTP and RSP) that encode motion volatility, temporal recency, and ring-view geometry as soft constraints inside a diversity selection objective. These are defined directly from camera geometry and frame differences rather than from the downstream task loss or from any fitted parameters that are later renamed as predictions. No equations reduce by construction to their own inputs, no self-citation chain is invoked to justify uniqueness or an ansatz, and the method is explicitly training-free. Validation on four benchmarks therefore constitutes an independent test rather than a tautology. The derivation chain remains self-contained against external signals.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The approach assumes inherent spatio-temporal redundancies exist in driving data.

axioms (1)
  • domain assumption Driving scenarios contain exploitable spatio-temporal redundancies that can be pruned without loss of key information
    Invoked to justify both MTP and RSP modules

pith-pipeline@v0.9.0 · 5559 in / 1210 out tokens · 45953 ms · 2026-05-10T02:56:35.659677+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    Autonomous driving: cognitive construction and situation understanding.Science China Information Sciences, 62(8):81101, 2019

    Shitao Chen, Zhiqiang Jian, Yuhao Huang, Yu Chen, Zhuoli Zhou, and Nanning Zheng. Autonomous driving: cognitive construction and situation understanding.Science China Information Sciences, 62(8):81101, 2019

  2. [2]

    Advanced trajectory prediction framework integrating diverse driving styles for autonomous vehicles.Science China Information Sciences, 69(3):1–16, 2026

    Juqi Hu, Caini Wang, Subhash Rakheja, Youmin Zhang, Changyin Sun, Hejia Gao, and Darong Huang. Advanced trajectory prediction framework integrating diverse driving styles for autonomous vehicles.Science China Information Sciences, 69(3):1–16, 2026

  3. [3]

    Trajectory planning and tracking control for vehicles with tire blowout in complex traffic flows.Science China Information Sciences, 68(3):132202, 2025

    Dongsheng Yang, Di Liu, Bing Han, Guoxiang Lu, Lingan Kong, Chaosheng Huang, and Jun Li. Trajectory planning and tracking control for vehicles with tire blowout in complex traffic flows.Science China Information Sciences, 68(3):132202, 2025

  4. [4]

    Vision language models in autonomous driving: A survey and outlook.IEEE Transactions on Intelligent Vehicles, 2024

    Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C Knoll. Vision language models in autonomous driving: A survey and outlook.IEEE Transactions on Intelligent Vehicles, 2024

  5. [5]

    Large (vision) language models for autonomous vehicles: Current trends and future directions.IEEE Transactions on Intelligent Transportation Systems, 27(1):187–210, 2025

    Hanlin Tian, Kethan Reddy, Yuxiang Feng, Mohammed Quddus, Yiannis Demiris, and Panagiotis Angeloudis. Large (vision) language models for autonomous vehicles: Current trends and future directions.IEEE Transactions on Intelligent Transportation Systems, 27(1):187–210, 2025. 15 APREPRINT- APRIL22, 2026

  6. [6]

    Enhancing scene understanding based on deep learning for end-to-end autonomous driving.Engineering Applications of Artificial Intelligence, 116:105474, 2022

    Jie Hu, Huifang Kong, Qian Zhang, and Runwu Liu. Enhancing scene understanding based on deep learning for end-to-end autonomous driving.Engineering Applications of Artificial Intelligence, 116:105474, 2022

  7. [7]

    Predicting the road ahead: A knowledge graph based foundation model for scene understanding in autonomous driving

    Hongkuan Zhou, Stefan Schimid, Yicong Li, Lavdim Halilaj, Xiangtong Yao, and Wei Cao. Predicting the road ahead: A knowledge graph based foundation model for scene understanding in autonomous driving. In Edward Curry, Maribel Acosta, Maria Poveda-Villalón, Marieke van Erp, Adegboyega Ojo, Katja Hose, Cogan Shimizu, and Pasquale Lisena, editors,The Semanti...

  8. [8]

    Khan Muhammad, Tanveer Hussain, Hayat Ullah, Javier Del Ser, Mahdi Rezaei, Neeraj Kumar, Mohammad Hijji, Paolo Bellavista, and Victor Hugo C De Albuquerque. Vision-based semantic segmentation in scene understanding for autonomous driving: Recent achievements, challenges, and outlooks.IEEE Transactions on Intelligent Transportation Systems, 23(12):22694–22...

  9. [9]

    Rethinking closed-loop training for autonomous driving

    Chris Zhang, Runsheng Guo, Wenyuan Zeng, Yuwen Xiong, Binbin Dai, Rui Hu, Mengye Ren, and Raquel Urtasun. Rethinking closed-loop training for autonomous driving. InEuropean Conference on Computer Vision, pages 264–282. Springer, 2022

  10. [10]

    Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

  11. [11]

    Neuroncap: Photorealistic closed-loop safety testing for autonomous driving

    William Ljungbergh, Adam Tonderski, Joakim Johnander, Holger Caesar, Kalle Åström, Michael Felsberg, and Christoffer Petersson. Neuroncap: Photorealistic closed-loop safety testing for autonomous driving. InEuropean Conference on Computer Vision, pages 161–177. Springer, 2024

  12. [12]

    A peek into Tesla’s autonomous future: Core tech revealed at ICCV 2025 WDFM-AD

    Ashok Elluswamy. A peek into Tesla’s autonomous future: Core tech revealed at ICCV 2025 WDFM-AD. Invited Talk at the ICCV Workshop on Foundation Models for Autonomous Driving (WDFM-AD), 2025. https: //www.youtube.com/watch?v=IRu-cPkpiFk

  13. [13]

    Marius Zöllner

    Daniel Bogdoll, Stefani Guneshka, and J. Marius Zöllner. One ontology to rule them all: Corner case scenarios for autonomous driving. In Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, editors,Computer Vision – ECCV 2022 Workshops, pages 409–425, Cham, 2023. Springer Nature Switzerland

  14. [14]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 19–35, Cham, ...

  15. [15]

    Token merging: Your ViT but faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InInternational Conference on Learning Representations, 2023

  16. [16]

    Sparsevlm: Visual token sparsification for efficient vision-language model inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. InInternational Conference on Machine Learning, 2025

  17. [17]

    Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025

    Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025

  18. [18]

    arXiv preprint arXiv:2502.11501 , year=

    Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Linfeng Zhang. Token pruning in multimodal large language models: Are we solving the right problem?arXiv preprint arXiv:2502.11501, 2025

  19. [19]

    Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction-based token pruning

    Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Liyun Li, Xianming Liu, et al. Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction-based token pruning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2571–2579, 2026

  20. [20]

    Prune2drive: A plug-and-play framework for accelerating vision-language models in autonomous driving.arXiv preprint arXiv:2508.13305, 2025

    Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, et al. Prune2drive: A plug-and-play framework for accelerating vision-language models in autonomous driving.arXiv preprint arXiv:2508.13305, 2025

  21. [21]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  22. [22]

    Feedback is all you need: from chatgpt to autonomous driving.Science China Information Sciences, 66(6):1–3, 2023

    Hong Chen, Kang Yuan, Yanjun Huang, Lulu Guo, Yulei Wang, and Jie Chen. Feedback is all you need: from chatgpt to autonomous driving.Science China Information Sciences, 66(6):1–3, 2023

  23. [23]

    GPT-Driver: Learning to Drive with GPT

    Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023. 16 APREPRINT- APRIL22, 2026

  24. [24]

    Driving with llms: Fusing object-level vector modality for explainable autonomous driving

    Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14093–14100, 2024

  25. [25]

    Waslander, Yu Liu, and Hongsheng Li

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L. Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15120–15130, 2024

  26. [26]

    Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

  27. [27]

    Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M

    Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P. Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M. Wolff, and Xin Huang. Vlm-ad: End-to-end autonomous driving through vision-language model supervision. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learni...

  28. [28]

    Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  29. [29]

    Covla: Comprehensive vision-language-action dataset for autonomous driving

    Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943, 2025

  30. [30]

    Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025

    Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025

  31. [31]

    VGGDrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

    Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, and Long Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026

  32. [32]

    Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving

    Zhiyuan Zhang, Xiaofan Li, Zhihao Xu, Wenjie Peng, Zijian Zhou, Miaojing Shi, and Shuangping Huang. Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12089–12099, 2025

  33. [33]

    SpaceDrive: Infusing spatial awareness into VLM-based autonomous driving.arXiv preprint arXiv:2512.10719, 2025

    Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

  34. [34]

    Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving

    Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. InEuropean Conference on Computer Vision, pages 292–308. Springer, 2024

  35. [35]

    DynVLA: Learning world dynamics for action reasoning in autonomous driving.arXiv preprint arXiv:2603.11041, 2026

    Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, and Tieniu Tan. Dynvla: Learning world dynamics for action reasoning in autonomous driving.arXiv preprint arXiv:2603.11041, 2026

  36. [36]

    Accelerating structured chain-of-thought in autonomous vehicles.arXiv preprint arXiv:2602.02864, 2026

    Yi Gu, Yan Wang, Yuxiao Chen, Yurong You, Wenjie Luo, Yue Wang, Wenhao Ding, Boyi Li, Heng Yang, Boris Ivanovic, et al. Accelerating structured chain-of-thought in autonomous vehicles.arXiv preprint arXiv:2602.02864, 2026

  37. [37]

    arXiv preprint arXiv:2412.07689 (2024)

    Zhijian Huang, Chengjian Fen, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Drivemm: All-in-one large multimodal model for autonomous driving.arXiv preprint arXiv:2412.07689, 2024

  38. [38]

    Divprune: Diversity-based visual token pruning for large multimodal models

    Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9392–9401, 2025

  39. [39]

    Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

    Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20857–20867, 2025

  40. [40]

    Pact: Pruning and clustering-based token reduction for faster visual language models

    Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, and Aymen Shabou. Pact: Pruning and clustering-based token reduction for faster visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14582–14592, June 2025. 17 APREPRINT- APRIL22, 2026

  41. [41]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 256–274, Cham, 2025. Spr...

  42. [42]

    Lingoqa: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Lingoqa: Visual question answering for autonomous driving. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Com...

  43. [43]

    Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

    Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

  44. [44]

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M. Alvarez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In CVPR, 2025

  45. [45]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023

  46. [46]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  47. [47]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...