Recognition: unknown
ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving
Pith reviewed 2026-05-10 02:56 UTC · model grok-4.3
The pith
ST-Prune reduces visual tokens by 90 percent in driving vision-language models while maintaining near-full performance through motion and ring-view pruning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that Motion-aware Temporal Pruning encodes motion volatility and temporal recency as soft constraints within a diversity selection objective to prioritize dynamic trajectories and current frames, while Ring-view Spatial Pruning uses camera geometry to penalize bilateral cross-view similarities and suppress duplicate projections. Together these modules form a complete pruning process that preserves key scene information under high compression, achieving new state-of-the-art results among training-free methods and near-lossless performance even at 90 percent token reduction across perception, prediction, and planning benchmarks.
What carries the argument
Motion-aware Temporal Pruning combined with Ring-view Spatial Pruning, which together apply motion and geometry penalties to token diversity selection to remove spatio-temporal redundancies.
If this is right
- Vision-language models can process longer sequences or additional views at the same computational budget.
- Real-time operation on vehicle hardware becomes feasible without task-specific retraining.
- Certain accuracy metrics can exceed the unpruned baseline by focusing attention on relevant dynamic content.
- Inference speeds remain comparable to prior pruning techniques while retaining higher task performance.
Where Pith is reading between the lines
- The geometry-based pruning step could adapt to other multi-camera video settings such as robotic navigation with known sensor layouts.
- Pairing this pruning with quantization or distillation might compound efficiency gains for onboard deployment.
- Scenes with subtle or uniform motion patterns could expose cases where the volatility metric removes necessary context.
- Dynamic adjustment of pruning ratios based on detected scene complexity could further balance speed and safety.
Load-bearing premise
Motion volatility, temporal recency, and ring-view cross-similarity penalties can reliably identify redundant tokens without discarding critical scene information needed for perception, prediction, and planning.
What would settle it
A test sequence containing a critical low-motion object in a region the method would prune due to low volatility or high cross-view similarity, resulting in measurable drops in planning or detection accuracy.
Figures
read the original abstract
Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ST-Prune, a training-free, plug-and-play framework for spatio-temporal token pruning in vision-language models for autonomous driving. It comprises Motion-aware Temporal Pruning (MTP), which encodes motion volatility and temporal recency as soft constraints in a diversity selection objective, and Ring-view Spatial Pruning (RSP), which penalizes bilateral cross-view similarity using ring-view camera geometry. The method is validated on four benchmarks spanning perception, prediction, and planning, claiming new state-of-the-art results for training-free pruning with near-lossless performance at 90% token reduction and some metrics surpassing the full-model baseline while preserving inference speed.
Significance. If the central claims hold under rigorous controls, the work offers a practical advance for deploying VLMs in autonomous driving by directly exploiting spatio-temporal redundancies in multi-view, multi-frame inputs rather than relying on task-specific fine-tuning. The training-free design and explicit use of motion/geometry signals are clear strengths that could generalize beyond the evaluated benchmarks.
major comments (2)
- [Abstract] Abstract: the claim of 'certain metrics surpassing the full-model baseline' at 90% pruning is load-bearing for the near-lossless assertion but lacks visible verification of experimental controls, variance, or whether the improvement is statistically significant versus the unpruned model.
- [Method] MTP/RSP modules: the soft constraints on motion volatility, temporal recency, and ring-view cross-similarity rest on the assumption that these heuristics reliably discard only redundant tokens; this is undermined in low-motion (e.g., distant slow vehicles) or high-density scenes where volatility scores may under-weight static but safety-critical elements, and the paper provides no targeted ablation or failure-case analysis on such regimes.
minor comments (2)
- [Abstract] Abstract: specify the exact four benchmarks, the precise metrics that exceed the full baseline, and the pruning ratios at which this occurs.
- [Method] Notation: define 'motion volatility' and 'ring-view cross-similarity' more formally with equations before describing their use in the diversity objective.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review of our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation and analysis.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'certain metrics surpassing the full-model baseline' at 90% pruning is load-bearing for the near-lossless assertion but lacks visible verification of experimental controls, variance, or whether the improvement is statistically significant versus the unpruned model.
Authors: We acknowledge that the claim requires stronger supporting evidence for rigor. While our experiments demonstrated consistent outperformance in select metrics at 90% pruning across the evaluated benchmarks, we agree that variance, controls, and significance testing should be explicitly reported. In the revised manuscript, we will add standard deviations from multiple random seeds, clarify the experimental setup and controls in the results section, and include statistical significance tests (e.g., paired t-tests) against the full-model baseline. The abstract claim will be qualified or supported accordingly based on these results. revision: yes
-
Referee: [Method] MTP/RSP modules: the soft constraints on motion volatility, temporal recency, and ring-view cross-similarity rest on the assumption that these heuristics reliably discard only redundant tokens; this is undermined in low-motion (e.g., distant slow vehicles) or high-density scenes where volatility scores may under-weight static but safety-critical elements, and the paper provides no targeted ablation or failure-case analysis on such regimes.
Authors: We agree that the effectiveness of the soft constraints in MTP and RSP should be validated in challenging regimes such as low-motion and high-density scenes, where static safety-critical elements could be at risk. Our current benchmarks span diverse driving conditions, but we did not include dedicated targeted ablations or failure-case studies for these specific scenarios. In the revision, we will add a new subsection with quantitative ablations on low-motion (e.g., distant slow vehicles) and high-density urban scenes, along with qualitative analysis of failure cases and discussion of the heuristics' limitations. This will be accompanied by an expanded limitations section. revision: yes
Circularity Check
No circularity: pruning heuristics rely on explicit external signals
full rationale
The paper's core derivation consists of two heuristic modules (MTP and RSP) that encode motion volatility, temporal recency, and ring-view geometry as soft constraints inside a diversity selection objective. These are defined directly from camera geometry and frame differences rather than from the downstream task loss or from any fitted parameters that are later renamed as predictions. No equations reduce by construction to their own inputs, no self-citation chain is invoked to justify uniqueness or an ansatz, and the method is explicitly training-free. Validation on four benchmarks therefore constitutes an independent test rather than a tautology. The derivation chain remains self-contained against external signals.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Driving scenarios contain exploitable spatio-temporal redundancies that can be pruned without loss of key information
Reference graph
Works this paper leans on
-
[1]
Autonomous driving: cognitive construction and situation understanding.Science China Information Sciences, 62(8):81101, 2019
Shitao Chen, Zhiqiang Jian, Yuhao Huang, Yu Chen, Zhuoli Zhou, and Nanning Zheng. Autonomous driving: cognitive construction and situation understanding.Science China Information Sciences, 62(8):81101, 2019
2019
-
[2]
Advanced trajectory prediction framework integrating diverse driving styles for autonomous vehicles.Science China Information Sciences, 69(3):1–16, 2026
Juqi Hu, Caini Wang, Subhash Rakheja, Youmin Zhang, Changyin Sun, Hejia Gao, and Darong Huang. Advanced trajectory prediction framework integrating diverse driving styles for autonomous vehicles.Science China Information Sciences, 69(3):1–16, 2026
2026
-
[3]
Trajectory planning and tracking control for vehicles with tire blowout in complex traffic flows.Science China Information Sciences, 68(3):132202, 2025
Dongsheng Yang, Di Liu, Bing Han, Guoxiang Lu, Lingan Kong, Chaosheng Huang, and Jun Li. Trajectory planning and tracking control for vehicles with tire blowout in complex traffic flows.Science China Information Sciences, 68(3):132202, 2025
2025
-
[4]
Vision language models in autonomous driving: A survey and outlook.IEEE Transactions on Intelligent Vehicles, 2024
Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C Knoll. Vision language models in autonomous driving: A survey and outlook.IEEE Transactions on Intelligent Vehicles, 2024
2024
-
[5]
Large (vision) language models for autonomous vehicles: Current trends and future directions.IEEE Transactions on Intelligent Transportation Systems, 27(1):187–210, 2025
Hanlin Tian, Kethan Reddy, Yuxiang Feng, Mohammed Quddus, Yiannis Demiris, and Panagiotis Angeloudis. Large (vision) language models for autonomous vehicles: Current trends and future directions.IEEE Transactions on Intelligent Transportation Systems, 27(1):187–210, 2025. 15 APREPRINT- APRIL22, 2026
2025
-
[6]
Enhancing scene understanding based on deep learning for end-to-end autonomous driving.Engineering Applications of Artificial Intelligence, 116:105474, 2022
Jie Hu, Huifang Kong, Qian Zhang, and Runwu Liu. Enhancing scene understanding based on deep learning for end-to-end autonomous driving.Engineering Applications of Artificial Intelligence, 116:105474, 2022
2022
-
[7]
Predicting the road ahead: A knowledge graph based foundation model for scene understanding in autonomous driving
Hongkuan Zhou, Stefan Schimid, Yicong Li, Lavdim Halilaj, Xiangtong Yao, and Wei Cao. Predicting the road ahead: A knowledge graph based foundation model for scene understanding in autonomous driving. In Edward Curry, Maribel Acosta, Maria Poveda-Villalón, Marieke van Erp, Adegboyega Ojo, Katja Hose, Cogan Shimizu, and Pasquale Lisena, editors,The Semanti...
2025
-
[8]
Khan Muhammad, Tanveer Hussain, Hayat Ullah, Javier Del Ser, Mahdi Rezaei, Neeraj Kumar, Mohammad Hijji, Paolo Bellavista, and Victor Hugo C De Albuquerque. Vision-based semantic segmentation in scene understanding for autonomous driving: Recent achievements, challenges, and outlooks.IEEE Transactions on Intelligent Transportation Systems, 23(12):22694–22...
2022
-
[9]
Rethinking closed-loop training for autonomous driving
Chris Zhang, Runsheng Guo, Wenyuan Zeng, Yuwen Xiong, Binbin Dai, Rui Hu, Mengye Ren, and Raquel Urtasun. Rethinking closed-loop training for autonomous driving. InEuropean Conference on Computer Vision, pages 264–282. Springer, 2022
2022
-
[10]
Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024
Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024
2024
-
[11]
Neuroncap: Photorealistic closed-loop safety testing for autonomous driving
William Ljungbergh, Adam Tonderski, Joakim Johnander, Holger Caesar, Kalle Åström, Michael Felsberg, and Christoffer Petersson. Neuroncap: Photorealistic closed-loop safety testing for autonomous driving. InEuropean Conference on Computer Vision, pages 161–177. Springer, 2024
2024
-
[12]
A peek into Tesla’s autonomous future: Core tech revealed at ICCV 2025 WDFM-AD
Ashok Elluswamy. A peek into Tesla’s autonomous future: Core tech revealed at ICCV 2025 WDFM-AD. Invited Talk at the ICCV Workshop on Foundation Models for Autonomous Driving (WDFM-AD), 2025. https: //www.youtube.com/watch?v=IRu-cPkpiFk
2025
-
[13]
Marius Zöllner
Daniel Bogdoll, Stefani Guneshka, and J. Marius Zöllner. One ontology to rule them all: Corner case scenarios for autonomous driving. In Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, editors,Computer Vision – ECCV 2022 Workshops, pages 409–425, Cham, 2023. Springer Nature Switzerland
2022
-
[14]
An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 19–35, Cham, ...
2024
-
[15]
Token merging: Your ViT but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InInternational Conference on Learning Representations, 2023
2023
-
[16]
Sparsevlm: Visual token sparsification for efficient vision-language model inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. InInternational Conference on Machine Learning, 2025
2025
-
[17]
Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025
-
[18]
arXiv preprint arXiv:2502.11501 , year=
Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Linfeng Zhang. Token pruning in multimodal large language models: Are we solving the right problem?arXiv preprint arXiv:2502.11501, 2025
-
[19]
Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction-based token pruning
Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Liyun Li, Xianming Liu, et al. Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction-based token pruning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2571–2579, 2026
2026
-
[20]
Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, et al. Prune2drive: A plug-and-play framework for accelerating vision-language models in autonomous driving.arXiv preprint arXiv:2508.13305, 2025
-
[21]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
2022
-
[22]
Feedback is all you need: from chatgpt to autonomous driving.Science China Information Sciences, 66(6):1–3, 2023
Hong Chen, Kang Yuan, Yanjun Huang, Lulu Guo, Yulei Wang, and Jie Chen. Feedback is all you need: from chatgpt to autonomous driving.Science China Information Sciences, 66(6):1–3, 2023
2023
-
[23]
GPT-Driver: Learning to Drive with GPT
Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023. 16 APREPRINT- APRIL22, 2026
work page internal anchor Pith review arXiv 2023
-
[24]
Driving with llms: Fusing object-level vector modality for explainable autonomous driving
Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14093–14100, 2024
2024
-
[25]
Waslander, Yu Liu, and Hongsheng Li
Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L. Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15120–15130, 2024
2024
-
[26]
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024
work page internal anchor Pith review arXiv 2024
-
[27]
Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M
Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P. Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M. Wolff, and Xin Huang. Vlm-ad: End-to-end autonomous driving through vision-language model supervision. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learni...
2025
-
[28]
Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
2025
-
[29]
Covla: Comprehensive vision-language-action dataset for autonomous driving
Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943, 2025
1933
-
[30]
Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025
-
[31]
Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, and Long Chen. Vggdrive: Empowering vision-language models with cross-view geometric grounding for autonomous driving.arXiv preprint arXiv:2602.20794, 2026
-
[32]
Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving
Zhiyuan Zhang, Xiaofan Li, Zhihao Xu, Wenjie Peng, Zijian Zhou, Miaojing Shi, and Shuangping Huang. Mpdrive: Improving spatial understanding with marker-based prompt learning for autonomous driving. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12089–12099, 2025
2025
-
[33]
Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025
-
[34]
Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving
Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. InEuropean Conference on Computer Vision, pages 292–308. Springer, 2024
2024
-
[35]
Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, and Tieniu Tan. Dynvla: Learning world dynamics for action reasoning in autonomous driving.arXiv preprint arXiv:2603.11041, 2026
-
[36]
Yi Gu, Yan Wang, Yuxiao Chen, Yurong You, Wenjie Luo, Yue Wang, Wenhao Ding, Boyi Li, Heng Yang, Boris Ivanovic, et al. Accelerating structured chain-of-thought in autonomous vehicles.arXiv preprint arXiv:2602.02864, 2026
-
[37]
arXiv preprint arXiv:2412.07689 (2024)
Zhijian Huang, Chengjian Fen, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Drivemm: All-in-one large multimodal model for autonomous driving.arXiv preprint arXiv:2412.07689, 2024
-
[38]
Divprune: Diversity-based visual token pruning for large multimodal models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9392–9401, 2025
2025
-
[39]
Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms
Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20857–20867, 2025
2025
-
[40]
Pact: Pruning and clustering-based token reduction for faster visual language models
Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, and Aymen Shabou. Pact: Pruning and clustering-based token reduction for faster visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14582–14592, June 2025. 17 APREPRINT- APRIL22, 2026
2025
-
[41]
Drivelm: Driving with graph visual question answering
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 256–274, Cham, 2025. Spr...
2024
-
[42]
Lingoqa: Visual question answering for autonomous driving
Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Lingoqa: Visual question answering for autonomous driving. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Com...
2024
-
[43]
Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models
Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024
2024
-
[44]
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M. Alvarez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In CVPR, 2025
2025
-
[45]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023
2023
-
[46]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Qwen3-vl technical report, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.