pith. sign in

arxiv: 2606.02774 · v1 · pith:4W7WAU6Nnew · submitted 2026-06-01 · 💻 cs.CV

GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

Pith reviewed 2026-06-28 14:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsautonomous drivingbenchmarkregion-specific reasoningmultimodal reasoningtraffic conventionsgeo-cultural differencesdriving tasks
0
0 comments X

The pith

Vision-language models for autonomous driving perform inconsistently across regions because they lack robust awareness of local traffic conventions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates GeoDrive-Bench, a set of 5,053 multiple-choice questions drawn from six countries, to test whether vision-language models can choose correct driving actions using only visual scenes plus unspoken local rules. Questions cover perception, prediction, planning, and region reasoning and deliberately omit any country names so models must infer the right behavior from the image alone. The authors also present a distillation method that transfers region-specific traffic knowledge into a model's internal representations. When nine existing models are tested, their accuracy fluctuates sharply from one driving culture to another, while the authors' adapted baselines improve across the board. A sympathetic reader would care because any model deployed worldwide must handle these hidden regional differences or risk unsafe decisions.

Core claim

GeoDrive-Bench supplies 5,053 human-validated questions across six countries that each require a model to combine visual evidence with implicit local traffic conventions to select the correct action among perception, prediction, planning, and region-reasoning options; no country label is provided. A distillation algorithm is introduced that injects region-specific traffic-rule knowledge directly into the model's representations. Experiments on nine state-of-the-art VLMs reveal large accuracy gaps between geo-driving cultures on every task, while the authors' baseline models show measurable gains in cross-region performance, indicating that present VLMs do not yet possess reliable region-awar

What carries the argument

GeoDrive-Bench, a curated collection of 5,053 QA pairs that force inference from visual scenes plus unspoken local traffic conventions, paired with a distillation algorithm that embeds region-specific rule knowledge into VLM representations.

If this is right

  • Existing VLMs display large performance differences across the six countries on perception, prediction, planning, and region-reasoning tasks.
  • The proposed distillation method produces baseline models that improve geo-cultural reasoning uniformly across regions.
  • Current VLMs still lack the region-aware intelligence required for safe deployment in varied global driving environments.
  • GeoDrive-Bench functions as both a diagnostic test and a training resource for building more deployable autonomous-driving foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the performance gaps persist, any worldwide rollout of driving VLMs would need systematic region-by-region adaptation rather than a single global model.
  • Models might begin to treat subtle visual markers such as sign styles or vehicle types as implicit location signals, which could be measured in follow-up experiments.
  • The same curation approach could be applied to other multimodal tasks where unspoken local conventions matter, such as region-specific legal or medical image reasoning.

Load-bearing premise

The benchmark questions can be answered correctly only by combining visual evidence with implicit local traffic conventions rather than by surface-level image features or any explicit country information.

What would settle it

A single VLM that reaches near-ceiling accuracy with no statistically significant difference across all six countries on the full set of 5,053 questions, or a controlled test showing that the same questions can be solved at high accuracy using only generic visual features without any region-specific knowledge.

Figures

Figures reproduced from arXiv: 2606.02774 by Chaowei Xiao, Ming Jiang, Yingzi Ma.

Figure 1
Figure 1. Figure 1: Overview of GEODRIVE-BENCH. Left: radar visualization of per-country accuracy across representative VLMs, where each polygon corresponds to one country and each axis denotes a model. The results show that current VLMs exhibit highly imbalanced performance across country￾specific scenarios, even when evaluated on the same driving tasks.Right: region-specific visual cues (signs, license plates, signals, vehi… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our scenario collection pipeline. To make “cultural relevance” an op￾erational criterion rather than an in￾tuitive judgment, we manually define 13 categories of culture-specific traffic situations, drawing on crowdsourced traffic regulations from Wikipedia and prior studies on cross-country driving behavior [21]. A category is retained only when national traffic codes di￾verge along at least on… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of GEODRIVE-BENCH across datasets, task categories, countries, scenarios, and region-specific topics. 3.2 Culture-relevant Driving Question-Answer Generation Our goal is to identify VLM backbones suitable for VLA systems that operate across countries, so we focus on high-level driving-related VQA. Following prior driving benchmarks [33, 30, 42], we adopt the standard Perception / Prediction / … view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on traffic rule injection across the four task categories. Rule-Given provides the correct rule, Wrong-Rule injects a mismatched rule, and Full-Handbook provides the full multi￾country handbook. closer to common pretraining priors and degrade sharply in culturally distinctive ones, so high aggregate accuracy does not imply robust region-aware reasoning. Effect of prompting settings. Comparing Dire… view at source ↗
Figure 5
Figure 5. Figure 5: Country-wise distribution of error types for InternVL3 and our DRIVEOPD ‡ under the reasoning setting. Each pie chart shows the proportion of four major error categories within a coun￾try: Visual Misperception, Geographic Misclassification, Cultural Rule Gap, and Reasoning Error. behave more stably: a smaller gap between Rule-Given and Wrong-Rule together with competitive Full-Handbook performance suggests… view at source ↗
Figure 6
Figure 6. Figure 6: shows a region reasoning case study of InternVL3 on a school-warning sign question across four countries. The model identifies each country from salient cues—Japanese text, UK street layout, Indian auto-rickshaws, Chinese license plates—yet defaults to a generic yellow-diamond template, failing in the UK, India, and China. This decoupling of recognition from rule grounding motivates explicit internalizatio… view at source ↗
Figure 7
Figure 7. Figure 7: Country-wise distribution of error types for Qwen2.5-VL-7B and DRIVEOPD † under the Reasoning setting. Each pie chart shows the proportion of four error categories within a country: Visual Misperception, Geographic Misclassification, Cultural Rule Gap, and Reasoning Error. toward Visual Misperception and Reasoning Error. This shift is itself informative: once regional rule knowledge is internalized, fine-g… view at source ↗
Figure 8
Figure 8. Figure 8: Web-based annotation tool used for human review. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_8.png] view at source ↗
read the original abstract

Vision-language models (VLMs) for autonomous driving have shown promising performance, but their ability to handle region-specific traffic rules remains underexplored, raising uncertainties about their deployment across diverse global settings. We therefore introduce GeoDrive-Bench, a novel benchmark that enables the systematic investigation of VLMs' geo-culturally grounded driving reasoning. We curated 5,053 human-validated multiple-choice QA pairs across six countries covering diverse driving cultures. Specifically, we emphasize four driving tasks: perception, prediction, planning, and region reasoning. Each question requires models to infer the correct driving behavior from visual evidence and local traffic conventions without explicit country labels. Beyond evaluation, we further design a distillation algorithm that injects region-specific traffic-rule knowledge into the internal representations of VLMs, enabling models to better align visual scene understanding with local driving policies. Experiments on nine state-of-the-art VLMs show substantial performance variations across geo-driving cultures for each task, while our proposed baseline models exhibit improved geo-cultural reasoning across regions. These results suggest that current VLMs still lack robust region-aware driving intelligence and highlight GeoDrive-Bench as a diagnostic and training-oriented testbed for deployable autonomous driving foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces GeoDrive-Bench, a benchmark of 5,053 human-validated multiple-choice QA pairs spanning six countries and four driving tasks (perception, prediction, planning, region reasoning). Questions are asserted to require joint visual and implicit geo-cultural inference without explicit country labels. The authors additionally propose a distillation algorithm to inject region-specific traffic knowledge into VLMs. Experiments on nine state-of-the-art VLMs report substantial cross-region performance gaps, while the authors' baseline models show improved geo-cultural reasoning; the work concludes that current VLMs lack robust region-aware driving intelligence.

Significance. If the benchmark questions demonstrably require geo-cultural inference beyond surface-level visual or textual cues, the dataset and distillation method would provide a valuable diagnostic and training resource for assessing and improving VLMs in globally deployable autonomous driving systems, where regional traffic conventions vary substantially.

major comments (2)
  1. [Abstract] Abstract: The central claim that performance variations demonstrate missing region-aware intelligence rests on the premise that the 5,053 QA pairs force inference from visual evidence plus implicit local traffic conventions rather than surface cues (sign text, vehicle models, road markings, language) or inferable labels. The abstract asserts human validation and absence of explicit country labels but supplies no quantitative check (e.g., inter-annotator agreement on cue independence, ablation of region-specific elements, or solvability after masking geo-cues) that would be required to secure this premise; without such evidence the observed gaps could arise from training-data biases or general VLM weaknesses instead.
  2. [Abstract] Abstract (curation paragraph): No details are provided on the question-construction or validation process (e.g., how annotators were instructed to avoid explicit or inferable country signals, what fraction of questions were rejected during validation, or any pilot study measuring answerability from non-driving features). This information is load-bearing for interpreting the reported cross-country variations as evidence of missing geo-cultural reasoning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger justification of the benchmark's design in the abstract. We address each point below and commit to revisions that improve clarity without overstating the current evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that performance variations demonstrate missing region-aware intelligence rests on the premise that the 5,053 QA pairs force inference from visual evidence plus implicit local traffic conventions rather than surface cues (sign text, vehicle models, road markings, language) or inferable labels. The abstract asserts human validation and absence of explicit country labels but supplies no quantitative check (e.g., inter-annotator agreement on cue independence, ablation of region-specific elements, or solvability after masking geo-cues) that would be required to secure this premise; without such evidence the observed gaps could arise from training-data biases or general VLM weaknesses instead.

    Authors: We agree that quantitative checks would further secure the premise. Section 3 of the full manuscript details the human validation protocol, where country-specific annotators were explicitly instructed to create questions requiring local traffic conventions beyond visible cues, and all questions were reviewed for absence of explicit country labels. While we did not conduct the specific ablations or cue-masking experiments suggested, the consistent cross-region gaps across nine diverse VLMs provide supporting evidence. We will revise the abstract to reference these validation steps and add a limitations paragraph discussing potential surface cues. revision: partial

  2. Referee: [Abstract] Abstract (curation paragraph): No details are provided on the question-construction or validation process (e.g., how annotators were instructed to avoid explicit or inferable country signals, what fraction of questions were rejected during validation, or any pilot study measuring answerability from non-driving features). This information is load-bearing for interpreting the reported cross-country variations as evidence of missing geo-cultural reasoning.

    Authors: The full manuscript (Sections 3.1–3.2) describes the construction process, including annotator instructions to avoid inferable signals, the use of a pilot study to confirm answerability from driving features, and rejection criteria during validation. We will expand the abstract with a concise summary of these elements, including key statistics on the validation process. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark curation and empirical evaluation are independent of fitted inputs or self-citation chains

full rationale

The paper presents a new dataset of 5,053 human-validated QA pairs across six countries and reports direct empirical results on nine VLMs plus a proposed distillation baseline. No equations, parameter fits, or derivations appear in the provided text. The central claim (performance variations indicate lack of region-aware intelligence) rests on the benchmark's construction and measured accuracies rather than reducing to any self-defined quantity, fitted subset renamed as prediction, or load-bearing self-citation. Curation is described as external human validation without explicit country labels, and the distillation step is presented as an added contribution rather than a circular justification of the benchmark itself. This is a standard empirical benchmark paper with no detectable circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5741 in / 1014 out tokens · 17260 ms · 2026-06-28T14:43:46.960490+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 21 canonical work pages · 12 internal anchors

  1. [1]

    Claude sonnet 4.6

    Anthropic. Claude sonnet 4.6. https://www.anthropic.com/claude, 2025. Large language model

  2. [2]

    Covla: Comprehensive vision-language-action dataset for autonomous driving

    Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Y u Y amaguchi, Shunsuke Aoki, and Issei Y amamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages 1933–1943. IEEE, 2025

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 , 2025

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  5. [5]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, V arun Bankiti, Alex H Lang, Sourabh V ora, V enice Erin Liong, Qiang Xu, Anush Krish- nan, Y u Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11621–11631, 2020

  6. [6]

    Automated evaluation of large vision-language models on self-driving corner cases

    Kai Chen, Y anze Li, Wenhua Zhang, Y anxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages 7817–7826. IEEE, 2025

  7. [7]

    Impromptu vla: Open weights and open data for driving vision-language- action models

    Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Y ang, Y angcheng Y u, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language- action models. arXiv preprint arXiv:2505.23757, 2025

  8. [8]

    Holistic au- tonomous driving understanding by bird’s-eye-view injected multi-modal large models

    Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic au- tonomous driving understanding by bird’s-eye-view injected multi-modal large models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13668–13677, 2024

  9. [9]

    Towards understanding worldwide cross-cultural differences in implicit driving cues: Review, comparative analysis, and research roadmap

    Y ongqi Dong, Chang Liu, Yiyun Wang, and Zhe Fu. Towards understanding worldwide cross-cultural differences in implicit driving cues: Review, comparative analysis, and research roadmap. In 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC) , pages 1569–1575. IEEE, 2024

  10. [10]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex V aughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  11. [11]

    Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models

    Xianda Guo, Ruijun Zhang, Yiqun Duan, Y uhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models. arXiv preprint arXiv:2411.13112, 2024

  12. [12]

    Driveaction: A benchmark for exploring human-like driving decisions in vla models

    Y uhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, and Xianpeng Lang. Driveaction: A benchmark for exploring human-like driving decisions in vla models. arXiv preprint arXiv:2506.05667, 2025

  13. [13]

    Carscenes: Semantic vlm dataset for safe autonomous driving

    Y uankai He and Weisong Shi. Carscenes: Semantic vlm dataset for safe autonomous driving. arXiv preprint arXiv:2511.10701, 2025

  14. [14]

    Benchmarking neural network robustness to common corruptions and perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations , 2019

  15. [15]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Y ang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 17853–17862, 2023

  16. [16]

    Nuscenes-mqa: Integrated evaluation of captions and qa for autonomous driving datasets using markup annotations

    Y uichi Inoue, Y uki Y ada, Kotaro Tanahashi, and Y u Y amaguchi. Nuscenes-mqa: Integrated evaluation of captions and qa for autonomous driving datasets using markup annotations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 930–938, 2024

  17. [17]

    Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding

    Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Y uhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, et al. Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , page...

  18. [18]

    V ad: V ectorized scene representation for efficient autonomous driv- ing

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V ad: V ectorized scene representation for efficient autonomous driv- ing. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 8340–8350, 2023

  19. [19]

    Sdpo: Segment-level direct preference optimization for social agents

    Aobo Kong, Wentao Ma, Shiwan Zhao, Y ongbin Li, Y uchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Y ong Qin, and Fei Huang. Sdpo: Segment-level direct preference optimization for social agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12409–12423, 2025

  20. [20]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Y u, Joseph Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles , pages 611–626, 2023

  21. [21]

    Driving everywhere with large language model policy adaptation

    Boyi Li, Y ue Wang, Jiageng Mao, Boris Ivanovic, Sushant V eer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14948–14957, 2024

  22. [22]

    Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Y uheng Li, Bo Li, Y uanhan Zhang, Sheng Shen, and Y ong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github. io/blog/2024-01-30-llava-next/

  23. [23]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Y ang, Qing Jiang, Chunyuan Li, Jianwei Y ang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision , pages 38–55. Springer, 2024

  24. [24]

    Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving

    Y uechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Y ang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving. arXiv preprint arXiv:2509.13769, 2025

  25. [25]

    Dolphins: Multimodal language model for driving

    Yingzi Ma, Y ulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. In European Conference on Computer Vision, pages 403–420. Springer, 2024

  26. [26]

    One million scenes for autonomous driving: Once dataset

    Jiageng Mao, Minzhe Niu, Chenhan Jiang, Xiaodan Liang, Y amin Li, Chaoqiang Y e, Wei Zhang, Zhenguo Li, Jie Y u, Chunjing Xu, et al. One million scenes for autonomous driving: Once dataset. 2021

  27. [27]

    Lingoqa: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, and Oleg Sinavski. Lingoqa: Visual question answering for autonomous driving. arXiv preprint arXiv:2312.14115, 2023

  28. [28]

    Is your vlm for autonomous driving safety-ready? a comprehensive benchmark for evaluating external and in-cabin risks

    Xianhui Meng, Y uchen Zhang, Zhijian Huang, Zheng Lu, Ziling Ji, Y aoyao Yin, Hongyuan Zhang, Guangfeng Jiang, Y andan Lin, Long Chen, et al. Is your vlm for autonomous driving safety-ready? a comprehensive benchmark for evaluating external and in-cabin risks. arXiv preprint arXiv:2511.14592 , 2025

  29. [29]

    Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

    NVIDIA, Y an Wang, Wenjie Luo, Junjie Bai, Y ulong Cao, Tong Che, Ke Chen, Y uxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Dongran Liu, Ming-Y u Liu, Langechuan Liu, Zhijian Liu, Jason Lu, Y unxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinger...

  30. [30]

    Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Y ang Jiao, and Y u-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

  31. [31]

    Lmdrive: Closed-loop end-to-end driving with large language models

    Hao Shao, Y uxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Y u Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 15120–15130, 2024

  32. [32]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026

  33. [33]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In European conference on computer vision , pages 256–274. Springer, 2024. 12

  34. [34]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

  35. [35]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

  36. [36]

    Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving

    Kexin Tian, Jingrui Mao, Y unlong Zhang, Jiwan Jiang, Y ang Zhou, and Zhengzhong Tu. Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4567– 4576, 2025

  37. [37]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Y ang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xian- peng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024

  38. [38]

    Idd: A dataset for exploring problems of autonomous navigation in unconstrained environments

    Girish V arma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chandraker, and CV Jawahar. Idd: A dataset for exploring problems of autonomous navigation in unconstrained environments. In 2019 IEEE winter conference on applications of computer vision (WACV) , pages 1743–1751. IEEE, 2019

  39. [39]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfac- tual reasoning

    Shihao Wang, Zhiding Y u, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfac- tual reasoning. In Proceedings of the computer vision and pattern recognition conference , pages 22442– 22452, 2025

  40. [40]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Y e, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025

  41. [41]

    Impact of regional driving behav- ior differences on traffic flow

    Y uting Wang, Zhaocheng He, Wangyong Xing, and Chengchuang Lin. Impact of regional driving behav- ior differences on traffic flow. Scientific Reports, 15(1):9027, 2025

  42. [42]

    Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

    Shaoyuan Xie, Lingdong Kong, Y uhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 6585–6597, 2025. 13

  43. [43]

    WOD-E2E: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

    Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Y uliang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125, 2025

  44. [44]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model

    Zhenhua Xu, Y ujia Zhang, Enze Xie, Zhen Zhao, Y ong Guo, Kwan-Y ee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters , 9(10):8186–8193, 2024

  45. [45]

    FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. arXiv preprint arXiv:2505.17685, 2025

  46. [46]

    Opendrivevla: Towards end- to-end autonomous driving with large vision language action model

    Xingcheng Zhou, Xuyuan Han, Feng Y ang, Y unpu Ma, and Alois C Knoll. Opendrivevla: Towards end- to-end autonomous driving with large vision language action model. arXiv preprint arXiv:2503.23463 , 2025

  47. [47]

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Zewei Zhou, Tianhui Cai, Seth Z Zhao, Y un Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforce- ment fine-tuning. arXiv preprint arXiv:2506.13757, 2025

  48. [48]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Y e, Lixin Gu, Hao Tian, Y uchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 14 A Overview Our appendix includes the following sections:

  49. [49]

    Prompt templates for every evaluation setting and the full training recipe of D RIVE OPD (algorithm, data, hyperparameters, compute)

    Section B: Additional Implementation Details. Prompt templates for every evaluation setting and the full training recipe of D RIVE OPD (algorithm, data, hyperparameters, compute)

  50. [50]

    Full image-perturbation table, per-category rule-context ablation, and error-type analysis on the Qwen2.5-VL family

    Section C: Additional Results. Full image-perturbation table, per-category rule-context ablation, and error-type analysis on the Qwen2.5-VL family

  51. [51]

    The 13 culture-specific traffic categories, the 20-section per-country traffic-rule handbook, the counterfactual verification protocol, and the annotation tool used by human reviewers

    Section D: Benchmark Construction Details. The 13 culture-specific traffic categories, the 20-section per-country traffic-rule handbook, the counterfactual verification protocol, and the annotation tool used by human reviewers

  52. [52]

    Additional qualitative comparisons between base VLMs and DRIVEOPD across countries

    Section E: Extended Case Studies. Additional qualitative comparisons between base VLMs and DRIVEOPD across countries

  53. [53]

    this country

    Section F: Broader Impact. Discussion of the broader implications of G EODRIVE - BENCH . B Additional Implementation Details B.1 D RIVE OPD Training Details We instantiate D RIVE OPD on top of two open-source VLM backbones, Qwen2.5-VL-7B [ 4] and InternVL3-8B [ 48], yielding the two checkpoints denoted as D RIVE OPD † and D RIVE OPD ‡ in the main paper. B...

  54. [54]

    Read s c e n e _ s t a t e to confirm what is ac tu al ly visible

  55. [55]

    Apply c a n d i d a t e _ r u l e ( NOT o r i g i n _ r u l e ) to the scene

  56. [56]

    Pick the option that becomes correct under c a n d i d a t e _ r u l e

  57. [57]

    a n s w e r _ u n d e r _ c a n d i d a t e

    Compare against o r i g i n _ g t . Output STRICT JSON , no c o m m e n t a r y : { " a n s w e r _ u n d e r _ c a n d i d a t e " : " A | B | C | D " , " d i f f e r s _ f r o m _ o r i g i n " : true | false , " reason " : " < one - s ent en ce r a t i o n a l e g ro un de d in candidate_rule >" } Dec is io n : a c a n d i d a t e QA pair is R ET AI NE...

  58. [58]

    Look at the image c a r e f u l l y

  59. [59]

    Decide whether the pr ov ide d G T _ a n s w e r is s u p p o r t e d by the image under the country - spe ci fi c traffic context

  60. [60]

    verdict

    Output a JSON record with the verdict , your confidence , a one - p a r a g r a p h rationale , and ( for I N C O R R E C T ve rdi ct s ) the option you believe is act ua ll y correct . Inputs : s c e n e _ i m a g e : the c a n d i d a t e driving frame country : the country g o v e r n i n g the rules for this item que st io n : the multiple - choice q ...