pith. machine review for the scientific record. sign in

arxiv: 2512.03454 · v4 · submitted 2025-12-03 · 💻 cs.CV · cs.AI

Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Pith reviewed 2026-05-17 03:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords visual groundingautonomous drivingworld modelsmultimodal learningspatial reasoningnatural language commandsfuture predictionhypergraph fusion
0
0 comments X

The pith

A world model that simulates future spatial states sharpens natural language object localization for autonomous vehicles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current visual grounding systems for self-driving cars often misinterpret commands that depend on context, 3D relations, or how the scene will change. This paper builds a Spatial-Aware World Model that first condenses the present view into a command-conditioned latent state and then generates a short rollout of future latent states. Those forward-looking representations feed a hypergraph decoder that fuses them with image and language features to select the target object. The result is higher accuracy on ambiguous, multi-agent, and long-text instructions, plus retained performance when only half the training data is available. If the forward simulation supplies reliable cues, passenger commands become easier to execute safely without constant clarification.

Core claim

ThinkDeeper reasons about future spatial states before grounding by distilling the current scene into a command-aware latent state inside a Spatial-Aware World Model and rolling out a sequence of future latent states; these states are then hierarchically fused with multimodal inputs in a hypergraph-guided decoder to localize referred objects more robustly than methods that operate only on the present frame.

What carries the argument

Spatial-Aware World Model (SA-WM) that distills the current scene into a command-aware latent state and rolls out future latent states to supply disambiguating cues for the grounding decoder.

If this is right

  • Achieves first place on the Talk2Car leaderboard for language-based object localization in driving scenes.
  • Surpasses prior methods on the DrivePilot dataset and on MoCAD plus RefCOCO/+/g benchmarks.
  • Preserves high accuracy in long-text, multi-agent, and ambiguous command cases.
  • Delivers superior results even when trained on only 50 percent of the available data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same future-state reasoning could transfer to other language-guided robotic tasks in changing environments.
  • Direct coupling of the world model outputs to downstream motion planning might reduce separate perception-planning handoffs.
  • Real-world deployment would require checking whether prediction errors grow under rare but safety-critical events not seen in training.

Load-bearing premise

Simulated future latent states will supply reliable disambiguating information instead of noise or compounding errors inside the localization decoder.

What would settle it

Measure accuracy on a test set of scenes containing sudden unpredictable events that break the world model's rollout assumptions; a drop below baseline performance would show the future states are not helping.

Figures

Figures reproduced from arXiv: 2512.03454 by Bonan Wang, Chengyue Wang, Chengzhong Xu, Dingyi Zhuang, Haicheng Liao, Hai Yang, Huanming Shen, Kehua Chen, Yihong Tang, Yongkang Li, Zhenning Li.

Figure 2
Figure 2. Figure 2: Illustration of depth-based spatial priors. (a) Real-world [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed DrivePilot. (a) An example of the multi-source representation for a real-world scene, including RAG [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the CoT prompting used in DrivePilot to generate semantic annotations for a given traffic scene. This step-by-step [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of ThinkDeeper. Multimodal Backbones first encode the image [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of depth map, current/future latent states, and model performance on the DrivePilot dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation results of ThinkDeeper’s hyperparameter. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ThinkDeeper, a world-model-inspired framework for natural-language visual grounding in autonomous driving. Its core is a Spatial-Aware World Model (SA-WM) that distills a command-aware latent state from the current scene and rolls out a sequence of future latent states; these are fused by a hypergraph-guided decoder to capture higher-order spatial relations. The authors also release DrivePilot, a new multi-source VG dataset whose annotations were generated by an RAG+CoT LLM pipeline. Experiments report that ThinkDeeper ranks first on the Talk2Car leaderboard, outperforms prior methods on DrivePilot, MoCAD and RefCOCO/+/g, and remains robust under long-text, multi-agent and ambiguous conditions even when trained on only 50 % of the data.

Significance. If the performance gains are shown to arise from the future-state rollouts rather than from decoder architecture or dataset differences, the work would provide a concrete demonstration that explicit forward simulation improves disambiguation in dynamic, context-dependent grounding tasks. The data-efficiency result and the new DrivePilot benchmark would be useful community resources for AV perception research.

major comments (3)
  1. [§3.2] §3.2 (SA-WM rollout): the central claim that future latent states supply net disambiguating signal is not supported by any ablation that compares the full model against a current-state-only baseline. Without this comparison it remains possible that observed gains on Talk2Car, DrivePilot and the 50 %-data regime are driven by the hypergraph decoder or dataset construction rather than by the world-model component.
  2. [§4.2] §4.2 (benchmark tables): no per-scene error breakdown or correlation between rollout prediction accuracy and grounding accuracy is reported for the long-text, multi-agent and ambiguous subsets highlighted in the abstract. This omission prevents verification that the rollout step improves rather than degrades performance under the conditions where it is most needed.
  3. [§4.1] §4.1 (DrivePilot construction): the RAG+CoT LLM annotation pipeline is described without quantitative validation (human agreement rates, inter-annotator agreement, or error analysis on the generated referring expressions). Because results on the new dataset are used to support the method’s superiority, the lack of such checks is load-bearing for the reliability of those claims.
minor comments (2)
  1. [§3.2] The latent-state dimensionality and rollout horizon are listed as free parameters in the method but their concrete values and sensitivity analysis are not provided in the experimental section.
  2. [Figure 2] Figure 2 (architecture diagram) would benefit from explicit arrows and labels distinguishing the command-aware distillation step from the subsequent rollout steps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (SA-WM rollout): the central claim that future latent states supply net disambiguating signal is not supported by any ablation that compares the full model against a current-state-only baseline. Without this comparison it remains possible that observed gains on Talk2Car, DrivePilot and the 50 %-data regime are driven by the hypergraph decoder or dataset construction rather than by the world-model component.

    Authors: We agree that an explicit ablation isolating the contribution of the future-state rollouts versus a current-state-only baseline is necessary to substantiate the central claim. In the revised manuscript we will add this comparison, training and evaluating a variant of ThinkDeeper that omits the SA-WM rollout and uses only the current latent state. This will clarify whether the reported gains are attributable to the world-model component rather than the decoder architecture or dataset. revision: yes

  2. Referee: [§4.2] §4.2 (benchmark tables): no per-scene error breakdown or correlation between rollout prediction accuracy and grounding accuracy is reported for the long-text, multi-agent and ambiguous subsets highlighted in the abstract. This omission prevents verification that the rollout step improves rather than degrades performance under the conditions where it is most needed.

    Authors: We acknowledge the value of granular analysis for the challenging subsets. We will add a per-scene error breakdown for the long-text, multi-agent, and ambiguous cases, together with a correlation analysis between rollout prediction accuracy and final grounding accuracy, in the updated experimental section of the revised manuscript. revision: yes

  3. Referee: [§4.1] §4.1 (DrivePilot construction): the RAG+CoT LLM annotation pipeline is described without quantitative validation (human agreement rates, inter-annotator agreement, or error analysis on the generated referring expressions). Because results on the new dataset are used to support the method’s superiority, the lack of such checks is load-bearing for the reliability of those claims.

    Authors: We recognize that quantitative validation of the RAG+CoT annotation pipeline is important for establishing the reliability of DrivePilot. We will conduct and report human agreement rates, inter-annotator agreement, and error analysis on a sampled subset of the generated referring expressions in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces architectural components (SA-WM for command-aware latent distillation and future rollout, plus hypergraph decoder) and a new dataset (DrivePilot via RAG/CoT LLM), then reports empirical rankings and robustness on external benchmarks including Talk2Car, DrivePilot, MoCAD, and RefCOCO variants. No equations or steps in the provided description reduce a claimed prediction or result to a fitted input by construction, nor does any load-bearing premise collapse to a self-citation or self-defined ansatz. The central claims rest on verifiable performance deltas against baselines rather than internal redefinitions or forced statistical equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the unproven effectiveness of distilling scenes into command-aware latent states and rolling them forward; specific neural-network hyperparameters, loss weights, and the fidelity of the world-model rollout are not detailed in the abstract.

free parameters (1)
  • latent state dimensionality and rollout horizon
    Typical world-model hyperparameters that must be chosen or tuned; not specified in abstract.
axioms (1)
  • domain assumption Future latent states generated by the SA-WM provide useful disambiguation signals for the current grounding task.
    Invoked as the core motivation for the world-model component.
invented entities (3)
  • Spatial-Aware World Model (SA-WM) no independent evidence
    purpose: Distills current scene into command-aware latent state and rolls out future states for grounding cues.
    New component introduced in the paper.
  • Hypergraph-guided decoder no independent evidence
    purpose: Hierarchically fuses multimodal inputs and future states to capture higher-order spatial dependencies.
    New architectural element proposed for the decoder.
  • DrivePilot dataset no independent evidence
    purpose: Multi-source visual grounding dataset for autonomous driving with RAG+CoT LLM annotations.
    New dataset contributed by the paper.

pith-pipeline@v0.9.0 · 5594 in / 1488 out tokens · 33362 ms · 2026-05-17T03:06:31.787453+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 10 internal anchors

  1. [1]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 4

  2. [2]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF CVPR, pages 11621–11631, 2020. 3, 5

  3. [3]

    Ground- ing commands for autonomous vehicles via layer fusion with region-specific dynamic layer attention

    Hou Pong Chan, Mingxi Guo, and Cheng-Zhong Xu. Ground- ing commands for autonomous vehicles via layer fusion with region-specific dynamic layer attention. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12464–12470. IEEE, 2022. 7

  4. [4]

    Mpcct: Multimodal vision-language learning paradigm with context- based compact transformer.Pattern Recognition, 147:110084,

    Chongqing Chen, Dezhi Han, and Chin-Chen Chang. Mpcct: Multimodal vision-language learning paradigm with context- based compact transformer.Pattern Recognition, 147:110084,

  5. [5]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning.arXiv preprint arXiv:2310.09478, 2023. 1, 3, 7

  6. [6]

    Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding

    Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. InProceedings of the AAAI conference on artificial intelligence, pages 1036– 1044, 2021. 3

  7. [7]

    Drivinggpt: Unifying driving world modeling and planning with multi- modal autoregressive transformers

    Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Drivinggpt: Unifying driving world modeling and planning with multi- modal autoregressive transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26890–26900, 2025. 3

  8. [8]

    Uniter: Universal image-text representation learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. InEuropean conference on computer vision, pages 104–120, 2020. 7

  9. [9]

    Com- mands for autonomous vehicles by progressively stacking visual-linguistic representations

    Hang Dai, Shujie Luo, Yong Ding, and Ling Shao. Com- mands for autonomous vehicles by progressively stacking visual-linguistic representations. InComputer Vision– ECCV Workshops, pages 27–32, 2020. 7

  10. [10]

    Simvg: A simple framework for visual ground- ing with decoupled multi-modal fusion.Advances in neural information processing systems, 37:121670–121698, 2024

    Ming Dai, Lingfeng Yang, Yihao Xu, Zhenhua Feng, and Wankou Yang. Simvg: A simple framework for visual ground- ing with decoupled multi-modal fusion.Advances in neural information processing systems, 37:121670–121698, 2024. 1

  11. [11]

    Transvg: End-to-end visual ground- ing with transformers

    Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. Transvg: End-to-end visual ground- ing with transformers. InProceedings of the IEEE/CVF ICCV, pages 1769–1779, 2021. 7

  12. [12]

    Talk2car: Taking control of your self-driving car

    Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie-Francine Moens. Talk2car: Taking con- trol of your self-driving car.arXiv preprint arXiv:1909.10838,

  13. [13]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 5

  14. [14]

    Understanding world or predict- ing future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

    Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Ze- fang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predict- ing future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025. 3

  15. [15]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 4

  16. [16]

    Gpt-3.5, gpt-4, or bard? evaluating llms reasoning ability in zero-shot setting and performance boosting through prompts.Natural Language Processing Journal, 5:100032,

    Jessica L ´opez Espejel, El Hassane Ettifouri, Mahaman Sanoussi Yahaya Alassan, El Mehdi Chouham, and Walid Dahhane. Gpt-3.5, gpt-4, or bard? evaluating llms reasoning ability in zero-shot setting and performance boosting through prompts.Natural Language Processing Journal, 5:100032,

  17. [17]

    Human decisions in moral dilem- mas are largely described by utilitarianism: Virtual car driving study provides guidelines for autonomous driving vehicles

    Anja K Faulhaber, Anke Dittmer, Felix Blind, Maximilian A W¨achter, Silja Timm, Leon R S¨utfeld, Achim Stephan, Gor- don Pipa, and Peter K¨onig. Human decisions in moral dilem- mas are largely described by utilitarianism: Virtual car driving study provides guidelines for autonomous driving vehicles. Science and engineering ethics, 25:399–418, 2019. 1

  18. [18]

    Large-scale adversarial training for vision- and-language representation learning.Nips, pages 6616–6628,

    Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision- and-language representation learning.Nips, pages 6616–6628,

  19. [19]

    Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024. 3

  20. [20]

    Zeyu Gao, Yao Mu, Chen Chen, Jingliang Duan, Ping Luo, Yanfeng Lu, and Shengbo Eben Li. Enhance sample efficiency and robustness of end-to-end urban autonomous driving via semantic masked world model.IEEE Transactions on Intel- ligent Transportation Systems, 25(10):13067–13079, 2024. 3

  21. [21]

    Pseudo-ev: Enhancing 3d visual grounding with pseudo em- bodied viewpoint.IEEE Transactions on Circuits and Systems for Video Technology, 2025

    Liang Geng, Jianqin Yin, Gang Chen, and Qingxuan Jia. Pseudo-ev: Enhancing 3d visual grounding with pseudo em- bodied viewpoint.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 1

  22. [22]

    Fast r-cnn

    Ross Girshick. Fast r-cnn. InProceedings of the IEEE inter- national conference on computer vision, pages 1440–1448,

  23. [23]

    Understanding the dif- ficulty of training deep feedforward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the dif- ficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010. 6

  24. [24]

    World models for autonomous driving: An initial survey.IEEE Transactions on Intelligent Vehicles, 2024

    Yanchen Guan, Haicheng Liao, Zhenning Li, Jia Hu, Runze Yuan, Yunjian Li, Guohui Zhang, and Chengzhong Xu. World models for autonomous driving: An initial survey.IEEE Transactions on Intelligent Vehicles, 2024. 3

  25. [25]

    World model-based end-to-end scene generation for accident anticipation in au- tonomous driving.Communications Engineering, 4(1):144,

    Yanchen Guan, Haicheng Liao, Chengyue Wang, Xingcheng Liu, Jiaxun Zhang, and Zhenning Li. World model-based end-to-end scene generation for accident anticipation in au- tonomous driving.Communications Engineering, 4(1):144,

  26. [26]

    Recurrent world models facilitate policy evolution.Advances in neural information processing systems, 31, 2018

    David Ha and J¨urgen Schmidhuber. Recurrent world models facilitate policy evolution.Advances in neural information processing systems, 31, 2018. 3

  27. [27]

    Learning to compose and reason with lan- guage tree structures for visual grounding.IEEE TPAMI, pages 684–696, 2019

    Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, and Hanwang Zhang. Learning to compose and reason with lan- guage tree structures for visual grounding.IEEE TPAMI, pages 684–696, 2019. 7

  28. [28]

    Pseudo-q: Generating pseudo language queries for visual grounding

    Haojun Jiang, Yuanze Lin, Dongchen Han, Shiji Song, and Gao Huang. Pseudo-q: Generating pseudo language queries for visual grounding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 15513–15523, 2022. 1

  29. [29]

    Mdetr-modulated detection for end-to-end multi-modal understanding

    Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Syn- naeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. InPro- ceedings of the IEEE/CVF ICCV, pages 1780–1790, 2021. 7

  30. [30]

    Referitgame: Referring to objects in pho- tographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. InProceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 2

  31. [31]

    Autonomous vehicles and moral judgments under risk.Transportation research part A: policy and practice, 155:1–10, 2022

    Sebastian Kr¨ugel and Matthias Uhl. Autonomous vehicles and moral judgments under risk.Transportation research part A: policy and practice, 155:1–10, 2022. 1

  32. [32]

    Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational Con- ference on Machine Learning, pages 12888–12900. PMLR,

  33. [33]

    Enhancing End-to-End Autonomous Driving with Latent World Model

    Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024. 3

  34. [34]

    Steering the future: Redefining intelligent transportation systems with foundation models.CHAIN, 1(1): 46–53, 2024

    Zhenning Li et al. Steering the future: Redefining intelligent transportation systems with foundation models.CHAIN, 1(1): 46–53, 2024. 4, 3

  35. [35]

    When, where, and what? a benchmark for accident anticipation and localization with large language models

    Haicheng Liao, Yongkang Li, Chengyue Wang, Yanchen Guan, Kahou Tam, Chunlin Tian, Li Li, Chengzhong Xu, and Zhenning Li. When, where, and what? a benchmark for accident anticipation and localization with large language models. InACM International Conference on Multimedia (ACM MM), Oral Presentation, pages 8–17, 2024. 1

  36. [36]

    Gpt- 4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

    Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li, Yiming Bie, and Chengzhong Xu. Gpt- 4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models. Communications in Transportation Research, 4:100116, 2024. 1, 6, 7

  37. [37]

    Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting.IEEE Transactions on Artificial Intelligence, 2025

    Haicheng Liao, Hanlin Kong, Bonan Wang, Chengyue Wang, Wang Ye, Zhengbing He, Chengzhong Xu, and Zhenning Li. Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting.IEEE Transactions on Artificial Intelligence, 2025. 2, 1

  38. [38]

    A real-time cross-modality correlation filtering method for referring expression comprehension

    Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF CVPR, pages 10880–10889,

  39. [39]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 5

  40. [40]

    Learning to assemble neural module tree networks for visual grounding

    Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. Learning to assemble neural module tree networks for visual grounding. InProceedings of the IEEE/CVF ICCV, pages 4673–4682, 2019. 7

  41. [41]

    Referring image segmentation using text supervision

    Fang Liu, Yuhao Liu, Yuqiu Kong, Ke Xu, Lihe Zhang, Bao- cai Yin, Gerhard Hancke, and Rynson Lau. Referring image segmentation using text supervision. InProceedings of the IEEE/CVF ICCV, pages 22124–22134, 2023. 2

  42. [42]

    Llava-next: Improved reason- ing, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reason- ing, ocr, and world knowledge, 2024. 2, 7

  43. [43]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 1, 7

  44. [44]

    Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models

    Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Siheng Chen, Mike Chen, Sanja Fidler, et al. Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 27272–27283, 2025. 3

  45. [45]

    C4av: learning cross-modal representations from transformers

    Shujie Luo, Hang Dai, Ling Shao, and Yong Ding. C4av: learning cross-modal representations from transformers. In Computer Vision–ECCV 2020, pages 33–38, 2020. 7

  46. [46]

    Position: Prospective of au- tonomous driving—multimodal LLMs world models embod- ied intelligence AI alignment and mamba

    Yunsheng Ma, Wenqian Ye, Can Cui, Haiming Zhang, Shuo Xing, Fucai Ke, Jinhong Wang, Chenglin Miao, Jintai Chen, Hamid Rezatofighi, et al. Position: Prospective of au- tonomous driving—multimodal LLMs world models embod- ied intelligence AI alignment and mamba. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 1010–102...

  47. [47]

    Enhanc- ing clip with gpt-4: Harnessing visual descriptions as prompts

    Mayug Maniparambil, Chris V orster, Derek Molloy, Noel Murphy, Kevin McGuinness, and Noel E O’Connor. Enhanc- ing clip with gpt-4: Harnessing visual descriptions as prompts. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 262–271, 2023. 1

  48. [48]

    Attngrounder: Talking to cars with attention

    Vivek Mittal. Attngrounder: Talking to cars with attention. In Computer Vision– ECCV Workshops, pages 62–73, 2020. 7

  49. [49]

    Recondreamer: Crafting world models for driving scene reconstruction via online restora- tion

    Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Guan Huang, Chen Liu, Yuyin Chen, Yida Wang, Xueyang Zhang, et al. Recondreamer: Crafting world models for driving scene reconstruction via online restora- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1559–1569, 2025. 3

  50. [50]

    Cosine meets softmax: A tough-to-beat baseline for visual grounding

    Nivedita Rufus, Unni Krishnan R Nair, K Madhava Krishna, and Vineet Gandhi. Cosine meets softmax: A tough-to-beat baseline for visual grounding. InComputer Vision– ECCV Workshops, pages 39–50, 2020. 7

  51. [51]

    Tversky loss function for image segmentation using 3d fully convolutional deep networks

    Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3d fully convolutional deep networks. InInternational workshop on machine learning in medical imaging, pages 379–387. Springer, 2017. 5

  52. [52]

    Lxmert: Learning cross- modality encoder representations from transformers

    Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers.arXiv preprint arXiv:1908.07490, 2019. 3

  53. [53]

    Context disentangling and prototype inheriting for robust visual grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3213–3229, 2024

    Wei Tang, Liang Li, Xuejing Liu, Lu Jin, Jinhui Tang, and Zechao Li. Context disentangling and prototype inheriting for robust visual grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3213–3229, 2024. 1, 7

  54. [54]

    Chengyue Wang, Haicheng Liao, Zhenning Li, and Chengzhong Xu. Wake: Towards robust and physically feasible trajectory prediction for autonomous vehicles with wavelet and kinematics synergy.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2025. 1

  55. [55]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 3, 7

  56. [56]

    Drivedreamer: Towards real-world- drive world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024. 3

  57. [57]

    Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

    Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024. 3

  58. [58]

    Universal instance percep- tion as object discovery and retrieval

    Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Ze- huan Yuan, and Huchuan Lu. Universal instance percep- tion as object discovery and retrieval. InProceedings of the IEEE/CVF CVPR, pages 15325–15336, 2023. 7

  59. [59]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 7

  60. [60]

    Improving visual grounding with visual- linguistic verification and iterative reasoning

    Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, and Weiming Hu. Improving visual grounding with visual- linguistic verification and iterative reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9499–9508, 2022. 2, 3, 6, 7

  61. [61]

    A fast and accurate one- stage approach to visual grounding

    Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. A fast and accurate one- stage approach to visual grounding. InProceedings of the IEEE/CVF ICCV, pages 4683–4693, 2019. 7

  62. [62]

    Improving one-stage visual grounding by recursive sub- query construction

    Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. Improving one-stage visual grounding by recursive sub- query construction. InComputer Vision–ECCV 2020, pages 387–404, 2020. 3, 7

  63. [63]

    Rag-driver: Generalisable driving explanations with retrieval-augmented in-context learning in multi-modal large language model.arXiv preprint arXiv:2402.10828,

    Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, and Matthew Gadd. Rag-driver: Gen- eralisable driving explanations with retrieval-augmented in- context learning in multi-modal large language model.arXiv preprint arXiv:2402.10828, 2024. 3

  64. [64]

    FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yi- fan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025. 3

  65. [65]

    Mono3dvg: 3d visual grounding in monocular images

    Yang Zhan, Yuan Yuan, and Zhitong Xiong. Mono3dvg: 3d visual grounding in monocular images. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6988–6996,

  66. [66]

    Drivedreamer4d: World models are effective data machines for 4d driving scene repre- sentation

    Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, Boyuan Wang, Youyi Zhang, et al. Drivedreamer4d: World models are effective data machines for 4d driving scene repre- sentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12015–12026, 2025. 3

  67. [67]

    Drivedreamer- 2: Llm-enhanced world models for diverse driving video gen- eration

    Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, and Xingang Wang. Drivedreamer- 2: Llm-enhanced world models for diverse driving video gen- eration. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10412–10420, 2025. 3

  68. [68]

    Occworld: Learning a 3d occupancy world model for autonomous driving

    Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InEu- ropean conference on computer vision, pages 55–72. Springer,

  69. [69]

    World4drive: End-to-end au- tonomous driving via intention-aware physical latent world model

    Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end au- tonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 28632–28642, 2025. 3

  70. [70]

    Objects as Points

    Xingyi Zhou, Dequan Wang, and Philipp Kr¨ahenb¨uhl. Objects as points.arXiv preprint arXiv:1904.07850, 2019. 4

  71. [71]

    low visibility

    Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, and Jiwen Lu. Gaussianworld: Gaussian world model for streaming 3d occupancy prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6772–6781, 2025. 3 Appendix A. DrivePilot Dataset A.1. Step-1: In-Context RAG Annotation To enhance LLM reasoning with real-world drivin...