Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
Pith reviewed 2026-05-17 03:06 UTC · model grok-4.3
The pith
A world model that simulates future spatial states sharpens natural language object localization for autonomous vehicles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ThinkDeeper reasons about future spatial states before grounding by distilling the current scene into a command-aware latent state inside a Spatial-Aware World Model and rolling out a sequence of future latent states; these states are then hierarchically fused with multimodal inputs in a hypergraph-guided decoder to localize referred objects more robustly than methods that operate only on the present frame.
What carries the argument
Spatial-Aware World Model (SA-WM) that distills the current scene into a command-aware latent state and rolls out future latent states to supply disambiguating cues for the grounding decoder.
If this is right
- Achieves first place on the Talk2Car leaderboard for language-based object localization in driving scenes.
- Surpasses prior methods on the DrivePilot dataset and on MoCAD plus RefCOCO/+/g benchmarks.
- Preserves high accuracy in long-text, multi-agent, and ambiguous command cases.
- Delivers superior results even when trained on only 50 percent of the available data.
Where Pith is reading between the lines
- The same future-state reasoning could transfer to other language-guided robotic tasks in changing environments.
- Direct coupling of the world model outputs to downstream motion planning might reduce separate perception-planning handoffs.
- Real-world deployment would require checking whether prediction errors grow under rare but safety-critical events not seen in training.
Load-bearing premise
Simulated future latent states will supply reliable disambiguating information instead of noise or compounding errors inside the localization decoder.
What would settle it
Measure accuracy on a test set of scenes containing sudden unpredictable events that break the world model's rollout assumptions; a drop below baseline performance would show the future states are not helping.
Figures
read the original abstract
Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ThinkDeeper, a world-model-inspired framework for natural-language visual grounding in autonomous driving. Its core is a Spatial-Aware World Model (SA-WM) that distills a command-aware latent state from the current scene and rolls out a sequence of future latent states; these are fused by a hypergraph-guided decoder to capture higher-order spatial relations. The authors also release DrivePilot, a new multi-source VG dataset whose annotations were generated by an RAG+CoT LLM pipeline. Experiments report that ThinkDeeper ranks first on the Talk2Car leaderboard, outperforms prior methods on DrivePilot, MoCAD and RefCOCO/+/g, and remains robust under long-text, multi-agent and ambiguous conditions even when trained on only 50 % of the data.
Significance. If the performance gains are shown to arise from the future-state rollouts rather than from decoder architecture or dataset differences, the work would provide a concrete demonstration that explicit forward simulation improves disambiguation in dynamic, context-dependent grounding tasks. The data-efficiency result and the new DrivePilot benchmark would be useful community resources for AV perception research.
major comments (3)
- [§3.2] §3.2 (SA-WM rollout): the central claim that future latent states supply net disambiguating signal is not supported by any ablation that compares the full model against a current-state-only baseline. Without this comparison it remains possible that observed gains on Talk2Car, DrivePilot and the 50 %-data regime are driven by the hypergraph decoder or dataset construction rather than by the world-model component.
- [§4.2] §4.2 (benchmark tables): no per-scene error breakdown or correlation between rollout prediction accuracy and grounding accuracy is reported for the long-text, multi-agent and ambiguous subsets highlighted in the abstract. This omission prevents verification that the rollout step improves rather than degrades performance under the conditions where it is most needed.
- [§4.1] §4.1 (DrivePilot construction): the RAG+CoT LLM annotation pipeline is described without quantitative validation (human agreement rates, inter-annotator agreement, or error analysis on the generated referring expressions). Because results on the new dataset are used to support the method’s superiority, the lack of such checks is load-bearing for the reliability of those claims.
minor comments (2)
- [§3.2] The latent-state dimensionality and rollout horizon are listed as free parameters in the method but their concrete values and sensitivity analysis are not provided in the experimental section.
- [Figure 2] Figure 2 (architecture diagram) would benefit from explicit arrows and labels distinguishing the command-aware distillation step from the subsequent rollout steps.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (SA-WM rollout): the central claim that future latent states supply net disambiguating signal is not supported by any ablation that compares the full model against a current-state-only baseline. Without this comparison it remains possible that observed gains on Talk2Car, DrivePilot and the 50 %-data regime are driven by the hypergraph decoder or dataset construction rather than by the world-model component.
Authors: We agree that an explicit ablation isolating the contribution of the future-state rollouts versus a current-state-only baseline is necessary to substantiate the central claim. In the revised manuscript we will add this comparison, training and evaluating a variant of ThinkDeeper that omits the SA-WM rollout and uses only the current latent state. This will clarify whether the reported gains are attributable to the world-model component rather than the decoder architecture or dataset. revision: yes
-
Referee: [§4.2] §4.2 (benchmark tables): no per-scene error breakdown or correlation between rollout prediction accuracy and grounding accuracy is reported for the long-text, multi-agent and ambiguous subsets highlighted in the abstract. This omission prevents verification that the rollout step improves rather than degrades performance under the conditions where it is most needed.
Authors: We acknowledge the value of granular analysis for the challenging subsets. We will add a per-scene error breakdown for the long-text, multi-agent, and ambiguous cases, together with a correlation analysis between rollout prediction accuracy and final grounding accuracy, in the updated experimental section of the revised manuscript. revision: yes
-
Referee: [§4.1] §4.1 (DrivePilot construction): the RAG+CoT LLM annotation pipeline is described without quantitative validation (human agreement rates, inter-annotator agreement, or error analysis on the generated referring expressions). Because results on the new dataset are used to support the method’s superiority, the lack of such checks is load-bearing for the reliability of those claims.
Authors: We recognize that quantitative validation of the RAG+CoT annotation pipeline is important for establishing the reliability of DrivePilot. We will conduct and report human agreement rates, inter-annotator agreement, and error analysis on a sampled subset of the generated referring expressions in the revised manuscript. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces architectural components (SA-WM for command-aware latent distillation and future rollout, plus hypergraph decoder) and a new dataset (DrivePilot via RAG/CoT LLM), then reports empirical rankings and robustness on external benchmarks including Talk2Car, DrivePilot, MoCAD, and RefCOCO variants. No equations or steps in the provided description reduce a claimed prediction or result to a fitted input by construction, nor does any load-bearing premise collapse to a self-citation or self-defined ansatz. The central claims rest on verifiable performance deltas against baselines rather than internal redefinitions or forced statistical equivalence.
Axiom & Free-Parameter Ledger
free parameters (1)
- latent state dimensionality and rollout horizon
axioms (1)
- domain assumption Future latent states generated by the SA-WM provide useful disambiguation signals for the current grounding task.
invented entities (3)
-
Spatial-Aware World Model (SA-WM)
no independent evidence
-
Hypergraph-guided decoder
no independent evidence
-
DrivePilot dataset
no independent evidence
Reference graph
Works this paper leans on
-
[1]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF CVPR, pages 11621–11631, 2020. 3, 5
work page 2020
-
[3]
Hou Pong Chan, Mingxi Guo, and Cheng-Zhong Xu. Ground- ing commands for autonomous vehicles via layer fusion with region-specific dynamic layer attention. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12464–12470. IEEE, 2022. 7
work page 2022
-
[4]
Chongqing Chen, Dezhi Han, and Chin-Chen Chang. Mpcct: Multimodal vision-language learning paradigm with context- based compact transformer.Pattern Recognition, 147:110084,
-
[5]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning.arXiv preprint arXiv:2310.09478, 2023. 1, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding
Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. InProceedings of the AAAI conference on artificial intelligence, pages 1036– 1044, 2021. 3
work page 2021
-
[7]
Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Drivinggpt: Unifying driving world modeling and planning with multi- modal autoregressive transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26890–26900, 2025. 3
work page 2025
-
[8]
Uniter: Universal image-text representation learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. InEuropean conference on computer vision, pages 104–120, 2020. 7
work page 2020
-
[9]
Com- mands for autonomous vehicles by progressively stacking visual-linguistic representations
Hang Dai, Shujie Luo, Yong Ding, and Ling Shao. Com- mands for autonomous vehicles by progressively stacking visual-linguistic representations. InComputer Vision– ECCV Workshops, pages 27–32, 2020. 7
work page 2020
-
[10]
Ming Dai, Lingfeng Yang, Yihao Xu, Zhenhua Feng, and Wankou Yang. Simvg: A simple framework for visual ground- ing with decoupled multi-modal fusion.Advances in neural information processing systems, 37:121670–121698, 2024. 1
work page 2024
-
[11]
Transvg: End-to-end visual ground- ing with transformers
Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. Transvg: End-to-end visual ground- ing with transformers. InProceedings of the IEEE/CVF ICCV, pages 1769–1779, 2021. 7
work page 2021
-
[12]
Talk2car: Taking control of your self-driving car
Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie-Francine Moens. Talk2car: Taking con- trol of your self-driving car.arXiv preprint arXiv:1909.10838,
-
[13]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 5
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Ze- fang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predict- ing future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025. 3
work page 2025
-
[15]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 4
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[16]
Jessica L ´opez Espejel, El Hassane Ettifouri, Mahaman Sanoussi Yahaya Alassan, El Mehdi Chouham, and Walid Dahhane. Gpt-3.5, gpt-4, or bard? evaluating llms reasoning ability in zero-shot setting and performance boosting through prompts.Natural Language Processing Journal, 5:100032,
-
[17]
Anja K Faulhaber, Anke Dittmer, Felix Blind, Maximilian A W¨achter, Silja Timm, Leon R S¨utfeld, Achim Stephan, Gor- don Pipa, and Peter K¨onig. Human decisions in moral dilem- mas are largely described by utilitarianism: Virtual car driving study provides guidelines for autonomous driving vehicles. Science and engineering ethics, 25:399–418, 2019. 1
work page 2019
-
[18]
Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision- and-language representation learning.Nips, pages 6616–6628,
-
[19]
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024. 3
work page 2024
-
[20]
Zeyu Gao, Yao Mu, Chen Chen, Jingliang Duan, Ping Luo, Yanfeng Lu, and Shengbo Eben Li. Enhance sample efficiency and robustness of end-to-end urban autonomous driving via semantic masked world model.IEEE Transactions on Intel- ligent Transportation Systems, 25(10):13067–13079, 2024. 3
work page 2024
-
[21]
Liang Geng, Jianqin Yin, Gang Chen, and Qingxuan Jia. Pseudo-ev: Enhancing 3d visual grounding with pseudo em- bodied viewpoint.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 1
work page 2025
-
[22]
Ross Girshick. Fast r-cnn. InProceedings of the IEEE inter- national conference on computer vision, pages 1440–1448,
-
[23]
Understanding the dif- ficulty of training deep feedforward neural networks
Xavier Glorot and Yoshua Bengio. Understanding the dif- ficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010. 6
work page 2010
-
[24]
Yanchen Guan, Haicheng Liao, Zhenning Li, Jia Hu, Runze Yuan, Yunjian Li, Guohui Zhang, and Chengzhong Xu. World models for autonomous driving: An initial survey.IEEE Transactions on Intelligent Vehicles, 2024. 3
work page 2024
-
[25]
Yanchen Guan, Haicheng Liao, Chengyue Wang, Xingcheng Liu, Jiaxun Zhang, and Zhenning Li. World model-based end-to-end scene generation for accident anticipation in au- tonomous driving.Communications Engineering, 4(1):144,
-
[26]
David Ha and J¨urgen Schmidhuber. Recurrent world models facilitate policy evolution.Advances in neural information processing systems, 31, 2018. 3
work page 2018
-
[27]
Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, and Hanwang Zhang. Learning to compose and reason with lan- guage tree structures for visual grounding.IEEE TPAMI, pages 684–696, 2019. 7
work page 2019
-
[28]
Pseudo-q: Generating pseudo language queries for visual grounding
Haojun Jiang, Yuanze Lin, Dongchen Han, Shiji Song, and Gao Huang. Pseudo-q: Generating pseudo language queries for visual grounding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 15513–15523, 2022. 1
work page 2022
-
[29]
Mdetr-modulated detection for end-to-end multi-modal understanding
Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Syn- naeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. InPro- ceedings of the IEEE/CVF ICCV, pages 1780–1790, 2021. 7
work page 2021
-
[30]
Referitgame: Referring to objects in pho- tographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. InProceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 2
work page 2014
-
[31]
Sebastian Kr¨ugel and Matthias Uhl. Autonomous vehicles and moral judgments under risk.Transportation research part A: policy and practice, 155:1–10, 2022. 1
work page 2022
-
[32]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational Con- ference on Machine Learning, pages 12888–12900. PMLR,
-
[33]
Enhancing End-to-End Autonomous Driving with Latent World Model
Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[34]
Zhenning Li et al. Steering the future: Redefining intelligent transportation systems with foundation models.CHAIN, 1(1): 46–53, 2024. 4, 3
work page 2024
-
[35]
Haicheng Liao, Yongkang Li, Chengyue Wang, Yanchen Guan, Kahou Tam, Chunlin Tian, Li Li, Chengzhong Xu, and Zhenning Li. When, where, and what? a benchmark for accident anticipation and localization with large language models. InACM International Conference on Multimedia (ACM MM), Oral Presentation, pages 8–17, 2024. 1
work page 2024
-
[36]
Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li, Yiming Bie, and Chengzhong Xu. Gpt- 4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models. Communications in Transportation Research, 4:100116, 2024. 1, 6, 7
work page 2024
-
[37]
Haicheng Liao, Hanlin Kong, Bonan Wang, Chengyue Wang, Wang Ye, Zhengbing He, Chengzhong Xu, and Zhenning Li. Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting.IEEE Transactions on Artificial Intelligence, 2025. 2, 1
work page 2025
-
[38]
A real-time cross-modality correlation filtering method for referring expression comprehension
Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF CVPR, pages 10880–10889,
-
[39]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 5
work page 2017
-
[40]
Learning to assemble neural module tree networks for visual grounding
Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. Learning to assemble neural module tree networks for visual grounding. InProceedings of the IEEE/CVF ICCV, pages 4673–4682, 2019. 7
work page 2019
-
[41]
Referring image segmentation using text supervision
Fang Liu, Yuhao Liu, Yuqiu Kong, Ke Xu, Lihe Zhang, Bao- cai Yin, Gerhard Hancke, and Rynson Lau. Referring image segmentation using text supervision. InProceedings of the IEEE/CVF ICCV, pages 22124–22134, 2023. 2
work page 2023
-
[42]
Llava-next: Improved reason- ing, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reason- ing, ocr, and world knowledge, 2024. 2, 7
work page 2024
-
[43]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 1, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Siheng Chen, Mike Chen, Sanja Fidler, et al. Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 27272–27283, 2025. 3
work page 2025
-
[45]
C4av: learning cross-modal representations from transformers
Shujie Luo, Hang Dai, Ling Shao, and Yong Ding. C4av: learning cross-modal representations from transformers. In Computer Vision–ECCV 2020, pages 33–38, 2020. 7
work page 2020
-
[46]
Yunsheng Ma, Wenqian Ye, Can Cui, Haiming Zhang, Shuo Xing, Fucai Ke, Jinhong Wang, Chenglin Miao, Jintai Chen, Hamid Rezatofighi, et al. Position: Prospective of au- tonomous driving—multimodal LLMs world models embod- ied intelligence AI alignment and mamba. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 1010–102...
work page 2025
-
[47]
Enhanc- ing clip with gpt-4: Harnessing visual descriptions as prompts
Mayug Maniparambil, Chris V orster, Derek Molloy, Noel Murphy, Kevin McGuinness, and Noel E O’Connor. Enhanc- ing clip with gpt-4: Harnessing visual descriptions as prompts. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 262–271, 2023. 1
work page 2023
-
[48]
Attngrounder: Talking to cars with attention
Vivek Mittal. Attngrounder: Talking to cars with attention. In Computer Vision– ECCV Workshops, pages 62–73, 2020. 7
work page 2020
-
[49]
Recondreamer: Crafting world models for driving scene reconstruction via online restora- tion
Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Guan Huang, Chen Liu, Yuyin Chen, Yida Wang, Xueyang Zhang, et al. Recondreamer: Crafting world models for driving scene reconstruction via online restora- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1559–1569, 2025. 3
work page 2025
-
[50]
Cosine meets softmax: A tough-to-beat baseline for visual grounding
Nivedita Rufus, Unni Krishnan R Nair, K Madhava Krishna, and Vineet Gandhi. Cosine meets softmax: A tough-to-beat baseline for visual grounding. InComputer Vision– ECCV Workshops, pages 39–50, 2020. 7
work page 2020
-
[51]
Tversky loss function for image segmentation using 3d fully convolutional deep networks
Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3d fully convolutional deep networks. InInternational workshop on machine learning in medical imaging, pages 379–387. Springer, 2017. 5
work page 2017
-
[52]
Lxmert: Learning cross- modality encoder representations from transformers
Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers.arXiv preprint arXiv:1908.07490, 2019. 3
-
[53]
Wei Tang, Liang Li, Xuejing Liu, Lu Jin, Jinhui Tang, and Zechao Li. Context disentangling and prototype inheriting for robust visual grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3213–3229, 2024. 1, 7
work page 2024
-
[54]
Chengyue Wang, Haicheng Liao, Zhenning Li, and Chengzhong Xu. Wake: Towards robust and physically feasible trajectory prediction for autonomous vehicles with wavelet and kinematics synergy.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2025. 1
work page 2025
-
[55]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Drivedreamer: Towards real-world- drive world models for autonomous driving
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024. 3
work page 2024
-
[57]
Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024. 3
work page 2024
-
[58]
Universal instance percep- tion as object discovery and retrieval
Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Ze- huan Yuan, and Huchuan Lu. Universal instance percep- tion as object discovery and retrieval. InProceedings of the IEEE/CVF CVPR, pages 15325–15336, 2023. 7
work page 2023
-
[59]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Improving visual grounding with visual- linguistic verification and iterative reasoning
Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, and Weiming Hu. Improving visual grounding with visual- linguistic verification and iterative reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9499–9508, 2022. 2, 3, 6, 7
work page 2022
-
[61]
A fast and accurate one- stage approach to visual grounding
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. A fast and accurate one- stage approach to visual grounding. InProceedings of the IEEE/CVF ICCV, pages 4683–4693, 2019. 7
work page 2019
-
[62]
Improving one-stage visual grounding by recursive sub- query construction
Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. Improving one-stage visual grounding by recursive sub- query construction. InComputer Vision–ECCV 2020, pages 387–404, 2020. 3, 7
work page 2020
-
[63]
Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, and Matthew Gadd. Rag-driver: Gen- eralisable driving explanations with retrieval-augmented in- context learning in multi-modal large language model.arXiv preprint arXiv:2402.10828, 2024. 3
-
[64]
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yi- fan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Mono3dvg: 3d visual grounding in monocular images
Yang Zhan, Yuan Yuan, and Zhitong Xiong. Mono3dvg: 3d visual grounding in monocular images. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6988–6996,
-
[66]
Drivedreamer4d: World models are effective data machines for 4d driving scene repre- sentation
Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, Boyuan Wang, Youyi Zhang, et al. Drivedreamer4d: World models are effective data machines for 4d driving scene repre- sentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12015–12026, 2025. 3
work page 2025
-
[67]
Drivedreamer- 2: Llm-enhanced world models for diverse driving video gen- eration
Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, and Xingang Wang. Drivedreamer- 2: Llm-enhanced world models for diverse driving video gen- eration. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10412–10420, 2025. 3
work page 2025
-
[68]
Occworld: Learning a 3d occupancy world model for autonomous driving
Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InEu- ropean conference on computer vision, pages 55–72. Springer,
-
[69]
World4drive: End-to-end au- tonomous driving via intention-aware physical latent world model
Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end au- tonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 28632–28642, 2025. 3
work page 2025
-
[70]
Xingyi Zhou, Dequan Wang, and Philipp Kr¨ahenb¨uhl. Objects as points.arXiv preprint arXiv:1904.07850, 2019. 4
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[71]
Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, and Jiwen Lu. Gaussianworld: Gaussian world model for streaming 3d occupancy prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6772–6781, 2025. 3 Appendix A. DrivePilot Dataset A.1. Step-1: In-Context RAG Annotation To enhance LLM reasoning with real-world drivin...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.