Recognition: unknown
Learning Vision-Language-Action World Models for Autonomous Driving
Pith reviewed 2026-05-10 17:02 UTC · model grok-4.3
The pith
VLA-World unifies predictive imagination and reflective reasoning to enhance autonomous driving foresight and safety.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues. It then reasons over this self-generated future imagined frame to refine the predicted trajectory. Supported by the nuScenes-GR-20K dataset and three-stage training, this unification of imagination and reflection leads to higher performance and better interpretability in autonomous driving tasks.
What carries the argument
The action-guided future-frame generation followed by reflective reasoning over the imagined scene, which refines the trajectory prediction.
If this is right
- Improved trajectory prediction accuracy by incorporating future scene reasoning.
- Better interpretability of the model's driving decisions through explicit future simulation.
- Enhanced performance on both planning and scene generation benchmarks over prior VLA and world model approaches.
- More robust handling of temporal dynamics and global world consistency in driving scenarios.
Where Pith is reading between the lines
- This method suggests that closing the loop between action prediction and visual imagination could generalize to other robotics domains like manipulation or navigation.
- If generation errors are mitigated, the reflective step might enable safer long-horizon planning without manual rule-based safety checks.
- Future work could test whether this self-refinement reduces the need for extensive human-labeled trajectory data.
Load-bearing premise
That reasoning over the model's own generated future frames will consistently correct trajectory errors rather than amplify inaccuracies from imperfect image synthesis.
What would settle it
A controlled ablation experiment showing that disabling the reasoning-over-imagined-frame step results in no performance drop or even improvement on the planning benchmarks would falsify the benefit of the reflective component.
Figures
read the original abstract
Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: https://vlaworld.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VLA-World, a vision-language-action world model for autonomous driving that unifies predictive imagination with reflective reasoning. The pipeline first conditions next-frame generation on an action-derived initial trajectory to capture spatial-temporal evolution, then reasons over the self-generated imagined frame to refine the trajectory. It curates the nuScenes-GR-20K generative reasoning dataset from nuScenes and employs a three-stage training strategy (pretraining, supervised fine-tuning, and reinforcement learning). The central claim is that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks.
Significance. If the results hold under rigorous validation, this work could meaningfully advance end-to-end autonomous driving by bridging world models' predictive capabilities with VLA-style reasoning, potentially improving foresight and interpretability. The dataset curation and staged training approach are constructive contributions that may enable follow-on research. The significance is limited by the absence of targeted validation for the core refinement loop.
major comments (3)
- [§4] §4 (Experiments): No ablation is reported that removes only the reflective reasoning module while keeping the initial trajectory-guided generation fixed. This is load-bearing for the headline claim, as the superiority over baselines is attributed to the reasoning step refining trajectories; without it, it is impossible to determine whether the loop provides net benefit or amplifies generation artifacts.
- [§3 and §4] §3 (Method) and §4 (Experiments): The manuscript provides no per-scene or per-metric correlation between future-frame generation quality (e.g., FID, PSNR, or LPIPS on nuScenes-GR-20K) and the delta in planning metrics (e.g., collision rate or trajectory error) before versus after reasoning. This leaves the assumption that reasoning reliably refines rather than compounds errors untested.
- [§4] §4 (Experiments): Failure-case analysis is absent; there is no examination of scenes where low-quality imagined frames lead to worse final trajectories than the initial action-derived prediction, which would directly address the risk of error propagation highlighted in the pipeline design.
minor comments (2)
- [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., planning metric improvements) to substantiate the superiority claim rather than stating it qualitatively.
- [§3] A diagram or pseudocode for the three-stage training pipeline and the exact conditioning of generation on the initial trajectory would improve clarity of the method.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validating the reflective reasoning component in our pipeline. We have addressed each point by adding the requested analyses to the revised manuscript. Our responses are provided below.
read point-by-point responses
-
Referee: §4 (Experiments): No ablation is reported that removes only the reflective reasoning module while keeping the initial trajectory-guided generation fixed. This is load-bearing for the headline claim, as the superiority over baselines is attributed to the reasoning step refining trajectories; without it, it is impossible to determine whether the loop provides net benefit or amplifies generation artifacts.
Authors: We agree that isolating the contribution of the reflective reasoning module is necessary to support the central claim. In the revised manuscript, we have added an ablation study in Section 4 that compares the full VLA-World model to a variant using only the action-guided generation step without reflective reasoning. The results show that including the reasoning module yields consistent gains in planning metrics such as lower collision rates and reduced trajectory error, confirming a net benefit. The updated experiments and table are now included. revision: yes
-
Referee: §3 (Method) and §4 (Experiments): The manuscript provides no per-scene or per-metric correlation between future-frame generation quality (e.g., FID, PSNR, or LPIPS on nuScenes-GR-20K) and the delta in planning metrics (e.g., collision rate or trajectory error) before versus after reasoning. This leaves the assumption that reasoning reliably refines rather than compounds errors untested.
Authors: We acknowledge the value of correlating generation quality with planning improvements to test the refinement assumption. We have added this analysis to the revised Section 4, reporting per-metric correlations across the test set (e.g., between FID and trajectory error delta). A positive correlation is observed, indicating that better generation quality is associated with larger planning gains after reasoning. Per-scene breakdowns are provided for representative examples due to high scene variability; aggregate statistics and discussion of limitations are included. revision: yes
-
Referee: §4 (Experiments): Failure-case analysis is absent; there is no examination of scenes where low-quality imagined frames lead to worse final trajectories than the initial action-derived prediction, which would directly address the risk of error propagation highlighted in the pipeline design.
Authors: We recognize that failure-case analysis is important for addressing potential error propagation. In the revised manuscript, we have added a dedicated failure-case subsection in Section 4. This examines scenes where low-quality imagined frames (high FID/LPIPS) lead to final trajectories worse than the initial prediction. We provide quantitative frequency statistics and qualitative examples, noting that such cases are infrequent and typically arise in complex dynamic scenes. Mitigation approaches are discussed. revision: yes
Circularity Check
No significant circularity; pipeline is data-driven and externally benchmarked
full rationale
The VLA-World architecture is presented as a trained multimodal model that generates next-frame images conditioned on action-derived trajectories and then applies reasoning to refine those trajectories. This is supported by a curated external dataset (nuScenes-GR-20K) and a three-stage training procedure (pretraining, SFT, RL) evaluated on standard planning and generation benchmarks. No equations, uniqueness theorems, or central claims reduce by construction to fitted parameters or self-citations; the derivation chain relies on independent data and external baselines rather than self-referential definitions or imported ansatzes.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Driving Intents Amplify Planning-Oriented Reinforcement Learning
DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).
-
Driving Intents Amplify Planning-Oriented Reinforcement Learning
DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, pages 23716–23736, 2022. 1
2022
-
[3]
Covla: Comprehensive vision-language-action dataset for autonomous driving
Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watan- abe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. InWACV, pages 1933–1943, 2025. 1
1933
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 3
-
[7]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 3
work page internal anchor Pith review arXiv 2022
-
[9]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 3
2024
-
[10]
nuscenes: A multimodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InCVPR, pages 11621– 11631, 2020. 2, 6, 7, 3
2020
-
[11]
WorldVLA: Towards Autoregressive Action World Model
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539, 2025. 2, 3
work page internal anchor Pith review arXiv 2025
-
[12]
End-to-end autonomous driving: Challenges and frontiers.IEEE TPAMI, 2024
Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE TPAMI, 2024. 1
2024
-
[13]
Vadv2: End-to-end vectorized autonomous driving via probabilistic planning, 2024
Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning, 2024. 1, 3, 6, 4
2024
-
[14]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[15]
Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving
Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving. InECCV, pages 239–256, 2024. 1
2024
-
[16]
arXiv preprint arXiv:2505.23757 (2025) 4
Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025. 1
-
[17]
Palm-e: An embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. InICML, 2023. 3
2023
-
[18]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, pages 12873–12883, 2021. 5
2021
-
[19]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving
Wencheng Han, Dongqian Guo, Cheng-Zhong Xu, and Jian- bing Shen. Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving. InAAAI, pages 3347–3355, 2025. 1
2025
-
[21]
Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control
Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control. InCVPR, pages 22404–22415, 2025. 7
2025
-
[22]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Cor- rado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023. 1
work page internal anchor Pith review arXiv 2023
-
[23]
St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning
Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In ECCV, pages 533–549, 2022. 6, 7, 4
2022
-
[24]
Drivingworld: Constructing world model for autonomous driving via video gpt
Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Driving- world: Constructing world model for autonomous driving via video gpt.arXiv preprint arXiv:2412.19505, 2024. 3
-
[25]
Planning-oriented autonomous driv- ing
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driv- ing. InCVPR, pages 17853–17862, 2023. 1, 3, 6, 7, 4
2023
-
[26]
Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, and Ze- qun et al. Jie. Making large language models better planners with reasoning-decision alignment. InECCV, pages 73–90,
-
[27]
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,
work page internal anchor Pith review arXiv
-
[28]
Available: https://arxiv.org/abs/2311.13549
Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai Wang. Adriver-i: A general world model for autonomous driving. arXiv:2311.13549, 2023. 1
-
[29]
Vad: Vectorized scene representation for efficient autonomous driving
Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InICCV, pages 8306–8316,
-
[30]
Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xing- gang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning. arXiv preprint arXiv:2503.07608, 2025. 1, 2, 3
-
[31]
Drivegan: Towards a controllable high-quality neural simulation
Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. InCVPR, pages 5820–5829, 2021. 6, 7, 3
2021
-
[32]
Uniscene: Unified occupancy-centric driving scene generation.arXiv preprint arXiv:2412.05435, 2024
Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation.arXiv preprint arXiv:2412.05435, 2024. 3
-
[33]
Driving everywhere with large language model policy adaptation
Boyi Li, Yue Wang, Jiageng Mao, Boris Ivanovic, Sushant Veer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. InCVPR, pages 14948–14957, 2024. 1
2024
-
[34]
arXiv preprint arXiv:2510.18313 (2025)
Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving naviga- tion world models.arXiv preprint arXiv:2510.18313, 2025. 3
-
[35]
Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InICML, pages 12888–12900, 2022. 3
2022
-
[36]
Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, pages 19730–19742, 2023. 3
2023
-
[37]
Xiang Li, Pengfei Li, Yupeng Zheng, Wei Sun, Yan Wang, and Yilun Chen. Semi-supervised vision-centric 3d occu- pancy world model for autonomous driving.arXiv preprint arXiv:2502.07309, 2025. 7
-
[38]
Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InCVPR, pages 14864–14873, 2024. 6, 7
2024
-
[39]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, pages 34892–34916,
-
[40]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 1, 3
2024
-
[41]
Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation
Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation. InECCV, pages 329–345, 2024. 1
2024
-
[42]
Dolphins: Multimodal language model for driving
Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. InECCV, pages 403–420, 2024. 1
2024
-
[43]
A language agent for au- tonomous driving
Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023. 3
-
[44]
Driveworld: 4d pre-trained scene understanding via world models for autonomous driving
Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InCVPR, pages 15522– 15533, 2024. 1
2024
-
[45]
Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing
Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jian- hua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing. InECCV, pages 292–308, 2024. 3
2024
-
[46]
arXiv preprint arXiv:2505.15298 1,
Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, et al. Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision- language models for autonomous driving.arXiv preprint arXiv:2505.15298, 2025. 3
-
[47]
Grounding Everything in Tokens for Multimodal Large Language Models
Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang, and Chao Ma. Grounding everything in tokens for multimodal large language models.arXiv preprint arXiv:2512.10554, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,
Ayushman Sarkar, Mohd Yamani Idna Idris, and Zhenyu Yu. Reasoning in computer vision: Taxonomy, models, tasks, and methodologies.arXiv preprint arXiv:2508.10523, 2025. 3
-
[49]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 1
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[50]
Lmdrive: Closed-loop end-to-end driving with large language models
Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InCVPR, pages 15120–15130, 2024. 3
2024
-
[51]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3, 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[53]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024. 2, 3
work page internal anchor Pith review arXiv 2024
-
[54]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Neural discrete representation learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, pages 6309–6318, 2017. 5
2017
-
[56]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InCVPR, pages 22442– 22452, 2025. 3, 6, 7, 4
2025
-
[58]
Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023. 3
-
[59]
DriveDreamer: Towards real-world-driven world models for autonomous driving
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv:2309.09777, 2023. 1, 3, 6, 7, 4
-
[60]
Xiaofeng Wang, Zheng Zhu, Guan Huang, Boyuan Wang, Xinze Chen, and Jiwen Lu. Worlddreamer: Towards general world models for video generation via predicting masked tokens.arXiv:2401.09985, 2024
-
[61]
Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving
Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. InCVPR, pages 14749–14759, 2024. 1, 3, 6, 7, 4
2024
-
[62]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. InNeurIPS, pages 24824–24837, 2022. 3
2022
-
[63]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InCVPR, pages 12966–12977, 2025. 2
2025
-
[64]
Mars: An instance-aware, modular and realistic simulator for autonomous driving
Zirui Wu, Tianyu Liu, Liyi Luo, Zhide Zhong, Jianteng Chen, Hongmin Xiao, Chao Hou, Haozhe Lou, Yuantao Chen, Runyi Yang, et al. Mars: An instance-aware, modular and realistic simulator for autonomous driving. InCAAI, pages 3–15, 2023. 3
2023
-
[65]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and genera- tion.arXiv preprint arXiv:2408.12528, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[66]
Occ-llm: Enhancing autonomous driving with occupancy-based large language models,
Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-llm: Enhancing au- tonomous driving with occupancy-based large language mod- els.arXiv:2502.06419, 2025. 3
-
[67]
Wong, Zhenguo Li, and Hengshuang Zhao
Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.arXiv preprint arXiv:2310.01412,
-
[68]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Generalized predictive model for autonomous driving
Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. InCVPR, pages 14662–14672, 2024. 3, 7
2024
-
[70]
ReSim: Reliable World Simulation for Autonomous Driving
Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, and Li Chen. Resim: Reliable world simulation for autonomous driving.arXiv preprint arXiv:2506.09981, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Driving in the occupancy world: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving
Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the occupancy world: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving. In AAAI, pages 9327–9335, 2025. 3
2025
-
[72]
Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving.arXiv preprint arXiv:2505.16278,
-
[73]
Zhenlong Yuan, Jing Tang, Jinguo Luo, Rui Chen, Chengxuan Qian, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, and Shuo Li. Autodrive-r 2: Incentivizing reasoning and self- reflection capacity for vla model in autonomous driving.arXiv preprint arXiv:2509.01944, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving,
Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yi- fan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025. 2, 3, 4, 6, 7, 8, 5
-
[75]
Feedback-guided autonomous driving
Jimuyang Zhang, Zanming Huang, Arijit Ray, and Eshed Ohn-Bar. Feedback-guided autonomous driving. InCVPR, pages 15000–15011, 2024. 7
2024
-
[76]
Chatscene: Knowledge- enabled safety-critical scenario generation for autonomous vehicles
Jiawei Zhang, Chejian Xu, and Bo Li. Chatscene: Knowledge- enabled safety-critical scenario generation for autonomous vehicles. InCVPR, pages 15459–15469, 2024. 1
2024
-
[77]
Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 2
-
[78]
Occworld: Learning a 3d occupancy world model for autonomous driving
Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. In ECCV, pages 55–72, 2024. 3, 7
2024
-
[79]
Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv:2402.11502, 2024
Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv:2402.11502, 2024
-
[80]
arXiv preprint arXiv:2412.09627 (2024)
Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, and Jiwen Lu. Doe-1: Closed-loop autonomous driv- ing with large world model.arXiv preprint arXiv:2412.09627,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.