pith. machine review for the scientific record. sign in

arxiv: 2604.08719 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.RO

Recognition: unknown

LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords autonomous drivinggenerative world modelslarge language modelsend-to-end controlvideo predictionclosed-loop planningmultimodal understandinginstruction following
0
0 comments X

The pith

LMGenDrive unifies LLM understanding with generative video prediction for closed-loop driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LMGenDrive to tackle poor generalization in autonomous driving by merging large language model strengths in semantic reasoning with generative models that forecast scene evolution. The system takes multi-view camera images and natural language instructions to produce both predicted future driving videos and the matching vehicle control signals. A three-stage training sequence builds stability by moving from basic vision pretraining through to full multi-step driving tasks. Closed-loop benchmark results show gains over prior separate approaches, with particular lifts in instruction adherence and handling of uncommon events. This indicates that models able to both comprehend and imagine futures can yield more adaptable embodied control.

Core claim

LMGenDrive is the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive generates both future driving videos and control signals. This design provides complementary benefits: video prediction improves spatio-temporal scene modeling, while the LLM contributes strong semantic priors and instruction grounding from large-scale pretraining. A progressive three-stage training strategy, from vision pretraining to multi-step long-horizon driving, improves stability and performance. LMGenDrive supports both low-latency online planning and autoregress

What carries the argument

The LMGenDrive joint generator that outputs future video sequences together with control signals, using LLM components for semantic and instruction grounding alongside world-model components for scene dynamics.

If this is right

  • The model can run in low-latency online mode for real-time planning or in autoregressive mode for offline video simulation.
  • Complementary semantic priors and spatio-temporal prediction improve robustness on rare and long-tail driving cases.
  • Instruction grounding becomes more reliable because language understanding is trained alongside visual future prediction.
  • The three-stage progressive training reduces instability when scaling to longer driving horizons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This style of unification could transfer to other embodied tasks such as robotic manipulation where both language commands and visual future modeling matter.
  • If the joint training proves stable, similar architectures might reduce the need for separate modules in perception-planning stacks.
  • Real-world deployment would benefit from testing whether the generated videos remain consistent with actual sensor data over extended sequences.

Load-bearing premise

That fusing video prediction with LLM semantic priors will produce additive gains without creating new prediction inconsistencies or training instabilities.

What would settle it

A controlled test showing LMGenDrive performs no better than a pure LLM planner or a standalone generative world model on the same closed-loop benchmarks, or produces video predictions that lead to unsafe controls.

Figures

Figures reproduced from arXiv: 2604.08719 by Hao Shao, Hongsheng Li, Letian Wang, Steven L. Waslander, Wei Zhan, Yang Zhou, Yuxuan Hu, Zhuofan Zong.

Figure 1
Figure 1. Figure 1: Comparison between existing works and ours. Prior works either leverage LLMs/VLMs for multimodal understanding and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our unified understanding and generation architecture. We start by encoding the language instruction, multi-view [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of our world generator. We begin by fusing the world embedding obtained from the LLM with multi-view RGB [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of multi-view future scenes generated by LMGenDrive, showing consistent left-front-right views aligned with [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Recent years have seen remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains a major bottleneck for large-scale deployment. To address this challenge, some works use LLMs and VLMs for vision-language understanding and reasoning, enabling vehicles to interpret rare and safety-critical situations when generating actions. Others study generative world models to capture the spatio-temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive generates both future driving videos and control signals. This design provides complementary benefits: video prediction improves spatio-temporal scene modeling, while the LLM contributes strong semantic priors and instruction grounding from large-scale pretraining. We further propose a progressive three-stage training strategy, from vision pretraining to multi-step long-horizon driving, to improve stability and performance. LMGenDrive supports both low-latency online planning and autoregressive offline video generation. Experiments show that it significantly outperforms prior methods on challenging closed-loop benchmarks, with clear gains in instruction following, spatio-temporal understanding, and robustness to rare scenarios. These results suggest that unifying multimodal understanding and generation is a promising direction for more generalizable and robust embodied decision-making systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces LMGenDrive, the first unified framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, the model generates both future driving videos and control signals. It employs a progressive three-stage training strategy (vision pretraining to multi-step long-horizon driving) to improve stability. Experiments demonstrate significant outperformance over prior methods on challenging closed-loop benchmarks, with gains in instruction following, spatio-temporal understanding, and robustness to rare scenarios.

Significance. If the empirical results hold under scrutiny, this work could meaningfully advance autonomous driving and embodied AI research. Unifying semantic priors from large-scale LLM pretraining with spatio-temporal modeling from generative video prediction directly targets the long-tail generalization bottleneck. The three-stage training strategy offers a practical recipe for stable hybrid model optimization, and the dual support for low-latency online planning and autoregressive offline generation increases applicability. Credit is due for the closed-loop evaluation focus and the explicit attempt to demonstrate complementary benefits between understanding and generation modules.

minor comments (3)
  1. [Abstract] Abstract: The statement that LMGenDrive 'significantly outperforms prior methods on challenging closed-loop benchmarks' would be strengthened by naming at least one benchmark (e.g., CARLA closed-loop) and reporting a key quantitative metric (e.g., success rate or collision reduction) rather than leaving the claim entirely qualitative.
  2. [Section 3] Section 3 (Method): The precise mechanism by which the LLM semantic priors condition the generative world model (and vice versa) during joint video-and-control prediction is described at a high level; adding a short equation or pseudocode snippet for the cross-modal conditioning would improve reproducibility.
  3. [Figure 2] Figure 2 / Section 4.2: The architecture diagram and training-stage illustrations are clear, but the caption for the three-stage progression could explicitly state the loss terms active in each stage to make the stability claim easier to verify.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of LMGenDrive and the recommendation for minor revision. We appreciate the recognition of the unified framework's potential to address long-tail generalization in autonomous driving through the integration of LLM priors and generative world modeling.

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent experimental validation

full rationale

The paper describes an empirical system (LMGenDrive) that integrates pretrained LLM/VLM components with a generative video model, trained progressively in three stages and evaluated on closed-loop driving benchmarks. No mathematical derivations, equations, or first-principles predictions are presented that could reduce to fitted parameters or self-citations by construction. Performance gains are reported via external benchmarks and ablations rather than internal self-referential loops. Self-citations, if present, are not load-bearing for any claimed uniqueness theorem or ansatz. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Based on abstract only; the central claim rests on the assumption that neural generative models and LLMs can be jointly trained to produce coherent video predictions and controls, plus standard deep-learning training assumptions. No specific free parameters or invented entities are detailed in the provided text.

free parameters (1)
  • model architecture hyperparameters
    Neural network sizes, learning rates, and stage-specific training schedules are necessarily present but not enumerated in the abstract.
axioms (2)
  • domain assumption Generative video models can capture useful spatio-temporal dynamics of driving scenes
    Invoked when stating that video prediction improves scene modeling.
  • domain assumption LLM pretraining provides strong semantic priors and instruction grounding transferable to driving
    Invoked when claiming complementary benefits from the LLM component.
invented entities (1)
  • LMGenDrive unified framework no independent evidence
    purpose: Single model performing both multimodal understanding and generative world modeling for driving
    New proposed system whose benefits are asserted but not independently verified outside the paper's experiments.

pith-pipeline@v0.9.0 · 5588 in / 1472 out tokens · 71814 ms · 2026-05-10T17:30:26.437234+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 37 canonical work pages · 13 internal anchors

  1. [1]

    Advances in neural infor- mation processing systems35, 23716–23736 (2022) 2

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural infor- mation processing systems35, 23716–23736 (2022) 2

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 2, 3

  3. [3]

    Ball, P.J., Bauer, J., Belletti, F., et al.: Genie 3: A new frontier for world models (2025) 2

  4. [4]

    CARLA Team: Carla autonomous driving leader- board.https://leaderboard.carla.org/ (2020), accessed: 2021-02-11 7

  5. [5]

    In: Conference on Robot Learning

    Chen, D., Zhou, B., Koltun, V ., Kr¨ahenb¨uhl, P.: Learn- ing by cheating. In: Conference on Robot Learning. pp. 66–75. PMLR (2020) 5

  6. [6]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Chen, J., Xu, Z., Pan, X., Hu, Y ., Qin, C., Gold- stein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multi- modal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025) 2

  7. [7]

    arXiv preprint arXiv:2310.01957 (2023) 3

    Chen, L., Sinavski, O., H ¨unermann, J., Karnsund, A., Willmott, A.J., Birch, D., Maund, D., Shotton, J.: Driving with llms: Fusing object-level vector modal- ity for explainable autonomous driving. arXiv preprint arXiv:2310.01957 (2023) 3

  8. [8]

    See https://vicuna

    Chiang, W.L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gon- zalez, J.E., et al.: Vicuna: An open-source chat- bot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) 2(3), 6 (2023) 7

  9. [9]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerg- 9 ing properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 2

  10. [10]

    In: Conference on robot learning

    Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V .: Carla: An open urban driving simulator. In: Conference on robot learning. pp. 1–16. PMLR (2017) 7

  11. [11]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

    Fu, H., Zhang, D., Zhao, Z., Cui, J., Liang, D., Zhang, C., Zhang, D., Xie, H., Wang, B., Bai, X.: Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 24823–24834 (2025) 3

  12. [12]

    Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023

    Gao, R., Chen, K., Xie, E., Hong, L., Li, Z., Ye- ung, D.Y ., Xu, Q.: Magicdrive: Street view genera- tion with diverse 3d geometry control. arXiv preprint arXiv:2310.02601 (2023) 2, 3

  13. [13]

    Vista: A generalizable driving world model with high fidelity and versatile controllability

    Gao, S., Yang, J., Chen, L., Chitta, K., Qiu, Y ., Geiger, A., Zhang, J., Li, H.: Vista: A generalizable driving world model with high fidelity and versatile controlla- bility. arXiv preprint arXiv:2405.17398 (2024) 2, 3

  14. [14]

    Blog post, April 1(2023) 4

    Geng, X., Gudibande, A., Liu, H., Wallace, E., Abbeel, P., Levine, S., Song, D.: Koala: A dia- logue model for academic research. Blog post, April 1(2023) 4

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek- r1: Incentivizing reasoning capability in llms via rein- forcement learning. arXiv preprint arXiv:2501.12948 (2025) 1, 3

  16. [16]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Guo, Y ., Yang, C., Rao, A., Liang, Z., Wang, Y ., Qiao, Y ., Agrawala, M., Lin, D., Dai, B.: Animate- diff: Animate your personalized text-to-image diffu- sion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023) 5, 7

  17. [17]

    World Models

    Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.10122 (2018) 2

  18. [18]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016) 4

  19. [19]

    Advances in neural information pro- cessing systems33, 6840–6851 (2020) 5

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion prob- abilistic models. Advances in neural information pro- cessing systems33, 6840–6851 (2020) 5

  20. [20]

    GAIA-1: A Generative World Model for Autonomous Driving

    Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 (2023) 2, 3

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition

    Hu, Y ., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning- oriented autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. pp. 17853–17862 (2023) 3

  22. [22]

    arXiv preprint arXiv:2306.07957 (2023) 3, 4

    Jaeger, B., Chitta, K., Geiger, A.: Hidden bi- ases of end-to-end driving models. arXiv preprint arXiv:2306.07957 (2023) 3, 4

  23. [23]

    arXiv preprint arXiv:2503.22231 (2025) 2, 3

    Ji, Y ., Zhu, Z., Zhu, Z., Xiong, K., Lu, M., Li, Z., Zhou, L., Sun, H., Wang, B., Lu, T.: Cogen: 3d con- sistent video generation via adaptive conditioning for autonomous driving. arXiv preprint arXiv:2503.22231 (2025) 2, 3

  24. [24]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

    Jia, X., Gao, Y ., Chen, L., Yan, J., Liu, P.L., Li, H.: Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 7953– 7963 (2023) 3

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Jia, X., Wu, P., Chen, L., Xie, J., He, C., Yan, J., Li, H.: Think twice before driving: Towards scal- able decoders for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 21983– 21994 (2023) 3

  26. [26]

    Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656,

    Jia, X., You, J., Zhang, Z., Yan, J.: Drivetrans- former: Unified transformer for scalable end-to-end autonomous driving. arXiv preprint arXiv:2503.07656 (2025) 3

  27. [27]

    In: Interna- tional conference on machine learning

    Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrap- ping language-image pre-training for unified vision- language understanding and generation. In: Interna- tional conference on machine learning. pp. 12888– 12900. PMLR (2022) 2

  28. [28]

    Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

    Li, Y ., Fan, L., He, J., Wang, Y ., Chen, Y ., Zhang, Z., Tan, T.: Enhancing end-to-end au- tonomous driving with latent world model. arXiv preprint arXiv:2406.08481 (2024) 4

  29. [29]

    Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

    Liao, C., Liu, L., Wang, X., Luo, Z., Zhang, X., Zhao, W., Wu, J., Li, L., Tian, Z., Huang, W.: Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472 (2025) 2

  30. [30]

    arXiv preprint arXiv:2402.05935 , year=

    Liu, D., Zhang, R., Qiu, L., Huang, S., Lin, W., Zhao, S., Geng, S., Lin, Z., Jin, P., Zhang, K., et al.: Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 (2024) 3

  31. [31]

    Advances in neural information process- ing systems36, 34892–34916 (2023) 2, 3, 4

    Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruc- tion tuning. Advances in neural information process- ing systems36, 34892–34916 (2023) 2, 3, 4

  32. [32]

    In: International Conference on Learn- ing Representations (2018) 6

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learn- ing Representations (2018) 6

  33. [33]

    Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a

    Mao, J., Qian, Y ., Ye, J., Zhao, H., Wang, Y .: Gpt- driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415 (2023) 2, 3

  34. [34]

    A language agent for au- tonomous driving

    Mao, J., Ye, J., Qian, Y ., Pavone, M., Wang, Y .: A 10 language agent for autonomous driving. arXiv preprint arXiv:2311.10813 (2023) 3

  35. [35]

    Advances in Neural Information Processing Systems34, 5416– 5429 (2021) 3

    Qian, S., Shao, H., Zhu, Y ., Li, M., Jia, J.: Blend- ing anti-aliasing into vision transformer. Advances in Neural Information Processing Systems34, 5416– 5429 (2021) 3

  36. [36]

    In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence

    Renz, K., Chen, L., Arani, E., Sinavski, O.: Sim- lingo: Vision-only closed-loop autonomous driving with language-action alignment. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence. pp. 11993–12003 (2025) 3

  37. [37]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 7

  38. [38]

    In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international con- ference, Munich, Germany, October 5-9, 2015, pro- ceedings, part III 18

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convo- lutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international con- ference, Munich, Germany, October 5-9, 2015, pro- ceedings, part III 18. pp. 234–241. Springer (2015) 5

  39. [39]

    Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

    Russell, L., Hu, A., Bertoni, L., Fedoseev, G., Shot- ton, J., Arani, E., Corrado, G.: Gaia-2: A controllable multi-view generative world model for autonomous driving. arXiv preprint arXiv:2503.20523 (2025) 2, 3

  40. [40]

    Languagempc: Large language models as deci- sion makers for autonomous driving,

    Sha, H., Mu, Y ., Jiang, Y ., Chen, L., Xu, C., Luo, P., Li, S.E., Tomizuka, M., Zhan, W., Ding, M.: Languagempc: Large language models as deci- sion makers for autonomous driving. arXiv preprint arXiv:2310.03026 (2023) 3

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition

    Shao, H., Hu, Y ., Wang, L., Song, G., Waslander, S.L., Liu, Y ., Li, H.: Lmdrive: Closed-loop end-to-end driv- ing with large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. pp. 15120–15130 (2024) 2, 3, 4, 7

  42. [42]

    Advances in Neural Information Processing Sys- tems37, 8612–8642 (2024) 2, 3

    Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y ., Li, H.: Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reason- ing. Advances in Neural Information Processing Sys- tems37, 8612–8642 (2024) 2, 3

  43. [43]

    In: Conference on Robot Learning

    Shao, H., Wang, L., Chen, R., Li, H., Liu, Y .: Safety- enhanced autonomous driving using interpretable sen- sor fusion transformer. In: Conference on Robot Learning. pp. 726–737. PMLR (2023) 3, 6

  44. [44]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition

    Shao, H., Wang, L., Chen, R., Waslander, S.L., Li, H., Liu, Y .: Reasonnet: End-to-end driving with tem- poral and global reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. pp. 13723–13733 (2023) 3, 4

  45. [45]

    Llamafusion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188,

    Shi, W., Han, X., Zhou, C., Liang, W., Lin, X.V ., Zettlemoyer, L., Yu, L.: Lmfusion: Adapting pre- trained language models for multimodal generation. arXiv preprint arXiv:2412.15188 (2024) 2

  46. [46]

    In: European Conference on Computer Vision

    Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answer- ing. In: European Conference on Computer Vision. pp. 256–274. Springer (2024) 3

  47. [47]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Tian, X., Gu, J., Li, B., Liu, Y ., Wang, Y ., Zhao, Z., Zhan, K., Jia, P., Lang, X., Zhao, H.: Drivevlm: The convergence of autonomous driving and large vision- language models. arXiv preprint arXiv:2402.12289 (2024) 3

  48. [48]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozi `ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 3, 4

  49. [49]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Alma- hairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhar- gava, P., Bhosale, S., et al.: Llama 2: Open foun- dation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 3

  50. [50]

    arXiv preprint arXiv:2503.12170 (2025) 3

    Wang, T., Zhang, C., Qu, X., Li, K., Liu, W., Huang, C.: Diffad: A unified diffusion modeling approach for autonomous driving. arXiv preprint arXiv:2503.12170 (2025) 3

  51. [51]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 2, 3

  52. [52]

    arXiv preprint arXiv:2505.18650 (2025) 2, 3

    Wang, X., Peng, P.: Prophetdwm: A driving world model for rolling out future actions and videos. arXiv preprint arXiv:2505.18650 (2025) 2, 3

  53. [53]

    In: European Confer- ence on Computer Vision

    Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., Lu, J.: Drivedreamer: Towards real-world-drive world models for autonomous driving. In: European Confer- ence on Computer Vision. pp. 55–72. Springer (2024) 2, 3

  54. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, Y ., He, J., Fan, L., Li, H., Chen, Y ., Zhang, Z.: Driving into the future: Multiview visual forecasting and planning with world model for autonomous driv- ing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14749– 14759 (2024) 2, 3, 5

  55. [55]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition

    Weng, X., Ivanovic, B., Wang, Y ., Wang, Y ., Pavone, M.: Para-drive: Parallelized architecture for real- time autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. pp. 15449–15458 (2024) 3 11

  56. [56]

    Bevdriver: Leveraging bev maps in llms for robust closed-loop driving.arXiv preprint arXiv:2503.03074, 2025

    Winter, K., Azer, M., Flohr, F.B.: Bevdriver: Lever- aging bev maps in llms for robust closed-loop driving. arXiv preprint arXiv:2503.03074 (2025) 3, 7

  57. [57]

    Xing, and Zhiting Hu

    Xiang, J., Liu, G., Gu, Y ., Gao, Q., Ning, Y ., Zha, Y ., Feng, Z., Tao, T., Hao, S., Shi, Y ., et al.: Pandora: Towards general world model with natu- ral language actions and video states. arXiv preprint arXiv:2406.09455 (2024) 2, 5

  58. [58]

    Wong, Zhenguo Li, and Hengshuang Zhao

    Xu, Z., Zhang, Y ., Xie, E., Zhao, Z., Guo, Y ., Wong, K.K., Li, Z., Zhao, H.: Drivegpt4: Interpretable end- to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412 (2023) 3

  59. [59]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 3

  60. [60]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Yang, J., Gao, S., Qiu, Y ., Chen, L., Li, T., Dai, B., Chitta, K., Wu, P., Zeng, J., Luo, P., et al.: Gener- alized predictive model for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 14662– 14672 (2024) 2, 3

  61. [61]

    DriveMoE: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving

    Yang, Z., Chai, Y ., Jia, X., Li, Q., Shao, Y ., Zhu, X., Su, H., Yan, J.: Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end au- tonomous driving. arXiv preprint arXiv:2505.16278 (2025) 3

  62. [62]

    OPT: Open Pre-trained Transformer Language Models

    Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V ., et al.: Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022) 7

  63. [63]

    arXiv preprint arXiv:2406.03474 (2024) 3, 7

    Zhang, Z., Tang, S., Zhang, Y ., Fu, T., Wang, Y ., Liu, Y ., Wang, D., Shao, J., Wang, L., Lu, H.: Ad-h: Autonomous driving with hierarchical agents. arXiv preprint arXiv:2406.03474 (2024) 3, 7

  64. [64]

    Zheng, L., Chiang, W.L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging llm-as- a-judge with mt-bench and chatbot arena (2023) 4

  65. [65]

    Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564,

    Zheng, Y ., Liang, R., Zheng, K., Zheng, J., Mao, L., Li, J., Gu, W., Ai, R., Li, S.E., Zhan, X., et al.: Diffusion-based planning for autonomous driving with flexible guidance. arXiv preprint arXiv:2501.15564 (2025) 3

  66. [66]

    In: Pro- ceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition

    Zhou, Y ., Shao, H., Wang, L., Waslander, S.L., Li, H., Liu, Y .: Smartrefine: A scenario-adaptive refinement framework for efficient motion prediction. In: Pro- ceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition. pp. 15281–15290 (2024) 3

  67. [67]

    Drivinggen: A compre- hensive benchmark for generative video world models in au- tonomous driving.arXiv preprint arXiv:2601.01528, 2026

    Zhou, Y ., Shao, H., Wang, L., Zong, Z., Li, H., Waslander, S.L.: Drivinggen: A comprehensive benchmark for generative video world models in au- tonomous driving. arXiv preprint arXiv:2601.01528 (2026) 2

  68. [68]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) 3, 4

  69. [69]

    Advances in Neural Information Processing Systems37, 103305–103333 (2024) 3 12

    Zong, Z., Ma, B., Shen, D., Song, G., Shao, H., Jiang, D., Li, H., Liu, Y .: Mova: Adapting mixture of vision experts to multimodal context. Advances in Neural Information Processing Systems37, 103305–103333 (2024) 3 12