pith. machine review for the scientific record. sign in

arxiv: 2503.19755 · v1 · submitted 2025-03-25 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-17 08:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords drivingreasoningactionmethodsorionspaceautonomousclosed-loop
0
0 comments X

The pith

ORION reports 77.74 Driving Score and 54.62% Success Rate on Bench2Drive, outperforming prior end-to-end methods by 14.28 DS and 19.61% SR through unified VQA and planning optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system for self-driving cars that processes camera images and uses language models to understand the driving scene. It adds a module to summarize past observations over time, lets a large language model reason about what the car should do next, and then generates a smooth path for the vehicle to follow. The key step is training everything together so the language reasoning directly influences the trajectory numbers instead of treating them as separate tasks. On a standard benchmark for closed-loop driving, the system scores higher than previous methods in both overall driving quality and the percentage of successful trips without crashes or rule violations. The authors argue this closes a gap where language models understand scenes well but struggle to output precise control actions.

Core claim

Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.

Load-bearing premise

The assumption that aligning the reasoning space of the LLM with the numerical action space through unified E2E optimization will reliably improve closed-loop causal reasoning and trajectory quality without introducing new failure modes.

read the original abstract

End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ORION, a holistic end-to-end autonomous driving framework that combines a QT-Former to aggregate long-term history context, an LLM for driving scenario reasoning, and a generative planner for precision trajectory prediction. It aligns the LLM reasoning space with the numerical action space via unified E2E optimization of VQA and planning tasks. On the Bench2Drive benchmark, the method reports a closed-loop Driving Score of 77.74 and Success Rate of 54.62%, outperforming prior SOTA by 14.28 DS and 19.61% SR.

Significance. If the central performance claims hold after addressing the noted gaps, this work could meaningfully advance VLM integration in autonomous driving by demonstrating a practical way to bridge semantic reasoning and numerical trajectory generation. The focus on closed-loop evaluation and the scale of the reported margins are notable strengths, as is the explicit attempt at unified optimization across perception-reasoning and control.

major comments (2)
  1. [§3] §3 (Framework and Unified Optimization): The manuscript describes the QT-Former + LLM + generative planner architecture and the unified loss for VQA and planning, but provides no ablation that isolates the contribution of the reasoning-action space alignment (e.g., joint E2E training versus separate VQA-only or planning-only optimization). This ablation is load-bearing for the claim that alignment improves causal reasoning and closes the semantic-numerical gap, as the 14.28 DS / 19.61% SR gains could otherwise be explained by dataset tuning, history aggregation, or planner design alone.
  2. [§5] §5 (Experiments and Results): The reported benchmark numbers on Bench2Drive lack error bars, standard deviations across runs, or explicit data exclusion criteria. Without these, it is difficult to assess whether the large margins over SOTA are statistically reliable or sensitive to particular evaluation conditions.
minor comments (2)
  1. [§3] The notation for the unified loss function and the QT-Former aggregation could be made more explicit with an equation or pseudocode block to improve reproducibility.
  2. [Figure 1] Figure 1 (overall architecture) would benefit from clearer arrows or labels distinguishing the VQA branch from the trajectory generation branch.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below, along with our plans for revision.

read point-by-point responses
  1. Referee: [§3] §3 (Framework and Unified Optimization): The manuscript describes the QT-Former + LLM + generative planner architecture and the unified loss for VQA and planning, but provides no ablation that isolates the contribution of the reasoning-action space alignment (e.g., joint E2E training versus separate VQA-only or planning-only optimization). This ablation is load-bearing for the claim that alignment improves causal reasoning and closes the semantic-numerical gap, as the 14.28 DS / 19.61% SR gains could otherwise be explained by dataset tuning, history aggregation, or planner design alone.

    Authors: We agree that an explicit ablation isolating the contribution of the unified E2E optimization (joint VQA and planning training) versus separate optimization would provide stronger support for the benefits of reasoning-action space alignment. In the revised manuscript, we will add this ablation study, including comparisons of the full ORION model against variants trained separately on VQA-only and planning-only objectives. This will help clarify that the observed gains stem from the alignment mechanism rather than other architectural components. revision: yes

  2. Referee: [§5] §5 (Experiments and Results): The reported benchmark numbers on Bench2Drive lack error bars, standard deviations across runs, or explicit data exclusion criteria. Without these, it is difficult to assess whether the large margins over SOTA are statistically reliable or sensitive to particular evaluation conditions.

    Authors: We acknowledge that the current results presentation would benefit from greater statistical transparency. In the revised manuscript, we will report standard deviations and error bars obtained from multiple evaluation runs using different random seeds. We will also explicitly document the data exclusion criteria applied during the Bench2Drive closed-loop evaluations to facilitate assessment of result reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results tied to external benchmark evaluation

full rationale

The paper describes an architectural framework (QT-Former + LLM + generative planner) with unified E2E optimization for VQA and planning, then reports closed-loop metrics on the external Bench2Drive benchmark. No equations, fitted parameters, or self-citations are presented that reduce the claimed DS/SR gains to a self-referential quantity or to a prior result by the same authors. The performance numbers are obtained by running the trained model on a held-out challenge dataset, which constitutes independent empirical evidence rather than a derivation that collapses to its own inputs by construction. The central claim therefore remains non-circular under the stated criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate free parameters, axioms, or invented entities; no equations or training procedures are shown.

pith-pipeline@v0.9.0 · 5563 in / 1123 out tokens · 37739 ms · 2026-05-17T08:04:53.553705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fail2Drive: Benchmarking Closed-Loop Driving Generalization

    cs.RO 2026-04 conditional novelty 7.0

    Fail2Drive is the first paired-route benchmark for closed-loop generalization in CARLA, showing an average 22.8% success-rate drop on shifted scenarios and revealing failure modes such as ignoring visible LiDAR objects.

  2. Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

    cs.RO 2026-03 unverdicted novelty 7.0

    PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.

  3. Latent Chain-of-Thought World Modeling for End-to-End Driving

    cs.CV 2025-12 unverdicted novelty 7.0

    LCDrive unifies chain-of-thought reasoning and action selection for end-to-end driving by interleaving action-proposal tokens and latent world-model tokens that predict action outcomes, yielding faster inference and b...

  4. ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 7.0

    ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.

  5. MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.

  6. GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic

    cs.AI 2026-05 unverdicted novelty 6.0

    GuardAD reduces accident rates by 32% in autonomous driving MLLMs by using n-th order Markovian logic to infer latent hazards and revise actions.

  7. Unified Map Prior Encoder for Mapping and Planning

    cs.CV 2026-05 unverdicted novelty 6.0

    UMPE fuses any subset of HD/SD vector maps, raster SD maps, and satellite imagery into BEV features via alignment-aware vector and raster branches, raising mapping mAP by 5.3-5.9 points and cutting planning L2 error b...

  8. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.

  9. OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

    cs.CV 2026-04 unverdicted novelty 6.0

    OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.

  10. Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.

  11. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  12. DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

    cs.CV 2025-10 unverdicted novelty 6.0

    DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.

  13. AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    cs.CV 2025-06 unverdicted novelty 6.0

    AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...

  14. Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

    cs.AI 2026-05 unverdicted novelty 5.0

    Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.

  15. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 5.0

    VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.

  16. SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

    cs.CV 2026-04 unverdicted novelty 5.0

    SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

  17. EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation

    cs.CV 2026-03 unverdicted novelty 5.0

    EvoDriveVLA uses collaborative perception-planning distillation with self-anchor and future-aware teachers to fix perception degradation and long-term instability in driving VLA models, reaching SOTA on nuScenes and NAVSIM.

  18. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 4.0

    DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 17 Pith papers · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Proc. of Ad- vances in Neural Information Processing Systems, 2022. 3

  3. [3]

    Gemini: A Family of Highly Capable Multimodal Models

    Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 1, 2023. 2, 3

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 3

  5. [5]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science., page 8, 2023. 4

  6. [6]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition , pages 11621–11631, 2020. 1, 5, 2

  7. [7]

    Multipath: Multiple probabilistic anchor tra- jectory hypotheses for behavior prediction

    Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic anchor tra- jectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449, 2019. 1

  8. [8]

    VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243,

  9. [9]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences , page 220101,

  10. [10]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, pages 24185–24198, 2024. 1, 2

  11. [11]

    Carla: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conf. on Robot Learning, pages 1–16, 2017. 2, 5

  12. [12]

    Eva-02: A visual representa- 9 tion for neon genesis

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representa- 9 tion for neon genesis. Image and Vision Computing , page 105171, 2024. 5

  13. [13]

    Mathematical capabilities of chatgpt

    Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tom- maso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. Proc. of Advances in Neural Information Processing Systems , 36,

  14. [14]

    Generative adversarial networks

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commu- nications of the ACM, pages 139–144, 2020. 4

  15. [15]

    Vip3d: End-to-end visual trajectory prediction via 3d agent queries

    Junru Gu, Chenxu Hu, Tianyuan Zhang, Xuanyao Chen, Yilun Wang, Yue Wang, and Hang Zhao. Vip3d: End-to-end visual trajectory prediction via 3d agent queries. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recogni- tion, pages 5496–5506, 2023. 1

  16. [16]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In Proc. of Intl. Conf. on Learning Representations , 2022. 5, 2

  17. [17]

    St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In Proc. of European Conference on Computer Vision , pages 533–549, 2022. 1

  18. [18]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, pages 17853–17862, 2023. 1, 2, 5, 6

  19. [19]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262,

  20. [20]

    Hid- den biases of end-to-end driving models

    Bernhard Jaeger, Kashyap Chitta, and Andreas Geiger. Hid- den biases of end-to-end driving models. In Porc. of IEEE Intl. Conf. on Computer Vision, pages 8240–8249, 2023. 8

  21. [21]

    Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

    Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In Porc. of IEEE Intl. Conf. on Computer Vision, 2023. 2, 6, 8

  22. [22]

    Think twice before driv- ing: Towards scalable decoders for end-to-end autonomous driving

    Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driv- ing: Towards scalable decoders for end-to-end autonomous driving. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2023. 2, 6, 8

  23. [23]

    Bench2drive: Towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving. Proc. of Advances in Neural Information Processing Systems , 2024. 1, 5, 9, 2

  24. [24]

    Drivetransformer: Unified transformer for scalable end-to- end autonomous driving

    Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to- end autonomous driving. In Proc. of Intl. Conf. on Learning Representations, 2025. 2, 6, 8

  25. [25]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition , pages 8340– 8350, 2023. 1, 2, 3, 5, 6, 7

  26. [26]

    Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xing- gang Wang. Senna: Bridging large vision-language mod- els and end-to-end autonomous driving. arXiv preprint arXiv:2410.22313, 2024. 2, 3, 7

  27. [27]

    Far3d: Expanding the horizon for surround-view 3d object detec- tion

    Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, and Xiangyu Zhang. Far3d: Expanding the horizon for surround-view 3d object detec- tion. In Proc. of the AAAI Conf. on Artificial Intelligence , pages 2561–2569, 2024. 1

  28. [28]

    Auto-Encoding Variational Bayes

    Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 2, 4, 5

  29. [29]

    Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2)

    Qifeng Li, Xiaosong Jia, Shaobo Wang, and Junchi Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2). In Proc. of European Conference on Computer Vision, pages 142–158, 2024. 6

  30. [30]

    Mon- key: Image resolution and text label are important things for large multi-modal models

    Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Mon- key: Image resolution and text label are important things for large multi-modal models. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 26763– 26773, 2024. 2

  31. [31]

    Is ego status all you need for open- loop end-to-end autonomous driving? In Proc

    Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open- loop end-to-end autonomous driving? In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition , pages 14864–14873, 2024. 2

  32. [32]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2025. 2

  33. [33]

    Rouge: A package for automatic evalua- tion of summaries

    Chin-Yew Lin. Rouge: A package for automatic evalua- tion of summaries. In Proc. Annual Meeting of the Associa- tion for Computational Linguistics Workshop, pages 74–81,

  34. [34]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Porc. of IEEE Intl. Conf. on Computer Vision , pages 2980–2988,

  35. [35]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Proc. of Advances in Neural Information Processing Systems, 2023. 2

  36. [36]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2

  37. [37]

    Petr: Position embedding transformation for multi-view 3d object detection

    Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. In Proc. of European Conference on Com- puter Vision, pages 531–548, 2022. 3 10

  38. [38]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024. 2, 4

  39. [39]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding. arXiv preprint arXiv:2403.05525,

  40. [40]

    Continuously learning, adapting, and im- proving: A dual-process approach to autonomous driving

    Jianbiao Mei, Yukai Ma, Xuemeng Yang, Licheng Wen, Xinyu Cai, Xin Li, Daocheng Fu, Bo Zhang, Pinlong Cai, Min Dou, et al. Continuously learning, adapting, and im- proving: A dual-process approach to autonomous driving. In Proc. of Advances in Neural Information Processing Sys- tems, 2024. 2

  41. [41]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proc. Annual Meeting of the Association for Computational Linguistics, pages 311–318, 2002. 5

  42. [42]

    Math- bert: A pre-trained model for mathematical formula under- standing

    Shuai Peng, Ke Yuan, Liangcai Gao, and Zhi Tang. Math- bert: A pre-trained model for mathematical formula under- standing. arXiv preprint arXiv:2105.00377, 2021. 2, 7

  43. [43]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

    Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proc. of European Conference on Computer Vision, pages 194–210, 2020. 1

  44. [44]

    Multi- modal fusion transformer for end-to-end autonomous driv- ing

    Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi- modal fusion transformer for end-to-end autonomous driv- ing. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 7077–7087, 2021. 1

  45. [45]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In Proc. of Intl. Conf. on Machine Learning , pages 8748–8763, 2021. 2

  46. [46]

    Carllava: Vision language models for camera-only closed-loop driving

    Katrin Renz, Long Chen, Ana-Maria Marcu, Jan H¨unermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera-only closed-loop driving. arXiv preprint arXiv:2406.10165, 2024. 7

  47. [47]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition , pages 10684–10695, 2022. 2, 4

  48. [48]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In Proc. of Intl. Conf. on Medical Image Computing and Computer Assisted Intervention , pages 234–241, 2015. 2, 4

  49. [49]

    Lmdrive: Closed-loop end-to-end driving with large language models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 15120–15130, 2024. 1, 3

  50. [50]

    Motion transformer with global intention localization and lo- cal movement refinement

    Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and lo- cal movement refinement. Proc. of Advances in Neural In- formation Processing Systems, 35:6531–6543, 2022. 1

  51. [51]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recogni- tion, pages 18221–18232, 2024. 4

  52. [52]

    Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving

    Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, 2025. 6

  53. [53]

    Scalability in percep- tion for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in percep- tion for autonomous driving: Waymo open dataset. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, pages 2446–2454, 2020. 1

  54. [54]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024. 3, 2

  55. [55]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2

  56. [56]

    Cider: Consensus-based image description evalua- tion

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 4566–4575, 2015. 5

  57. [57]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 2, 3, 4

  58. [58]

    Exploring object-centric temporal modeling for efficient multi-view 3d object detection

    Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xi- angyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. InPorc. of IEEE Intl. Conf. on Computer Vision, pages 3621–3631, 2023. 3, 4

  59. [59]

    Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, 2024. 1, 2, 3, 5, 7

  60. [60]

    Drivemlm: Aligning multi-modal large language models with behavioral planning states for au- tonomous driving

    Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for au- tonomous driving. arXiv preprint arXiv:2312.09245, 2023. 1, 3 11

  61. [61]

    Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong base- line

    Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong base- line. In Proc. of Advances in Neural Information Processing Systems, 2022. 2, 6, 8

  62. [62]

    Openemma: Open-source multimodal model for end-to-end autonomous driving

    Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, and Zhengzhong Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving. In Proc. of IEEE Winter Conf. on Applications of Computer Vision, pages 1001–1009, 2025. 2, 3

  63. [63]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Let- ters, 2024. 1

  64. [64]

    Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

    Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023. 2, 6

  65. [65]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Porc. of IEEE Intl. Conf. on Computer Vision , pages 11975–11986, 2023. 2

  66. [66]

    Fully sparse transformer 3-d detector for lidar point cloud

    Diankun Zhang, Zhijie Zheng, Haoyu Niu, Xueqing Wang, and Xiaojun Liu. Fully sparse transformer 3-d detector for lidar point cloud. IEEE Transactions on Geoscience and Re- mote Sensing, 61:1–12, 2023. 1

  67. [67]

    Sparsead: Sparse query-centric paradigm for efficient end-to-end autonomous driving

    Diankun Zhang, Guoan Wang, Runwen Zhu, Jianbo Zhao, Xiwu Chen, Siyu Zhang, Jiahao Gong, Qibin Zhou, Wenyuan Zhang, Ningzi Wang, et al. Sparsead: Sparse query-centric paradigm for efficient end-to-end autonomous driving. arXiv preprint arXiv:2404.06892, 2024. 1

  68. [68]

    End-to-end urban driving by imitating a reinforcement learning coach

    Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, and Luc Van Gool. End-to-end urban driving by imitating a reinforcement learning coach. In Porc. of IEEE Intl. Conf. on Computer Vision, 2021. 2

  69. [69]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Proc. of Advances in Neural In- formation Processing Systems, 36:46595–46623, 2023. 5

  70. [70]

    Genad: Generative end-to-end au- tonomous driving

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end au- tonomous driving. InProc. of European Conference on Com- puter Vision, pages 87–104, 2024. 1, 2, 5, 6

  71. [71]

    Hidden biases of end- to-end driving datasets

    Julian Zimmerlin, Jens Beißwenger, Bernhard Jaeger, An- dreas Geiger, and Kashyap Chitta. Hidden biases of end- to-end driving datasets. arXiv preprint arXiv:2412.09602 ,

  72. [72]

    description

    8 12 ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation Supplementary Material We provide supplementary material to complement the main paper, arranged as follows: • Appendix A: Details on the Chat-B2D dataset. • Appendix B: Traning Details. • Appendix C: More results. A. Details on the Chat-B2D datas...

  73. [73]

    What can you tell about the current driving conditions from the images?

  74. [74]

    What can be observed in the panoramic images provided?

  75. [75]

    Can you provide a summary of the current driving scenario based on the input images?

  76. [76]

    What can you observe from the provided images regarding the driving conditions?

  77. [77]

    Please describe the current driving conditions based on the images provided

  78. [78]

    Can you describe the current weather conditions and the general environment depicted in the images?

  79. [79]

    Please describe the current driving conditions based on the input images

  80. [80]

    Could you summarize the current driving conditions based on the input images?

Showing first 80 references.