pith. machine review for the scientific record. sign in

arxiv: 2410.22313 · v1 · submitted 2024-10-29 · 💻 cs.CV · cs.RO

Recognition: 3 theorem links

· Lean Theorem

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:19 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords autonomous drivingvision-language modelsend-to-end drivingtrajectory predictionplanning decisionsnuScenespre-training
0
0 comments X

The pith

Senna uses a large vision-language model for natural language driving plans that an end-to-end model converts into precise trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to merge the commonsense reasoning of large vision-language models with the data-driven planning of end-to-end autonomous driving systems. It proposes decoupling high-level decisions expressed in language from the generation of precise numerical trajectories to address limitations in complex and rare scenarios. A sympathetic reader would care because current approaches either lack reasoning or struggle with precise control, and successful integration could advance toward fully autonomous vehicles. The work shows concrete gains through pre-training on large datasets followed by fine-tuning on nuScenes.

Core claim

Senna decouples high-level planning from low-level trajectory prediction by having its vision-language component generate natural language planning decisions using multi-image encoding and multi-view prompts, while the end-to-end component predicts precise trajectories, supported by planning-oriented question-answering data and a three-stage training strategy. This yields state-of-the-art performance, including a 27.12% reduction in average planning error and 33.33% in collision rate after pre-training on DriveX and fine-tuning on nuScenes.

What carries the argument

The hybrid architecture that separates natural language planning decisions from trajectory prediction, with the LVLM handling scene understanding and reasoning.

If this is right

  • Senna achieves state-of-the-art planning performance on standard benchmarks.
  • Pre-training on large-scale data like DriveX followed by fine-tuning significantly improves results over no pre-training.
  • The system demonstrates improved handling of complex and rare scenarios through commonsense from the language model.
  • Cross-scenario generalization and transferability support progress toward fully autonomous driving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This bridging strategy might apply to other domains needing both high-level reasoning and low-level precision, such as robotics manipulation.
  • Further work could test whether the language-to-trajectory translation holds in real-time, safety-critical driving without simulation artifacts.
  • Exploring richer prompt engineering or additional sensor inputs could enhance the multi-view scene understanding.

Load-bearing premise

Natural language planning outputs from the LVLM can be translated into low-level trajectories by the E2E model without introducing critical errors or losing necessary detail.

What would settle it

A high rate of collisions in rare scenarios where the generated language plan is accurate but the resulting trajectory deviates would show the translation step fails.

read the original abstract

End-to-end autonomous driving demonstrates strong planning capabilities with large-scale data but still struggles in complex, rare scenarios due to limited commonsense. In contrast, Large Vision-Language Models (LVLMs) excel in scene understanding and reasoning. The path forward lies in merging the strengths of both approaches. Previous methods using LVLMs to predict trajectories or control signals yield suboptimal results, as LVLMs are not well-suited for precise numerical predictions. This paper presents Senna, an autonomous driving system combining an LVLM (Senna-VLM) with an end-to-end model (Senna-E2E). Senna decouples high-level planning from low-level trajectory prediction. Senna-VLM generates planning decisions in natural language, while Senna-E2E predicts precise trajectories. Senna-VLM utilizes a multi-image encoding approach and multi-view prompts for efficient scene understanding. Besides, we introduce planning-oriented QAs alongside a three-stage training strategy, which enhances Senna-VLM's planning performance while preserving commonsense. Extensive experiments on two datasets show that Senna achieves state-of-the-art planning performance. Notably, with pre-training on a large-scale dataset DriveX and fine-tuning on nuScenes, Senna significantly reduces average planning error by 27.12% and collision rate by 33.33% over model without pre-training. We believe Senna's cross-scenario generalization and transferability are essential for achieving fully autonomous driving. Code and models will be released at https://github.com/hustvl/Senna.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Senna, which decouples high-level planning from low-level control in autonomous driving by pairing an LVLM (Senna-VLM) that outputs natural-language planning decisions with an E2E model (Senna-E2E) that predicts precise trajectories. Senna-VLM uses multi-image encoding, multi-view prompts, and planning-oriented QAs trained in three stages; experiments on DriveX pre-training followed by nuScenes fine-tuning report SOTA planning results, including a 27.12% drop in average planning error and 33.33% reduction in collision rate relative to the no-pretrain baseline.

Significance. If the empirical gains hold under rigorous controls, the work offers a practical route to inject commonsense reasoning into E2E driving pipelines, addressing their documented weakness in rare or complex scenes. The public release of code and models is a concrete strength that supports reproducibility and follow-on research.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the reported 27.12% planning-error and 33.33% collision-rate reductions are presented without error bars, without an ablation isolating the language-to-trajectory translation step, and without a full table of competing baselines; these omissions make it impossible to attribute the gains specifically to the LVLM component rather than to scale or training schedule.
  2. [§3.2] §3.2 (Senna-VLM to Senna-E2E interface): the central architectural claim rests on the assumption that natural-language plans can be mapped to E2E trajectory inputs without critical loss of spatial or temporal detail; no quantitative metric (e.g., conversion accuracy on held-out ambiguous prompts or rare-scenario subsets) is provided to validate this interface, which directly undermines the attribution of performance improvements to the LVLM.
minor comments (2)
  1. [Abstract] The abstract states that LVLMs are ill-suited for numerical outputs yet does not quantify how the chosen conditioning mechanism (tokenization of planning QAs into E2E inputs) mitigates this limitation.
  2. [§4] Figure captions and §4 should explicitly list the exact metrics (e.g., L2 error, collision rate) and the precise definition of the “model without pre-training” baseline for each reported number.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects for strengthening the experimental validation and interface analysis. We address each point below and commit to revisions that improve clarity and rigor without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported 27.12% planning-error and 33.33% collision-rate reductions are presented without error bars, without an ablation isolating the language-to-trajectory translation step, and without a full table of competing baselines; these omissions make it impossible to attribute the gains specifically to the LVLM component rather than to scale or training schedule.

    Authors: We agree that error bars and expanded baselines would improve attribution. In the revised version we will add standard deviations computed over three independent runs for the reported metrics on nuScenes. We will also expand the main results table to include all relevant competing methods with their full metrics. For isolating the language-to-trajectory step, we will insert a new controlled ablation that feeds ground-truth parsed plans versus Senna-VLM-generated plans into the identical Senna-E2E backbone; this directly quantifies the contribution of the LVLM component beyond scale or schedule. These additions will appear in the updated §4 and supplementary material. revision: yes

  2. Referee: [§3.2] §3.2 (Senna-VLM to Senna-E2E interface): the central architectural claim rests on the assumption that natural-language plans can be mapped to E2E trajectory inputs without critical loss of spatial or temporal detail; no quantitative metric (e.g., conversion accuracy on held-out ambiguous prompts or rare-scenario subsets) is provided to validate this interface, which directly undermines the attribution of performance improvements to the LVLM.

    Authors: The interface uses a lightweight deterministic parser that extracts structured fields (target velocity, lane-change flag, stop/go decision, and approximate waypoint offsets) from the generated natural-language sentence; these fields are tokenized and concatenated to the E2E visual features. While the current manuscript provides qualitative examples, we acknowledge the absence of a quantitative fidelity metric. In revision we will add a parser-accuracy evaluation on a held-out set of 500 Senna-VLM outputs (including 100 rare-scenario cases), reporting exact-match accuracy for each extracted field against human annotations. This metric will be reported in an expanded §3.2 and will support the claim that critical spatial-temporal information is preserved. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark results with no self-referential derivations

full rationale

The paper reports measured planning error and collision reductions (27.12% and 33.33%) from pre-training on DriveX followed by fine-tuning on nuScenes, compared against an internal no-pretrain baseline. These are direct empirical outcomes on external datasets rather than any mathematical derivation, fitted parameter, or prediction that reduces to a quantity defined inside the paper. No equations, uniqueness theorems, or ansatzes are invoked in a load-bearing way; the architecture (multi-image encoding, planning-oriented QAs, three-stage training) is described independently of the final metrics, which remain falsifiable on the cited benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LVLMs can produce reliable high-level plans in language and that these plans integrate cleanly with an E2E trajectory predictor. No new physical entities or ad-hoc constants are introduced beyond standard neural-network training.

axioms (1)
  • domain assumption LVLMs excel in scene understanding and reasoning while end-to-end models struggle with commonsense in rare scenarios
    Stated directly in the opening of the abstract as the motivation for the hybrid design.

pith-pipeline@v0.9.0 · 5600 in / 1215 out tokens · 59824 ms · 2026-05-15T15:19:50.878304+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  2. VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 7.0

    VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.

  3. ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 7.0

    ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.

  4. MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.

  5. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  6. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  7. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  8. GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic

    cs.AI 2026-05 unverdicted novelty 6.0

    GuardAD reduces accident rates by 32% in autonomous driving MLLMs by using n-th order Markovian logic to infer latent hazards and revise actions.

  9. DriveFuture: Future-Aware Latent World Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.

  10. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.

  11. ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    ST-Prune is a training-free spatio-temporal token pruning framework for VLMs in autonomous driving that achieves near-lossless results at 90% token reduction by exploiting motion volatility, temporal recency, and mult...

  12. OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

    cs.CV 2026-04 unverdicted novelty 6.0

    OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.

  13. FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving

    cs.RO 2026-04 unverdicted novelty 6.0

    FeaXDrive improves end-to-end autonomous driving by shifting diffusion planning to a trajectory-centric formulation with curvature-constrained training, drivable-area guidance, and GRPO post-training, yielding stronge...

  14. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  15. CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

    cs.CV 2026-03 unverdicted novelty 6.0

    CausalVAD applies sparse causal intervention to remove spurious correlations from end-to-end autonomous driving models, reporting state-of-the-art planning accuracy and robustness on nuScenes.

  16. AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    cs.CV 2025-06 unverdicted novelty 6.0

    AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...

  17. C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 5.0

    C-CoT applies VLMs to autonomous driving via five-stage reasoning with a meta-action tree for counterfactuals, yielding 81.9% risk recall, 3.52% collision rate, and 1.98 m L2 error on a new dataset.

  18. Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

    cs.AI 2026-05 unverdicted novelty 5.0

    Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.

  19. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 5.0

    VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.

  20. SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

    cs.CV 2026-04 unverdicted novelty 5.0

    SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

  21. RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

    cs.CV 2026-04 unverdicted novelty 5.0

    RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.

  22. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 4.0

    DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 19 Pith papers · 12 internal anchors

  1. [1]

    Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,

    Y . Wang, V . C. Guizilini, T. Zhang, Y . Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,” in CoRL, 2022

  2. [2]

    Planning-oriented autonomous driving,

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang et al., “Planning-oriented autonomous driving,” in CVPR, 2023

  3. [3]

    Vad: Vectorized scene representation for efficient autonomous driving,

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” in ICCV, 2023

  4. [4]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,

    J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in ECCV, 2020

  5. [5]

    Bev- former: Learning bird’s-eye-view representation from multi-camera im- ages via spatiotemporal transformers,

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bev- former: Learning bird’s-eye-view representation from multi-camera im- ages via spatiotemporal transformers,” arXiv preprint arXiv:2203.17270 , 2022

  6. [6]

    Maptr: Structured modeling and learning for online vectorized hd map construction,

    B. Liao, S. Chen, X. Wang, T. Cheng, Q. Zhang, W. Liu, and C. Huang, “Maptr: Structured modeling and learning for online vectorized hd map construction,” arXiv preprint arXiv:2208.14437 , 2022

  7. [7]

    Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,

    Y . Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” arXiv preprint arXiv:1910.05449, 2019

  8. [8]

    Vip3d: End-to-end visual trajectory prediction via 3d agent queries,

    J. Gu, C. Hu, T. Zhang, X. Chen, Y . Wang, Y . Wang, and H. Zhao, “Vip3d: End-to-end visual trajectory prediction via 3d agent queries,” arXiv preprint arXiv:2208.01582 , 2022

  9. [9]

    Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction,

    B. Jiang, S. Chen, X. Wang, B. Liao, T. Cheng, J. Chen, H. Zhou, Q. Zhang, W. Liu, and C. Huang, “Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction,” arXiv preprint arXiv:2212.02181, 2022

  10. [10]

    End-to-end model-free reinforcement learning for urban driving using implicit affordances,

    M. Toromanoff, E. Wirbel, and F. Moutarde, “End-to-end model-free reinforcement learning for urban driving using implicit affordances,” in CVPR, 2020

  11. [11]

    Multi-modal fusion transformer for end-to-end autonomous driving,

    A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in CVPR, 2021

  12. [12]

    St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,

    S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,” in ECCV, 2022

  13. [13]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in NeurIPS, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, OCTOBER 2024 10 Scene Description The image shows a daytime urban driving scene with clear visibility. There are several cyclists on the road and a pedestrian in the front crossing the street. The time of day is likely morning...

  14. [14]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966 , 2023

  15. [15]

    CogVLM: Visual Expert for Pretrained Language Models

    W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song et al., “Cogvlm: Visual expert for pretrained language models,” arXiv preprint arXiv:2311.03079 , 2023

  16. [16]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” in CVPR, 2024

  17. [17]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

  18. [18]

    Hierarchical models of behavior and prefrontal function,

    M. M. Botvinick, “Hierarchical models of behavior and prefrontal function,” Trends in cognitive sciences , 2008

  19. [19]

    The architecture of cognitive control in the human prefrontal cortex,

    E. Koechlin, C. Ody, and F. Kouneiher, “The architecture of cognitive control in the human prefrontal cortex,” Science, 2003

  20. [20]

    Cognitive control, hierarchy, and the rostro–caudal organiza- tion of the frontal lobes,

    D. Badre, “Cognitive control, hierarchy, and the rostro–caudal organiza- tion of the frontal lobes,” Trends in cognitive sciences , 2008

  21. [21]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

    Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K. K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,” arXiv preprint arXiv:2310.01412 , 2023

  22. [22]

    Driving with llms: Fusing object- level vector modality for explainable autonomous driving,

    L. Chen, O. Sinavski, J. Hünermann, A. Karnsund, A. J. Willmott, D. Birch, D. Maund, and J. Shotton, “Driving with llms: Fusing object- level vector modality for explainable autonomous driving,” arXiv preprint arXiv:2310.01957, 2023

  23. [23]

    Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving,

    W. Wang, J. Xie, C. Hu, H. Zou, J. Fan, W. Tong, Y . Wen, S. Wu, H. Deng, Z. Li et al., “Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving,” arXiv preprint arXiv:2312.09245, 2023

  24. [24]

    ´Alvarez

    S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez, “Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning,” arXiv preprint arXiv:2405.01533, 2024

  25. [25]

    Mathematical capabilities of chatgpt,

    S. Frieder, L. Pinchetti, R.-R. Griffiths, T. Salvatori, T. Lukasiewicz, P. Petersen, and J. Berner, “Mathematical capabilities of chatgpt,” in NeurIPS, 2024

  26. [26]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,” arXiv preprint arXiv:2103.03874 , 2021

  27. [27]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    X. Tian, J. Gu, B. Li, Y . Liu, C. Hu, Y . Wang, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,” arXiv preprint arXiv:2402.12289 , 2024

  28. [28]

    Vila: On pre-training for visual language models,

    J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” in CVPR, 2024

  29. [29]

    Embodied understanding of driving scenarios,

    Y . Zhou, L. Huang, Q. Bu, J. Zeng, T. Li, H. Qiu, H. Zhu, M. Guo, Y . Qiao, and H. Li, “Embodied understanding of driving scenarios,”arXiv preprint arXiv:2403.04593, 2024

  30. [30]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” in CVPR, 2024

  31. [31]

    Drivelm: Driving with graph visual question answering,

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” arXiv preprint arXiv:2312.14150 , 2023

  32. [32]

    Language prompt for autonomous driving,

    D. Wu, W. Han, T. Wang, Y . Liu, X. Zhang, and J. Shen, “Language prompt for autonomous driving,” arXiv preprint arXiv:2309.04379, 2023

  33. [33]

    Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,

    T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang, “Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,” in AAAI, 2024

  34. [34]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020

  35. [35]

    A survey of motion planning and control techniques for self-driving urban vehicles,

    B. Paden, M. ˇCáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of motion planning and control techniques for self-driving urban vehicles,” IEEE Transactions on intelligent vehicles , 2016

  36. [36]

    Stanley: The robot that won the darpa grand challenge,

    S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann et al., “Stanley: The robot that won the darpa grand challenge,” Journal of field Robotics , 2006

  37. [37]

    Autonomous driving in urban environments: Boss and the urban challenge,

    C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan, D. Duggins, T. Galatali, C. Geyer et al., “Autonomous driving in urban environments: Boss and the urban challenge,” Journal of field Robotics , 2008

  38. [38]

    Exploring the limitations of behavior cloning for autonomous driving,

    F. Codevilla, E. Santana, A. M. López, and A. Gaidon, “Exploring the limitations of behavior cloning for autonomous driving,” in ICCV, 2019

  39. [39]

    Alvinn: An autonomous land vehicle in a neural network,

    D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in NeurIPS, 1988

  40. [40]

    End-to-end urban driving by imitating a reinforcement learning coach,

    Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool, “End-to-end urban driving by imitating a reinforcement learning coach,” in ICCV, 2021

  41. [41]

    Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,

    X. Jia, Y . Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,” in ICCV, 2023

  42. [42]

    Maptrv2: An end-to-end framework for online vectorized hd map construction,

    B. Liao, S. Chen, Y . Zhang, B. Jiang, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Maptrv2: An end-to-end framework for online vectorized hd map construction,” arXiv preprint arXiv:2308.05736 , 2023

  43. [43]

    VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

    S. Chen, B. Jiang, H. Gao, B. Liao, Q. Xu, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Vadv2: End-to-end vectorized autonomous driving via probabilistic planning,” arXiv preprint arXiv:2402.13243 , 2024

  44. [44]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017

  45. [45]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in NeurIPS, 2020

  46. [46]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023

  47. [47]

    PaLM 2 Technical Report

    R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403 , 2023

  48. [48]

    Qwen2 Technical Report

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang et al. , “Qwen2 technical report,” arXiv preprint arXiv:2407.10671, 2024

  49. [49]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023

  50. [50]

    Instructblip: Towards general-purpose vision-language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” 2023

  51. [51]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021

  52. [52]

    Eva: Exploring the limits of masked visual representation learning at scale,

    Y . Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y . Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in CVPR, 2023

  53. [53]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022

  54. [54]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” in ICML, 2023

  55. [55]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021

  56. [56]

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,

    J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in NeurIPS, 2019

  57. [57]

    Lxmert: Learning cross-modality encoder representations from transformers,

    H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490 , 2019

  58. [58]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191 , 2024

  59. [59]

    Languagempc: Large language models as deci- sion makers for autonomous driving,

    H. Sha, Y . Mu, Y . Jiang, L. Chen, C. Xu, P. Luo, S. E. Li, M. Tomizuka, W. Zhan, and M. Ding, “Languagempc: Large language models as deci- sion makers for autonomous driving,” arXiv preprint arXiv:2310.03026 , 2023

  60. [60]

    Carla: An open urban driving simulator,

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “Carla: An open urban driving simulator,” in CoRL, 2017

  61. [61]

    Referring multi-object tracking,

    D. Wu, W. Han, T. Wang, X. Dong, X. Zhang, and J. Shen, “Referring multi-object tracking,” in CVPR, 2023

  62. [62]

    Talk2car: Taking control of your self-driving car,

    T. Deruyttere, S. Vandenhende, D. Grujicic, L. Van Gool, and M.-F. Moens, “Talk2car: Taking control of your self-driving car,”arXiv preprint arXiv:1909.10838, 2019

  63. [63]

    Textual explanations for self-driving vehicles,

    J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual explanations for self-driving vehicles,” in Proceedings of the European conference on computer vision (ECCV) , 2018, pp. 563–578

  64. [64]

    Judging llm-as-a-judge with mt-bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” in NeurIPS, 2023

  65. [65]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016

  66. [66]

    Maximum margin planning,

    N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum margin planning,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 729–736. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, OCTOBER 2024 13

  67. [67]

    End-to-end interpretable neural motion planner,

    W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “End-to-end interpretable neural motion planner,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 8660–8669

  68. [68]

    Safe local motion planning with self-supervised freespace forecasting,

    P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan, “Safe local motion planning with self-supervised freespace forecasting,” in CVPR, 2021

  69. [69]

    Differentiable raycasting for self-supervised occupancy forecasting,

    T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan, “Differentiable raycasting for self-supervised occupancy forecasting,” in ECCV, 2022

  70. [70]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL, 2002

  71. [71]

    Cider: Consensus- based image description evaluation,

    R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus- based image description evaluation,” in CVPR, 2015

  72. [72]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,

    S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization , 2005

  73. [73]

    Is ego status all you need for open-loop end-to-end autonomous driving?

    Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Alvarez, “Is ego status all you need for open-loop end-to-end autonomous driving?” in CVPR, 2024. Bo Jiang received his B.E. degree in data science and big data technology from Central South University, China, in 2021. Currently, he is a Ph.D. candidate at Huazhong University of Science and Technolog...