arxiv: 2503.19755 · v1 · submitted 2025-03-25 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu , Diankun Zhang , Zongchuang Zhao , Jianfeng Cui , Dingkang Liang , Chong Zhang , Dingyuan Zhang , Hongwei Xie

show 2 more authors

Bing Wang Xiang Bai

Authors on Pith no claims yet

Pith reviewed 2026-05-17 08:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords drivingreasoningactionmethodsorionspaceautonomousclosed-loop

0 comments

The pith

ORION reports 77.74 Driving Score and 54.62% Success Rate on Bench2Drive, outperforming prior end-to-end methods by 14.28 DS and 19.61% SR through unified VQA and planning optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system for self-driving cars that processes camera images and uses language models to understand the driving scene. It adds a module to summarize past observations over time, lets a large language model reason about what the car should do next, and then generates a smooth path for the vehicle to follow. The key step is training everything together so the language reasoning directly influences the trajectory numbers instead of treating them as separate tasks. On a standard benchmark for closed-loop driving, the system scores higher than previous methods in both overall driving quality and the percentage of successful trips without crashes or rule violations. The authors argue this closes a gap where language models understand scenes well but struggle to output precise control actions.

Core claim

Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.

Load-bearing premise

The assumption that aligning the reasoning space of the LLM with the numerical action space through unified E2E optimization will reliably improve closed-loop causal reasoning and trajectory quality without introducing new failure modes.

read the original abstract

End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ORION, a holistic end-to-end autonomous driving framework that combines a QT-Former to aggregate long-term history context, an LLM for driving scenario reasoning, and a generative planner for precision trajectory prediction. It aligns the LLM reasoning space with the numerical action space via unified E2E optimization of VQA and planning tasks. On the Bench2Drive benchmark, the method reports a closed-loop Driving Score of 77.74 and Success Rate of 54.62%, outperforming prior SOTA by 14.28 DS and 19.61% SR.

Significance. If the central performance claims hold after addressing the noted gaps, this work could meaningfully advance VLM integration in autonomous driving by demonstrating a practical way to bridge semantic reasoning and numerical trajectory generation. The focus on closed-loop evaluation and the scale of the reported margins are notable strengths, as is the explicit attempt at unified optimization across perception-reasoning and control.

major comments (2)

[§3] §3 (Framework and Unified Optimization): The manuscript describes the QT-Former + LLM + generative planner architecture and the unified loss for VQA and planning, but provides no ablation that isolates the contribution of the reasoning-action space alignment (e.g., joint E2E training versus separate VQA-only or planning-only optimization). This ablation is load-bearing for the claim that alignment improves causal reasoning and closes the semantic-numerical gap, as the 14.28 DS / 19.61% SR gains could otherwise be explained by dataset tuning, history aggregation, or planner design alone.
[§5] §5 (Experiments and Results): The reported benchmark numbers on Bench2Drive lack error bars, standard deviations across runs, or explicit data exclusion criteria. Without these, it is difficult to assess whether the large margins over SOTA are statistically reliable or sensitive to particular evaluation conditions.

minor comments (2)

[§3] The notation for the unified loss function and the QT-Former aggregation could be made more explicit with an equation or pseudocode block to improve reproducibility.
[Figure 1] Figure 1 (overall architecture) would benefit from clearer arrows or labels distinguishing the VQA branch from the trajectory generation branch.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below, along with our plans for revision.

read point-by-point responses

Referee: [§3] §3 (Framework and Unified Optimization): The manuscript describes the QT-Former + LLM + generative planner architecture and the unified loss for VQA and planning, but provides no ablation that isolates the contribution of the reasoning-action space alignment (e.g., joint E2E training versus separate VQA-only or planning-only optimization). This ablation is load-bearing for the claim that alignment improves causal reasoning and closes the semantic-numerical gap, as the 14.28 DS / 19.61% SR gains could otherwise be explained by dataset tuning, history aggregation, or planner design alone.

Authors: We agree that an explicit ablation isolating the contribution of the unified E2E optimization (joint VQA and planning training) versus separate optimization would provide stronger support for the benefits of reasoning-action space alignment. In the revised manuscript, we will add this ablation study, including comparisons of the full ORION model against variants trained separately on VQA-only and planning-only objectives. This will help clarify that the observed gains stem from the alignment mechanism rather than other architectural components. revision: yes
Referee: [§5] §5 (Experiments and Results): The reported benchmark numbers on Bench2Drive lack error bars, standard deviations across runs, or explicit data exclusion criteria. Without these, it is difficult to assess whether the large margins over SOTA are statistically reliable or sensitive to particular evaluation conditions.

Authors: We acknowledge that the current results presentation would benefit from greater statistical transparency. In the revised manuscript, we will report standard deviations and error bars obtained from multiple evaluation runs using different random seeds. We will also explicitly document the data exclusion criteria applied during the Bench2Drive closed-loop evaluations to facilitate assessment of result reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results tied to external benchmark evaluation

full rationale

The paper describes an architectural framework (QT-Former + LLM + generative planner) with unified E2E optimization for VQA and planning, then reports closed-loop metrics on the external Bench2Drive benchmark. No equations, fitted parameters, or self-citations are presented that reduce the claimed DS/SR gains to a self-referential quantity or to a prior result by the same authors. The performance numbers are obtained by running the trained model on a held-out challenge dataset, which constitutes independent empirical evidence rather than a derivation that collapses to its own inputs by construction. The central claim therefore remains non-circular under the stated criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate free parameters, axioms, or invented entities; no equations or training procedures are shown.

pith-pipeline@v0.9.0 · 5563 in / 1123 out tokens · 37739 ms · 2026-05-17T08:04:53.553705+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR)
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a generative planner to establish a unified latent representation for aligning the two spaces

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fail2Drive: Benchmarking Closed-Loop Driving Generalization
cs.RO 2026-04 conditional novelty 7.0

Fail2Drive is the first paired-route benchmark for closed-loop generalization in CARLA, showing an average 22.8% success-rate drop on shifted scenarios and revealing failure modes such as ignoring visible LiDAR objects.
Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving
cs.RO 2026-03 unverdicted novelty 7.0

PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.
Latent Chain-of-Thought World Modeling for End-to-End Driving
cs.CV 2025-12 unverdicted novelty 7.0

LCDrive unifies chain-of-thought reasoning and action selection for end-to-end driving by interleaving action-proposal tokens and latent world-model tokens that predict action outcomes, yielding faster inference and b...
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
cs.CV 2025-06 unverdicted novelty 7.0

ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.
GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic
cs.AI 2026-05 unverdicted novelty 6.0

GuardAD reduces accident rates by 32% in autonomous driving MLLMs by using n-th order Markovian logic to infer latent hazards and revise actions.
Unified Map Prior Encoder for Mapping and Planning
cs.CV 2026-05 unverdicted novelty 6.0

UMPE fuses any subset of HD/SD vector maps, raster SD maps, and satellite imagery into BEV features via alignment-aware vector and raster branches, raising mapping mAP by 5.3-5.9 points and cutting planning L2 error b...
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
cs.CV 2026-04 unverdicted novelty 6.0

OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
cs.CV 2026-04 unverdicted novelty 6.0

Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving
cs.CV 2025-10 unverdicted novelty 6.0

DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
cs.CV 2025-06 unverdicted novelty 6.0

AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
cs.AI 2026-05 unverdicted novelty 5.0

Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 5.0

VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation
cs.CV 2026-03 unverdicted novelty 5.0

EvoDriveVLA uses collaborative perception-planning distillation with self-anchor and future-aware teachers to fix perception degradation and long-term instability in driving VLA models, reaching SOTA on nuScenes and NAVSIM.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 17 Pith papers · 13 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Proc. of Ad- vances in Neural Information Processing Systems, 2022. 3

work page 2022
[3]

Gemini: A Family of Highly Capable Multimodal Models

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 1, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science., page 8, 2023. 4

work page 2023
[6]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition , pages 11621–11631, 2020. 1, 5, 2

work page 2020
[7]

Multipath: Multiple probabilistic anchor tra- jectory hypotheses for behavior prediction

Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic anchor tra- jectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449, 2019. 1

work page arXiv 1910
[8]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences , page 220101,

work page
[10]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, pages 24185–24198, 2024. 1, 2

work page 2024
[11]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conf. on Robot Learning, pages 1–16, 2017. 2, 5

work page 2017
[12]

Eva-02: A visual representa- 9 tion for neon genesis

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representa- 9 tion for neon genesis. Image and Vision Computing , page 105171, 2024. 5

work page 2024
[13]

Mathematical capabilities of chatgpt

Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tom- maso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. Proc. of Advances in Neural Information Processing Systems , 36,

work page
[14]

Generative adversarial networks

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commu- nications of the ACM, pages 139–144, 2020. 4

work page 2020
[15]

Vip3d: End-to-end visual trajectory prediction via 3d agent queries

Junru Gu, Chenxu Hu, Tianyuan Zhang, Xuanyao Chen, Yilun Wang, Yue Wang, and Hang Zhao. Vip3d: End-to-end visual trajectory prediction via 3d agent queries. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recogni- tion, pages 5496–5506, 2023. 1

work page 2023
[16]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In Proc. of Intl. Conf. on Learning Representations , 2022. 5, 2

work page 2022
[17]

St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In Proc. of European Conference on Computer Vision , pages 533–549, 2022. 1

work page 2022
[18]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, pages 17853–17862, 2023. 1, 2, 5, 6

work page 2023
[19]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Hid- den biases of end-to-end driving models

Bernhard Jaeger, Kashyap Chitta, and Andreas Geiger. Hid- den biases of end-to-end driving models. In Porc. of IEEE Intl. Conf. on Computer Vision, pages 8240–8249, 2023. 8

work page 2023
[21]

Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In Porc. of IEEE Intl. Conf. on Computer Vision, 2023. 2, 6, 8

work page 2023
[22]

Think twice before driv- ing: Towards scalable decoders for end-to-end autonomous driving

Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driv- ing: Towards scalable decoders for end-to-end autonomous driving. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2023. 2, 6, 8

work page 2023
[23]

Bench2drive: Towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmark- ing of closed-loop end-to-end autonomous driving. Proc. of Advances in Neural Information Processing Systems , 2024. 1, 5, 9, 2

work page 2024
[24]

Drivetransformer: Unified transformer for scalable end-to- end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to- end autonomous driving. In Proc. of Intl. Conf. on Learning Representations, 2025. 2, 6, 8

work page 2025
[25]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition , pages 8340– 8350, 2023. 1, 2, 3, 5, 6, 7

work page 2023
[26]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xing- gang Wang. Senna: Bridging large vision-language mod- els and end-to-end autonomous driving. arXiv preprint arXiv:2410.22313, 2024. 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Far3d: Expanding the horizon for surround-view 3d object detec- tion

Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, and Xiangyu Zhang. Far3d: Expanding the horizon for surround-view 3d object detec- tion. In Proc. of the AAAI Conf. on Artificial Intelligence , pages 2561–2569, 2024. 1

work page 2024
[28]

Auto-Encoding Variational Bayes

Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 2, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2013
[29]

Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2)

Qifeng Li, Xiaosong Jia, Shaobo Wang, and Junchi Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla- v2). In Proc. of European Conference on Computer Vision, pages 142–158, 2024. 6

work page 2024
[30]

Mon- key: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Mon- key: Image resolution and text label are important things for large multi-modal models. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 26763– 26773, 2024. 2

work page 2024
[31]

Is ego status all you need for open- loop end-to-end autonomous driving? In Proc

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open- loop end-to-end autonomous driving? In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition , pages 14864–14873, 2024. 2

work page 2024
[32]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2025. 2

work page 2025
[33]

Rouge: A package for automatic evalua- tion of summaries

Chin-Yew Lin. Rouge: A package for automatic evalua- tion of summaries. In Proc. Annual Meeting of the Associa- tion for Computational Linguistics Workshop, pages 74–81,

work page
[34]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Porc. of IEEE Intl. Conf. on Computer Vision , pages 2980–2988,

work page
[35]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Proc. of Advances in Neural Information Processing Systems, 2023. 2

work page 2023
[36]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2

work page 2024
[37]

Petr: Position embedding transformation for multi-view 3d object detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. In Proc. of European Conference on Com- puter Vision, pages 531–548, 2022. 3 10

work page 2022
[38]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding. arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Continuously learning, adapting, and im- proving: A dual-process approach to autonomous driving

Jianbiao Mei, Yukai Ma, Xuemeng Yang, Licheng Wen, Xinyu Cai, Xin Li, Daocheng Fu, Bo Zhang, Pinlong Cai, Min Dou, et al. Continuously learning, adapting, and im- proving: A dual-process approach to autonomous driving. In Proc. of Advances in Neural Information Processing Sys- tems, 2024. 2

work page 2024
[41]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proc. Annual Meeting of the Association for Computational Linguistics, pages 311–318, 2002. 5

work page 2002
[42]

Math- bert: A pre-trained model for mathematical formula under- standing

Shuai Peng, Ke Yuan, Liangcai Gao, and Zhi Tang. Math- bert: A pre-trained model for mathematical formula under- standing. arXiv preprint arXiv:2105.00377, 2021. 2, 7

work page arXiv 2021
[43]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proc. of European Conference on Computer Vision, pages 194–210, 2020. 1

work page 2020
[44]

Multi- modal fusion transformer for end-to-end autonomous driv- ing

Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi- modal fusion transformer for end-to-end autonomous driv- ing. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 7077–7087, 2021. 1

work page 2021
[45]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In Proc. of Intl. Conf. on Machine Learning , pages 8748–8763, 2021. 2

work page 2021
[46]

Carllava: Vision language models for camera-only closed-loop driving

Katrin Renz, Long Chen, Ana-Maria Marcu, Jan H¨unermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera-only closed-loop driving. arXiv preprint arXiv:2406.10165, 2024. 7

work page arXiv 2024
[47]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition , pages 10684–10695, 2022. 2, 4

work page 2022
[48]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In Proc. of Intl. Conf. on Medical Image Computing and Computer Assisted Intervention , pages 234–241, 2015. 2, 4

work page 2015
[49]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 15120–15130, 2024. 1, 3

work page 2024
[50]

Motion transformer with global intention localization and lo- cal movement refinement

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and lo- cal movement refinement. Proc. of Advances in Neural In- formation Processing Systems, 35:6531–6543, 2022. 1

work page 2022
[51]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recogni- tion, pages 18221–18232, 2024. 4

work page 2024
[52]

Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, 2025. 6

work page 2025
[53]

Scalability in percep- tion for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in percep- tion for autonomous driving: Waymo open dataset. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, pages 2446–2454, 2020. 1

work page 2020
[54]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024. 3, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 4566–4575, 2015. 5

work page 2015
[57]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Exploring object-centric temporal modeling for efficient multi-view 3d object detection

Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xi- angyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. InPorc. of IEEE Intl. Conf. on Computer Vision, pages 3621–3631, 2023. 3, 4

work page 2023
[59]

Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, 2024. 1, 2, 3, 5, 7

work page 2024
[60]

Drivemlm: Aligning multi-modal large language models with behavioral planning states for au- tonomous driving

Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for au- tonomous driving. arXiv preprint arXiv:2312.09245, 2023. 1, 3 11

work page arXiv 2023
[61]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong base- line

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong base- line. In Proc. of Advances in Neural Information Processing Systems, 2022. 2, 6, 8

work page 2022
[62]

Openemma: Open-source multimodal model for end-to-end autonomous driving

Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, and Zhengzhong Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving. In Proc. of IEEE Winter Conf. on Applications of Computer Vision, pages 1001–1009, 2025. 2, 3

work page 2025
[63]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Let- ters, 2024. 1

work page 2024
[64]

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023. 2, 6

work page internal anchor Pith review arXiv 2023
[65]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Porc. of IEEE Intl. Conf. on Computer Vision , pages 11975–11986, 2023. 2

work page 2023
[66]

Fully sparse transformer 3-d detector for lidar point cloud

Diankun Zhang, Zhijie Zheng, Haoyu Niu, Xueqing Wang, and Xiaojun Liu. Fully sparse transformer 3-d detector for lidar point cloud. IEEE Transactions on Geoscience and Re- mote Sensing, 61:1–12, 2023. 1

work page 2023
[67]

Sparsead: Sparse query-centric paradigm for efficient end-to-end autonomous driving

Diankun Zhang, Guoan Wang, Runwen Zhu, Jianbo Zhao, Xiwu Chen, Siyu Zhang, Jiahao Gong, Qibin Zhou, Wenyuan Zhang, Ningzi Wang, et al. Sparsead: Sparse query-centric paradigm for efficient end-to-end autonomous driving. arXiv preprint arXiv:2404.06892, 2024. 1

work page arXiv 2024
[68]

End-to-end urban driving by imitating a reinforcement learning coach

Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, and Luc Van Gool. End-to-end urban driving by imitating a reinforcement learning coach. In Porc. of IEEE Intl. Conf. on Computer Vision, 2021. 2

work page 2021
[69]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Proc. of Advances in Neural In- formation Processing Systems, 36:46595–46623, 2023. 5

work page 2023
[70]

Genad: Generative end-to-end au- tonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end au- tonomous driving. InProc. of European Conference on Com- puter Vision, pages 87–104, 2024. 1, 2, 5, 6

work page 2024
[71]

Hidden biases of end- to-end driving datasets

Julian Zimmerlin, Jens Beißwenger, Bernhard Jaeger, An- dreas Geiger, and Kashyap Chitta. Hidden biases of end- to-end driving datasets. arXiv preprint arXiv:2412.09602 ,

work page arXiv
[72]

description

8 12 ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation Supplementary Material We provide supplementary material to complement the main paper, arranged as follows: • Appendix A: Details on the Chat-B2D dataset. • Appendix B: Traning Details. • Appendix C: More results. A. Details on the Chat-B2D datas...

work page
[73]

What can you tell about the current driving conditions from the images?

work page
[74]

What can be observed in the panoramic images provided?

work page
[75]

Can you provide a summary of the current driving scenario based on the input images?

work page
[76]

What can you observe from the provided images regarding the driving conditions?

work page
[77]

Please describe the current driving conditions based on the images provided

work page
[78]

Can you describe the current weather conditions and the general environment depicted in the images?

work page
[79]

Please describe the current driving conditions based on the input images

work page
[80]

Could you summarize the current driving conditions based on the input images?

work page

Showing first 80 references.