pith. machine review for the scientific record. sign in

arxiv: 2604.08266 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: unknown

Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords LLM distillationvision-only drivingautonomous drivingclosed-loop evaluationVLA modelsreactive planningBench2Drive
0
0 comments X

The pith

An efficient vision-only model distilled from a large VLA teacher surpasses the teacher on complex driving tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that distilling reasoning from a massive vision-language-action model into a compact vision-only student model enables the smaller model to outperform its teacher in interactive, closed-loop driving scenarios. This occurs through latent feature distillation paired with direct supervision on ground-truth trajectories. If accurate, the result indicates that advanced planning capabilities can be retained in lightweight architectures suitable for real-time deployment. The approach sets a new benchmark high on Bench2Drive with a driving score of 80.6 and points to untapped capacity in vision-only systems for handling rare situations.

Core claim

Through a combination of latent feature distillation and ground-truth trajectory supervision, the efficient vision-only student model Orion-Lite surpasses the performance of its massive VLA teacher ORION on the Bench2Drive benchmark, reaching a Driving Score of 80.6. This outcome reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning in autonomous driving.

What carries the argument

Latent feature distillation from the teacher's internal representations together with ground-truth trajectory supervision, which transfers interactive reasoning capabilities to vision-only inputs.

If this is right

  • Vision-only models become viable for state-of-the-art performance in closed-loop autonomous driving without language processing at inference.
  • Computational demands for advanced driving systems drop while retaining the ability to manage rare and complex cases.
  • Closed-loop evaluations expose capabilities in distilled models that simpler open-loop tests miss.
  • Knowledge transfer techniques can scale high-level reasoning into deployable, energy-efficient planners.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation pattern could extend to other planning domains where large models supply reasoning but runtime efficiency matters.
  • Explicit language inputs may prove unnecessary at deployment time once reasoning patterns are internalized through vision features.
  • Further compression of the student model or alternative supervision signals could push performance gains even higher.

Load-bearing premise

That the distillation process successfully moves complex interactive reasoning from language-augmented inputs to pure vision inputs without critical loss in closed-loop performance.

What would settle it

Orion-Lite underperforming its teacher ORION when tested on a new collection of interactive scenarios absent from the Bench2Drive set.

Figures

Figures reproduced from arXiv: 2604.08266 by Gijs Dubbelman, Jing Gu, Niccol\`o Cavagnero.

Figure 1
Figure 1. Figure 1: Overview of the proposed distillation framework. A joint distillation and trajectory supervision strategy (top) yields a student model, Orion-Lite, that is 3× faster than its teacher, es￾tablishing a new state-of-the-art on the closed-loop Bench2Drive benchmark (bottom). Consequently, VLA models currently achieve state-of￾the-art performance across multiple autonomous driving benchmarks. However, their rel… view at source ↗
Figure 2
Figure 2. Figure 2: Latency and Driving Score Comparison. Our distilled framework demonstrates a massive reduction in inference latency compared to the teacher model while improving the overall Driv￾ing Score. Latency is measured by the averaged inference step￾time on CARLA evaluated on an A6000 GPU. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison in interactive scenarios. Top rows: rollouts from the Orion teacher model. Bottom rows: rollouts from our student model. Sequences are visualized at 5-frame intervals, with overlaid points indicating predicted future trajectories. Green checks (✓) denote successful, intervention-free maneuvers, while red crosses (×) indicate task failures. While Orion frequently hesitates or fails to… view at source ↗
Figure 4
Figure 4. Figure 4: Impact of Decoder Depth. Driving Score and mean Multi-ability Score across varying numbers of transformer de￾coder layers. and 1 LiDAR sweep. Radar and LiDAR sweeps are dis￾carded, as our model leverages RGB cameras only. Each clip spans roughly 150 meters and captures a specific inter￾active driving scenario. For our distillation pipeline, we uti￾lize 950 clips for training and 50 for open-loop validation… view at source ↗
read the original abstract

Leveraging the general world knowledge of Large Language Models (LLMs) holds significant promise for improving the ability of autonomous driving systems to handle rare and complex scenarios. While integrating LLMs into Vision-Language-Action (VLA) models has yielded state-of-the-art performance, their massive parameter counts pose severe challenges for latency-sensitive and energy-efficient deployment. Distilling LLM knowledge into a compact driving model offers a compelling solution to retain these reasoning capabilities while maintaining a manageable computational footprint. Although previous works have demonstrated the efficacy of distillation, these efforts have primarily focused on relatively simple scenarios and open-loop evaluations. Therefore, in this work, we investigate LLM distillation in more complex, interactive scenarios under closed-loop evaluation. We demonstrate that through a combination of latent feature distillation and ground-truth trajectory supervision, an efficient vision-only student model \textbf{Orion-Lite} can even surpass the performance of its massive VLA teacher, ORION. Setting a new state-of-the-art on the rigorous Bench2Drive benchmark, with a Driving Score of 80.6. Ultimately, this reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Orion-Lite, a compact vision-only driving model distilled from a large VLA teacher (ORION) via latent feature distillation combined with ground-truth trajectory supervision. It claims that this recipe enables the student to surpass the teacher's closed-loop performance on the Bench2Drive benchmark, achieving a new state-of-the-art Driving Score of 80.6 and demonstrating that vision-only architectures retain significant untapped potential for complex interactive planning.

Significance. If the central result holds after addressing the noted gaps, the work would be significant for autonomous driving and model compression: it provides empirical evidence that LLM-derived reasoning can be effectively transferred to efficient vision-only students in closed-loop settings, potentially enabling high-performance deployment without VLA-scale compute. The focus on rigorous closed-loop evaluation on Bench2Drive is a strength relative to prior open-loop distillation studies.

major comments (2)
  1. [Experiments / Results] Experiments / Results section (and abstract): The headline claim that Orion-Lite (DS=80.6) surpasses ORION is presented as evidence of successful transfer of complex interactive reasoning via latent distillation. However, training combines two signals—latent feature distillation and direct ground-truth trajectory supervision—with no ablation isolating the distillation component (e.g., a vision-only baseline trained only on trajectory imitation). This is load-bearing for the central claim, as the gain could arise from more faithful expert imitation rather than acquired world-knowledge reasoning.
  2. [Method] Method section: The description of the latent feature distillation objective lacks explicit equations or loss formulations (e.g., no definition of the feature alignment term or weighting between distillation and trajectory losses). Without these, it is difficult to assess how the transfer of LLM reasoning is implemented or to reproduce the exact recipe.
minor comments (2)
  1. [Abstract] Abstract and introduction: The manuscript states the outcome and high-level method but supplies no equations, ablation details, error bars, or dataset statistics, making the central claim hard to verify from the provided text alone.
  2. [Results] The paper would benefit from reporting standard deviations or multiple random seeds for the Driving Score to establish statistical reliability of the 80.6 result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor and methodological clarity that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Experiments / Results] Experiments / Results section (and abstract): The headline claim that Orion-Lite (DS=80.6) surpasses ORION is presented as evidence of successful transfer of complex interactive reasoning via latent distillation. However, training combines two signals—latent feature distillation and direct ground-truth trajectory supervision—with no ablation isolating the distillation component (e.g., a vision-only baseline trained only on trajectory imitation). This is load-bearing for the central claim, as the gain could arise from more faithful expert imitation rather than acquired world-knowledge reasoning.

    Authors: We agree that an ablation isolating the distillation component is necessary to strengthen the central claim. In the revised manuscript we will add a vision-only baseline trained exclusively on ground-truth trajectory imitation (without latent feature distillation) and report its closed-loop performance on Bench2Drive. This will allow direct comparison to Orion-Lite and clarify the incremental benefit attributable to the distillation objective. revision: yes

  2. Referee: [Method] Method section: The description of the latent feature distillation objective lacks explicit equations or loss formulations (e.g., no definition of the feature alignment term or weighting between distillation and trajectory losses). Without these, it is difficult to assess how the transfer of LLM reasoning is implemented or to reproduce the exact recipe.

    Authors: We acknowledge the need for explicit formulations to ensure reproducibility. In the revised Method section we will provide the full mathematical definition of the latent feature distillation loss, including the specific feature alignment metric (e.g., L2 or cosine similarity on selected layers), the combined objective with the trajectory supervision term, and the weighting coefficients used during training. revision: yes

Circularity Check

0 steps flagged

No derivational circularity; purely empirical claims

full rationale

The paper reports an empirical distillation experiment on Bench2Drive without any equations, derivations, fitted parameters presented as predictions, or self-citation chains that reduce the central result to its inputs by construction. Performance numbers (DS=80.6) are direct benchmark outputs, not outputs of a self-referential formula. Absence of ablations on distillation vs. ground-truth supervision is a validity concern, not a circularity issue under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or explicit assumptions detailed in abstract; no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5510 in / 1031 out tokens · 43977 ms · 2026-05-10T16:58:20.536167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 2

  2. [2]

    Knowledge distilla- tion: A good teacher is patient and consistent

    Lucas Beyer, Xiaohua Zhai, Am´elie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distilla- tion: A good teacher is patient and consistent. InIEEE Conf. Comput. Vis. Pattern Recog., 2022. 3

  3. [3]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InIEEE Conf. Com- put. Vis. Pattern Recog., 2020. 2

  4. [4]

    VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang 8 Wang. V ADv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243,

  5. [5]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing.IEEE Trans

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing.IEEE Trans. Pattern Anal. Mach. Intell., 45(11):12878– 12895, 2022. 2

  6. [6]

    CARLA: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InCoRL, 2017. 6, 7

  7. [7]

    EV A-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. EV A-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

  8. [8]

    VERDI: VLM-Embedded Reasoning for Autonomous Driving

    Bowen Feng, Zhiting Mei, Baiang Li, Julian Ost, Fil- ippo Ghilotti, Roger Girgis, Anirudha Majumdar, and Felix Heide. VERDI: VLM-embedded reasoning for autonomous driving.arXiv preprint arXiv:2505.15925, 2025. 2, 3

  9. [9]

    Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755,

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation.arXiv preprint arXiv:2503.19755,

  10. [10]

    Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning,

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. MindDrive: A vision-language-action model for autonomous driving via online reinforcement learning. arXiv preprint arXiv:2512.13636, 2025. 1, 2, 3, 5, 7

  11. [11]

    Dis- tilling multi-modal large language models for autonomous driving

    Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Dis- tilling multi-modal large language models for autonomous driving. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 2, 3

  12. [12]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 3

  13. [13]

    ST-P3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. ST-P3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In Eur. Conf. Comput. Vis., 2022. 7

  14. [14]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InIEEE Conf. Comput. Vis. Pattern Recog., 2023. 2, 5

  15. [15]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Cov- ington, Benjamin Sapp, et al. EMMA: End-to-end mul- timodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024. 3

  16. [16]

    DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

    Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InInt. Conf. Comput. Vis.,

  17. [17]

    Think twice before driv- ing: Towards scalable decoders for end-to-end autonomous driving

    Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driv- ing: Towards scalable decoders for end-to-end autonomous driving. InIEEE Conf. Comput. Vis. Pattern Recog., 2023. 5

  18. [18]

    Bench2Drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2Drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving. In Adv. Neural Inform. Process. Syst., 2024. 2, 6

  19. [19]

    Drivetransformer: Unified transformer for scalable end-to-end autonomous driving.arXiv preprint arXiv:2503.07656,

    Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. DriveTransformer: Unified transformer for scalable end-to- end autonomous driving.arXiv preprint arXiv:2503.07656,

  20. [20]

    V AD: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene representation for efficient autonomous driving. InInt. Conf. Comput. Vis.,

  21. [21]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3

  22. [22]

    Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

    Kailin Li, Zhenxin Li, Shiyi Lan, Yuan Xie, Zhizhong Zhang, Jiayi Liu, Zuxuan Wu, Zhiding Yu, and Jose M Alvarez. Hydra-MDP++: Advancing end-to-end driv- ing via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025. 3

  23. [23]

    arXiv preprint arXiv:2406.08481 (2024) 2, 4, 6, 10, 11, 13

    Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024. 3

  24. [24]

    arXiv preprint arXiv:2504.01941 (2025) 4, 10, 11, 13

    Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online tra- jectory evaluation via BEV world model.arXiv preprint arXiv:2504.01941, 2025. 3

  25. [25]

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-MDP: End-to-end multimodal planning with multi- target hydra-distillation.arXiv preprint arXiv:2406.06978,

  26. [26]

    DiffusionDrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. DiffusionDrive: Truncated diffusion model for end-to-end autonomous driving. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 2

  27. [27]

    3D bounding box estimation using deep learning and geometry

    Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Kosecka. 3D bounding box estimation using deep learning and geometry. InIEEE Conf. Comput. Vis. Pattern Recog., 2017. 3

  28. [28]

    Multi- modal fusion transformer for end-to-end autonomous driv- ing

    Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi- modal fusion transformer for end-to-end autonomous driv- ing. InIEEE Conf. Comput. Vis. Pattern Recog., 2021. 2

  29. [29]

    SimLingo: Vision-only closed-loop autonomous driving with language-action alignment

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. SimLingo: Vision-only closed-loop autonomous driving with language-action alignment. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 1, 5

  30. [30]

    LMDrive: Closed-loop end-to-end driving with large language models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. LMDrive: Closed-loop end-to-end driving with large language models. InIEEE Conf. Comput. Vis. Pattern Recog., 2024. 1 9

  31. [31]

    Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving

    Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 5

  32. [32]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 3

  33. [33]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. DriveVLM: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024. 1, 3

  34. [34]

    OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Al- varez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 1, 3

  35. [35]

    Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

    Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wen- hao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025. 3

  36. [36]

    Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong base- line

    Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong base- line. InAdv. Neural Inform. Process. Syst., 2022. 5

  37. [37]

    OpenEMMA: Open-source multimodal model for end-to-end autonomous driving

    Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, and Zhengzhong Tu. OpenEMMA: Open-source multimodal model for end-to-end autonomous driving. InWACV, 2025. 1, 3

  38. [38]

    arXiv preprint arXiv:2601.04453 (2026)

    Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren. UniDrive-WM: Unified understanding, planning and generation world model for autonomous driving.arXiv preprint arXiv:2601.04453,

  39. [39]

    Vlm-ad: End-to-end autonomous driving through vision-language model supervision.arXiv preprint arXiv:2412.14446, 2024

    Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M Wolff, and Xin Huang. VLM-AD: End-to-end autonomous driving through vision-language model supervision.arXiv preprint arXiv:2412.14446, 2024. 1

  40. [40]

    Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430,

    Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuScenes.arXiv preprint arXiv:2305.10430, 2023. 3, 5

  41. [41]

    Judging LLM-as-a-judge with MT-Bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdv. Neural Inform. Process. Syst., 2023. 3

  42. [42]

    GenAD: Generative end-to-end au- tonomous driving

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. GenAD: Generative end-to-end au- tonomous driving. InEur. Conf. Comput. Vis., 2024. 5

  43. [43]

    Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.IEEE Trans

    Hongyu Zhou, Longzhong Lin, Jiabao Wang, Yichong Lu, Dongfeng Bai, Bingbing Liu, Yue Wang, Andreas Geiger, and Yiyi Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.IEEE Trans. Pattern Anal. Mach. Intell., 2025. 2 10