Recognition: unknown
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
Pith reviewed 2026-05-10 16:58 UTC · model grok-4.3
The pith
An efficient vision-only model distilled from a large VLA teacher surpasses the teacher on complex driving tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through a combination of latent feature distillation and ground-truth trajectory supervision, the efficient vision-only student model Orion-Lite surpasses the performance of its massive VLA teacher ORION on the Bench2Drive benchmark, reaching a Driving Score of 80.6. This outcome reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning in autonomous driving.
What carries the argument
Latent feature distillation from the teacher's internal representations together with ground-truth trajectory supervision, which transfers interactive reasoning capabilities to vision-only inputs.
If this is right
- Vision-only models become viable for state-of-the-art performance in closed-loop autonomous driving without language processing at inference.
- Computational demands for advanced driving systems drop while retaining the ability to manage rare and complex cases.
- Closed-loop evaluations expose capabilities in distilled models that simpler open-loop tests miss.
- Knowledge transfer techniques can scale high-level reasoning into deployable, energy-efficient planners.
Where Pith is reading between the lines
- The same distillation pattern could extend to other planning domains where large models supply reasoning but runtime efficiency matters.
- Explicit language inputs may prove unnecessary at deployment time once reasoning patterns are internalized through vision features.
- Further compression of the student model or alternative supervision signals could push performance gains even higher.
Load-bearing premise
That the distillation process successfully moves complex interactive reasoning from language-augmented inputs to pure vision inputs without critical loss in closed-loop performance.
What would settle it
Orion-Lite underperforming its teacher ORION when tested on a new collection of interactive scenarios absent from the Bench2Drive set.
Figures
read the original abstract
Leveraging the general world knowledge of Large Language Models (LLMs) holds significant promise for improving the ability of autonomous driving systems to handle rare and complex scenarios. While integrating LLMs into Vision-Language-Action (VLA) models has yielded state-of-the-art performance, their massive parameter counts pose severe challenges for latency-sensitive and energy-efficient deployment. Distilling LLM knowledge into a compact driving model offers a compelling solution to retain these reasoning capabilities while maintaining a manageable computational footprint. Although previous works have demonstrated the efficacy of distillation, these efforts have primarily focused on relatively simple scenarios and open-loop evaluations. Therefore, in this work, we investigate LLM distillation in more complex, interactive scenarios under closed-loop evaluation. We demonstrate that through a combination of latent feature distillation and ground-truth trajectory supervision, an efficient vision-only student model \textbf{Orion-Lite} can even surpass the performance of its massive VLA teacher, ORION. Setting a new state-of-the-art on the rigorous Bench2Drive benchmark, with a Driving Score of 80.6. Ultimately, this reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Orion-Lite, a compact vision-only driving model distilled from a large VLA teacher (ORION) via latent feature distillation combined with ground-truth trajectory supervision. It claims that this recipe enables the student to surpass the teacher's closed-loop performance on the Bench2Drive benchmark, achieving a new state-of-the-art Driving Score of 80.6 and demonstrating that vision-only architectures retain significant untapped potential for complex interactive planning.
Significance. If the central result holds after addressing the noted gaps, the work would be significant for autonomous driving and model compression: it provides empirical evidence that LLM-derived reasoning can be effectively transferred to efficient vision-only students in closed-loop settings, potentially enabling high-performance deployment without VLA-scale compute. The focus on rigorous closed-loop evaluation on Bench2Drive is a strength relative to prior open-loop distillation studies.
major comments (2)
- [Experiments / Results] Experiments / Results section (and abstract): The headline claim that Orion-Lite (DS=80.6) surpasses ORION is presented as evidence of successful transfer of complex interactive reasoning via latent distillation. However, training combines two signals—latent feature distillation and direct ground-truth trajectory supervision—with no ablation isolating the distillation component (e.g., a vision-only baseline trained only on trajectory imitation). This is load-bearing for the central claim, as the gain could arise from more faithful expert imitation rather than acquired world-knowledge reasoning.
- [Method] Method section: The description of the latent feature distillation objective lacks explicit equations or loss formulations (e.g., no definition of the feature alignment term or weighting between distillation and trajectory losses). Without these, it is difficult to assess how the transfer of LLM reasoning is implemented or to reproduce the exact recipe.
minor comments (2)
- [Abstract] Abstract and introduction: The manuscript states the outcome and high-level method but supplies no equations, ablation details, error bars, or dataset statistics, making the central claim hard to verify from the provided text alone.
- [Results] The paper would benefit from reporting standard deviations or multiple random seeds for the Driving Score to establish statistical reliability of the 80.6 result.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor and methodological clarity that we will address in the revision. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Experiments / Results] Experiments / Results section (and abstract): The headline claim that Orion-Lite (DS=80.6) surpasses ORION is presented as evidence of successful transfer of complex interactive reasoning via latent distillation. However, training combines two signals—latent feature distillation and direct ground-truth trajectory supervision—with no ablation isolating the distillation component (e.g., a vision-only baseline trained only on trajectory imitation). This is load-bearing for the central claim, as the gain could arise from more faithful expert imitation rather than acquired world-knowledge reasoning.
Authors: We agree that an ablation isolating the distillation component is necessary to strengthen the central claim. In the revised manuscript we will add a vision-only baseline trained exclusively on ground-truth trajectory imitation (without latent feature distillation) and report its closed-loop performance on Bench2Drive. This will allow direct comparison to Orion-Lite and clarify the incremental benefit attributable to the distillation objective. revision: yes
-
Referee: [Method] Method section: The description of the latent feature distillation objective lacks explicit equations or loss formulations (e.g., no definition of the feature alignment term or weighting between distillation and trajectory losses). Without these, it is difficult to assess how the transfer of LLM reasoning is implemented or to reproduce the exact recipe.
Authors: We acknowledge the need for explicit formulations to ensure reproducibility. In the revised Method section we will provide the full mathematical definition of the latent feature distillation loss, including the specific feature alignment metric (e.g., L2 or cosine similarity on selected layers), the combined objective with the trajectory supervision term, and the weighting coefficients used during training. revision: yes
Circularity Check
No derivational circularity; purely empirical claims
full rationale
The paper reports an empirical distillation experiment on Bench2Drive without any equations, derivations, fitted parameters presented as predictions, or self-citation chains that reduce the central result to its inputs by construction. Performance numbers (DS=80.6) are direct benchmark outputs, not outputs of a self-referential formula. Absence of ablations on distillation vs. ground-truth supervision is a validity concern, not a circularity issue under the defined patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Knowledge distilla- tion: A good teacher is patient and consistent
Lucas Beyer, Xiaohua Zhai, Am´elie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distilla- tion: A good teacher is patient and consistent. InIEEE Conf. Comput. Vis. Pattern Recog., 2022. 3
2022
-
[3]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InIEEE Conf. Com- put. Vis. Pattern Recog., 2020. 2
2020
-
[4]
VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang 8 Wang. V ADv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing.IEEE Trans
Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing.IEEE Trans. Pattern Anal. Mach. Intell., 45(11):12878– 12895, 2022. 2
2022
-
[6]
CARLA: An open urban driving simulator
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InCoRL, 2017. 6, 7
2017
-
[7]
EV A-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. EV A-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,
-
[8]
VERDI: VLM-Embedded Reasoning for Autonomous Driving
Bowen Feng, Zhiting Mei, Baiang Li, Julian Ost, Fil- ippo Ghilotti, Roger Girgis, Anirudha Majumdar, and Felix Heide. VERDI: VLM-embedded reasoning for autonomous driving.arXiv preprint arXiv:2505.15925, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to- end autonomous driving framework by vision-language in- structed action generation.arXiv preprint arXiv:2503.19755,
-
[10]
Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning,
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. MindDrive: A vision-language-action model for autonomous driving via online reinforcement learning. arXiv preprint arXiv:2512.13636, 2025. 1, 2, 3, 5, 7
-
[11]
Dis- tilling multi-modal large language models for autonomous driving
Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Dis- tilling multi-modal large language models for autonomous driving. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 2, 3
2025
-
[12]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 3
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
ST-P3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning
Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. ST-P3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In Eur. Conf. Comput. Vis., 2022. 7
2022
-
[14]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InIEEE Conf. Comput. Vis. Pattern Recog., 2023. 2, 5
2023
-
[15]
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Cov- ington, Benjamin Sapp, et al. EMMA: End-to-end mul- timodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[16]
DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving
Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InInt. Conf. Comput. Vis.,
-
[17]
Think twice before driv- ing: Towards scalable decoders for end-to-end autonomous driving
Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driv- ing: Towards scalable decoders for end-to-end autonomous driving. InIEEE Conf. Comput. Vis. Pattern Recog., 2023. 5
2023
-
[18]
Bench2Drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving
Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2Drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving. In Adv. Neural Inform. Process. Syst., 2024. 2, 6
2024
-
[19]
Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. DriveTransformer: Unified transformer for scalable end-to- end autonomous driving.arXiv preprint arXiv:2503.07656,
-
[20]
V AD: Vectorized scene representation for efficient autonomous driving
Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene representation for efficient autonomous driving. InInt. Conf. Comput. Vis.,
-
[21]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[22]
Kailin Li, Zhenxin Li, Shiyi Lan, Yuan Xie, Zhizhong Zhang, Jiayi Liu, Zuxuan Wu, Zhiding Yu, and Jose M Alvarez. Hydra-MDP++: Advancing end-to-end driv- ing via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025. 3
-
[23]
arXiv preprint arXiv:2406.08481 (2024) 2, 4, 6, 10, 11, 13
Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024. 3
-
[24]
arXiv preprint arXiv:2504.01941 (2025) 4, 10, 11, 13
Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online tra- jectory evaluation via BEV world model.arXiv preprint arXiv:2504.01941, 2025. 3
-
[25]
Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-MDP: End-to-end multimodal planning with multi- target hydra-distillation.arXiv preprint arXiv:2406.06978,
work page internal anchor Pith review arXiv
-
[26]
DiffusionDrive: Truncated diffusion model for end-to-end autonomous driving
Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. DiffusionDrive: Truncated diffusion model for end-to-end autonomous driving. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 2
2025
-
[27]
3D bounding box estimation using deep learning and geometry
Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Kosecka. 3D bounding box estimation using deep learning and geometry. InIEEE Conf. Comput. Vis. Pattern Recog., 2017. 3
2017
-
[28]
Multi- modal fusion transformer for end-to-end autonomous driv- ing
Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi- modal fusion transformer for end-to-end autonomous driv- ing. InIEEE Conf. Comput. Vis. Pattern Recog., 2021. 2
2021
-
[29]
SimLingo: Vision-only closed-loop autonomous driving with language-action alignment
Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. SimLingo: Vision-only closed-loop autonomous driving with language-action alignment. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 1, 5
2025
-
[30]
LMDrive: Closed-loop end-to-end driving with large language models
Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. LMDrive: Closed-loop end-to-end driving with large language models. InIEEE Conf. Comput. Vis. Pattern Recog., 2024. 1 9
2024
-
[31]
Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving
Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum- aware planning in end-to-end autonomous driving. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 5
2025
-
[32]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. DriveVLM: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024. 1, 3
work page internal anchor Pith review arXiv 2024
-
[34]
OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Al- varez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 1, 3
2025
-
[35]
Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wen- hao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025. 3
-
[36]
Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong base- line
Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong base- line. InAdv. Neural Inform. Process. Syst., 2022. 5
2022
-
[37]
OpenEMMA: Open-source multimodal model for end-to-end autonomous driving
Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, and Zhengzhong Tu. OpenEMMA: Open-source multimodal model for end-to-end autonomous driving. InWACV, 2025. 1, 3
2025
-
[38]
arXiv preprint arXiv:2601.04453 (2026)
Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren. UniDrive-WM: Unified understanding, planning and generation world model for autonomous driving.arXiv preprint arXiv:2601.04453,
-
[39]
Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M Wolff, and Xin Huang. VLM-AD: End-to-end autonomous driving through vision-language model supervision.arXiv preprint arXiv:2412.14446, 2024. 1
-
[40]
Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuScenes.arXiv preprint arXiv:2305.10430, 2023. 3, 5
-
[41]
Judging LLM-as-a-judge with MT-Bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdv. Neural Inform. Process. Syst., 2023. 3
2023
-
[42]
GenAD: Generative end-to-end au- tonomous driving
Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. GenAD: Generative end-to-end au- tonomous driving. InEur. Conf. Comput. Vis., 2024. 5
2024
-
[43]
Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.IEEE Trans
Hongyu Zhou, Longzhong Lin, Jiabao Wang, Yichong Lu, Dongfeng Bai, Bingbing Liu, Yue Wang, Andreas Geiger, and Yiyi Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.IEEE Trans. Pattern Anal. Mach. Intell., 2025. 2 10
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.