Teaching Vision-Language-Action Models What to See and Where to Look

Baochang Zhang; Bo Zhang; Canyu Chen; Chunyang Liu; Juan Zhang; Kehua Sheng; Linlin Yang; Xianbin Cao; Yan Wang; Yizhi Wang

arxiv: 2607.01658 · v1 · pith:WL3J6PMRnew · submitted 2026-07-02 · 💻 cs.CV

Teaching Vision-Language-Action Models What to See and Where to Look

Yuguang Yang , Canyu Chen , Zhewen Tan , Yizhi Wang , Zichao Feng , Chunyang Liu , Kehua Sheng , Juan Zhang

show 5 more authors

Linlin Yang Baochang Zhang Yan Wang Bo Zhang Xianbin Cao

This is my paper

Pith reviewed 2026-07-03 16:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords Vision-Language-Action modelsautonomous drivingtrajectory predictionvision distillationspatial promptsend-to-end drivingNAVSIMnuScenes

0 comments

The pith

DriveTeach-VLA adds driving-specific vision distillation and trajectory prompts to vision-language-action models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix a gap in Vision-Language-Action models for autonomous driving, where text-heavy training leaves the models short on spatial understanding needed to predict safe trajectories. It introduces Driving-aware Vision Distillation to embed driving-relevant perceptual knowledge directly into the vision encoder and 2D Trajectory-Guided Prompts to steer attention toward feasible paths. These components create a staged pipeline that first teaches perception, then spatial focus, then action, and the resulting models set new performance records on the NAVSIM and nuScenes benchmarks.

Core claim

DriveTeach-VLA explicitly teaches VLAs what to see and where to look via Driving-aware Vision Distillation that injects driving-specific perceptual priors into the vision encoder together with 2D Trajectory-Guided Prompts that supply spatial conditioning aligned with feasible driving trajectories, forming the pipeline of DVD pretraining followed by TGP-guided supervised fine-tuning and TGP-guided GRPO.

What carries the argument

Driving-aware Vision Distillation (DVD) and 2D Trajectory-Guided Prompts (2D-TGP) that together supply driving priors and trajectory-aligned spatial conditioning to the VLA training process.

If this is right

The vision encoder receives driving-specific perceptual priors before any action learning occurs.
Spatial conditioning is aligned directly with feasible driving trajectories during fine-tuning and reinforcement stages.
The three-stage pipeline separates perception teaching from action learning.
Trajectory prediction reliability improves because the model learns both what to see and where to look.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of perceptual priors from action learning may apply to other embodied tasks that combine vision and control.
Text-centric pretraining alone may prove insufficient for any VLA that must output physical actions rather than language.
Extending the 2D prompts to incorporate depth or multi-camera geometry could further tighten the spatial alignment.

Load-bearing premise

Existing VLAs trained on text-centric data capture semantic knowledge but miss the spatial dependencies required for reliable trajectory prediction.

What would settle it

A VLA trained without DVD pretraining or 2D-TGP guidance that matches or exceeds DriveTeach-VLA performance on NAVSIM and nuScenes would show the added components are not required.

Figures

Figures reproduced from arXiv: 2607.01658 by Baochang Zhang, Bo Zhang, Canyu Chen, Chunyang Liu, Juan Zhang, Kehua Sheng, Linlin Yang, Xianbin Cao, Yan Wang, Yizhi Wang, Yuguang Yang, Zhewen Tan, Zichao Feng.

**Figure 2.** Figure 2: DriveTeach-VLA Architecture. Left: DriveTeach-VLA operates in a dualmodel manner consisting of a TGP-Prompter and a TGP-Planner. The TGP-Prompter first predicts the 2D-TGP, which is then used to condition the TGP-Planner for trajectory generation. Right-Up: The TGP-Prompter is trained via DVD, supervised by critical-object bounding-box–augmented images’ features and ground-truth 2D-TGP. Right-Down: The TG… view at source ↗

**Figure 3.** Figure 3: DriveTeach-VLA schemes. Left: Traffic critical object priors are injected via bbox-augmented image self-distillation. Right: The visualized 2D-TGP is highly related to the driving behavior (turn left), and the 2D-TGP conditions TGP-Planner in text form of a sequence of 2D coordinates, which is interpretable for MLLM. ViT encoders pretrained on natural images, which lack the domain knowledge about what shou… view at source ↗

**Figure 4.** Figure 4: Visualization of the predicted 2D-TGP trajectory, critical objects detected by Grounding Dino of DriveTeach-VLA. feature map into a grid of rows × columns, e.g., 2 × 4 yields K=8 blocks) and compute block-wise alignment loss Ldistill: \bar {v}_k^{t} = \frac {1}{|\mathcal {B}_k|}\sum _{i \in \mathcal {B}_k} v_i^{t}, \quad \bar {v}_k^{s} = \frac {1}{|\mathcal {B}_k|}\sum _{i \in \mathcal {B}_k} v_i^{s}, \qua… view at source ↗

**Figure 5.** Figure 5: PDMS by predicted-vs-GT 2D-TGP L2 error. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: The critical object to detect using Grounding Dino [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for Qwen2.5-VL-72B to pseudo-label CoT reasoning stesps. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Language instruction for TGP-Prompter to regress 2D-TGP trajectory [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Language instruction for TGP-Planner under the guidance of 2D-TGP prompt [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing VLAs' training relies heavily on text-centric visual question answering and chain-of-thought reasoning data, which emphasizes linguistic reasoning rather than action-grounded planning. As a result, the learned representations capture semantic knowledge but lack spatial dependencies crucial for reliable trajectory prediction. We propose DriveTeach-VLA, a framework that explicitly teaches VLAs what to see and where to look. Driving-aware Vision Distillation (DVD) injects driving-specific perceptual priors into the vision encoder, while 2D Trajectory-Guided Prompts (2D-TGP) provide spatial conditioning aligned with feasible driving trajectories. Together, they form a vision-guided learning pipeline: what to see (DVD pretraining) - where to look (TGP-guided SFT) - how to act (TGP-guided GRPO). DriveTeach-VLA achieves the state-of-the-art performance on NAVSIM and nuScenes. Our code is available at: https://github.com/ShivaTeam/DriveTeach-VLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DriveTeach-VLA adds DVD pretraining and 2D-TGP spatial prompts to VLA training for driving and claims SOTA on NAVSIM and nuScenes, with code released.

read the letter

DriveTeach-VLA adds DVD pretraining and 2D-TGP spatial prompts to VLA training for driving and claims SOTA on NAVSIM and nuScenes, with code released.

The new element is the three-stage pipeline that first distills driving-specific priors into the vision encoder, then uses trajectory-aligned prompts during supervised fine-tuning, and finally applies the same prompts in GRPO. This directly targets the gap where text-heavy VLA data leaves models short on spatial structure for trajectory prediction. The named components and the explicit separation of perception, conditioning, and action stages are the concrete additions over prior VLA driving work.

The paper does a reasonable job stating the motivation and laying out how the stages fit together. Code availability lowers the barrier for anyone who wants to test the claims or adapt the conditioning approach.

The main soft spot is that the abstract supplies no numbers, baselines, or ablation breakdowns, so the size of the gains and whether each piece actually contributes remain unclear until the tables are examined. The premise that existing VLAs lack spatial dependencies is treated as background rather than measured head-to-head. If the full results include fair comparisons and component ablations, the central claim holds; otherwise the evidence stays thin.

This is for researchers already working on end-to-end VLA models for autonomous driving. Readers looking for practical ways to inject spatial guidance into these models could pick up usable ideas. It has enough structure, benchmark results, and public code to deserve a serious referee rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces DriveTeach-VLA, a framework for Vision-Language-Action (VLA) models in autonomous driving. It proposes Driving-aware Vision Distillation (DVD) to inject driving-specific perceptual priors into the vision encoder during pretraining, and 2D Trajectory-Guided Prompts (2D-TGP) to supply spatial conditioning aligned with feasible trajectories during supervised fine-tuning (SFT) and GRPO stages. The pipeline is framed as teaching the model what to see (DVD), where to look (TGP-guided SFT), and how to act (TGP-guided GRPO). The central claim is that this approach reaches state-of-the-art performance on the NAVSIM and nuScenes benchmarks, with code released at the provided GitHub link.

Significance. If the reported performance gains hold under rigorous evaluation, the work could meaningfully advance end-to-end driving models by improving spatial structure in VLA representations. The explicit separation of perceptual pretraining from trajectory-guided fine-tuning stages offers a clear, modular recipe that other researchers could adapt. Open-sourcing the code is a concrete strength that lowers the barrier to verification and extension.

minor comments (3)

[Abstract] Abstract: the SOTA claim on NAVSIM and nuScenes is stated without any numerical metrics, baseline names, or ablation summaries; adding one or two key numbers (e.g., success rate or collision rate deltas) would make the abstract self-contained while remaining within length limits.
[Introduction] The motivation paragraph asserts that existing VLAs lack spatial dependencies, yet no diagnostic experiment or citation to a quantitative study of spatial awareness in prior VLAs is referenced; a short supporting sentence or reference would clarify the premise without altering the central contribution.
[Method] Terminology for the three-stage pipeline (DVD pretraining, TGP-guided SFT, TGP-guided GRPO) is introduced in the abstract but should be cross-referenced with consistent subsection headings in the method section to aid readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive and positive review, including the recognition of our modular pipeline, open-sourced code, and potential impact on end-to-end driving models. The recommendation for minor revision is appreciated. No major comments were provided in the report, so we have no specific points requiring rebuttal or revision at this stage.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces DriveTeach-VLA via two proposed components (DVD pretraining and 2D-TGP conditioning) and reports empirical SOTA results on NAVSIM and nuScenes. No equations, parameter-fitting steps, or derivation chains appear in the abstract or description. The central claims rest on the training pipeline stages rather than any self-referential definition, fitted-input prediction, or load-bearing self-citation. The background statement that prior VLAs lack spatial structure is presented as motivation, not a result derived from the method. This is a standard empirical proposal with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or quantified.

pith-pipeline@v0.9.1-grok · 5764 in / 968 out tokens · 16923 ms · 2026-07-03T16:40:04.482555+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 29 canonical work pages · 12 internal anchors

[1]

Ang, S., Chen, Y., Haiyan, L., Mao, X., Bao, J., Xuliang, Sun, B., Wang, Y.: Asscg: Just-right gating over chattering for fast-slow llm planning in autonomous driving (2026),https://arxiv.org/abs/2606.25509

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Transactions on Computational and Scientific Methods5(4) (2025)

Broekman, N.: Toward safe and scalable autonomy: A comprehensive review of technologies, deployments, and challenges in autonomous driving. Transactions on Computational and Scientific Methods5(4) (2025)

2025
[4]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

2021
[5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, C., Yang, Y., Tan, Z., Wang, Y., Zhan, R., Liu, H., Mao, X., Bao, J., Tang, X., Yang, L., et al.: Devil is in narrow policy: Unleashing exploration in driving vla models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1062–1072 (2026)

2026
[6]

In: European Conference on Computer Vision

Chen, Y., Ding, Z.h., Wang, Z., Wang, Y., Zhang, L., Liu, S.: Asynchronous large language model enhanced planner for autonomous driving. In: European Conference on Computer Vision. pp. 22–38. Springer (2024)

2024
[7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

2024
[8]

Impromptu VLA: Open weights and open data for driving vision-language-action models

Chi, H., Gao, H.a., Liu, Z., Liu, J., Liu, C., Li, J., Yang, K., Yu, Y., Wang, Z., Li, W., et al.: Impromptu vla: Open weights and open data for driving vision-language- action models. arXiv preprint arXiv:2505.23757 (2025)

work page arXiv 2025
[9]

IEEE transactions on pattern analysis and machine intelligence45(11), 12878–12895 (2022)

Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., Geiger, A.: Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE transactions on pattern analysis and machine intelligence45(11), 12878–12895 (2022)

2022
[10]

Advances in Neural Information Processing Systems37, 28706–28719 (2024)

Dauner, D., Hallgarten, M., Li, T., Weng, X., Huang, Z., Yang, Z., Li, H., Gilitschen- ski, I., Ivanovic, B., Pavone, M., et al.: Navsim: Data-driven non-reactive au- tonomous vehicle simulation and benchmarking. Advances in Neural Information Processing Systems37, 28706–28719 (2024)

2024
[11]

arXiv preprint arXiv:2504.19580 (2025)

Feng, R., Xi, N., Chu, D., Wang, R., Deng, Z., Wang, A., Lu, L., Wang, J., Huang, Y.: Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. arXiv preprint arXiv:2504.19580 (2025)

work page arXiv 2025
[12]

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Fu, H., Zhang, D., Zhao, Z., Cui, J., Liang, D., Zhang, C., Zhang, D., Xie, H., Wang, B., Bai, X.: Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. arXiv preprint arXiv:2503.19755 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Nature645(8081), 633–638 (2025)

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)

2025
[14]

Emerging topics in computer vision3, 45–108 (2005)

Heyden, A., Pollefeys, M.: Multiple view geometry. Emerging topics in computer vision3, 45–108 (2005)

2005
[15]

In: European Conference on Computer Vision

Hu, S., Chen, L., Wu, P., Li, H., Yan, J., Tao, D.: St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In: European Conference on Computer Vision. pp. 533–549. Springer (2022) DriveTeach-VLA ECCV paper 17

2022
[16]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17853–17862 (2023)

2023
[17]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Huang, Z., Liu, H., Lv, C.: Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3903–3913 (2023)

2023
[18]

IEEE transactions on neural networks and learning systems (2023)

Huang, Z., Liu, H., Wu, J., Lv, C.: Differentiable integrated motion prediction and planning with learnable cost function for autonomous driving. IEEE transactions on neural networks and learning systems (2023)

2023
[19]

arXiv preprint arXiv:2410.05582 (2024)

Huang, Z., Weng, X., Igl, M., Chen, Y., Cao, Y., Ivanovic, B., Pavone, M., Lv, C.: Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning. arXiv preprint arXiv:2410.05582 (2024)

work page arXiv 2024
[20]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Hwang, J.J., Xu, R., Lin, H., Hung, W.C., Ji, J., Choi, K., Huang, D., He, T., Cov- ington, P., Sapp, B., et al.: Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W., Huang, C., Wang, X.: Vad: Vectorized scene representation for efficient autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8340–8350 (2023)

2023
[23]

arXiv preprint arXiv:2508.09158 (2025)

Jiao, S., Qian, K., Ye, H., Zhong, Y., Luo, Z., Jiang, S., Huang, Z., Fang, Y., Miao, J., Fu, Z., et al.: Evadrive: Evolutionary adversarial policy optimization for end-to-end autonomous driving. arXiv preprint arXiv:2508.09158 (2025)

work page arXiv 2025
[24]

Kirby, A

Kirby, E., Boulch, A., Xu, Y., Yin, Y., Puy, G., Zablocki, É., Bursuc, A., Gi- daris, S., Marlet, R., Bartoccioni, F., et al.: Driving on registers. arXiv preprint arXiv:2601.05083 (2026)

work page arXiv 2026
[25]

arXiv preprint arXiv:2508.11428 (2025)

Li, J., Zhang, B., Jin, X., Deng, J., Zhu, X., Zhang, L.: Imagidrive: A uni- fied imagination-and-planning framework for autonomous driving. arXiv preprint arXiv:2508.11428 (2025)

work page arXiv 2025
[26]

arXiv preprint arXiv:2503.12820 (2025)

Li, K., Li, Z., Lan, S., Xie, Y., Zhang, Z., Liu, J., Wu, Z., Yu, Z., Alvarez, J.M.: Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation. arXiv preprint arXiv:2503.12820 (2025)

work page arXiv 2025
[27]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Li, Y., Shang, S., Liu, W., Zhan, B., Wang, H., Wang, Y., Chen, Y., Wang, X., An, Y., Tang, C., et al.: Drivevla-w0: World models amplify data scaling law in autonomous driving. arXiv preprint arXiv:2510.12796 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

arXiv preprint arXiv:2504.01941 (2025)

Li, Y., Wang, Y., Liu, Y., He, J., Fan, L., Zhang, Z.: End-to-end driving with online trajectory evaluation via bev world model. arXiv preprint arXiv:2504.01941 (2025)

work page arXiv 2025
[29]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Li, Y., Xiong, K., Guo, X., Li, F., Yan, S., Xu, G., Zhou, L., Chen, L., Sun, H., Wang, B., et al.: Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. arXiv preprint arXiv:2506.08052 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Li, Z., Li, K., Wang, S., Lan, S., Yu, Z., Ji, Y., Li, Z., Zhu, Z., Kautz, J., Wu, Z., et al.: Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation. arXiv preprint arXiv:2406.06978 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 18 Y

Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., Dai, J.: Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal trans- formers. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 18 Y. Yang et al

2024
[32]

Li, Z., Yu, Z., Lan, S., Li, J., Kautz, J., Lu, T., Alvarez, J.M.: Is ego status all you need for open-loop end-to-end autonomous driving? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14864–14873 (2024)

2024
[33]

Advances in Neural Information Processing Systems35, 10421–10434 (2022)

Liang, T., Xie, H., Yu, K., Xia, Z., Lin, Z., Wang, Y., Tang, T., Wang, B., Tang, Z.: Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems35, 10421–10434 (2022)

2022
[34]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liao, B., Chen, S., Yin, H., Jiang, B., Wang, C., Yan, S., Zhang, X., Li, X., Zhang, Y., Zhang, Q., et al.: Diffusiondrive: Truncated diffusion model for end- to-end autonomous driving. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12037–12047 (2025)

2025
[35]

In: 2021 IEEE Symposium on Computers and Communications (ISCC)

Liu, C., Yu, S., Yu, M., Wei, B., Li, B., Li, G., Huang, W.: Adaptive smooth l1 loss: A better way to regress scene texts with extreme aspect ratios. In: 2021 IEEE Symposium on Computers and Communications (ISCC). pp. 1–7. IEEE (2021)

2021
[36]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Liu, H., Huang, Z., Huang, W., Yang, H., Mo, X., Lv, C.: Hybrid-prediction integrated planning for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

2025
[37]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023
[38]

arXiv preprint arXiv:2509.05578 (2025)

Liu, R., Kong, L., Li, D., Zhao, H.: Occvla: Vision-language-action model with implicit 3d occupancy supervision. arXiv preprint arXiv:2509.05578 (2025)

work page arXiv 2025
[39]

In: European conference on computer vision

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

2024
[40]

arXiv preprint arXiv:2509.13769 (2025)

Luo, Y., Li, F., Xu, S., Lai, Z., Yang, L., Chen, Q., Luo, Z., Xie, Z., Jiang, S., Liu, J., et al.: Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving. arXiv preprint arXiv:2509.13769 (2025)

work page arXiv 2025
[41]

In: Proceed- ings of the AAAI Conference on Artificial Intelligence

Qian, T., Chen, J., Zhuo, L., Jiao, Y., Jiang, Y.G.: Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In: Proceed- ings of the AAAI Conference on Artificial Intelligence. pp. 4542–4550 (2024)

2024
[42]

arXiv preprint arXiv:2505.00284 (2025)

Qiao, Z., Li, H., Cao, Z., Liu, H.X.: Lightemma: Lightweight end-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2505.00284 (2025)

work page arXiv 2025
[43]

arXiv preprint arXiv:2503.09594 (2025)

Renz, K., Chen, L., Arani, E., Sinavski, O.: Simlingo: Vision-only closed-loop au- tonomous driving with language-action alignment. arXiv preprint arXiv:2503.09594 (2025)

work page arXiv 2025
[44]

arXiv preprint arXiv:2406.10165 (2024)

Renz, K., Chen, L., Marcu, A.M., Hünermann, J., Hanotte, B., Karnsund, A., Shotton, J., Arani, E., Sinavski, O.: Carllava: Vision language models for camera- only closed-loop driving. arXiv preprint arXiv:2406.10165 (2024)

work page arXiv 2024
[45]

arXiv preprint arXiv:2506.11234 (2025)

Rowe, L., de Schaetzen, R., Girgis, R., Pal, C., Paull, L.: Poutine: Vision-language- trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving. arXiv preprint arXiv:2506.11234 (2025)

work page arXiv 2025
[46]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(5), 3955–3971 (2024)

Shi, S., Jiang, L., Dai, D., Schiele, B.: Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. IEEE Transactions on Pattern Analysis and Machine Intelligence46(5), 3955–3971 (2024)

2024
[47]

In: European conference on computer vision

Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answering. In: European conference on computer vision. pp. 256–274. Springer (2024)

2024
[48]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Tian, X., Gu, J., Li, B., Liu, Y., Wang, Y., Zhao, Z., Zhan, K., Jia, P., Lang, X., Zhao, H.: Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289 (2024) DriveTeach-VLA ECCV paper 19

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

arXiv preprint arXiv:2405.01533 (2024)

Wang, S., Yu, Z., Jiang, X., Lan, S., Shi, M., Chang, N., Kautz, J., Li, Y., Alvarez, J.M.: Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. arXiv preprint arXiv:2405.01533 (2024)

work page arXiv 2024
[50]

In: Conference on Robot Learning

Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on Robot Learning. pp. 180–191. PMLR (2022)

2022
[51]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., Xie, Z., Wu, Y., Hu, K., Wang, J., Sun, Y., Li, Y., Piao, Y., Guan, K., Liu, A., Xie, X., You, Y., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y., Ruan, C.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. https://arxiv.o...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

In: Proceedings of the Winter Conference on Applications of Computer Vision

Xing, S., Qian, C., Wang, Y., Hua, H., Tian, K., Zhou, Y., Tu, Z.: Openemma: Open-source multimodal model for end-to-end autonomous driving. In: Proceedings of the Winter Conference on Applications of Computer Vision. pp. 1001–1009 (2025)

2025
[53]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Yang, Z., Chai, Y., Jia, X., Li, Q., Shao, Y., Zhu, X., Su, H., Yan, J.: Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving. arXiv preprint arXiv:2505.16278 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

arXiv preprint arXiv:2503.23463 (2025)

Zhou, X., Han, X., Yang, F., Ma, Y., Knoll, A.C.: Opendrivevla: Towards end-to- end autonomous driving with large vision language action model. arXiv preprint arXiv:2503.23463 (2025)

work page arXiv 2025
[55]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zhou, Z., Cai, T., Zhao, S.Z., Zhang, Y., Huang, Z., Zhou, B., Ma, J.: Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning. arXiv preprint arXiv:2506.13757 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

yes" or

Zhou, Z., Wen, Z., Wang, J., Li, Y.H., Huang, Y.K.: Qcnext: A next- generation framework for joint multi-agent trajectory prediction. arXiv preprint arXiv:2306.10508 (2023) DriveTeach-VLA ECCV paper 1 A NA VSIM Best-of-N PDMS & EPDMS Autoregressive model often exhibit strong exploration capability. In Table 11, we compare the best-of-N performance. For ex...

work page arXiv 2023

[1] [1]

Ang, S., Chen, Y., Haiyan, L., Mao, X., Bao, J., Xuliang, Sun, B., Wang, Y.: Asscg: Just-right gating over chattering for fast-slow llm planning in autonomous driving (2026),https://arxiv.org/abs/2606.25509

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Transactions on Computational and Scientific Methods5(4) (2025)

Broekman, N.: Toward safe and scalable autonomy: A comprehensive review of technologies, deployments, and challenges in autonomous driving. Transactions on Computational and Scientific Methods5(4) (2025)

2025

[4] [4]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

2021

[5] [5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, C., Yang, Y., Tan, Z., Wang, Y., Zhan, R., Liu, H., Mao, X., Bao, J., Tang, X., Yang, L., et al.: Devil is in narrow policy: Unleashing exploration in driving vla models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1062–1072 (2026)

2026

[6] [6]

In: European Conference on Computer Vision

Chen, Y., Ding, Z.h., Wang, Z., Wang, Y., Zhang, L., Liu, S.: Asynchronous large language model enhanced planner for autonomous driving. In: European Conference on Computer Vision. pp. 22–38. Springer (2024)

2024

[7] [7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

2024

[8] [8]

Impromptu VLA: Open weights and open data for driving vision-language-action models

Chi, H., Gao, H.a., Liu, Z., Liu, J., Liu, C., Li, J., Yang, K., Yu, Y., Wang, Z., Li, W., et al.: Impromptu vla: Open weights and open data for driving vision-language- action models. arXiv preprint arXiv:2505.23757 (2025)

work page arXiv 2025

[9] [9]

IEEE transactions on pattern analysis and machine intelligence45(11), 12878–12895 (2022)

Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., Geiger, A.: Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE transactions on pattern analysis and machine intelligence45(11), 12878–12895 (2022)

2022

[10] [10]

Advances in Neural Information Processing Systems37, 28706–28719 (2024)

Dauner, D., Hallgarten, M., Li, T., Weng, X., Huang, Z., Yang, Z., Li, H., Gilitschen- ski, I., Ivanovic, B., Pavone, M., et al.: Navsim: Data-driven non-reactive au- tonomous vehicle simulation and benchmarking. Advances in Neural Information Processing Systems37, 28706–28719 (2024)

2024

[11] [11]

arXiv preprint arXiv:2504.19580 (2025)

Feng, R., Xi, N., Chu, D., Wang, R., Deng, Z., Wang, A., Lu, L., Wang, J., Huang, Y.: Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. arXiv preprint arXiv:2504.19580 (2025)

work page arXiv 2025

[12] [12]

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Fu, H., Zhang, D., Zhao, Z., Cui, J., Liang, D., Zhang, C., Zhang, D., Xie, H., Wang, B., Bai, X.: Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. arXiv preprint arXiv:2503.19755 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Nature645(8081), 633–638 (2025)

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)

2025

[14] [14]

Emerging topics in computer vision3, 45–108 (2005)

Heyden, A., Pollefeys, M.: Multiple view geometry. Emerging topics in computer vision3, 45–108 (2005)

2005

[15] [15]

In: European Conference on Computer Vision

Hu, S., Chen, L., Wu, P., Li, H., Yan, J., Tao, D.: St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In: European Conference on Computer Vision. pp. 533–549. Springer (2022) DriveTeach-VLA ECCV paper 17

2022

[16] [16]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17853–17862 (2023)

2023

[17] [17]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Huang, Z., Liu, H., Lv, C.: Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3903–3913 (2023)

2023

[18] [18]

IEEE transactions on neural networks and learning systems (2023)

Huang, Z., Liu, H., Wu, J., Lv, C.: Differentiable integrated motion prediction and planning with learnable cost function for autonomous driving. IEEE transactions on neural networks and learning systems (2023)

2023

[19] [19]

arXiv preprint arXiv:2410.05582 (2024)

Huang, Z., Weng, X., Igl, M., Chen, Y., Cao, Y., Ivanovic, B., Pavone, M., Lv, C.: Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning. arXiv preprint arXiv:2410.05582 (2024)

work page arXiv 2024

[20] [20]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Hwang, J.J., Xu, R., Lin, H., Hung, W.C., Ji, J., Choi, K., Huang, D., He, T., Cov- ington, P., Sapp, B., et al.: Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W., Huang, C., Wang, X.: Vad: Vectorized scene representation for efficient autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8340–8350 (2023)

2023

[23] [23]

arXiv preprint arXiv:2508.09158 (2025)

Jiao, S., Qian, K., Ye, H., Zhong, Y., Luo, Z., Jiang, S., Huang, Z., Fang, Y., Miao, J., Fu, Z., et al.: Evadrive: Evolutionary adversarial policy optimization for end-to-end autonomous driving. arXiv preprint arXiv:2508.09158 (2025)

work page arXiv 2025

[24] [24]

Kirby, A

Kirby, E., Boulch, A., Xu, Y., Yin, Y., Puy, G., Zablocki, É., Bursuc, A., Gi- daris, S., Marlet, R., Bartoccioni, F., et al.: Driving on registers. arXiv preprint arXiv:2601.05083 (2026)

work page arXiv 2026

[25] [25]

arXiv preprint arXiv:2508.11428 (2025)

Li, J., Zhang, B., Jin, X., Deng, J., Zhu, X., Zhang, L.: Imagidrive: A uni- fied imagination-and-planning framework for autonomous driving. arXiv preprint arXiv:2508.11428 (2025)

work page arXiv 2025

[26] [26]

arXiv preprint arXiv:2503.12820 (2025)

Li, K., Li, Z., Lan, S., Xie, Y., Zhang, Z., Liu, J., Wu, Z., Yu, Z., Alvarez, J.M.: Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation. arXiv preprint arXiv:2503.12820 (2025)

work page arXiv 2025

[27] [27]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Li, Y., Shang, S., Liu, W., Zhan, B., Wang, H., Wang, Y., Chen, Y., Wang, X., An, Y., Tang, C., et al.: Drivevla-w0: World models amplify data scaling law in autonomous driving. arXiv preprint arXiv:2510.12796 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

arXiv preprint arXiv:2504.01941 (2025)

Li, Y., Wang, Y., Liu, Y., He, J., Fan, L., Zhang, Z.: End-to-end driving with online trajectory evaluation via bev world model. arXiv preprint arXiv:2504.01941 (2025)

work page arXiv 2025

[29] [29]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Li, Y., Xiong, K., Guo, X., Li, F., Yan, S., Xu, G., Zhou, L., Chen, L., Sun, H., Wang, B., et al.: Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. arXiv preprint arXiv:2506.08052 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Li, Z., Li, K., Wang, S., Lan, S., Yu, Z., Ji, Y., Li, Z., Zhu, Z., Kautz, J., Wu, Z., et al.: Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation. arXiv preprint arXiv:2406.06978 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 18 Y

Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., Dai, J.: Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal trans- formers. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 18 Y. Yang et al

2024

[32] [32]

Li, Z., Yu, Z., Lan, S., Li, J., Kautz, J., Lu, T., Alvarez, J.M.: Is ego status all you need for open-loop end-to-end autonomous driving? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14864–14873 (2024)

2024

[33] [33]

Advances in Neural Information Processing Systems35, 10421–10434 (2022)

Liang, T., Xie, H., Yu, K., Xia, Z., Lin, Z., Wang, Y., Tang, T., Wang, B., Tang, Z.: Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems35, 10421–10434 (2022)

2022

[34] [34]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liao, B., Chen, S., Yin, H., Jiang, B., Wang, C., Yan, S., Zhang, X., Li, X., Zhang, Y., Zhang, Q., et al.: Diffusiondrive: Truncated diffusion model for end- to-end autonomous driving. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12037–12047 (2025)

2025

[35] [35]

In: 2021 IEEE Symposium on Computers and Communications (ISCC)

Liu, C., Yu, S., Yu, M., Wei, B., Li, B., Li, G., Huang, W.: Adaptive smooth l1 loss: A better way to regress scene texts with extreme aspect ratios. In: 2021 IEEE Symposium on Computers and Communications (ISCC). pp. 1–7. IEEE (2021)

2021

[36] [36]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Liu, H., Huang, Z., Huang, W., Yang, H., Mo, X., Lv, C.: Hybrid-prediction integrated planning for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

2025

[37] [37]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023

[38] [38]

arXiv preprint arXiv:2509.05578 (2025)

Liu, R., Kong, L., Li, D., Zhao, H.: Occvla: Vision-language-action model with implicit 3d occupancy supervision. arXiv preprint arXiv:2509.05578 (2025)

work page arXiv 2025

[39] [39]

In: European conference on computer vision

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

2024

[40] [40]

arXiv preprint arXiv:2509.13769 (2025)

Luo, Y., Li, F., Xu, S., Lai, Z., Yang, L., Chen, Q., Luo, Z., Xie, Z., Jiang, S., Liu, J., et al.: Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving. arXiv preprint arXiv:2509.13769 (2025)

work page arXiv 2025

[41] [41]

In: Proceed- ings of the AAAI Conference on Artificial Intelligence

Qian, T., Chen, J., Zhuo, L., Jiao, Y., Jiang, Y.G.: Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In: Proceed- ings of the AAAI Conference on Artificial Intelligence. pp. 4542–4550 (2024)

2024

[42] [42]

arXiv preprint arXiv:2505.00284 (2025)

Qiao, Z., Li, H., Cao, Z., Liu, H.X.: Lightemma: Lightweight end-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2505.00284 (2025)

work page arXiv 2025

[43] [43]

arXiv preprint arXiv:2503.09594 (2025)

Renz, K., Chen, L., Arani, E., Sinavski, O.: Simlingo: Vision-only closed-loop au- tonomous driving with language-action alignment. arXiv preprint arXiv:2503.09594 (2025)

work page arXiv 2025

[44] [44]

arXiv preprint arXiv:2406.10165 (2024)

Renz, K., Chen, L., Marcu, A.M., Hünermann, J., Hanotte, B., Karnsund, A., Shotton, J., Arani, E., Sinavski, O.: Carllava: Vision language models for camera- only closed-loop driving. arXiv preprint arXiv:2406.10165 (2024)

work page arXiv 2024

[45] [45]

arXiv preprint arXiv:2506.11234 (2025)

Rowe, L., de Schaetzen, R., Girgis, R., Pal, C., Paull, L.: Poutine: Vision-language- trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving. arXiv preprint arXiv:2506.11234 (2025)

work page arXiv 2025

[46] [46]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(5), 3955–3971 (2024)

Shi, S., Jiang, L., Dai, D., Schiele, B.: Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. IEEE Transactions on Pattern Analysis and Machine Intelligence46(5), 3955–3971 (2024)

2024

[47] [47]

In: European conference on computer vision

Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answering. In: European conference on computer vision. pp. 256–274. Springer (2024)

2024

[48] [48]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Tian, X., Gu, J., Li, B., Liu, Y., Wang, Y., Zhao, Z., Zhan, K., Jia, P., Lang, X., Zhao, H.: Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289 (2024) DriveTeach-VLA ECCV paper 19

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

arXiv preprint arXiv:2405.01533 (2024)

Wang, S., Yu, Z., Jiang, X., Lan, S., Shi, M., Chang, N., Kautz, J., Li, Y., Alvarez, J.M.: Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. arXiv preprint arXiv:2405.01533 (2024)

work page arXiv 2024

[50] [50]

In: Conference on Robot Learning

Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on Robot Learning. pp. 180–191. PMLR (2022)

2022

[51] [51]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., Xie, Z., Wu, Y., Hu, K., Wang, J., Sun, Y., Li, Y., Piao, Y., Guan, K., Liu, A., Xie, X., You, Y., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y., Ruan, C.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. https://arxiv.o...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

In: Proceedings of the Winter Conference on Applications of Computer Vision

Xing, S., Qian, C., Wang, Y., Hua, H., Tian, K., Zhou, Y., Tu, Z.: Openemma: Open-source multimodal model for end-to-end autonomous driving. In: Proceedings of the Winter Conference on Applications of Computer Vision. pp. 1001–1009 (2025)

2025

[53] [53]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Yang, Z., Chai, Y., Jia, X., Li, Q., Shao, Y., Zhu, X., Su, H., Yan, J.: Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving. arXiv preprint arXiv:2505.16278 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

arXiv preprint arXiv:2503.23463 (2025)

Zhou, X., Han, X., Yang, F., Ma, Y., Knoll, A.C.: Opendrivevla: Towards end-to- end autonomous driving with large vision language action model. arXiv preprint arXiv:2503.23463 (2025)

work page arXiv 2025

[55] [55]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zhou, Z., Cai, T., Zhao, S.Z., Zhang, Y., Huang, Z., Zhou, B., Ma, J.: Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning. arXiv preprint arXiv:2506.13757 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

yes" or

Zhou, Z., Wen, Z., Wang, J., Li, Y.H., Huang, Y.K.: Qcnext: A next- generation framework for joint multi-agent trajectory prediction. arXiv preprint arXiv:2306.10508 (2023) DriveTeach-VLA ECCV paper 1 A NA VSIM Best-of-N PDMS & EPDMS Autoregressive model often exhibit strong exploration capability. In Table 11, we compare the best-of-N performance. For ex...

work page arXiv 2023