VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

Fei Gao; Jianlin Yu; Jiaqiao Liu; Rui Zhao; Zhenhai Gao

arxiv: 2605.08830 · v2 · pith:GYE2EZG5new · submitted 2026-05-09 · 💻 cs.CV · cs.AI· cs.RO

VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

Rui Zhao , Jianlin Yu , Zhenhai Gao , Jiaqiao Liu , Fei Gao This is my paper

Pith reviewed 2026-05-20 23:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords end-to-end autonomous drivingvision-language-action modelsexpert routingmultimodal transformerflow matchingsemantic-motion couplingtrajectory planning

0 comments

The pith

Routing feed-forward computation to separate vision-language and trajectory experts inside a shared-attention Transformer couples semantic reasoning with motion planning for end-to-end driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to resolve a core tension in vision-language-action models for autonomous driving: fully shared networks let semantics and motion interact but risk task entanglement, while separate pipelines avoid conflict at the cost of weaker coupling. VECTOR-DRIVE keeps every token in the same self-attention layers so vision, language, and trajectory information can still exchange information, yet it directs feed-forward work to two different experts according to token type. Vision and language tokens stay with a Vision-Language Expert that retains pretrained semantic knowledge, while target-point, ego-state, and action tokens move to a Trajectory Expert that focuses on motion computation; a flow-matching step then turns noisy actions into concrete waypoints and speeds. If this separation works as intended, driving systems could gain both the scene-understanding strengths of large vision-language models and the precise planning needed for safe control, all inside one network rather than stitched-together modules.

Core claim

We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN.

What carries the argument

Semantic-aware expert routing inside a multimodal Transformer: shared self-attention layers process vision, language, and trajectory tokens together, after which feed-forward layers are specialized into a Vision-Language Expert and a Trajectory Expert according to token type.

If this is right

Shared attention plus semantic routing produces higher driving scores than either fully shared or fully decoupled baselines.
Ablations show that removing the expert split or the shared attention each hurts performance.
Progressive training and flow-matching action decoding each add measurable gains on the same benchmark.
The approach keeps pretrained semantic knowledge usable for planning without requiring separate reasoning and control stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-type routing pattern could be tested on other integrated perception-planning tasks such as robotic manipulation.
If the experts can be made even more specialized, the model might handle longer-horizon or more crowded scenes with less interference.
Replacing the fixed routing rule with a learned router that decides per token on the fly would be a direct next experiment.
Scaling the same architecture to larger base vision-language models might preserve the coupling benefit while increasing overall capacity.

Load-bearing premise

That sending different token types to separate feed-forward experts reduces conflict between language reasoning and trajectory prediction while the shared self-attention layers still maintain useful interaction between semantics and motion.

What would settle it

An ablation on Bench2Drive that replaces the two specialized experts with a single shared feed-forward network across all tokens and measures whether driving score falls below the reported result.

Figures

Figures reproduced from arXiv: 2605.08830 by Fei Gao, Jianlin Yu, Jiaqiao Liu, Rui Zhao, Zhenhai Gao.

**Figure 1.** Figure 1: Three VLA design paradigms. Left: a shared VLM predicts actions with a single trajectory head. Middle: reasoning and chunk-level motion generation are separated. Right: our shared-attention and expert-routed design preserves multimodal interaction while routing motion-related computation to a dedicated Trajectory Expert. The main contributions are: • We propose a unified VLA architecture that combines shar… view at source ↗

**Figure 2.** Figure 2: Overall architecture of VECTOR-DRIVE. Visual observations, navigation conditions, language commands, ego states, and noisy action states are organized as an interleaved multimodal token sequence. Shared self-attention preserves cross-modal interaction, while semantic-aware FFN routing separates vision-language and trajectory-oriented computation. A. Problem Formulation We formulate end-to-end driving as co… view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative closed-loop visualization. Two scenarios are shown with time stamps, speed, and language-guided responses. Top: In nighttime wet-road car following, the model maintains speed, decelerates for dense traffic and a nearby right-side vehicle, and accelerates after the interaction resolves. Bottom: At a stop-controlled right turn, it stops, creeps forward to check cross traffic, accelerates after a … view at source ↗

**Figure 5.** Figure 5: CoT and instruction visualization. The examples show scene-aware reasoning and concise driving instructions generated by VECTOR-DRIVE under different traffic conditions. to 87.67 DS and 70.45% SR. The proposed flow-matching planner achieves 88.91 DS and 71.82% SR, demonstrating that continuous vector-field decoding is more effective for refining noisy action tokens into executable trajectories under multim… view at source ↗

read the original abstract

End-to-end autonomous driving requires models to understand traffic scenes, infer driving intent, and generate executable motion plans. Recent vision-language-action (VLA) models inherit semantic priors from large-scale vision-language pretraining, yet still face a coupling trade-off: fully shared backbones preserve multimodal interaction but may entangle language reasoning and trajectory prediction, whereas decou pled reasoning-action pipelines reduce task conflict but weaken semantic-motion coupling. We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN computation. On Bench2Drive, VECTOR-DRIVE achieves 88.91 Driving Score and outperforms representative end-to end and VLA-based baselines. Qualitative results and ablations further validate the benefits of shared attention, semantic-aware expert routing, progressive training, and flow-based action de coding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VECTOR-DRIVE routes FFN layers to separate vision-language and trajectory experts inside a shared-attention transformer and adds flow-matching for actions, but the coupling may weaken once the experts diverge.

read the letter

The main point is that this paper keeps a single multimodal transformer for driving but splits the feed-forward computation: vision and language tokens go through one expert to hold onto semantic priors, while target points, ego state, and action tokens go through a separate trajectory expert. Shared self-attention still runs across everything, and they decode actions with flow matching instead of direct regression. That combination is the concrete change from earlier decoupled VLA setups in the driving literature. They report 88.91 on Bench2Drive and say it beats the baselines they list, with ablations that touch the routing, the shared attention, and the training schedule. The flow-matching step is a practical addition that should produce smoother speed and waypoint outputs. Those pieces are clear and reproducible enough to check. The soft spot is the coupling claim. Once the FFN experts are fully separate, the representations that enter the next attention layer come from different weights. Nothing in the setup forces those spaces to stay aligned, so the shared attention could end up mixing features that have drifted apart. The ablations probably compare full routing against no routing or against fully shared layers, but that still leaves open whether the cross-modal transfer actually happens or whether the gains come from other factors like the progressive training or the base model size. If the paper only shows end-to-end scores without measuring representation similarity or breaking down semantic versus motion errors, the tight-coupling story stays partly assumptive. This is for groups already working on end-to-end driving stacks that want to keep language priors without full pipeline separation. Readers who care about mixture-of-experts adaptations in multimodal transformers will also see a direct example. It is worth sending to referees because the architectural choice is explicit, the benchmark numbers are given, and the trade-off it targets is real even if the mechanism needs tighter validation.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes VECTOR-DRIVE, a tightly coupled vision-language-action (VLA) framework for end-to-end autonomous driving built on Qwen2.5-VL-3B. It maintains shared self-attention across all tokens while routing vision-language tokens through a dedicated Vision-Language Expert and trajectory/ego-state/noisy action tokens through a separate Trajectory Expert; a flow-matching planner then refines the action tokens into waypoints and speed profiles. The central empirical claim is that this architecture achieves an 88.91 Driving Score on Bench2Drive, outperforming representative end-to-end and VLA baselines, with supporting qualitative results and ablations on shared attention, expert routing, progressive training, and flow-based decoding.

Significance. If the reported gains are robust and the shared-attention-plus-expert-routing design demonstrably preserves semantic-motion coupling without representational drift, the work would meaningfully advance VLA-based autonomous driving by offering a concrete resolution to the entanglement-versus-decoupling trade-off. The 88.91 score on Bench2Drive and the use of flow matching for action generation constitute tangible strengths; the approach could influence subsequent multimodal driving models if the coupling mechanism is shown to be effective rather than assumed.

major comments (2)

[Abstract] Abstract: The claim that 'shared self attention' still yields 'effective semantic-motion coupling' after routing vision-language tokens exclusively to one expert and trajectory tokens to another is load-bearing for the 'tightly coupled' framing. Once FFN parameters diverge, the key/query/value projections entering each attention head originate from distinct parameter sets; without explicit analysis (e.g., representation similarity metrics or cross-expert attention maps in the methods or experiments sections), it remains possible that shared attention operates over incompatible features, undermining the central architectural premise.
[Results and Experiments] Results and Experiments: The headline 88.91 Driving Score and outperformance over baselines are presented without error bars, exact baseline re-implementation details, or data-split information. These omissions make it difficult to determine whether the observed gains are attributable to the expert-routing design or to training variations, directly affecting the reliability of the empirical support for the coupling hypothesis.

minor comments (2)

[Abstract] Abstract contains minor typographical issues: 'decou pled' should read 'decoupled' and 'end-to end' should read 'end-to-end'.
[Method] The description of the flow-matching planner on the action-token pathway would benefit from a brief equation or pseudocode reference to clarify how noisy actions are refined into waypoints and speed profiles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript. We have carefully addressed each major comment below. Where revisions strengthen the presentation of our claims or improve reproducibility, we have incorporated them into the revised version.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'shared self attention' still yields 'effective semantic-motion coupling' after routing vision-language tokens exclusively to one expert and trajectory tokens to another is load-bearing for the 'tightly coupled' framing. Once FFN parameters diverge, the key/query/value projections entering each attention head originate from distinct parameter sets; without explicit analysis (e.g., representation similarity metrics or cross-expert attention maps in the methods or experiments sections), it remains possible that shared attention operates over incompatible features, undermining the central architectural premise.

Authors: We thank the referee for this precise observation on the coupling mechanism. The shared self-attention layers allow every token to attend to all others regardless of expert routing, which is the primary means of maintaining semantic-motion interaction; the expert-specific FFNs then apply task-specialized transformations after the attention step. To directly address the concern about feature compatibility, the revised manuscript now includes (i) cosine similarity measurements between token representations immediately before and after expert routing and (ii) qualitative cross-expert attention maps in the Experiments section. These additions empirically support that the shared attention continues to operate over compatible features. revision: yes
Referee: [Results and Experiments] Results and Experiments: The headline 88.91 Driving Score and outperformance over baselines are presented without error bars, exact baseline re-implementation details, or data-split information. These omissions make it difficult to determine whether the observed gains are attributable to the expert-routing design or to training variations, directly affecting the reliability of the empirical support for the coupling hypothesis.

Authors: We agree that error bars, precise re-implementation details, and data-split information are necessary for assessing result reliability. In the revised manuscript we now report the main Bench2Drive score together with standard deviation across five independent runs using different random seeds, provide a dedicated subsection detailing baseline re-implementations (including exact hyper-parameters, training schedules, and any code-level adaptations), and explicitly state the Bench2Drive train/validation/test splits employed. These changes allow readers to better evaluate whether performance differences arise from the proposed architecture. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark result independent of architecture description

full rationale

The paper proposes an architecture with shared self-attention and semantic-aware expert routing for FFN layers in a multimodal Transformer, then validates it via training and evaluation on the external Bench2Drive benchmark to obtain a reported Driving Score of 88.91. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citations are presented in the provided text that would reduce the central performance claim to a definitional or constructional tautology. The result is obtained through standard empirical measurement rather than any self-referential derivation, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework inherits semantic priors from the existing Qwen2.5-VL-3B model and introduces new routing experts and a flow-matching planner; no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Shared self-attention across all tokens preserves useful multimodal interactions between semantics and motion.
Invoked to justify keeping tokens coupled while routing only the FFN layers.

invented entities (2)

Vision-Language Expert no independent evidence
purpose: Process vision and language tokens to preserve semantic priors from pretraining
New component introduced for task-specific FFN computation on vision/language tokens.
Trajectory Expert no independent evidence
purpose: Process target-point, ego-state, and noisy action tokens for motion-specific computation
New component introduced for task-specific FFN computation on planning-related tokens.

pith-pipeline@v0.9.0 · 5787 in / 1449 out tokens · 68928 ms · 2026-05-20T23:09:48.245479+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VECTOR-DRIVE keeps all tokens coupled through shared self-attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert ... while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

shared self-attention ... semantic-aware expert routing ... flow-matching planner

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 10 internal anchors

[1]

Multi-modal fusion transformer for end-to-end autonomous driving,

A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7077– 7087

work page 2021
[2]

Planning-oriented autonomous driving,

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862

work page 2023
[3]

Para- drive: Parallelized architecture for real-time autonomous driving,

X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone, “Para- drive: Parallelized architecture for real-time autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 449–15 458

work page 2024
[4]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhanget al., “Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 037–12 047

work page 2025
[5]

Enhancing End-to-End Autonomous Driving with Latent World Model

Y . Li, L. Fan, J. He, Y . Wang, Y . Chen, Z. Zhang, and T. Tan, “Enhancing end-to-end autonomous driving with latent world model,”arXiv preprint arXiv:2406.08481, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu et al., “Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation,”arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline,

P. Wu, X. Jia, L. Chen, J. Yan, H. Li, and Y . Qiao, “Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

work page 2022
[8]

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving,

X. Jia, P. Wu, L. Chen, J. Xie, C. He, J. Yan, and H. Li, “Think twice before driving: Towards scalable decoders for end-to-end autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[9]

Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,

X. Jia, Y . Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[10]

Vad: Vectorized scene representation for efficient autonomous driving,

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,”ICCV, 2023

work page 2023
[11]

Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving,

X. Jia, J. You, Z. Zhang, and J. Yan, “Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving,” inInternational Conference on Learning Representations, 2025

work page 2025
[12]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driv- ing,

X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan, “Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driv- ing,”Advances in Neural Information Processing Systems, vol. 37, pp. 819–844, 2024

work page 2024
[13]

Drivelm: Driving with graph visual ques- tion answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual ques- tion answering,” inEuropean conference on computer vision. Springer, 2024, pp. 256–274

work page 2024
[14]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

B. Jiang, S. Chen, B. Liao, X. Zhang, W. Yin, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Senna: Bridging large vision-language models and end-to-end autonomous driving,”arXiv preprint arXiv:2410.22313, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

work page 2023
[16]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,”arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment,

K. Renz, L. Chen, E. Arani, and O. Sinavski, “Simlingo: Vision-only closed-loop autonomous driving with language-action alignment,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 11 993–12 003

work page 2025
[18]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation,

H. Fu, D. Zhang, Z. Zhao, J. Cui, D. Liang, C. Zhang, D. Zhang, H. Xie, B. Wang, and X. Bai, “Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 24 823–24 834

work page 2025
[19]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma, “Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning,”arXiv preprint arXiv:2506.13757, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Drivecot: Integrating chain-of-thought reasoning with end-to-end driving.arXiv preprint arXiv:2403.16996,

T. Wang, E. Xie, R. Chu, Z. Li, and P. Luo, “Drivecot: Integrating chain-of-thought reasoning with end-to-end driving,”arXiv preprint arXiv:2403.16996, 2024

work page arXiv 2024
[21]

Sce2drivex: A generalized mllm framework for scene-to-drive learning,

R. Zhao, Q. Yuan, J. Li, H. Hu, Y . Li, Z. Gao, and F. Gao, “Sce2drivex: A generalized mllm framework for scene-to-drive learning,”IEEE Robotics and Automation Letters, 2025

work page 2025
[22]

Gra- dient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gra- dient surgery for multi-task learning,”Advances in neural information processing systems, vol. 33, pp. 5824–5836, 2020

work page 2020
[23]

Mitigating task interference in multi-task learning via explicit task routing with non-learnable primitives,

C. Ding, Z. Lu, S. Wang, R. Cheng, and V . N. Boddeti, “Mitigating task interference in multi-task learning via explicit task routing with non-learnable primitives,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7756–7765

work page 2023
[24]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π 0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, Y . An, C. Tang, L. Hou, L. Fan, and Z. Zhang, “Drivevla-w0: World models amplify data scaling law in autonomous driving,”arXiv preprint arXiv:2510.12796, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Z. Yang, Y . Chai, X. Jia, Q. Li, Y . Shao, X. Zhu, H. Su, and J. Yan, “Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving,”arXiv preprint arXiv:2505.16278, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Drive- r1: Bridging reasoning and planning in vlms for autonomous driving IEEE ROBOTICS AND AUTOMATION LETTERS 9 with reinforcement learning,

Y . Li, M. Tian, D. Zhu, J. Zhu, Z. Lin, Z. Xiong, and X. Zhao, “Drive- r1: Bridging reasoning and planning in vlms for autonomous driving IEEE ROBOTICS AND AUTOMATION LETTERS 9 with reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 8, 2026, pp. 6708–6716

work page 2026
[28]

Latentvla: Efficient vision-language models for autonomous driving via latent action prediction

C. Xie, C. Sima, T. Li, B. Sun, J. Wu, Z. Hao, and H. Li, “Flare: Learning future-aware latent representations from vision-language models for autonomous driving,”arXiv preprint arXiv:2601.05611, 2026

work page arXiv 2026
[29]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, K. Ma, G. Chen, H. Ye, W. Liu, and X. Wang, “Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving,” arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang, “Rethinking the open-loop evaluation of end-to- end autonomous driving in nuscenes,”arXiv preprint arXiv:2305.10430, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Multi-modal fusion transformer for end-to-end autonomous driving,

A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7077– 7087

work page 2021

[2] [2]

Planning-oriented autonomous driving,

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862

work page 2023

[3] [3]

Para- drive: Parallelized architecture for real-time autonomous driving,

X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone, “Para- drive: Parallelized architecture for real-time autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 449–15 458

work page 2024

[4] [4]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhanget al., “Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 037–12 047

work page 2025

[5] [5]

Enhancing End-to-End Autonomous Driving with Latent World Model

Y . Li, L. Fan, J. He, Y . Wang, Y . Chen, Z. Zhang, and T. Tan, “Enhancing end-to-end autonomous driving with latent world model,”arXiv preprint arXiv:2406.08481, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu et al., “Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation,”arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline,

P. Wu, X. Jia, L. Chen, J. Yan, H. Li, and Y . Qiao, “Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

work page 2022

[8] [8]

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving,

X. Jia, P. Wu, L. Chen, J. Xie, C. He, J. Yan, and H. Li, “Think twice before driving: Towards scalable decoders for end-to-end autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[9] [9]

Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,

X. Jia, Y . Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023

[10] [10]

Vad: Vectorized scene representation for efficient autonomous driving,

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,”ICCV, 2023

work page 2023

[11] [11]

Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving,

X. Jia, J. You, Z. Zhang, and J. Yan, “Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving,” inInternational Conference on Learning Representations, 2025

work page 2025

[12] [12]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driv- ing,

X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan, “Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driv- ing,”Advances in Neural Information Processing Systems, vol. 37, pp. 819–844, 2024

work page 2024

[13] [13]

Drivelm: Driving with graph visual ques- tion answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual ques- tion answering,” inEuropean conference on computer vision. Springer, 2024, pp. 256–274

work page 2024

[14] [14]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

B. Jiang, S. Chen, B. Liao, X. Zhang, W. Yin, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Senna: Bridging large vision-language models and end-to-end autonomous driving,”arXiv preprint arXiv:2410.22313, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

work page 2023

[16] [16]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,”arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment,

K. Renz, L. Chen, E. Arani, and O. Sinavski, “Simlingo: Vision-only closed-loop autonomous driving with language-action alignment,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 11 993–12 003

work page 2025

[18] [18]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation,

H. Fu, D. Zhang, Z. Zhao, J. Cui, D. Liang, C. Zhang, D. Zhang, H. Xie, B. Wang, and X. Bai, “Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 24 823–24 834

work page 2025

[19] [19]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma, “Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning,”arXiv preprint arXiv:2506.13757, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Drivecot: Integrating chain-of-thought reasoning with end-to-end driving.arXiv preprint arXiv:2403.16996,

T. Wang, E. Xie, R. Chu, Z. Li, and P. Luo, “Drivecot: Integrating chain-of-thought reasoning with end-to-end driving,”arXiv preprint arXiv:2403.16996, 2024

work page arXiv 2024

[21] [21]

Sce2drivex: A generalized mllm framework for scene-to-drive learning,

R. Zhao, Q. Yuan, J. Li, H. Hu, Y . Li, Z. Gao, and F. Gao, “Sce2drivex: A generalized mllm framework for scene-to-drive learning,”IEEE Robotics and Automation Letters, 2025

work page 2025

[22] [22]

Gra- dient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gra- dient surgery for multi-task learning,”Advances in neural information processing systems, vol. 33, pp. 5824–5836, 2020

work page 2020

[23] [23]

Mitigating task interference in multi-task learning via explicit task routing with non-learnable primitives,

C. Ding, Z. Lu, S. Wang, R. Cheng, and V . N. Boddeti, “Mitigating task interference in multi-task learning via explicit task routing with non-learnable primitives,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7756–7765

work page 2023

[24] [24]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π 0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, Y . An, C. Tang, L. Hou, L. Fan, and Z. Zhang, “Drivevla-w0: World models amplify data scaling law in autonomous driving,”arXiv preprint arXiv:2510.12796, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Z. Yang, Y . Chai, X. Jia, Q. Li, Y . Shao, X. Zhu, H. Su, and J. Yan, “Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving,”arXiv preprint arXiv:2505.16278, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Drive- r1: Bridging reasoning and planning in vlms for autonomous driving IEEE ROBOTICS AND AUTOMATION LETTERS 9 with reinforcement learning,

Y . Li, M. Tian, D. Zhu, J. Zhu, Z. Lin, Z. Xiong, and X. Zhao, “Drive- r1: Bridging reasoning and planning in vlms for autonomous driving IEEE ROBOTICS AND AUTOMATION LETTERS 9 with reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 8, 2026, pp. 6708–6716

work page 2026

[28] [28]

Latentvla: Efficient vision-language models for autonomous driving via latent action prediction

C. Xie, C. Sima, T. Li, B. Sun, J. Wu, Z. Hao, and H. Li, “Flare: Learning future-aware latent representations from vision-language models for autonomous driving,”arXiv preprint arXiv:2601.05611, 2026

work page arXiv 2026

[29] [29]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, K. Ma, G. Chen, H. Ye, W. Liu, and X. Wang, “Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving,” arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang, “Rethinking the open-loop evaluation of end-to- end autonomous driving in nuscenes,”arXiv preprint arXiv:2305.10430, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023