VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
Pith reviewed 2026-05-20 23:09 UTC · model grok-4.3
The pith
Routing feed-forward computation to separate vision-language and trajectory experts inside a shared-attention Transformer couples semantic reasoning with motion planning for end-to-end driving.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN.
What carries the argument
Semantic-aware expert routing inside a multimodal Transformer: shared self-attention layers process vision, language, and trajectory tokens together, after which feed-forward layers are specialized into a Vision-Language Expert and a Trajectory Expert according to token type.
If this is right
- Shared attention plus semantic routing produces higher driving scores than either fully shared or fully decoupled baselines.
- Ablations show that removing the expert split or the shared attention each hurts performance.
- Progressive training and flow-matching action decoding each add measurable gains on the same benchmark.
- The approach keeps pretrained semantic knowledge usable for planning without requiring separate reasoning and control stages.
Where Pith is reading between the lines
- The same token-type routing pattern could be tested on other integrated perception-planning tasks such as robotic manipulation.
- If the experts can be made even more specialized, the model might handle longer-horizon or more crowded scenes with less interference.
- Replacing the fixed routing rule with a learned router that decides per token on the fly would be a direct next experiment.
- Scaling the same architecture to larger base vision-language models might preserve the coupling benefit while increasing overall capacity.
Load-bearing premise
That sending different token types to separate feed-forward experts reduces conflict between language reasoning and trajectory prediction while the shared self-attention layers still maintain useful interaction between semantics and motion.
What would settle it
An ablation on Bench2Drive that replaces the two specialized experts with a single shared feed-forward network across all tokens and measures whether driving score falls below the reported result.
Figures
read the original abstract
End-to-end autonomous driving requires models to understand traffic scenes, infer driving intent, and generate executable motion plans. Recent vision-language-action (VLA) models inherit semantic priors from large-scale vision-language pretraining, yet still face a coupling trade-off: fully shared backbones preserve multimodal interaction but may entangle language reasoning and trajectory prediction, whereas decou pled reasoning-action pipelines reduce task conflict but weaken semantic-motion coupling. We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN computation. On Bench2Drive, VECTOR-DRIVE achieves 88.91 Driving Score and outperforms representative end-to end and VLA-based baselines. Qualitative results and ablations further validate the benefits of shared attention, semantic-aware expert routing, progressive training, and flow-based action de coding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes VECTOR-DRIVE, a tightly coupled vision-language-action (VLA) framework for end-to-end autonomous driving built on Qwen2.5-VL-3B. It maintains shared self-attention across all tokens while routing vision-language tokens through a dedicated Vision-Language Expert and trajectory/ego-state/noisy action tokens through a separate Trajectory Expert; a flow-matching planner then refines the action tokens into waypoints and speed profiles. The central empirical claim is that this architecture achieves an 88.91 Driving Score on Bench2Drive, outperforming representative end-to-end and VLA baselines, with supporting qualitative results and ablations on shared attention, expert routing, progressive training, and flow-based decoding.
Significance. If the reported gains are robust and the shared-attention-plus-expert-routing design demonstrably preserves semantic-motion coupling without representational drift, the work would meaningfully advance VLA-based autonomous driving by offering a concrete resolution to the entanglement-versus-decoupling trade-off. The 88.91 score on Bench2Drive and the use of flow matching for action generation constitute tangible strengths; the approach could influence subsequent multimodal driving models if the coupling mechanism is shown to be effective rather than assumed.
major comments (2)
- [Abstract] Abstract: The claim that 'shared self attention' still yields 'effective semantic-motion coupling' after routing vision-language tokens exclusively to one expert and trajectory tokens to another is load-bearing for the 'tightly coupled' framing. Once FFN parameters diverge, the key/query/value projections entering each attention head originate from distinct parameter sets; without explicit analysis (e.g., representation similarity metrics or cross-expert attention maps in the methods or experiments sections), it remains possible that shared attention operates over incompatible features, undermining the central architectural premise.
- [Results and Experiments] Results and Experiments: The headline 88.91 Driving Score and outperformance over baselines are presented without error bars, exact baseline re-implementation details, or data-split information. These omissions make it difficult to determine whether the observed gains are attributable to the expert-routing design or to training variations, directly affecting the reliability of the empirical support for the coupling hypothesis.
minor comments (2)
- [Abstract] Abstract contains minor typographical issues: 'decou pled' should read 'decoupled' and 'end-to end' should read 'end-to-end'.
- [Method] The description of the flow-matching planner on the action-token pathway would benefit from a brief equation or pseudocode reference to clarify how noisy actions are refined into waypoints and speed profiles.
Simulated Author's Rebuttal
Thank you for the detailed and constructive review of our manuscript. We have carefully addressed each major comment below. Where revisions strengthen the presentation of our claims or improve reproducibility, we have incorporated them into the revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'shared self attention' still yields 'effective semantic-motion coupling' after routing vision-language tokens exclusively to one expert and trajectory tokens to another is load-bearing for the 'tightly coupled' framing. Once FFN parameters diverge, the key/query/value projections entering each attention head originate from distinct parameter sets; without explicit analysis (e.g., representation similarity metrics or cross-expert attention maps in the methods or experiments sections), it remains possible that shared attention operates over incompatible features, undermining the central architectural premise.
Authors: We thank the referee for this precise observation on the coupling mechanism. The shared self-attention layers allow every token to attend to all others regardless of expert routing, which is the primary means of maintaining semantic-motion interaction; the expert-specific FFNs then apply task-specialized transformations after the attention step. To directly address the concern about feature compatibility, the revised manuscript now includes (i) cosine similarity measurements between token representations immediately before and after expert routing and (ii) qualitative cross-expert attention maps in the Experiments section. These additions empirically support that the shared attention continues to operate over compatible features. revision: yes
-
Referee: [Results and Experiments] Results and Experiments: The headline 88.91 Driving Score and outperformance over baselines are presented without error bars, exact baseline re-implementation details, or data-split information. These omissions make it difficult to determine whether the observed gains are attributable to the expert-routing design or to training variations, directly affecting the reliability of the empirical support for the coupling hypothesis.
Authors: We agree that error bars, precise re-implementation details, and data-split information are necessary for assessing result reliability. In the revised manuscript we now report the main Bench2Drive score together with standard deviation across five independent runs using different random seeds, provide a dedicated subsection detailing baseline re-implementations (including exact hyper-parameters, training schedules, and any code-level adaptations), and explicitly state the Bench2Drive train/validation/test splits employed. These changes allow readers to better evaluate whether performance differences arise from the proposed architecture. revision: yes
Circularity Check
No circularity; empirical benchmark result independent of architecture description
full rationale
The paper proposes an architecture with shared self-attention and semantic-aware expert routing for FFN layers in a multimodal Transformer, then validates it via training and evaluation on the external Bench2Drive benchmark to obtain a reported Driving Score of 88.91. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citations are presented in the provided text that would reduce the central performance claim to a definitional or constructional tautology. The result is obtained through standard empirical measurement rather than any self-referential derivation, rendering the chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Shared self-attention across all tokens preserves useful multimodal interactions between semantics and motion.
invented entities (2)
-
Vision-Language Expert
no independent evidence
-
Trajectory Expert
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VECTOR-DRIVE keeps all tokens coupled through shared self-attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert ... while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
shared self-attention ... semantic-aware expert routing ... flow-matching planner
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Multi-modal fusion transformer for end-to-end autonomous driving,
A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7077– 7087
work page 2021
-
[2]
Planning-oriented autonomous driving,
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862
work page 2023
-
[3]
Para- drive: Parallelized architecture for real-time autonomous driving,
X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone, “Para- drive: Parallelized architecture for real-time autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 449–15 458
work page 2024
-
[4]
Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,
B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhanget al., “Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 037–12 047
work page 2025
-
[5]
Enhancing End-to-End Autonomous Driving with Latent World Model
Y . Li, L. Fan, J. He, Y . Wang, Y . Chen, Z. Zhang, and T. Tan, “Enhancing end-to-end autonomous driving with latent world model,”arXiv preprint arXiv:2406.08481, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu et al., “Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation,”arXiv preprint arXiv:2406.06978, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
P. Wu, X. Jia, L. Chen, J. Yan, H. Li, and Y . Qiao, “Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline,” inAdvances in Neural Information Processing Systems, vol. 35, 2022
work page 2022
-
[8]
Think twice before driving: Towards scalable decoders for end-to-end autonomous driving,
X. Jia, P. Wu, L. Chen, J. Xie, C. He, J. Yan, and H. Li, “Think twice before driving: Towards scalable decoders for end-to-end autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[9]
X. Jia, Y . Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
work page 2023
-
[10]
Vad: Vectorized scene representation for efficient autonomous driving,
B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,”ICCV, 2023
work page 2023
-
[11]
Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving,
X. Jia, J. You, Z. Zhang, and J. Yan, “Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving,” inInternational Conference on Learning Representations, 2025
work page 2025
-
[12]
Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driv- ing,
X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan, “Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driv- ing,”Advances in Neural Information Processing Systems, vol. 37, pp. 819–844, 2024
work page 2024
-
[13]
Drivelm: Driving with graph visual ques- tion answering,
C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual ques- tion answering,” inEuropean conference on computer vision. Springer, 2024, pp. 256–274
work page 2024
-
[14]
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
B. Jiang, S. Chen, B. Liao, X. Zhang, W. Yin, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Senna: Bridging large vision-language models and end-to-end autonomous driving,”arXiv preprint arXiv:2410.22313, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Rt-2: Vision-language-action models transfer web knowledge to robotic control,
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183
work page 2023
-
[16]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,”arXiv preprint arXiv:2402.12289, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Simlingo: Vision-only closed-loop autonomous driving with language-action alignment,
K. Renz, L. Chen, E. Arani, and O. Sinavski, “Simlingo: Vision-only closed-loop autonomous driving with language-action alignment,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 11 993–12 003
work page 2025
-
[18]
H. Fu, D. Zhang, Z. Zhao, J. Cui, D. Liang, C. Zhang, D. Zhang, H. Xie, B. Wang, and X. Bai, “Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 24 823–24 834
work page 2025
-
[19]
Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma, “Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning,”arXiv preprint arXiv:2506.13757, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
T. Wang, E. Xie, R. Chu, Z. Li, and P. Luo, “Drivecot: Integrating chain-of-thought reasoning with end-to-end driving,”arXiv preprint arXiv:2403.16996, 2024
-
[21]
Sce2drivex: A generalized mllm framework for scene-to-drive learning,
R. Zhao, Q. Yuan, J. Li, H. Hu, Y . Li, Z. Gao, and F. Gao, “Sce2drivex: A generalized mllm framework for scene-to-drive learning,”IEEE Robotics and Automation Letters, 2025
work page 2025
-
[22]
Gra- dient surgery for multi-task learning,
T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gra- dient surgery for multi-task learning,”Advances in neural information processing systems, vol. 33, pp. 5824–5836, 2020
work page 2020
-
[23]
C. Ding, Z. Lu, S. Wang, R. Cheng, and V . N. Boddeti, “Mitigating task interference in multi-task learning via explicit task routing with non-learnable primitives,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7756–7765
work page 2023
-
[24]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π 0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving
Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, Y . An, C. Tang, L. Hou, L. Fan, and Z. Zhang, “Drivevla-w0: World models amplify data scaling law in autonomous driving,”arXiv preprint arXiv:2510.12796, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Z. Yang, Y . Chai, X. Jia, Q. Li, Y . Shao, X. Zhu, H. Su, and J. Yan, “Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving,”arXiv preprint arXiv:2505.16278, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Y . Li, M. Tian, D. Zhu, J. Zhu, Z. Lin, Z. Xiong, and X. Zhao, “Drive- r1: Bridging reasoning and planning in vlms for autonomous driving IEEE ROBOTICS AND AUTOMATION LETTERS 9 with reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 8, 2026, pp. 6708–6716
work page 2026
-
[28]
Latentvla: Efficient vision-language models for autonomous driving via latent action prediction
C. Xie, C. Sima, T. Li, B. Sun, J. Wu, Z. Hao, and H. Li, “Flare: Learning future-aware latent representations from vision-language models for autonomous driving,”arXiv preprint arXiv:2601.05611, 2026
-
[29]
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, K. Ma, G. Chen, H. Ye, W. Liu, and X. Wang, “Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving,” arXiv preprint arXiv:2506.08052, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes
J.-T. Zhai, Z. Feng, J. Du, Y . Mao, J.-J. Liu, Z. Tan, Y . Zhang, X. Ye, and J. Wang, “Rethinking the open-loop evaluation of end-to- end autonomous driving in nuscenes,”arXiv preprint arXiv:2305.10430, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.