pith. sign in

arxiv: 2505.16278 · v2 · pith:OCUAVJ6Cnew · submitted 2025-05-22 · 💻 cs.CV · cs.AI· cs.RO

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Pith reviewed 2026-05-22 14:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords autonomous drivingmixture of expertsend-to-end learningvision-language-actionmulti-view sensingbehavior specializationclosed-loop evaluation
0
0 comments X

The pith

Mixture-of-experts routers for vision and action let end-to-end driving models handle rare maneuvers without averaging behaviors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes DriveMoE, which adds two mixture-of-experts modules to a vision-language-action baseline for autonomous driving. A vision MoE uses a router to pick the most relevant cameras for the current scene, much like a driver focuses on key visual cues instead of processing every view at once. An action MoE uses a second router to activate specialized expert networks for distinct driving skills such as sharp turns. The design aims to avoid the mode-averaging problem that occurs when a single model tries to master every possible behavior at once. Closed-loop tests on the Bench2Drive benchmark show the combined system reaching state-of-the-art results.

Core claim

DriveMoE integrates a Scene-Specialized Vision MoE, whose router selects relevant cameras according to driving context, and a Skill-Specialized Action MoE, whose router activates behavior-specific expert modules, into the Drive-π0 vision-language-action baseline; this explicit specialization enables robust handling of diverse and complex scenarios, including rare aggressive maneuvers, and produces state-of-the-art closed-loop performance on Bench2Drive.

What carries the argument

Dual-router mixture-of-experts system: one router dynamically chooses which cameras to attend to, the other activates driving-behavior experts.

If this is right

  • Dynamic camera selection reduces the need to process every sensor view at every moment.
  • Behavior-specific action experts prevent dilution of rare but safety-critical maneuvers.
  • The two MoE layers together produce higher closed-loop success rates than a single shared network.
  • Explicit specialization supports scaling to wider ranges of driving contexts without retraining the entire model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The router-based selection pattern could be reused in other multi-camera robotic control settings.
  • Online fine-tuning of the routers might be needed to maintain performance as real-world conditions drift.
  • Pairing this architecture with larger language models could add higher-level planning on top of the specialized low-level experts.

Load-bearing premise

Routers trained on the training distribution will correctly identify relevant cameras and driving behaviors when faced with diverse or unseen conditions.

What would settle it

A closed-loop test in which the model repeatedly selects unhelpful cameras or activates the wrong action experts on novel scenarios such as night driving or sudden weather changes, resulting in collisions or off-road events, would falsify the claim.

Figures

Figures reproduced from arXiv: 2505.16278 by Haisheng Su, Junchi Yan, Qifeng Li, Xiaosong Jia, Xuekai Zhu, Yilin Chai, Yuqian Shao, Zhenjie Yang.

Figure 1
Figure 1. Figure 1: Comparison of Different Vision and Action Modeling Strategies in VLA-based End-to￾End Driving. (a.1) Vanilla visual token encoding [14] processes all surround-view images through a vision tower, leading to token redundancy and increased computational cost. (a.2) Query-based token extraction [20] (e.g., Q-former [21]) selects a subset of visual tokens from each image, but loses spatial structure and require… view at source ↗
Figure 2
Figure 2. Figure 2: Framework of DriveMoE. Our proposed framework comprises two main Mixture-of￾Experts (MoE) modules tailored for end-to-end autonomous driving. The Scene-Specialized Vision MoE dynamically selects relevant camera views based on real-time driving contexts, efficiently reducing visual redundancy. Subsequently, selected views are fused into a unified representation by projector layers. The Skill-Specialized Act… view at source ↗
Figure 3
Figure 3. Figure 3: The Scene-Specialized Vision Mixture-of-Experts. 𝑬𝒔𝒉𝒂𝒓𝒆𝟏 𝑬𝒔𝒉𝒂𝒓𝒆𝟐 𝑬𝒔𝒉𝒂𝒓𝒆 𝑴 𝑬𝒏𝒐𝒏−𝒔𝒉𝒂𝒓𝒆 𝟏 𝑬𝒏𝒐𝒏−𝒔𝒉𝒂𝒓𝒆 𝟐 𝑬𝒏𝒐𝒏−𝒔𝒉𝒂𝒓𝒆 𝑵 Attention Normalize Camera Router Skill-Specialized Action MoE Transformer Decoder Layer Large Language Model Merging Give Way Overtaking Supervision Top-K = = [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The Skill-Specialized Action Mixture-of-Experts. be integrated effectively. This dynamic attention strategy significantly reduces the number of visual tokens processed per timestep, greatly improving computational efficiency and decision accuracy. Formally, we define the image from camera view v at timestep t as I v t , where v ∈ {1, 2, . . . , N} for N available camera views. In particular, the front-view… view at source ↗
read the original abstract

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DriveMoE, a Mixture-of-Experts extension to the π₀ Vision-Language-Action baseline (Drive-π₀) for end-to-end autonomous driving. It introduces a Scene-Specialized Vision MoE that trains a router to dynamically select relevant camera views and a Skill-Specialized Action MoE that activates behavior-specific expert modules. The central empirical claim is that this combination yields state-of-the-art closed-loop performance on the Bench2Drive benchmark by enabling specialization without mode averaging, while mirroring human selective attention and behavioral handling of rare maneuvers.

Significance. If the empirical results hold, the work would provide evidence that MoE architectures can improve scalability and robustness in multi-view E2E-AD by avoiding parameter averaging across diverse scenarios. The planned release of code and models would further strengthen reproducibility for the community.

major comments (2)
  1. [Abstract and Experiments/Results section] The abstract and results description assert SOTA performance on Bench2Drive closed-loop evaluation but supply no quantitative metrics (e.g., success rate, collision rate, or route completion), baseline comparisons (including against Drive-π₀), or ablation results isolating the Vision MoE and Action MoE contributions. This absence prevents verification of the central claim that the routers drive the gains rather than the base VLA model.
  2. [Experiments and Discussion] No analysis is provided on router generalization or stability when visual inputs or required behaviors fall outside the training distribution (e.g., rare aggressive turns or unseen environments). Without such tests, it remains unclear whether the reported improvements stem from true specialization or from overfitting to the Bench2Drive training support.
minor comments (2)
  1. [Introduction and Method] The notation Drive-π₀ versus π₀ should be clarified consistently throughout to avoid confusion with the original embodied AI model.
  2. [Method] Figure captions and router diagrams would benefit from explicit labels indicating which router controls camera selection versus expert activation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and commit to revisions that strengthen the empirical presentation without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract and Experiments/Results section] The abstract and results description assert SOTA performance on Bench2Drive closed-loop evaluation but supply no quantitative metrics (e.g., success rate, collision rate, or route completion), baseline comparisons (including against Drive-π₀), or ablation results isolating the Vision MoE and Action MoE contributions. This absence prevents verification of the central claim that the routers drive the gains rather than the base VLA model.

    Authors: We agree that the current manuscript version presents the SOTA claim in the abstract and results narrative without accompanying numerical values or ablations. This limits the reader's ability to verify the contribution of the routers. In the revised manuscript we will add a results table reporting closed-loop metrics (success rate, collision rate, route completion) for DriveMoE, the Drive-π₀ baseline, and prior methods, together with ablation tables that isolate the Vision MoE and Action MoE components. These additions will make explicit that the observed gains arise from the MoE routers rather than the base VLA architecture alone. revision: yes

  2. Referee: [Experiments and Discussion] No analysis is provided on router generalization or stability when visual inputs or required behaviors fall outside the training distribution (e.g., rare aggressive turns or unseen environments). Without such tests, it remains unclear whether the reported improvements stem from true specialization or from overfitting to the Bench2Drive training support.

    Authors: We acknowledge the absence of explicit out-of-distribution analysis for the routers. Bench2Drive already contains a range of challenging and infrequent maneuvers, yet we did not quantify router stability or selection patterns on held-out environments. In the revision we will add a dedicated subsection with qualitative router activation visualizations for rare aggressive turns and quantitative metrics (e.g., router entropy and performance drop) on a small set of unseen scenarios to demonstrate that the specialization generalizes beyond the training support. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical SOTA claim on external benchmark

full rationale

The paper proposes DriveMoE as an architectural extension (Vision MoE for camera routing + Action MoE for behavior specialization) atop the Drive-π₀ baseline and reports closed-loop SOTA results on the external Bench2Drive benchmark. No derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction. The routers are trained on the training distribution and their generalization is an empirical question tested via benchmark metrics; the performance numbers are not forced by any self-definition, fitted-input renaming, or load-bearing self-citation of a uniqueness theorem. The central claim remains falsifiable against an independent benchmark and does not rely on internal re-labeling of fitted quantities as predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that MoE specialization from LLMs transfers directly to vision and action components in driving without introducing new failure modes, plus reliance on the effectiveness of the Drive-π0 baseline.

free parameters (1)
  • Router training hyperparameters and number of experts
    The routers and expert count are learned components whose specific values are not detailed in the abstract but are required for the specialization mechanism.
axioms (1)
  • domain assumption Mixture-of-Experts enables specialization that avoids mode averaging on diverse tasks
    Invoked when the paper states that explicit behavioral specialization prevents mode averaging like existing models.

pith-pipeline@v0.9.0 · 5837 in / 1325 out tokens · 71118 ms · 2026-05-22T14:27:05.567767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

    cs.RO 2026-05 unverdicted novelty 7.0

    Bench2Drive-Robust is a new closed-loop benchmark that evaluates end-to-end autonomous driving models under deployment perturbations from camera failures, ego-state errors, and compute delays, showing substantial perf...

  2. VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 7.0

    VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.

  3. Learning Vision-Language-Action World Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 7.0

    VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

  4. LACO: Adaptive Latent Communication for Collaborative Driving

    cs.AI 2026-05 unverdicted novelty 6.0

    LACO introduces Iterative Latent Deliberation, Cross-Horizon Saliency Attribution, and Structured Semantic Knowledge Distillation to enable low-latency latent communication in collaborative driving while preserving pe...

  5. One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception

    cs.CV 2026-05 conditional novelty 6.0

    UniTrans pretrains a bank of translator experts and learns combination coefficients from modality mappings in a scene-invariant latent space to enable zero-shot any-to-any feature translation for heterogeneous collabo...

  6. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  7. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  8. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  9. VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    VECTOR-DRIVE uses shared self-attention with semantic-aware expert routing of tokens to VL and trajectory experts plus flow-matching action decoding to reach 88.91 driving score on Bench2Drive.

  10. SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling

    cs.LG 2026-04 unverdicted novelty 6.0

    SceneSelect discovers a latent scene taxonomy through clustering, trains a decoupled classifier to assign inputs, and uses a scheduling policy to dispatch to optimal expert trajectory predictors, reporting 10.5% avera...

  11. ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    ST-Prune is a training-free spatio-temporal token pruning framework for VLMs in autonomous driving that achieves near-lossless results at 90% token reduction by exploiting motion volatility, temporal recency, and mult...

  12. LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.

  13. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  14. CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

    cs.CV 2026-03 unverdicted novelty 6.0

    CausalVAD applies sparse causal intervention to remove spurious correlations from end-to-end autonomous driving models, reporting state-of-the-art planning accuracy and robustness on nuScenes.

  15. PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

    cs.RO 2026-01 unverdicted novelty 6.0

    PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

  16. SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

    cs.CV 2025-12 conditional novelty 6.0

    SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.

  17. Continually Evolving Skill Knowledge in Vision Language Action Model

    cs.RO 2025-11 unverdicted novelty 6.0

    Stellar VLA achieves continual learning in VLA models by maintaining a growing knowledge space and routing tasks to specialized experts conditioned on semantic relations, delivering strong LIBERO benchmark results wit...

  18. DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

    cs.CV 2025-10 unverdicted novelty 6.0

    DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.

  19. ReSim: Reliable World Simulation for Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 6.0

    ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module f...

  20. LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

    cs.CV 2026-05 unverdicted novelty 5.0

    LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.

  21. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  22. SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling

    cs.LG 2026-04 unverdicted novelty 5.0

    SceneSelect discovers latent scene categories via clustering, trains a classifier to assign inputs, and dispatches to expert trajectory predictors, reporting 10.5% average gains over single-model and ensemble baseline...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 19 Pith papers · 9 internal anchors

  1. [1]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023

  2. [2]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InCVPR, pages 17853–17862, 2023

  3. [3]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. ICCV, 2023

  4. [4]

    Policy pre-training for autonomous driving via self-supervised geometric modeling, 2023

    Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, and Yu Qiao. Policy pre-training for autonomous driving via self-supervised geometric modeling, 2023

  5. [5]

    Flatfusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving.arXiv preprint arXiv:2408.06832, 2024

    Yutao Zhu, Xiaosong Jia, Xinyu Yang, and Junchi Yan. Flatfusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving.arXiv preprint arXiv:2408.06832, 2024

  6. [6]

    Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2151–2170, 2023

    Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, et al. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2151–2170, 2023

  7. [7]

    Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

    Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. InNeurIPS, 2022

  8. [8]

    Hidden biases of end-to-end driving models

    Bernhard Jaeger, Kashyap Chitta, and Andreas Geiger. Hidden biases of end-to-end driving models. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2023

  9. [9]

    Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.arXiv preprint arXiv:2503.03125, 2025

    Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.arXiv preprint arXiv:2503.03125, 2025

  10. [10]

    Diffad: A unified diffusion modeling approach for autonomous driving.arXiv preprint arXiv:2503.12170, 2025

    Tao Wang, Cong Zhang, Xingguang Qu, Kun Li, Weiwei Liu, and Chang Huang. Diffad: A unified diffusion modeling approach for autonomous driving.arXiv preprint arXiv:2503.12170, 2025

  11. [11]

    Amp: Autoregressive motion prediction revisited with next token prediction for autonomous driving.arXiv preprint arXiv:2403.13331, 2024

    Xiaosong Jia, Shaoshuai Shi, Zijun Chen, Li Jiang, Wenlong Liao, Tao He, and Junchi Yan. Amp: Autoregressive motion prediction revisited with next token prediction for autonomous driving.arXiv preprint arXiv:2403.13331, 2024

  12. [12]

    Bench2drive-r: Turning real world data into reactive closed-loop autonomous driving benchmark by generative model.arXiv preprint arXiv:2412.09647, 2024

    Junqi You, Xiaosong Jia, Zhiyuan Zhang, Yutao Zhu, and Junchi Yan. Bench2drive-r: Turning real world data into reactive closed-loop autonomous driving benchmark by generative model.arXiv preprint arXiv:2412.09647, 2024

  13. [13]

    Waslander, Yu Liu, and Hongsheng Li

    Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed- loop end-to-end driving with large language models, 2023

  14. [14]

    Drivelm: Driving with graph visual ques- tion answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering.arXiv preprint arXiv:2312.14150, 2023

  15. [15]

    Asynchronous large language model enhanced planner for autonomous driving, 2024

    Yuan Chen, Zi han Ding, Ziqin Wang, Yan Wang, Lijun Zhang, and Si Liu. Asynchronous large language model enhanced planner for autonomous driving, 2024

  16. [17]

    Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv: 2402.11502, 2024

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv: 2402.11502, 2024

  17. [18]

    Ide-net: Interactive driving event and pattern extraction from human data.IEEE robotics and automation letters, 6(2):3065–3072, 2021

    Xiaosong Jia, Liting Sun, Masayoshi Tomizuka, and Wei Zhan. Ide-net: Interactive driving event and pattern extraction from human data.IEEE robotics and automation letters, 6(2):3065–3072, 2021

  18. [19]

    Activead: Planning- oriented active learning for end-to-end autonomous driving, 2024

    Han Lu, Xiaosong Jia, Yichen Xie, Wenlong Liao, Xiaokang Yang, and Junchi Yan. Activead: Planning- oriented active learning for end-to-end autonomous driving, 2024

  19. [20]

    Drivemlm: Aligning multi-modal large language models with behavioral planning states for au- tonomous driving

    Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023. 11

  20. [21]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  21. [22]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  22. [23]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi_0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  23. [24]

    Llm4drive: A survey of large language models for autonomous driving.ArXiv, abs/2311.01043, 2023

    Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. Llm4drive: A survey of large language models for autonomous driving.ArXiv, abs/2311.01043, 2023

  24. [25]

    Carllava: Vision language models for camera-only closed-loop driving, 2024

    Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera-only closed-loop driving, 2024

  25. [26]

    Gpt4point: A unified framework for point-language understanding and generation

    Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, and Hengshuang Zhao. Gpt4point: A unified framework for point-language understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26417–26427, 2024

  26. [27]

    Multi-agent trajectory prediction by combining egocentric and allocentric views

    Xiaosong Jia, Liting Sun, Hang Zhao, Masayoshi Tomizuka, and Wei Zhan. Multi-agent trajectory prediction by combining egocentric and allocentric views. InConference on Robot Learning, pages 1434–1443. PMLR, 2022

  27. [28]

    Towards capturing the temporal dynamics for trajectory prediction: a coarse-to-fine approach

    Xiaosong Jia, Li Chen, Penghao Wu, Jia Zeng, Junchi Yan, Hongyang Li, and Yu Qiao. Towards capturing the temporal dynamics for trajectory prediction: a coarse-to-fine approach. InCoRL, pages 910–920. PMLR, 2023

  28. [29]

    Xiaosong Jia, Penghao Wu, Li Chen, Yu Liu, Hongyang Li, and Junchi Yan. Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding.IEEE transactions on pattern analysis and machine intelligence, 45(11):13860–13875, 2023

  29. [30]

    A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

    Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

  30. [31]

    Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 1, 2023

    Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 1, 2023

  31. [32]

    Llama-moe: Building mixture-of-experts from llama with continual pre-training

    Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. Llama-moe: Building mixture-of-experts from llama with continual pre-training. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15913–15923, 2024

  32. [33]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  33. [34]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

  34. [35]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  35. [36]

    Trajectory-llm: A language-based data generator for trajectory prediction in autonomous driving

    Kairui Yang, Zihao Guo, Gengjie Lin, Haotian Dong, Zhao Huang, Yipeng Wu, Die Zuo, Jibin Peng, Ziyuan Zhong, Xin Wang, et al. Trajectory-llm: A language-based data generator for trajectory prediction in autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

  36. [37]

    Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

    Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

  37. [38]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024

  38. [39]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  39. [40]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

  40. [41]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024

  41. [42]

    Mentor: Mixture-of-experts network with task-oriented perturbation for visual reinforcement learning.arXiv preprint arXiv:2410.14972, 2024

    Suning Huang, Zheyu Zhang, Tianhai Liang, Yihan Xu, Zhehao Kou, Chenhao Lu, Guowei Xu, Zhengrong Xue, and Huazhe Xu. Mentor: Mixture-of-experts network with task-oriented perturbation for visual reinforcement learning.arXiv preprint arXiv:2410.14972, 2024

  42. [43]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  43. [44]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  44. [45]

    Deepseek-v3 technical report, 2024

    DeepSeek-AI. Deepseek-v3 technical report, 2024

  45. [46]

    Carla: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017. 13

  46. [47]

    Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS 2024 Datasets and Benchmarks Track, 2024

  47. [48]

    Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

    Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023

  48. [49]

    Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

    Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InCVPR, 2023

  49. [50]

    Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

    Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InICCV, 2023

  50. [51]

    Drivetransformer: Unified transformer for scalable end-to-end autonomous driving

    Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

  51. [52]

    turn left

    Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025. 14 A Annotation for Router Vision Router:We developed a set of heuristic rules based on annotation information from the Bench2Drive dataset to identify special driving sce...