DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Haisheng Su; Junchi Yan; Qifeng Li; Xiaosong Jia; Xuekai Zhu; Yilin Chai; Yuqian Shao; Zhenjie Yang

arxiv: 2505.16278 · v2 · pith:OCUAVJ6Cnew · submitted 2025-05-22 · 💻 cs.CV · cs.AI· cs.RO

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang , Yilin Chai , Xiaosong Jia , Qifeng Li , Yuqian Shao , Xuekai Zhu , Haisheng Su , Junchi Yan This is my paper

Pith reviewed 2026-05-22 14:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords autonomous drivingmixture of expertsend-to-end learningvision-language-actionmulti-view sensingbehavior specializationclosed-loop evaluation

0 comments

The pith

Mixture-of-experts routers for vision and action let end-to-end driving models handle rare maneuvers without averaging behaviors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes DriveMoE, which adds two mixture-of-experts modules to a vision-language-action baseline for autonomous driving. A vision MoE uses a router to pick the most relevant cameras for the current scene, much like a driver focuses on key visual cues instead of processing every view at once. An action MoE uses a second router to activate specialized expert networks for distinct driving skills such as sharp turns. The design aims to avoid the mode-averaging problem that occurs when a single model tries to master every possible behavior at once. Closed-loop tests on the Bench2Drive benchmark show the combined system reaching state-of-the-art results.

Core claim

DriveMoE integrates a Scene-Specialized Vision MoE, whose router selects relevant cameras according to driving context, and a Skill-Specialized Action MoE, whose router activates behavior-specific expert modules, into the Drive-π0 vision-language-action baseline; this explicit specialization enables robust handling of diverse and complex scenarios, including rare aggressive maneuvers, and produces state-of-the-art closed-loop performance on Bench2Drive.

What carries the argument

Dual-router mixture-of-experts system: one router dynamically chooses which cameras to attend to, the other activates driving-behavior experts.

If this is right

Dynamic camera selection reduces the need to process every sensor view at every moment.
Behavior-specific action experts prevent dilution of rare but safety-critical maneuvers.
The two MoE layers together produce higher closed-loop success rates than a single shared network.
Explicit specialization supports scaling to wider ranges of driving contexts without retraining the entire model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The router-based selection pattern could be reused in other multi-camera robotic control settings.
Online fine-tuning of the routers might be needed to maintain performance as real-world conditions drift.
Pairing this architecture with larger language models could add higher-level planning on top of the specialized low-level experts.

Load-bearing premise

Routers trained on the training distribution will correctly identify relevant cameras and driving behaviors when faced with diverse or unseen conditions.

What would settle it

A closed-loop test in which the model repeatedly selects unhelpful cameras or activates the wrong action experts on novel scenarios such as night driving or sudden weather changes, resulting in collisions or off-road events, would falsify the claim.

Figures

Figures reproduced from arXiv: 2505.16278 by Haisheng Su, Junchi Yan, Qifeng Li, Xiaosong Jia, Xuekai Zhu, Yilin Chai, Yuqian Shao, Zhenjie Yang.

**Figure 1.** Figure 1: Comparison of Different Vision and Action Modeling Strategies in VLA-based End-toEnd Driving. (a.1) Vanilla visual token encoding [14] processes all surround-view images through a vision tower, leading to token redundancy and increased computational cost. (a.2) Query-based token extraction [20] (e.g., Q-former [21]) selects a subset of visual tokens from each image, but loses spatial structure and require… view at source ↗

**Figure 2.** Figure 2: Framework of DriveMoE. Our proposed framework comprises two main Mixture-ofExperts (MoE) modules tailored for end-to-end autonomous driving. The Scene-Specialized Vision MoE dynamically selects relevant camera views based on real-time driving contexts, efficiently reducing visual redundancy. Subsequently, selected views are fused into a unified representation by projector layers. The Skill-Specialized Act… view at source ↗

**Figure 3.** Figure 3: The Scene-Specialized Vision Mixture-of-Experts. 𝑬𝒔𝒉𝒂𝒓𝒆𝟏 𝑬𝒔𝒉𝒂𝒓𝒆𝟐 𝑬𝒔𝒉𝒂𝒓𝒆 𝑴 𝑬𝒏𝒐𝒏−𝒔𝒉𝒂𝒓𝒆 𝟏 𝑬𝒏𝒐𝒏−𝒔𝒉𝒂𝒓𝒆 𝟐 𝑬𝒏𝒐𝒏−𝒔𝒉𝒂𝒓𝒆 𝑵 Attention Normalize Camera Router Skill-Specialized Action MoE Transformer Decoder Layer Large Language Model Merging Give Way Overtaking Supervision Top-K = = [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The Skill-Specialized Action Mixture-of-Experts. be integrated effectively. This dynamic attention strategy significantly reduces the number of visual tokens processed per timestep, greatly improving computational efficiency and decision accuracy. Formally, we define the image from camera view v at timestep t as I v t , where v ∈ {1, 2, . . . , N} for N available camera views. In particular, the front-view… view at source ↗

read the original abstract

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DriveMoE layers a camera router and a behavior router onto a VLA baseline to specialize vision and action in driving, but the SOTA claim rests on unshown metrics and router stability.

read the letter

The core move here is adding two routers to the π0 VLA model: one that dynamically picks which camera feeds matter for the current scene, and another that routes to behavior-specific action experts. This is meant to let the system handle rare maneuvers without the usual averaging across modes that happens in standard end-to-end driving models. The vision part is framed as mimicking selective human attention, which is a reasonable intuition, and the action part directly targets behavioral diversity. That combination applied to autonomous driving is not something already in the cited prior work, so the architecture itself is the clearest new piece. They also plan to release code and models, which helps reproducibility. The experiments are said to reach SOTA on Bench2Drive closed-loop, which would be useful if the numbers and controls hold up. The main soft spot is that the abstract gives no concrete metrics, baseline deltas, or ablation tables, so it is hard to tell how much the MoE layers actually move the needle versus the base model. The routers are trained on the training distribution, and nothing in the provided summary shows tests on out-of-distribution scenes or rare events, which is exactly where specialization would need to prove itself. This is aimed at people working on scalable E2E driving systems who already follow VLA and MoE ideas. A reader looking for a concrete recipe to try on their own driving stack could extract value from the router designs even before the full results are verified. I would send it to peer review because the idea is specific enough and the target problem is real; referees can push for the missing numbers and generalization checks without starting from zero.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DriveMoE, a Mixture-of-Experts extension to the π₀ Vision-Language-Action baseline (Drive-π₀) for end-to-end autonomous driving. It introduces a Scene-Specialized Vision MoE that trains a router to dynamically select relevant camera views and a Skill-Specialized Action MoE that activates behavior-specific expert modules. The central empirical claim is that this combination yields state-of-the-art closed-loop performance on the Bench2Drive benchmark by enabling specialization without mode averaging, while mirroring human selective attention and behavioral handling of rare maneuvers.

Significance. If the empirical results hold, the work would provide evidence that MoE architectures can improve scalability and robustness in multi-view E2E-AD by avoiding parameter averaging across diverse scenarios. The planned release of code and models would further strengthen reproducibility for the community.

major comments (2)

[Abstract and Experiments/Results section] The abstract and results description assert SOTA performance on Bench2Drive closed-loop evaluation but supply no quantitative metrics (e.g., success rate, collision rate, or route completion), baseline comparisons (including against Drive-π₀), or ablation results isolating the Vision MoE and Action MoE contributions. This absence prevents verification of the central claim that the routers drive the gains rather than the base VLA model.
[Experiments and Discussion] No analysis is provided on router generalization or stability when visual inputs or required behaviors fall outside the training distribution (e.g., rare aggressive turns or unseen environments). Without such tests, it remains unclear whether the reported improvements stem from true specialization or from overfitting to the Bench2Drive training support.

minor comments (2)

[Introduction and Method] The notation Drive-π₀ versus π₀ should be clarified consistently throughout to avoid confusion with the original embodied AI model.
[Method] Figure captions and router diagrams would benefit from explicit labels indicating which router controls camera selection versus expert activation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and commit to revisions that strengthen the empirical presentation without altering the core claims.

read point-by-point responses

Referee: [Abstract and Experiments/Results section] The abstract and results description assert SOTA performance on Bench2Drive closed-loop evaluation but supply no quantitative metrics (e.g., success rate, collision rate, or route completion), baseline comparisons (including against Drive-π₀), or ablation results isolating the Vision MoE and Action MoE contributions. This absence prevents verification of the central claim that the routers drive the gains rather than the base VLA model.

Authors: We agree that the current manuscript version presents the SOTA claim in the abstract and results narrative without accompanying numerical values or ablations. This limits the reader's ability to verify the contribution of the routers. In the revised manuscript we will add a results table reporting closed-loop metrics (success rate, collision rate, route completion) for DriveMoE, the Drive-π₀ baseline, and prior methods, together with ablation tables that isolate the Vision MoE and Action MoE components. These additions will make explicit that the observed gains arise from the MoE routers rather than the base VLA architecture alone. revision: yes
Referee: [Experiments and Discussion] No analysis is provided on router generalization or stability when visual inputs or required behaviors fall outside the training distribution (e.g., rare aggressive turns or unseen environments). Without such tests, it remains unclear whether the reported improvements stem from true specialization or from overfitting to the Bench2Drive training support.

Authors: We acknowledge the absence of explicit out-of-distribution analysis for the routers. Bench2Drive already contains a range of challenging and infrequent maneuvers, yet we did not quantify router stability or selection patterns on held-out environments. In the revision we will add a dedicated subsection with qualitative router activation visualizations for rare aggressive turns and quantitative metrics (e.g., router entropy and performance drop) on a small set of unseen scenarios to demonstrate that the specialization generalizes beyond the training support. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical SOTA claim on external benchmark

full rationale

The paper proposes DriveMoE as an architectural extension (Vision MoE for camera routing + Action MoE for behavior specialization) atop the Drive-π₀ baseline and reports closed-loop SOTA results on the external Bench2Drive benchmark. No derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction. The routers are trained on the training distribution and their generalization is an empirical question tested via benchmark metrics; the performance numbers are not forced by any self-definition, fitted-input renaming, or load-bearing self-citation of a uniqueness theorem. The central claim remains falsifiable against an independent benchmark and does not rely on internal re-labeling of fitted quantities as predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that MoE specialization from LLMs transfers directly to vision and action components in driving without introducing new failure modes, plus reliance on the effectiveness of the Drive-π0 baseline.

free parameters (1)

Router training hyperparameters and number of experts
The routers and expert count are learned components whose specific values are not detailed in the abstract but are required for the specialization mechanism.

axioms (1)

domain assumption Mixture-of-Experts enables specialization that avoids mode averaging on diverse tasks
Invoked when the paper states that explicit behavioral specialization prevents mode averaging like existing models.

pith-pipeline@v0.9.0 · 5837 in / 1325 out tokens · 71118 ms · 2026-05-22T14:27:05.567767+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Scene-Specialized Vision MoE... router... Top-K... Skill-Specialized Action MoE... flow-matching planner... two-stage training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations
cs.RO 2026-05 unverdicted novelty 7.0

Bench2Drive-Robust is a new closed-loop benchmark that evaluates end-to-end autonomous driving models under deployment perturbations from camera failures, ego-state errors, and compute delays, showing substantial perf...
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 7.0

VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
Learning Vision-Language-Action World Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
LACO: Adaptive Latent Communication for Collaborative Driving
cs.AI 2026-05 unverdicted novelty 6.0

LACO introduces Iterative Latent Deliberation, Cross-Horizon Saliency Attribution, and Structured Semantic Knowledge Distillation to enable low-latency latent communication in collaborative driving while preserving pe...
One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception
cs.CV 2026-05 conditional novelty 6.0

UniTrans pretrains a bank of translator experts and learns combination coefficients from modality mappings in a scene-invariant latent space to enable zero-shot any-to-any feature translation for heterogeneous collabo...
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

VECTOR-DRIVE uses shared self-attention with semantic-aware expert routing of tokens to VL and trajectory experts plus flow-matching action decoding to reach 88.91 driving score on Bench2Drive.
SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling
cs.LG 2026-04 unverdicted novelty 6.0

SceneSelect discovers a latent scene taxonomy through clustering, trains a decoupled classifier to assign inputs, and uses a scheduling policy to dispatch to optimal expert trajectory predictors, reporting 10.5% avera...
ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

ST-Prune is a training-free spatio-temporal token pruning framework for VLMs in autonomous driving that achieves near-lossless results at 90% token reduction by exploiting motion volatility, temporal recency, and mult...
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
cs.CV 2026-04 unverdicted novelty 6.0

LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
cs.CV 2026-03 unverdicted novelty 6.0

CausalVAD applies sparse causal intervention to remove spurious correlations from end-to-end autonomous driving models, reporting state-of-the-art planning accuracy and robustness on nuScenes.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
cs.CV 2025-12 conditional novelty 6.0

SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.
Continually Evolving Skill Knowledge in Vision Language Action Model
cs.RO 2025-11 unverdicted novelty 6.0

Stellar VLA achieves continual learning in VLA models by maintaining a growing knowledge space and routing tasks to specialized experts conditioned on semantic relations, delivering strong LIBERO benchmark results wit...
DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving
cs.CV 2025-10 unverdicted novelty 6.0

DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.
ReSim: Reliable World Simulation for Autonomous Driving
cs.CV 2025-06 unverdicted novelty 6.0

ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module f...
LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model
cs.CV 2026-05 unverdicted novelty 5.0

LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.
EponaV2: Driving World Model with Comprehensive Future Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling
cs.LG 2026-04 unverdicted novelty 5.0

SceneSelect discovers latent scene categories via clustering, trains a classifier to assign inputs, and dispatches to expert trajectory predictors, reporting 10.5% average gains over single-model and ensemble baseline...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 19 Pith papers · 9 internal anchors

[1]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023

work page 2023
[2]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InCVPR, pages 17853–17862, 2023

work page 2023
[3]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. ICCV, 2023

work page 2023
[4]

Policy pre-training for autonomous driving via self-supervised geometric modeling, 2023

Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, and Yu Qiao. Policy pre-training for autonomous driving via self-supervised geometric modeling, 2023

work page 2023
[5]

Flatfusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving.arXiv preprint arXiv:2408.06832, 2024

Yutao Zhu, Xiaosong Jia, Xinyu Yang, and Junchi Yan. Flatfusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving.arXiv preprint arXiv:2408.06832, 2024

work page arXiv 2024
[6]

Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2151–2170, 2023

Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, et al. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2151–2170, 2023

work page 2023
[7]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. InNeurIPS, 2022

work page 2022
[8]

Hidden biases of end-to-end driving models

Bernhard Jaeger, Kashyap Chitta, and Andreas Geiger. Hidden biases of end-to-end driving models. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2023

work page 2023
[9]

Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.arXiv preprint arXiv:2503.03125, 2025

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.arXiv preprint arXiv:2503.03125, 2025

work page arXiv 2025
[10]

Diffad: A unified diffusion modeling approach for autonomous driving.arXiv preprint arXiv:2503.12170, 2025

Tao Wang, Cong Zhang, Xingguang Qu, Kun Li, Weiwei Liu, and Chang Huang. Diffad: A unified diffusion modeling approach for autonomous driving.arXiv preprint arXiv:2503.12170, 2025

work page arXiv 2025
[11]

Amp: Autoregressive motion prediction revisited with next token prediction for autonomous driving.arXiv preprint arXiv:2403.13331, 2024

Xiaosong Jia, Shaoshuai Shi, Zijun Chen, Li Jiang, Wenlong Liao, Tao He, and Junchi Yan. Amp: Autoregressive motion prediction revisited with next token prediction for autonomous driving.arXiv preprint arXiv:2403.13331, 2024

work page arXiv 2024
[12]

Bench2drive-r: Turning real world data into reactive closed-loop autonomous driving benchmark by generative model.arXiv preprint arXiv:2412.09647, 2024

Junqi You, Xiaosong Jia, Zhiyuan Zhang, Yutao Zhu, and Junchi Yan. Bench2drive-r: Turning real world data into reactive closed-loop autonomous driving benchmark by generative model.arXiv preprint arXiv:2412.09647, 2024

work page arXiv 2024
[13]

Waslander, Yu Liu, and Hongsheng Li

Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed- loop end-to-end driving with large language models, 2023

work page 2023
[14]

Drivelm: Driving with graph visual ques- tion answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering.arXiv preprint arXiv:2312.14150, 2023

work page arXiv 2023
[15]

Asynchronous large language model enhanced planner for autonomous driving, 2024

Yuan Chen, Zi han Ding, Ziqin Wang, Yan Wang, Lijun Zhang, and Si Liu. Asynchronous large language model enhanced planner for autonomous driving, 2024

work page 2024
[17]

Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv: 2402.11502, 2024

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv: 2402.11502, 2024

work page arXiv 2024
[18]

Ide-net: Interactive driving event and pattern extraction from human data.IEEE robotics and automation letters, 6(2):3065–3072, 2021

Xiaosong Jia, Liting Sun, Masayoshi Tomizuka, and Wei Zhan. Ide-net: Interactive driving event and pattern extraction from human data.IEEE robotics and automation letters, 6(2):3065–3072, 2021

work page 2021
[19]

Activead: Planning- oriented active learning for end-to-end autonomous driving, 2024

Han Lu, Xiaosong Jia, Yichen Xie, Wenlong Liao, Xiaokang Yang, and Junchi Yan. Activead: Planning- oriented active learning for end-to-end autonomous driving, 2024

work page 2024
[20]

Drivemlm: Aligning multi-modal large language models with behavioral planning states for au- tonomous driving

Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023. 11

work page arXiv 2023
[21]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[22]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi_0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Llm4drive: A survey of large language models for autonomous driving.ArXiv, abs/2311.01043, 2023

Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. Llm4drive: A survey of large language models for autonomous driving.ArXiv, abs/2311.01043, 2023

work page arXiv 2023
[25]

Carllava: Vision language models for camera-only closed-loop driving, 2024

Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera-only closed-loop driving, 2024

work page 2024
[26]

Gpt4point: A unified framework for point-language understanding and generation

Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, and Hengshuang Zhao. Gpt4point: A unified framework for point-language understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26417–26427, 2024

work page 2024
[27]

Multi-agent trajectory prediction by combining egocentric and allocentric views

Xiaosong Jia, Liting Sun, Hang Zhao, Masayoshi Tomizuka, and Wei Zhan. Multi-agent trajectory prediction by combining egocentric and allocentric views. InConference on Robot Learning, pages 1434–1443. PMLR, 2022

work page 2022
[28]

Towards capturing the temporal dynamics for trajectory prediction: a coarse-to-fine approach

Xiaosong Jia, Li Chen, Penghao Wu, Jia Zeng, Junchi Yan, Hongyang Li, and Yu Qiao. Towards capturing the temporal dynamics for trajectory prediction: a coarse-to-fine approach. InCoRL, pages 910–920. PMLR, 2023

work page 2023
[29]

Xiaosong Jia, Penghao Wu, Li Chen, Yu Liu, Hongyang Li, and Junchi Yan. Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding.IEEE transactions on pattern analysis and machine intelligence, 45(11):13860–13875, 2023

work page 2023
[30]

A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

work page 2025
[31]

Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 1, 2023

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 1, 2023

work page arXiv 2023
[32]

Llama-moe: Building mixture-of-experts from llama with continual pre-training

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. Llama-moe: Building mixture-of-experts from llama with continual pre-training. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15913–15923, 2024

work page 2024
[33]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page 2024
[35]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Trajectory-llm: A language-based data generator for trajectory prediction in autonomous driving

Kairui Yang, Zihao Guo, Gengjie Lin, Haotian Dong, Zhao Huang, Yipeng Wu, Die Zuo, Jibin Peng, Ziyuan Zhong, Xin Wang, et al. Trajectory-llm: A language-based data generator for trajectory prediction in autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[37]

Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

work page arXiv 2025
[38]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024

work page 2024
[39]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

work page 2023
[41]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Mentor: Mixture-of-experts network with task-oriented perturbation for visual reinforcement learning.arXiv preprint arXiv:2410.14972, 2024

Suning Huang, Zheyu Zhang, Tianhai Liang, Yihan Xu, Zhehao Kou, Chenhao Lu, Guowei Xu, Zhengrong Xue, and Huazhe Xu. Mentor: Mixture-of-experts network with task-oriented perturbation for visual reinforcement learning.arXiv preprint arXiv:2410.14972, 2024

work page arXiv 2024
[43]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Deepseek-v3 technical report, 2024

DeepSeek-AI. Deepseek-v3 technical report, 2024

work page 2024
[46]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017. 13

work page 2017
[47]

Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS 2024 Datasets and Benchmarks Track, 2024

work page 2024
[48]

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InCVPR, 2023

work page 2023
[50]

Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InICCV, 2023

work page 2023
[51]

Drivetransformer: Unified transformer for scalable end-to-end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[52]

turn left

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025. 14 A Annotation for Router Vision Router:We developed a set of heuristic rules based on annotation information from the Bench2Drive dataset to identify special driving sce...

work page arXiv 2025

[1] [1]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023

work page 2023

[2] [2]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InCVPR, pages 17853–17862, 2023

work page 2023

[3] [3]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. ICCV, 2023

work page 2023

[4] [4]

Policy pre-training for autonomous driving via self-supervised geometric modeling, 2023

Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, and Yu Qiao. Policy pre-training for autonomous driving via self-supervised geometric modeling, 2023

work page 2023

[5] [5]

Flatfusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving.arXiv preprint arXiv:2408.06832, 2024

Yutao Zhu, Xiaosong Jia, Xinyu Yang, and Junchi Yan. Flatfusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving.arXiv preprint arXiv:2408.06832, 2024

work page arXiv 2024

[6] [6]

Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2151–2170, 2023

Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, et al. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2151–2170, 2023

work page 2023

[7] [7]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. InNeurIPS, 2022

work page 2022

[8] [8]

Hidden biases of end-to-end driving models

Bernhard Jaeger, Kashyap Chitta, and Andreas Geiger. Hidden biases of end-to-end driving models. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2023

work page 2023

[9] [9]

Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.arXiv preprint arXiv:2503.03125, 2025

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.arXiv preprint arXiv:2503.03125, 2025

work page arXiv 2025

[10] [10]

Diffad: A unified diffusion modeling approach for autonomous driving.arXiv preprint arXiv:2503.12170, 2025

Tao Wang, Cong Zhang, Xingguang Qu, Kun Li, Weiwei Liu, and Chang Huang. Diffad: A unified diffusion modeling approach for autonomous driving.arXiv preprint arXiv:2503.12170, 2025

work page arXiv 2025

[11] [11]

Amp: Autoregressive motion prediction revisited with next token prediction for autonomous driving.arXiv preprint arXiv:2403.13331, 2024

Xiaosong Jia, Shaoshuai Shi, Zijun Chen, Li Jiang, Wenlong Liao, Tao He, and Junchi Yan. Amp: Autoregressive motion prediction revisited with next token prediction for autonomous driving.arXiv preprint arXiv:2403.13331, 2024

work page arXiv 2024

[12] [12]

Bench2drive-r: Turning real world data into reactive closed-loop autonomous driving benchmark by generative model.arXiv preprint arXiv:2412.09647, 2024

Junqi You, Xiaosong Jia, Zhiyuan Zhang, Yutao Zhu, and Junchi Yan. Bench2drive-r: Turning real world data into reactive closed-loop autonomous driving benchmark by generative model.arXiv preprint arXiv:2412.09647, 2024

work page arXiv 2024

[13] [13]

Waslander, Yu Liu, and Hongsheng Li

Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed- loop end-to-end driving with large language models, 2023

work page 2023

[14] [14]

Drivelm: Driving with graph visual ques- tion answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering.arXiv preprint arXiv:2312.14150, 2023

work page arXiv 2023

[15] [15]

Asynchronous large language model enhanced planner for autonomous driving, 2024

Yuan Chen, Zi han Ding, Ziqin Wang, Yan Wang, Lijun Zhang, and Si Liu. Asynchronous large language model enhanced planner for autonomous driving, 2024

work page 2024

[16] [17]

Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv: 2402.11502, 2024

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv: 2402.11502, 2024

work page arXiv 2024

[17] [18]

Ide-net: Interactive driving event and pattern extraction from human data.IEEE robotics and automation letters, 6(2):3065–3072, 2021

Xiaosong Jia, Liting Sun, Masayoshi Tomizuka, and Wei Zhan. Ide-net: Interactive driving event and pattern extraction from human data.IEEE robotics and automation letters, 6(2):3065–3072, 2021

work page 2021

[18] [19]

Activead: Planning- oriented active learning for end-to-end autonomous driving, 2024

Han Lu, Xiaosong Jia, Yichen Xie, Wenlong Liao, Xiaokang Yang, and Junchi Yan. Activead: Planning- oriented active learning for end-to-end autonomous driving, 2024

work page 2024

[19] [20]

Drivemlm: Aligning multi-modal large language models with behavioral planning states for au- tonomous driving

Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023. 11

work page arXiv 2023

[20] [21]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023

[21] [22]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [23]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi_0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [24]

Llm4drive: A survey of large language models for autonomous driving.ArXiv, abs/2311.01043, 2023

Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. Llm4drive: A survey of large language models for autonomous driving.ArXiv, abs/2311.01043, 2023

work page arXiv 2023

[24] [25]

Carllava: Vision language models for camera-only closed-loop driving, 2024

Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera-only closed-loop driving, 2024

work page 2024

[25] [26]

Gpt4point: A unified framework for point-language understanding and generation

Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, and Hengshuang Zhao. Gpt4point: A unified framework for point-language understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26417–26427, 2024

work page 2024

[26] [27]

Multi-agent trajectory prediction by combining egocentric and allocentric views

Xiaosong Jia, Liting Sun, Hang Zhao, Masayoshi Tomizuka, and Wei Zhan. Multi-agent trajectory prediction by combining egocentric and allocentric views. InConference on Robot Learning, pages 1434–1443. PMLR, 2022

work page 2022

[27] [28]

Towards capturing the temporal dynamics for trajectory prediction: a coarse-to-fine approach

Xiaosong Jia, Li Chen, Penghao Wu, Jia Zeng, Junchi Yan, Hongyang Li, and Yu Qiao. Towards capturing the temporal dynamics for trajectory prediction: a coarse-to-fine approach. InCoRL, pages 910–920. PMLR, 2023

work page 2023

[28] [29]

Xiaosong Jia, Penghao Wu, Li Chen, Yu Liu, Hongyang Li, and Junchi Yan. Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding.IEEE transactions on pattern analysis and machine intelligence, 45(11):13860–13875, 2023

work page 2023

[29] [30]

A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

work page 2025

[30] [31]

Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 1, 2023

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 1, 2023

work page arXiv 2023

[31] [32]

Llama-moe: Building mixture-of-experts from llama with continual pre-training

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. Llama-moe: Building mixture-of-experts from llama with continual pre-training. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15913–15923, 2024

work page 2024

[32] [33]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [34]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page 2024

[34] [35]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [36]

Trajectory-llm: A language-based data generator for trajectory prediction in autonomous driving

Kairui Yang, Zihao Guo, Gengjie Lin, Haotian Dong, Zhao Huang, Yipeng Wu, Die Zuo, Jibin Peng, Ziyuan Zhong, Xin Wang, et al. Trajectory-llm: A language-based data generator for trajectory prediction in autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[36] [37]

Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

work page arXiv 2025

[37] [38]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024

work page 2024

[38] [39]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [40]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

work page 2023

[40] [41]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [42]

Mentor: Mixture-of-experts network with task-oriented perturbation for visual reinforcement learning.arXiv preprint arXiv:2410.14972, 2024

Suning Huang, Zheyu Zhang, Tianhai Liang, Yihan Xu, Zhehao Kou, Chenhao Lu, Guowei Xu, Zhengrong Xue, and Huazhe Xu. Mentor: Mixture-of-experts network with task-oriented perturbation for visual reinforcement learning.arXiv preprint arXiv:2410.14972, 2024

work page arXiv 2024

[42] [43]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [44]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [45]

Deepseek-v3 technical report, 2024

DeepSeek-AI. Deepseek-v3 technical report, 2024

work page 2024

[45] [46]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017. 13

work page 2017

[46] [47]

Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS 2024 Datasets and Benchmarks Track, 2024

work page 2024

[47] [48]

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [49]

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InCVPR, 2023

work page 2023

[49] [50]

Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. InICCV, 2023

work page 2023

[50] [51]

Drivetransformer: Unified transformer for scalable end-to-end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[51] [52]

turn left

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025. 14 A Annotation for Router Vision Router:We developed a set of heuristic rules based on annotation information from the Bench2Drive dataset to identify special driving sce...

work page arXiv 2025