arxiv: 2604.19710 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

Zewei Zhou , Ruining Yang , Xuewei (Tony) Qi , Yiluan Guo , Sherry X. Chen , Tao Feng , Kateryna Pistunova , Yishan Shen , Lili Su , Jiaqi Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords Vision-Language-ActionAutonomous DrivingFlow MatchingGRPO Post-trainingNegative-Recovery SamplesTrajectory PlanningReasoning Dataset

0 comments

The pith

SpanVLA bridges vision-language reasoning to a flow-matching policy conditioned on historical trajectories to generate driving actions faster while learning recoveries from negative examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpanVLA as a framework that pairs autoregressive vision-language reasoning with a flow-matching action expert for end-to-end autonomous driving. It creates an efficient bridge so that VLM guidance and past trajectory data condition the flow-matching policy, cutting the time needed to plan future paths. The work also adds GRPO-based post-training on a new dataset called mReasoning that includes negative-recovery samples, letting the model learn both good behaviors and how to avoid or fix typical mistakes. Experiments on NAVSIM benchmarks show competitive planning results and better handling of complex scenarios. The central goal is to overcome the latency and fragility that limit current vision-language-action models in long-tail driving situations.

Core claim

SpanVLA integrates an autoregressive VLM for reasoning with a flow-matching policy for action generation. The efficient bridge conditions the flow-matching policy on historical trajectory initialization to leverage VLM vision and reasoning guidance, which reduces inference time. GRPO post-training on the mReasoning dataset enables learning from both positive samples and negative-recovery behaviors, improving robustness in reasoning-demanding scenarios. This yields competitive performance on NAVSIM v1 and v2.

What carries the argument

The efficient bridge that conditions a flow-matching policy on historical trajectory initialization and VLM guidance to plan future trajectories.

Load-bearing premise

Conditioning the flow-matching policy on historical trajectory initialization plus VLM guidance will produce safe trajectories across real-world driving distributions without extra constraints.

What would settle it

A recorded driving sequence in which the generated trajectory collides or fails to recover in a scenario covered by the negative-recovery training data.

Figures

Figures reproduced from arXiv: 2604.19710 by Jiaqi Ma, Kateryna Pistunova, Lili Su, Ruining Yang, Sherry X. Chen, Tao Feng, Xuewei (Tony) Qi, Yiluan Guo, Yishan Shen, Zewei Zhou.

**Figure 1.** Figure 1: SpanVLA is a novel end-to-end autonomous driving framework, integrating the autoregressive reasoning and flow-matching action expert. It leverages a visionlanguage model (VLM) with chain-of-thought reasoning as the backbone, and introduces an efficient bridge to extract the multi-granular features from the VLM. Moreover, a flow-matching action expert is introduced to efficiently generate a continuous tr… view at source ↗

**Figure 2.** Figure 2: Overview of the efficient action bridging of the SpanVLA model. The VLM backbone leverages the autoregressive decoding to generate the reasoning results, and we introduce an action bridging to utilize the sparse KV cache to efficiently generate the continuous trajectory with historical initialization based on flow-matching, avoiding the linearly increasing latency of the autoregressive decoding with the lo… view at source ↗

**Figure 3.** Figure 3: mReasoning data distribution and typical negative-recovery samples. During training, to enable the model to learn how to reason for planning, we introduce an additional discrete action generation task following reasoning in the VLM, which unifies reasoning and planning within the SFT, as following: [\mathcal {T}_{\text {Reason}}, (A_{\text {token}})] = \mathrm {VLM}(\mathcal {V}^t, \mathcal {T}^t); A_{\tex… view at source ↗

**Figure 4.** Figure 4: RFT results of SpanVLA in the nuPlan dataset. (a) Comparison of PDMS among different settings of RFT training samples; (b) Qualitative comparison of planning and reasoning performance in positive samples before and after RFT [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: RFT Data-recipe comparison. Blue: fixed 2K positive warm-up (w), varying positive (p)/negative (n) ratios in the remaining samples. Red: fixed warm-up and positives, replacing negatives with recovery (r, partial to full). Green: no warm-up (replaced by positives), adding either negatives or recovery. Effect of Negative Penalty and Recovery Reward. We study the effect of the proposed negative-behavior penal… view at source ↗

**Figure 6.** Figure 6: Comparison of planning and reasoning performance of SpanVLA in negative samples of mReasoning before and after RFT with negative-recovery samples. Left Ground Truth Planning Ground Truth Planning Front Right Back Ground Truth Planning Ground Truth Planning Left Front Right Back Slow Thinking <think> This is a complex scenario requiring additional reasoning. During the right turn, cones and a keep right sig… view at source ↗

**Figure 7.** Figure 7: Comparison of planning and reasoning performance of SpanVLA in recovery samples of mReasoning before and after RFT with negative-recovery samples. 5 Conclusion We proposed SpanVLA, a VLA framework equipped with an efficient action bridge and learned from real-world negative-recovery samples for autonomous driving. To overcome the linearly increasing latency of autoregressive decoding, SpanVLA integrates a… view at source ↗

read the original abstract

Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpanVLA combines VLM reasoning with flow-matching action generation and GRPO training on negative-recovery samples to address latency and robustness in driving VLAs, but the evaluation details are too thin to judge whether the gains are real or generalizable.

read the letter

SpanVLA's main contribution is swapping autoregressive action output for a flow-matching policy that takes VLM reasoning plus historical trajectory initialization as conditioning. This is presented as a direct way to cut inference time while keeping the language model's planning guidance. The second piece is GRPO post-training that explicitly uses negative-recovery samples so the model learns both good trajectories and how to avoid or recover from bad ones. They also release mReasoning, a dataset aimed at complex reasoning cases and those negative examples, which is a practical addition since most driving data stays in the nominal regime.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpanVLA, an end-to-end Vision-Language-Action framework for autonomous driving that combines autoregressive VLM reasoning with a flow-matching action expert. It introduces an efficient bridge to condition a flow-matching policy on VLM guidance and historical trajectory initialization for reduced inference latency, a GRPO-based post-training procedure to learn from negative-recovery samples in addition to positive ones, and the mReasoning dataset focused on complex reasoning and recovery scenarios. Experiments are reported to show competitive performance on NAVSIM v1 and v2 with qualitative robustness gains.

Significance. If the quantitative claims hold with proper validation, the work could advance efficient VLA models for driving by demonstrating a practical bridge between VLMs and flow-matching policies plus negative-sample post-training, potentially aiding long-tail scenario handling. The mReasoning dataset may provide a useful resource for reasoning-focused driving research.

major comments (2)

Abstract: The claim of 'competitive performance' on NAVSIM v1/v2 and 'improved robustness' is asserted without any quantitative metrics, baselines, error bars, ablation results, or statistical details, preventing assessment of whether the efficient bridge or GRPO components deliver measurable gains over prior VLA methods.
The central construction (efficient bridge + flow-matching policy conditioned on historical trajectories and VLM output): No analysis is provided of out-of-distribution failure modes, mode collapse risks, or uncertainty quantification for the learned conditional distribution, which is load-bearing for the robustness and safety claims across real-world driving distributions.

minor comments (2)

Abstract: The description of the 'efficient bridge' and 'GRPO-based post-training' would benefit from a high-level diagram or pseudocode to clarify the integration of autoregressive reasoning with the flow-matching expert.
The introduction of mReasoning is noted as a contribution, but the abstract does not specify its size, collection protocol, or how negative-recovery samples are annotated, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve the presentation of results and analysis.

read point-by-point responses

Referee: Abstract: The claim of 'competitive performance' on NAVSIM v1/v2 and 'improved robustness' is asserted without any quantitative metrics, baselines, error bars, ablation results, or statistical details, preventing assessment of whether the efficient bridge or GRPO components deliver measurable gains over prior VLA methods.

Authors: We agree that the abstract is high-level and does not include specific metrics. The full manuscript reports quantitative results on NAVSIM v1 and v2 with baseline comparisons, ablations for the bridge and GRPO components, and qualitative robustness examples in Section 4. We will revise the abstract to include key performance numbers and references to the supporting experiments and ablations. revision: yes
Referee: The central construction (efficient bridge + flow-matching policy conditioned on historical trajectories and VLM output): No analysis is provided of out-of-distribution failure modes, mode collapse risks, or uncertainty quantification for the learned conditional distribution, which is load-bearing for the robustness and safety claims across real-world driving distributions.

Authors: The manuscript supports robustness claims primarily through the GRPO post-training on negative-recovery samples and the mReasoning dataset, with qualitative results across diverse scenarios. No dedicated quantitative analysis of OOD failure modes, mode collapse, or uncertainty quantification is currently included. We will add a new discussion subsection addressing these aspects, including limitations and future work, to better substantiate the safety-related claims. revision: yes

Circularity Check

0 steps flagged

No circularity: SpanVLA framework and training are additive proposals validated on external benchmarks.

full rationale

The paper introduces SpanVLA as a composite architecture (autoregressive VLM reasoning bridged to a flow-matching policy conditioned on historical trajectories, plus GRPO post-training on negative-recovery samples and a new mReasoning dataset). No equations, derivations, or self-referential definitions appear in the provided abstract or description that reduce claimed performance gains to quantities fitted from the same data by construction. Results are reported as empirical outcomes on NAVSIM v1/v2 rather than as logical consequences of prior fitted parameters or self-citations. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities beyond the high-level model components can be extracted or verified.

pith-pipeline@v0.9.0 · 5567 in / 1049 out tokens · 36982 ms · 2026-05-10T02:31:11.858900+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems
cs.RO 2026-05 unverdicted novelty 7.0

MDrive benchmark shows multi-agent cooperative driving systems generally outperform single-agent ones in closed-loop settings but perception sharing does not always improve planning and negotiation can harm performanc...
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.

Reference graph

Works this paper leans on

86 extracted references · 58 canonical work pages · cited by 2 Pith papers · 14 internal anchors

[1]

In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vi- sion (WACV)

Arai, H., Miwa, K., Sasaki, K., Watanabe, K., Yamaguchi, Y., Aoki, S., Ya- mamoto, I.: Covla: Comprehensive vision-language-action dataset for autonomous driving. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vi- sion (WACV). pp. 1933–1943. IEEE (2025)

2025
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Driving with regulation: Interpretable decision-making for autonomous vehicles with retrieval-augmented reasoning via llm.arXiv preprint arXiv:2410.04759, 2024

Cai, T., Liu, Y., Zhou, Z., Ma, H., Zhao, S.Z., Wu, Z., Ma, J.: Driving with regulation: Interpretable decision-making for autonomous vehicles with retrieval- augmented reasoning via llm. arXiv preprint arXiv:2410.04759 (2024)

work page arXiv 2024
[4]

Cao, W., Hallgarten, M., Li, T., Dauner, D., Gu, X., Wang, C., Miron, Y., Aiello, M.,Li,H.,Gilitschenski,I.,etal.:Pseudo-simulationforautonomousdriving.arXiv preprint arXiv:2506.04218 (2025)

work page arXiv 2025
[5]

IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (2024)

Chen, L., Wu, P., Chitta, K., Jaeger, B., Geiger, A., Li, H.: End-to-end autonomous driving: Challenges and frontiers. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (2024)

2024
[6]

IEEE trans- actions on pattern analysis and machine intelligence45(11), 12878–12895 (2022)

Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., Geiger, A.: Transfuser: Imi- tation with transformer-based sensor fusion for autonomous driving. IEEE trans- actions on pattern analysis and machine intelligence45(11), 12878–12895 (2022)

2022
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Contributors, O.: Openscene: The largest up-to-date 3d occupancy predic- tion benchmark in autonomous driving.https://github.com/OpenDriveLab/ OpenScene(2023)

2023
[9]

DriveFine : Refining-augmented masked diffusion VLA for precise and robust driving

Dang, C., Ang, S., Li, Y., Tian, H., Wang, J., Li, G., Ye, H., Ma, J., Chen, L., Wang, Y.: Drivefine: Refining-augmented masked diffusion vla for precise and ro- bust driving. arXiv preprint arXiv:2602.14577 (2026)

work page arXiv 2026
[10]

Advances in Neural Information Processing Systems37, 28706–28719 (2024)

Dauner, D., Hallgarten, M., Li, T., Weng, X., Huang, Z., Yang, Z., Li, H., Gilitschenski, I., Ivanovic, B., Pavone, M., et al.: Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. Advances in Neural Information Processing Systems37, 28706–28719 (2024)

2024
[11]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y., Sapp, B., Qi, C.R., Zhou, Y., et al.: Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9710–9719 (2021)

2021
[12]

arXiv preprint arXiv:2509.15968 (2025)

Fang, S., Cui, Y., Liang, H., Lv, C., Hang, P., Sun, J.: Corevla: A dual-stage end- to-end autonomous driving framework for long-tail scenarios via collect-and-refine. arXiv preprint arXiv:2509.15968 (2025)

work page arXiv 2025
[13]

Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025

Feng, L., Gao, Y., Zablocki, E., Li, Q., Li, W., Liu, S., Cord, M., Alahi, A.: Rap: 3d rasterization augmented end-to-end planning. arXiv preprint arXiv:2510.04333 (2025)

work page arXiv 2025
[14]

IEEE Robotics and Automation Letters11(1), 226–233 (2025) 16 Z

Feng, R., Xi, N., Chu, D., Wang, R., Deng, Z., Wang, A., Lu, L., Wang, J., Huang, Y.: Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving. IEEE Robotics and Automation Letters11(1), 226–233 (2025) 16 Z. Zhou, R. Yang et al

2025
[15]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755, 2025

Fu, H., Zhang, D., Zhao, Z., Cui, J., Liang, D., Zhang, C., Zhang, D., Xie, H., Wang, B., Bai, X.: Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. arXiv preprint arXiv:2503.19755 (2025)

work page arXiv 2025
[16]

Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning,

Fu, H., Zhang, D., Zhao, Z., Cui, J., Xie, H., Wang, B., Chen, G., Liang, D., Bai, X.: Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning. arXiv Preprint arXiv:2512.13636 (2025)

work page arXiv 2025
[17]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Gao, H., Chen, S., Jiang, B., Liao, B., Shi, Y., Guo, X., Pu, Y., haoran yin, Li, X., xinbang zhang, ying zhang, Liu, W., Zhang, Q., Wang, X.: RAD: Training an end-to-end driving policy via large-scale 3DGS-based reinforcement learning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

2025
[18]

Gemini Team, Google DeepMind: Gemini 3: A new era of intelligence with gemini
[19]

Technical Report (2025),https://deepmind.google/models/gemini/

2025
[20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

2022
[22]

Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

Hu, T., Liu, X., Wang, S., Zhu, Y., Liang, A., Kong, L., Zhao, G., Gong, Z., Cen, J., Huang, Z., et al.: Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.16760 (2025)

work page arXiv 2025
[23]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17853– 17862 (2023)

2023
[24]

Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Hwang, J.J., Xu, R., Lin, H., Hung, W.C., Ji, J., Choi, K., Huang, D., He, T., Cov- ington, P., Sapp, B., et al.: Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262 (2024)

work page arXiv 2024
[25]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., et al.: pi0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759 (2025)

work page Pith review arXiv 2025
[26]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al.: Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jia, X., Wu, P., Chen, L., Xie, J., He, C., Yan, J., Li, H.: Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21983–21994 (2023)

2023
[28]

Jia, X., Yang, Z., Li, Q., Zhang, Z., Yan, J.: Bench2drive: Towards multi- abilitybenchmarkingofclosed-loopend-to-endautonomousdriving.arXivpreprint arXiv:2406.03877 (2024)

work page arXiv 2024
[29]

IRL-VLA: Training an vision-language-action policy via reward world model,

Jiang, A., Gao, Y., Wang, Y., Sun, Z., Wang, S., Heng, Y., Sun, H., Tang, S., Zhu, L., Chai, J., et al.: Irl-vla: Training an vision-language-action policy via reward world model. arXiv preprint arXiv:2508.06571 (2025)

work page arXiv 2025
[30]

Senna: Bridging large vision-language models and end-to- end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

Jiang, B., Chen, S., Liao, B., Zhang, X., Yin, W., Zhang, Q., Huang, C., Liu, W., Wang, X.: Senna: Bridging large vision-language models and end-to-end au- tonomous driving. arXiv preprint arXiv:2410.22313 (2024)

work page arXiv 2024
[31]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W., Huang,C.,Wang,X.:Vad:Vectorizedscenerepresentationforefficientautonomous SpanVLA 17 driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8340–8350 (2023)

2023
[32]

Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

Jiang, B., Chen, S., Zhang, Q., Liu, W., Wang, X.: Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning. arXiv preprint arXiv:2503.07608 (2025)

work page arXiv 2025
[33]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Karnchanachari, N., Geromichalos, D., Tan, K.S., Li, N., Eriksen, C., Yaghoubi, S., Mehdipour, N., Bernasconi, G., Fong, W.K., Guo, Y., et al.: Towards learning- based planning: The nuplan benchmark for real-world autonomous driving. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 629–636. IEEE (2024)

2024
[34]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success, 2025. URL https://arxiv. org/abs/2502.19645 (2016)

work page internal anchor Pith review arXiv 2025
[35]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review arXiv 2024
[36]

Driving on registers.arXiv preprint arXiv:2601.05083, 2026

Kirby, E., Boulch, A., Xu, Y., Yin, Y., Puy, G., Zablocki, É., Bursuc, A., Gi- daris, S., Marlet, R., Bartoccioni, F., et al.: Driving on registers. arXiv preprint arXiv:2601.05083 (2026)

work page arXiv 2026
[37]

IEEE Robotics and Automation Letters11(1), 818–825 (2025)

Lei, M., Zhou, Z., Li, H., Ma, J., Hu, J.: Risk map as middleware: Toward inter- pretable cooperative end-to-end autonomous driving for risk-aware planning. IEEE Robotics and Automation Letters11(1), 818–825 (2025)

2025
[38]

Finetuning generative trajectory model with reinforcement learning from human feedback.arXiv preprint arXiv:2503.10434, 2025

Li, D., Ren, J., Wang, Y., Wen, X., Li, P., Xu, L., Zhan, K., Xia, Z., Jia, P., Lang, X., et al.: Finetuning generative trajectory model with reinforcement learning from human feedback. arXiv preprint arXiv:2503.10434 (2025)

work page arXiv 2025
[39]

Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

Li, Y., Shang, S., Liu, W., Zhan, B., Wang, H., Wang, Y., Chen, Y., Wang, X., An, Y., Tang, C., et al.: Drivevla-w0: World models amplify data scaling law in autonomous driving. arXiv preprint arXiv:2510.12796 (2025)

work page arXiv 2025
[40]

Li, Y., Wang, Y., Liu, Y., He, J., Fan, L., Zhang, Z.: End-to-end driving with onlinetrajectoryevaluationviabevworldmodel.In:ProceedingsoftheIEEE/CVF International Conference on Computer Vision. pp. 27137–27146 (2025)

2025
[41]

Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

Li, Y., Xiong, K., Guo, X., Li, F., Yan, S., Xu, G., Zhou, L., Chen, L., Sun, H., Wang, B., et al.: Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. arXiv preprint arXiv:2506.08052 (2025)

work page arXiv 2025
[42]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Li, Z., Li, K., Wang, S., Lan, S., Yu, Z., Ji, Y., Li, Z., Zhu, Z., Kautz, J., Wu, Z., et al.: Hydra-mdp: End-to-end multimodal planning with multi-target hydra- distillation. arXiv preprint arXiv:2406.06978 (2024)

work page internal anchor Pith review arXiv 2024
[43]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139, 2024

Liao, B., Chen, S., Yin, H., Jiang, B., Wang, C., Yan, S., Zhang, X., Li, X., Zhang, Y., Zhang, Q., et al.: Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. arXiv preprint arXiv:2411.15139 (2024)

work page arXiv 2024
[44]

Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting.arXiv preprint arXiv:2503.07234, 2025

Liao, H., Kong, H., Wang, B., Wang, C., Ye, W., He, Z., Xu, C., Li, Z.: Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting. arXiv preprint arXiv:2503.07234 (2025)

work page arXiv 2025
[45]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

IEEE Robotics and Automation Letters11(2), 1738–1745 (2025)

Liu, D., Gao, Y., Qian, D., Zhang, Q., Ye, X., Han, J., Zheng, Y., Liu, X., Xia, Z., Ding, D., et al.: Takead: Preference-based post-optimization for end-to-end autonomous driving with expert takeover data. IEEE Robotics and Automation Letters11(2), 1738–1745 (2025)

2025
[47]

DriveWorld-VLA: Unified latent- space world modeling with vision-language-action for autonomous driving.arXiv preprint arXiv:2602.06521, 2026

Liu, L., Song, Z., Jia, C., Ye, H., Hao, X., Chen, L., et al.: Driveworld-vla: Unified latent-space world modeling with vision-language-action for autonomous driving. arXiv preprint arXiv:2602.06521 (2026) 18 Z. Zhou, R. Yang et al

work page arXiv 2026
[48]

CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving

Liu, P., Ning, Q., Lu, X., Liu, H., Ma, W., She, D., Jia, P., Lang, X., Ma, J.: Omnireason: A temporal-guided vision-language-action framework for autonomous driving. arXiv preprint arXiv:2509.00789 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

arXiv preprint arXiv:2510.00154 (2025)

Liu, X., Sani, M.F., Zhou, Z., Wirbel, J., Zarrin, B., Galeazzi, R.: Robopilot: Gen- eralizable dynamic robotic manipulation with dual-thinking modes. arXiv preprint arXiv:2510.00154 (2025)

work page arXiv 2025
[50]

Ma, Y., Cao, Y., Ding, W., Zhang, S., Wang, Y., Ivanovic, B., Jiang, M., Pavone, M., Xiao, C.: dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning (2025)

2025
[51]

Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

Mao, J., Qian, Y., Ye, J., Zhao, H., Wang, Y.: Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415 (2023)

work page arXiv 2023
[52]

o Research: 2025 waymo open dataset challenge: Vision-based end-to-end driv- ing.https://waymo.com/open/challenges/2025/e2e-driving/(2025), accessed: 2025-04-25

2025
[53]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pan, C., Yaman, B., Nesti, T., Mallik, A., Allievi, A.G., Velipasalar, S., Ren, L.: Vlp: Vision language planning for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14760– 14769 (2024)

2024
[54]

Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models.arXiv preprint arXiv:2503.12772, 2025

Park, S.Y., Cui, C., Ma, Y., Moradipari, A., Gupta, R., Han, K., Wang, Z.: Nu- planqa: A large-scale dataset and benchmark for multi-view driving scene under- standing in multi-modal large language models. arXiv preprint arXiv:2503.12772 (2025)

work page arXiv 2025
[55]

Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

Peng, Z., Ding, W., You, Y., Chen, Y., Luo, W., Tian, T., Cao, Y., Sharma, A., Xu, D., Ivanovic, B., et al.: Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2512.24426 (2025)

work page arXiv 2025
[56]

In: Pro- ceedings of the AAAI Conference on Artificial Intelligence

Qian, T., Chen, J., Zhuo, L., Jiao, Y., Jiang, Y.G.: Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4542–4550 (2024)

2024
[57]

Advances in Neural Information Processing Systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems36, 53728–53741 (2023)

2023
[58]

Nord: A data-efficient vision-language-action model that drives without reasoning.arXiv preprint arXiv:2602.21172,

Rawal, I., Gupta, S., Hu, Y., Zhan, W.: Nord: A data-efficient vision-language- action model that drives without reasoning. arXiv preprint arXiv:2602.21172 (2026)

work page arXiv 2026
[59]

arXiv preprint arXiv:2506.11234 (2025) 2, 8, 10

Rowe, L., de Schaetzen, R., Girgis, R., Pal, C., Paull, L.: Poutine: Vision-language- trajectory pre-training and reinforcement learning post-training enable robust end- to-end autonomous driving. arXiv preprint arXiv:2506.11234 (2025)

work page arXiv 2025
[60]

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

Shang, S., Chen, Y., Wang, Y., Li, Y., Zhang, Z.: Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving. arXiv preprint arXiv:2509.17940 (2025)

work page arXiv 2025
[61]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

In: European Conference on Computer Vision

Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answering. In: European Conference on Computer Vision. pp. 256–274. Springer (2024)

2024
[63]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Song, R., Liang, C., Cao, H., Yan, Z., Zimmer, W., Gross, M., Festag, A., Knoll, A.: Collaborative semantic occupancy prediction with hybrid feature fusion in con- nected automated vehicles. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 17996–18006 (2024) SpanVLA 19

2024
[64]

Latent Chain-of-Thought World Modeling for End-to-End Driving

Tan, S., Chitta, K., Chen, Y., Tian, R., You, Y., Wang, Y., Luo, W., Cao, Y., Krahenbuhl, P., Pavone, M., et al.: Latent chain-of-thought world modeling for end-to-end driving. arXiv preprint arXiv:2512.10226 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

arXiv preprint arXiv:2510.11083 (2025)

Tan, T., Zheng, Y., Liang, R., Wang, Z., Zheng, K., Zheng, J., Li, J., Zhan, X., Liu, J.: Flow matching-based autonomous driving planning with advanced interactive behavior modeling. arXiv preprint arXiv:2510.11083 (2025)

work page arXiv 2025
[66]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Tian, X., Gu, J., Li, B., Liu, Y., Wang, Y., Zhao, Z., Zhan, K., Jia, P., Lang, X., Zhao, H.: Drivevlm: The convergence of autonomous driving and large vision- language models. arXiv preprint arXiv:2402.12289 (2024)

work page internal anchor Pith review arXiv 2024
[67]

Learning from failure: Integrating negative examples when fine-tuning large language models as agents.arXiv preprint arXiv:2402.11651, 2024

Wang, R., Li, H., Han, X., Zhang, Y., Baldwin, T.: Learning from failure: Inte- grating negative examples when fine-tuning large language models as agents. arXiv preprint arXiv:2402.11651 (2024)

work page arXiv 2024
[68]

Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning.arXiv preprint arXiv:2405.01533, 2024

Wang, S., Yu, Z., Jiang, X., Lan, S., Shi, M., Chang, N., Kautz, J., Li, Y., Alvarez, J.M.: Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. arXiv preprint arXiv:2405.01533 (2024)

work page arXiv 2024
[69]

Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Wang, Y., Luo, W., Bai, J., Cao, Y., Che, T., Chen, K., Chen, Y., Diamond, J., Ding, Y., Ding, W., et al.: Alpamayo-r1: Bridging reasoning and action pre- diction for generalizable autonomous driving in the long tail. arXiv preprint arXiv:2511.00088 (2025)

work page arXiv 2025
[70]

Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers

Wang, Y., Ding, P., Li, L., Cui, C., Ge, Z., Tong, X., Song, W., Zhao, H., Zhao, W., Hou, P., et al.: Vla-adapter: An effective paradigm for tiny-scale vision-language- action model. arXiv preprint arXiv:2509.09372 (2025)

work page arXiv 2025
[71]

Latentvla: Efficient vision-language models for autonomous driving via latent action prediction

Xie, C., Sun, B., Li, T., Wu, J., Hao, Z., Lang, X., Li, H.: Latentvla: Efficient vision-language models for autonomous driving via latent action prediction. arXiv preprint arXiv:2601.05611 (2026)

work page arXiv 2026
[72]

In: Proceed- ings of the Winter Conference on Applications of Computer Vision

Xing, S., Qian, C., Wang, Y., Hua, H., Tian, K., Zhou, Y., Tu, Z.: Openemma: Open-source multimodal model for end-to-end autonomous driving. In: Proceed- ings of the Winter Conference on Applications of Computer Vision. pp. 1001–1009 (2025)

2025
[73]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

Xu, R., Lin, H., Jeon, W., Feng, H., Zou, Y., Sun, L., Gorman, J., Tolstaya, E., Tang, S., White, B., et al.: Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125 (2025)

work page arXiv 2025
[74]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision.arXiv preprint arXiv:2412.14446, 2024

Xu, Y., Hu, Y., Zhang, Z., Meyer, G.P., Mustikovela, S.K., Srinivasa, S., Wolff, E.M.,Huang,X.:Vlm-ad:End-to-endautonomousdrivingthroughvision-language model supervision. arXiv preprint arXiv:2412.14446 (2024)

work page arXiv 2024
[75]

arXiv preprint arXiv:2506.06659 (2025)

Yao, W., Li, Z., Lan, S., Wang, Z., Sun, X., Alvarez, J.M., Wu, Z.: Drivesuprim: Towards precise trajectory selection for end-to-end planning. arXiv preprint arXiv:2506.06659 (2025)

work page arXiv 2025
[76]

A survey on efficient vision-language-action models, 2025

Yu, Z., Wang, B., Zeng, P., Zhang, H., Zhang, J., Gao, L., Song, J., Sebe, N., Shen, H.T.: A survey on efficient vision-language-action models. arXiv preprint arXiv:2510.24795 (2025)

work page arXiv 2025
[77]

Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601, 2024

Yuan, C., Zhang, Z., Sun, J., Sun, S., Huang, Z., Lee, C.D.W., Li, D., Han, Y., Wong, A., Tee, K.P., et al.: Drama: An efficient end-to-end motion planner for autonomous driving with mamba. arXiv preprint arXiv:2408.03601 (2024)

work page arXiv 2024
[78]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

Zeng, S., Chang, X., Xie, M., Liu, X., Bai, Y., Pan, Z., Xu, M., Wei, X., Guo, N.: Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. arXiv preprint arXiv:2505.17685 (2025)

work page arXiv 2025
[79]

arXiv preprint arXiv:2602.21952 (2026) 20 Z

Zhang, L., Yuan, Y., Wu, C., Chang, X., Cai, X., Zeng, S., Shi, L., Wang, S., Zhang, H., Xu, M.: Minddriver: Introducing progressive multimodal reasoning for autonomous driving. arXiv preprint arXiv:2602.21952 (2026) 20 Z. Zhou, R. Yang et al

work page arXiv 2026
[80]

BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving

Zhao, S.Z., Wang, L., Ruan, H., Bao, Y., Chen, Y., Leng, Z., Ravichandran, A., He, H., Zhou, Z., Han, X., et al.: Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving. arXiv preprint arXiv:2604.10856 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

Showing first 80 references.