ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

Feiyang Tan; Gong Chen; Hangning Zhou; Mu Yang; Tingguang Zhou; Xiaolei Wu; Xingtai Gui; Xinlin Wang; Xiyang Wang; Zhi Xu

arxiv: 2605.23270 · v1 · pith:HVCUPIZEnew · submitted 2026-05-22 · 💻 cs.CV · cs.AI· cs.RO

ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

Xiyang Wang , Xinlin Wang , Tingguang Zhou , Gong Chen , Xingtai Gui , Zhi Xu , Xiaolei Wu , Feiyang Tan

show 2 more authors

Hangning Zhou Mu Yang

This is my paper

Pith reviewed 2026-05-25 04:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords autonomous drivingvision-language modelstrajectory planningcausal modelingdiffusion modelsend-to-end planningresidual refinement

0 comments

The pith

ChainFlow-VLA unifies autoregressive causal modes with VLM-conditioned diffusion refinement for autonomous driving planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the mismatch between causal temporal reasoning in autoregressive models and global trajectory consistency in diffusion models by proposing a single probabilistic framework. It generates discrete causal trajectory modes with an autoregressive generator, then applies a diffusion refiner that conditions on Vision-Language Model hidden states to correct those modes in residual space. This unification is intended to produce reliable plans in interactive and safety-critical driving scenarios. If the approach holds, it would allow high-level scene understanding from VLMs to guide fine-grained adjustments without losing causal structure. Experiments report a score of 94.85 on the NAVSIM v1 leaderboard, matching reported human performance.

Core claim

ChainFlow-VLA formulates planning as a mixture over AR-induced modes and learns VLM-conditioned residual distributions over these modes. An autoregressive generator produces a discrete set of causal trajectory modes, followed by a diffusion-based refiner that leverages VLM hidden states as semantic priors to perform mode-conditioned correction in residual space while preserving causal structure.

What carries the argument

The ChainFlow architecture: an autoregressive Chain that produces causal trajectory modes, followed by a Flow diffusion refiner conditioned on VLM hidden states for residual-space corrections.

If this is right

Robust planning performance in ambiguous and long-tail driving scenarios.
State-of-the-art score of 94.85 on NAVSIM v1, matching human-level performance at 94.8.
High-level scene understanding from VLMs is injected directly into fine-grained trajectory adjustments.
Causal structure is maintained while still allowing global optimization in the residual distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The residual correction step could be tested on other sequential prediction tasks that require both local causality and global coherence.
If the VLM conditioning generalizes, the same mixture-of-modes structure might reduce error accumulation in longer-horizon forecasts beyond driving.
The separation into discrete modes followed by continuous refinement offers a template for hybrid discrete-continuous planners in robotics.

Load-bearing premise

VLM hidden states supply semantic priors that enable effective mode-conditioned corrections in residual space without disrupting the causal structure created by the autoregressive generator.

What would settle it

A controlled comparison in which trajectories refined by the VLM-conditioned Flow either violate the original causal ordering from the Chain or show no global consistency gain over standalone autoregressive or diffusion baselines.

Figures

Figures reproduced from arXiv: 2605.23270 by Feiyang Tan, Gong Chen, Hangning Zhou, Mu Yang, Tingguang Zhou, Xiaolei Wu, Xingtai Gui, Xinlin Wang, Xiyang Wang, Zhi Xu.

**Figure 2.** Figure 2: ChainFlow-VLA framework. The model first performs Autoregressive Trajectory Gen [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of trajectory predictions on representative NAVSIM scenarios. GT [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison between BEV-conditioned and VLM-conditioned refinement. GT [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) models capture interaction-aware temporal dependencies via causal factorization, but their step-wise decoding leads to error accumulation and suboptimal global structure. In contrast, diffusion models optimize trajectories globally but lack explicit causal constraints, making them unreliable in interactive and safety-critical scenarios. This dichotomy reveals a deeper issue: existing methods treat causal modeling and global optimization as separate paradigms, without a principled way to unify them within a single trajectory distribution. To address this, we propose ChainFlow-VLA, which unifies causal generation and global refinement within a unified probabilistic framework. We formulate planning as a mixture over AR-induced modes and learn Vision-Language Model (VLM)-conditioned residual distributions over these modes. An autoregressive generator (Chain) produces a discrete set of causal trajectory modes, followed by a diffusion-based refiner (Flow) that leverages VLM hidden states as semantic priors to perform mode-conditioned correction in residual space while preserving causal structure. This straightforward conditioning seamlessly injects high-level scene understanding into fine-grained trajectory adjustments. Experiments demonstrate that ChainFlow-VLA achieves robust planning in ambiguous and long-tail scenarios, achieving a state-of-the-art score of 94.85 on the NAVSIM v1 leaderboard, matching human-level performance (94.8). Code will be available at https://github.com/AFARI-Research/ChainFlow-VLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract frames a plausible unification of AR causal modes and VLM residual diffusion but supplies zero equations, ablations, or constraints showing the refiner preserves causality, so the 94.85 NAVSIM claim cannot be evaluated.

read the letter

The paper's central move is to let an autoregressive Chain produce discrete causal trajectory modes, then let a diffusion Flow refiner use VLM hidden states to correct residuals while supposedly keeping the causal factorization intact. That split is the only concrete novelty on offer; no earlier work is shown doing exactly this conditioning inside one probabilistic object. The problem statement itself is clear and worth stating: pure AR drifts on global structure, pure diffusion ignores temporal order in interactive scenes. That diagnosis is accurate and the proposed split is a reasonable attempt to address it. Everything else is thin. The abstract states the 94.85 score and the human-level comparison but gives no dataset statistics, no error bars, no ablation on the VLM conditioning, and no loss term or masking that would force the diffusion step to respect the AR-induced causal order. The stress-test concern therefore lands: without an explicit mechanism tying the refiner output back to the Chain's conditional distributions, the diffusion could alter future points independently and the claimed unification would not hold. Because the manuscript text was not supplied here, it is impossible to check whether the full paper adds those constraints or the missing experiments. As written, the work is aimed at the narrow autonomous-driving planning community. Readers already working on NAVSIM or similar end-to-end stacks might skim it for the architectural sketch, but the absence of verifiable internals means it does not yet support citation or strong claims. A serious editor should send it to review only if the full version contains the equations, ablations, and causal-preservation checks that the abstract omits; otherwise it belongs in the reject pile until those are added.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes ChainFlow-VLA to unify autoregressive causal trajectory generation (via a Chain generator producing discrete modes) with diffusion-based global refinement (via a Flow refiner that conditions on VLM hidden states for mode-specific residual correction). The central claim is that this framework resolves the mismatch between temporal causality and global consistency in end-to-end autonomous driving, achieving a state-of-the-art score of 94.85 on the NAVSIM v1 leaderboard that matches human-level performance (94.8).

Significance. If the unification holds and the performance is reproducible with proper controls, the approach could offer a concrete mechanism for injecting high-level VLM semantics into trajectory distributions while retaining causal factorization, potentially improving robustness in interactive and long-tail driving scenarios over pure AR or pure diffusion baselines.

major comments (2)

[Abstract] Abstract: The description states that the Flow refiner 'leverages VLM hidden states as semantic priors to perform mode-conditioned correction in residual space while preserving causal structure,' but supplies no equations, autoregressive masking strategy on the residual diffusion process, or auxiliary loss (e.g., KL divergence to the Chain's conditional distributions) that would enforce temporal causality during sampling. Without these, the claimed preservation of the AR-induced causal factorization cannot be verified and the performance gain cannot be attributed to the proposed unification.
[Abstract] Abstract: The SOTA claim of 94.85 on NAVSIM v1 is presented without any mention of error bars, ablation studies isolating the Flow refiner's contribution, dataset statistics, or comparison to the human baseline implementation details. This makes the central performance result impossible to evaluate from the given text and leaves open whether the result depends on the causal-global unification or on other unstated factors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will make targeted revisions to improve clarity and verifiability without altering the core technical claims.

read point-by-point responses

Referee: [Abstract] Abstract: The description states that the Flow refiner 'leverages VLM hidden states as semantic priors to perform mode-conditioned correction in residual space while preserving causal structure,' but supplies no equations, autoregressive masking strategy on the residual diffusion process, or auxiliary loss (e.g., KL divergence to the Chain's conditional distributions) that would enforce temporal causality during sampling. Without these, the claimed preservation of the AR-induced causal factorization cannot be verified and the performance gain cannot be attributed to the proposed unification.

Authors: The abstract provides a high-level overview; the full manuscript details the autoregressive masking applied to the residual diffusion process (Section 3.2) and the auxiliary KL alignment loss between the Flow refiner and Chain modes (Equation 5 and Section 3.3). These mechanisms explicitly preserve the causal factorization during sampling. We agree the abstract could better indicate these elements and will revise it to reference the causal masking strategy and alignment loss at a summary level. revision: yes
Referee: [Abstract] Abstract: The SOTA claim of 94.85 on NAVSIM v1 is presented without any mention of error bars, ablation studies isolating the Flow refiner's contribution, dataset statistics, or comparison to the human baseline implementation details. This makes the central performance result impossible to evaluate from the given text and leaves open whether the result depends on the causal-global unification or on other unstated factors.

Authors: The abstract summarizes the headline result; the full manuscript reports error bars (from 5 independent runs), ablations isolating the Flow refiner (Table 3), dataset statistics (Appendix A), and direct human baseline comparisons under the identical NAVSIM v1 protocol (Section 4.2). We will revise the abstract to briefly note that these supporting analyses appear in the paper, allowing readers to contextualize the 94.85 score. revision: yes

Circularity Check

0 steps flagged

No circularity; SOTA claim rests on external benchmark

full rationale

The paper's performance claim (94.85 on NAVSIM v1) is evaluated against an external leaderboard rather than any internally fitted or self-derived quantity. The abstract describes a conceptual unification of AR Chain generator and VLM-conditioned Flow refiner but supplies no equations, loss terms, or parameter definitions that could reduce the reported result to a self-referential fit or self-citation. No self-definitional steps, fitted-input predictions, or load-bearing self-citations are present in the provided text. The derivation chain is therefore self-contained against external benchmarks, consistent with a normal non-circular outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; the framework postulates a mixture over AR-induced modes and residual diffusion distributions conditioned on VLM states, but no explicit free parameters, axioms, or invented entities are enumerated in the provided text.

pith-pipeline@v0.9.0 · 5825 in / 1107 out tokens · 16662 ms · 2026-05-25T04:35:28.773347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 9 internal anchors

[1]

End-to- end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024a

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to- end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024a. doi: 10.1109/TPAMI.2024.3435937. Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Li...

work page doi:10.1109/tpami.2024.3435937 2024
[2]

Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333,

Lan Feng, Yang Gao, Eloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord, and Alexan- dre Alahi. Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333,

work page arXiv
[3]

Trajdiff: End-to-end autonomous driving without perception annotation.arXiv preprint arXiv:2512.00723,

Xingtai Gui, Jianbo Zhao, Wencheng Han, Jikai Wang, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, and Jianbing Shen. Trajdiff: End-to-end autonomous driving without perception annotation.arXiv preprint arXiv:2512.00723,

work page arXiv
[4]

Bridging scene generation and planning: Driving with world model via unifying vision and motion representation.arXiv preprint arXiv:2603.14948,

Xingtai Gui, Meijie Zhang, Tianyi Yan, Wencheng Han, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, and Jianbing Shen. Bridging scene generation and planning: Driving with world model via unifying vision and motion representation.arXiv preprint arXiv:2603.14948,

work page arXiv
[5]

Marcel Hallgarten, Julian Zapata, Martin Stoll, Katrin Renz, and Andreas Zell

URLhttps://arxiv.org/abs/2505.15111. Marcel Hallgarten, Julian Zapata, Martin Stoll, Katrin Renz, and Andreas Zell. Can vehicle motion planning generalize to realistic long-tail scenarios? In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5388–5395. IEEE,

work page arXiv
[6]

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

Minqing Huang, Yujiao Xiang, Zihan Liang, Jiajie Huang, Jingqi Wang, Zhi Xu, Feiyang Tan, Hangning Zhou, Mu Yang, and Gong Che. Coworld-vla: Thinking in a multi-expert world model for autonomous driving.arXiv preprint arXiv:2605.10426,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Amp: Autoregressive motion prediction revisited with next token prediction for autonomous driving

11 Xiaosong Jia, Shaoshuai Shi, Zijun Chen, Li Jiang, Wenlong Liao, Tao He, and Junchi Yan. Amp: Autoregressive motion prediction revisited with next token prediction for autonomous driving. arXiv preprint arXiv:2403.13331,

work page arXiv
[9]

Diffvla: Vision-language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381,

Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Yunda Dong, et al. Diffvla: Vision-language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381,

work page arXiv
[10]

Driving on registers.arXiv preprint arXiv:2601.05083,

Ellington Kirby, Alexandre Boulch, Yihong Xu, Yuan Yin, Gilles Puy, Éloi Zablocki, Andrei Bursuc, Spyros Gidaris, Renaud Marlet, Florent Bartoccioni, et al. Driving on registers.arXiv preprint arXiv:2601.05083,

work page arXiv
[11]

Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving

Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, Xianpeng Lang, Xiatian Zhu, and Li Zhang. Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving. arXiv preprint arXiv:2601.05640,

work page arXiv
[12]

Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

Kailin Li, Zhenxin Li, Shiyi Lan, Yuan Xie, Zhizhong Zhang, Jiayi Liu, Zuxuan Wu, Zhiding Yu, and Jose M Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation. arXiv preprint arXiv:2503.12820, 2025a. Yiming Li, Sihang Li, Xinhao Liu, Moonjun Gong, Kenan Li, Nuo Chen, Zijun Wang, Zhiheng Li, Tao Jiang, Fisher Yu, et al. Ss...

work page arXiv
[13]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025b. Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bin...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Drivelm: Driving with graph visual question answering.arXiv preprint arXiv:2312.14150,

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering.arXiv preprint arXiv:2312.14150,

work page arXiv
[16]

Centaur: Robust end-to-end autonomous driving with test-time training

Chonghao Sima, Kashyap Chitta, Zhiding Yu, Shiyi Lan, Ping Luo, Andreas Geiger, Hongyang Li, and Jose M Alvarez. Centaur: Robust end-to-end autonomous driving with test-time training. arXiv preprint arXiv:2503.11650,

work page arXiv
[17]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850,

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850,

work page arXiv
[18]

Latentvla: Efficient vision-language models for autonomous driving via latent action prediction

12 Chengen Xie, Bin Sun, Tianyu Li, Junjie Wu, Zhihui Hao, XianPeng Lang, and Hongyang Li. Latentvla: Efficient vision-language models for autonomous driving via latent action prediction. arXiv preprint arXiv:2601.05611,

work page arXiv
[19]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end au- tonomous driving.arXiv preprint arXiv:2505.16278,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

Yiwei Zhang, Xuesong Chen, Jin Gao, Hanshi Wang, Fudong Ge, Weiming Hu, Shaoshuai Shi, and Zhipeng Zhang. Onedrive: Unified multi-paradigm driving with vision-language-action models. arXiv preprint arXiv:2604.17915,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Resad: Normalized residual trajectory modeling for end-to-end autonomous driving.arXiv preprint arXiv:2510.08562,

Zhiyu Zheng, Shaoyu Chen, Haoran Yin, Xinbang Zhang, Jialv Zou, Xinggang Wang, Qian Zhang, and Lefei Zhang. Resad: Normalized residual trajectory modeling for end-to-end autonomous driving.arXiv preprint arXiv:2510.08562,

work page arXiv
[23]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13782–13790, 2026a. Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jia...

work page internal anchor Pith review Pith/arXiv arXiv
[24]

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

Zewei Zhou, Ruining Yang, Yiluan Guo, Sherry X Chen, Tao Feng, Kateryna Pistunova, Yishan Shen, Lili Su, Jiaqi Ma, et al. Spanvla: Efficient action bridging and learning from negative-recovery samples for vision-language-action model.arXiv preprint arXiv:2604.19710, 2026b. 13

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

End-to- end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024a

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to- end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024a. doi: 10.1109/TPAMI.2024.3435937. Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Li...

work page doi:10.1109/tpami.2024.3435937 2024

[2] [2]

Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333,

Lan Feng, Yang Gao, Eloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord, and Alexan- dre Alahi. Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333,

work page arXiv

[3] [3]

Trajdiff: End-to-end autonomous driving without perception annotation.arXiv preprint arXiv:2512.00723,

Xingtai Gui, Jianbo Zhao, Wencheng Han, Jikai Wang, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, and Jianbing Shen. Trajdiff: End-to-end autonomous driving without perception annotation.arXiv preprint arXiv:2512.00723,

work page arXiv

[4] [4]

Bridging scene generation and planning: Driving with world model via unifying vision and motion representation.arXiv preprint arXiv:2603.14948,

Xingtai Gui, Meijie Zhang, Tianyi Yan, Wencheng Han, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, and Jianbing Shen. Bridging scene generation and planning: Driving with world model via unifying vision and motion representation.arXiv preprint arXiv:2603.14948,

work page arXiv

[5] [5]

Marcel Hallgarten, Julian Zapata, Martin Stoll, Katrin Renz, and Andreas Zell

URLhttps://arxiv.org/abs/2505.15111. Marcel Hallgarten, Julian Zapata, Martin Stoll, Katrin Renz, and Andreas Zell. Can vehicle motion planning generalize to realistic long-tail scenarios? In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5388–5395. IEEE,

work page arXiv

[6] [6]

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

Minqing Huang, Yujiao Xiang, Zihan Liang, Jiajie Huang, Jingqi Wang, Zhi Xu, Feiyang Tan, Hangning Zhou, Mu Yang, and Gong Che. Coworld-vla: Thinking in a multi-expert world model for autonomous driving.arXiv preprint arXiv:2605.10426,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Amp: Autoregressive motion prediction revisited with next token prediction for autonomous driving

11 Xiaosong Jia, Shaoshuai Shi, Zijun Chen, Li Jiang, Wenlong Liao, Tao He, and Junchi Yan. Amp: Autoregressive motion prediction revisited with next token prediction for autonomous driving. arXiv preprint arXiv:2403.13331,

work page arXiv

[9] [9]

Diffvla: Vision-language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381,

Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Yunda Dong, et al. Diffvla: Vision-language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381,

work page arXiv

[10] [10]

Driving on registers.arXiv preprint arXiv:2601.05083,

Ellington Kirby, Alexandre Boulch, Yihong Xu, Yuan Yin, Gilles Puy, Éloi Zablocki, Andrei Bursuc, Spyros Gidaris, Renaud Marlet, Florent Bartoccioni, et al. Driving on registers.arXiv preprint arXiv:2601.05083,

work page arXiv

[11] [11]

Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving

Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, Xianpeng Lang, Xiatian Zhu, and Li Zhang. Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving. arXiv preprint arXiv:2601.05640,

work page arXiv

[12] [12]

Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv preprint arXiv:2503.12820, 2025

Kailin Li, Zhenxin Li, Shiyi Lan, Yuan Xie, Zhizhong Zhang, Jiayi Liu, Zuxuan Wu, Zhiding Yu, and Jose M Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation. arXiv preprint arXiv:2503.12820, 2025a. Yiming Li, Sihang Li, Xinhao Liu, Moonjun Gong, Kenan Li, Nuo Chen, Zijun Wang, Zhiheng Li, Tao Jiang, Fisher Yu, et al. Ss...

work page arXiv

[13] [13]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025b. Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bin...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Drivelm: Driving with graph visual question answering.arXiv preprint arXiv:2312.14150,

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering.arXiv preprint arXiv:2312.14150,

work page arXiv

[16] [16]

Centaur: Robust end-to-end autonomous driving with test-time training

Chonghao Sima, Kashyap Chitta, Zhiding Yu, Shiyi Lan, Ping Luo, Andreas Geiger, Hongyang Li, and Jose M Alvarez. Centaur: Robust end-to-end autonomous driving with test-time training. arXiv preprint arXiv:2503.11650,

work page arXiv

[17] [17]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850,

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850,

work page arXiv

[18] [18]

Latentvla: Efficient vision-language models for autonomous driving via latent action prediction

12 Chengen Xie, Bin Sun, Tianyu Li, Junjie Wu, Zhihui Hao, XianPeng Lang, and Hongyang Li. Latentvla: Efficient vision-language models for autonomous driving via latent action prediction. arXiv preprint arXiv:2601.05611,

work page arXiv

[19] [19]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end au- tonomous driving.arXiv preprint arXiv:2505.16278,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

Yiwei Zhang, Xuesong Chen, Jin Gao, Hanshi Wang, Fudong Ge, Weiming Hu, Shaoshuai Shi, and Zhipeng Zhang. Onedrive: Unified multi-paradigm driving with vision-language-action models. arXiv preprint arXiv:2604.17915,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Resad: Normalized residual trajectory modeling for end-to-end autonomous driving.arXiv preprint arXiv:2510.08562,

Zhiyu Zheng, Shaoyu Chen, Haoran Yin, Xinbang Zhang, Jialv Zou, Xinggang Wang, Qian Zhang, and Lefei Zhang. Resad: Normalized residual trajectory modeling for end-to-end autonomous driving.arXiv preprint arXiv:2510.08562,

work page arXiv

[23] [23]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13782–13790, 2026a. Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jia...

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

Zewei Zhou, Ruining Yang, Yiluan Guo, Sherry X Chen, Tao Feng, Kateryna Pistunova, Yishan Shen, Lili Su, Jiaqi Ma, et al. Spanvla: Efficient action bridging and learning from negative-recovery samples for vision-language-action model.arXiv preprint arXiv:2604.19710, 2026b. 13

work page internal anchor Pith review Pith/arXiv arXiv