arxiv: 2605.10426 · v2 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

Minqing Huang , Yujiao Xiang , Zihan Liang , Jiajie Huang , Jingqi Wang , Zhi Xu , Feiyang Tan , Hangning Zhou

show 2 more authors

Mu Yang Gong Che

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords autonomous drivingvision-language-actionworld modeldiffusion plannerexpert tokenstrajectory planningmulti-expert fusionscene generation

0 comments

The pith

Expert tokens condition diffusion planner for driving trajectories

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CoWorld-VLA to improve vision-language-action models for autonomous driving by creating planning-oriented intermediate representations. Text-based reasoning loses continuous spatial details while latent world models resist direct use for action output. The framework extracts four expert tokens via multi-source supervision to capture semantic interactions, geometric structure, dynamic evolution, and ego trajectory goals. These tokens feed into a diffusion-based hierarchical planner that jointly denoises trajectories with scene context. If the tokens remain complementary and usable as conditions, the approach should yield more accurate paths with fewer collisions on benchmarks like NAVSIM v1.

Core claim

CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, it constructs four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VL

What carries the argument

Four expert tokens (semantic interaction, geometric structure, dynamic evolution, ego trajectory) extracted through multi-source supervision, which encode distinct world aspects and serve as explicit conditioning signals for the diffusion-based hierarchical multi-expert fusion planner.

If this is right

The four tokens supply distinct world aspects that directly inform trajectory choices during denoising.
Joint denoising with scene context maintains consistency between generated paths and evolving surroundings.
Planner-accessible tokens reduce the gap between world understanding and action output compared to latent or text-only methods.
Ablation results indicate each token type contributes uniquely to overall planning performance.
Competitive benchmark scores in collision avoidance follow from better capture of interactions and dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token structure could transfer to other sequential control domains where separate world aspects need explicit conditioning.
Additional expert tokens for factors such as traffic rules or sensor noise might extend the framework without redesigning the planner.
Because tokens remain human-interpretable, they could support debugging of specific planning failures in deployed systems.
Real-vehicle testing would reveal whether the multi-source extraction generalizes beyond the benchmark's simulated data.

Load-bearing premise

Multi-source supervision can reliably extract four complementary expert representations that remain effective direct conditioning signals for the diffusion planner without losing critical spatiotemporal information.

What would settle it

A controlled removal of one expert token type on the NAVSIM v1 benchmark that produces no drop in collision avoidance rates or trajectory accuracy would show the multi-expert conditioning adds no unique value.

Figures

Figures reproduced from arXiv: 2605.10426 by Feiyang Tan, Gong Che, Hangning Zhou, Jiajie Huang, Jingqi Wang, Minqing Huang, Mu Yang, Yujiao Xiang, Zhi Xu, Zihan Liang.

**Figure 1.** Figure 1: Comparison of reasoning paradigms for VLA-based autonomous driving. (a) Direct action prediction maps multimodal inputs to actions without intermediate reasoning. (b) Textual CoT introduces language-based reasoning but may lose continuous spatio-temporal details. (c) Singleworld latent reasoning relies on one implicit world representation, which may be incomplete or weakly coupled with actions. (d) CoWorl… view at source ↗

**Figure 2.** Figure 2: Overview of CoWorld-VLA. CoWorld-VLA follows a three-stage training pipeline: videogenerator pre-training, multi-expert world-representation learning, and diffusion-based trajectory planning. It first learns future scene evolution from visual and textual conditions, then aligns VLM hidden states with semantic, geometric, visual-dynamic, and trajectory experts, and finally fuses these expert representation… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of future scene generation. Compared with Stage 1, Stage 2 better [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of trajectory planning across different training stages. Stage 2 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Additional qualitative results of future video generation under different driving scenar [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Local fidelity comparison in future video generation. The red boxes highlight roadside [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of trajectory planning across three representative driving scenarios. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/AFARI-Research/CoWorld-VLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is a concrete set of four expert tokens (semantic, geometric, dynamic, ego-trajectory) that condition a diffusion planner inside a VLA driving model, but the NAVSIM results are presented without enough detail to judge their reliability.

read the letter

The paper puts forward CoWorld-VLA as a way to give VLA planners explicit, planner-friendly world representations instead of relying on textual chain-of-thought or hard-to-use latents. It extracts four complementary tokens through multi-source supervision and fuses them hierarchically inside a diffusion process that stays coupled to scene context while denoising trajectories. That design choice is the clearest new element: it turns world reasoning into direct conditioning signals rather than an afterthought. The abstract frames the motivation cleanly and the ablations are positioned to test whether the tokens actually add value to each other. If the full experiments back that up, the approach gives a practical handle on injecting structure into end-to-end driving without forcing everything through language or opaque embeddings. The main soft spot is the evidence base. The claims of competitive performance on NAVSIM v1, especially for collision avoidance and trajectory accuracy, rest on benchmarks and ablations that are only summarized here. Without the actual numbers, baseline comparisons, or details on how the supervision signals are generated and aligned, it is difficult to know whether the gains are stable or depend on particular post-processing choices. The assumption that the four tokens remain informative after hierarchical fusion also needs stronger checks than the abstract supplies. This work is aimed at people already building VLA or diffusion planners for driving who want a more structured intermediate representation. A reader in that group could extract the token design and fusion scheme even if they end up re-running the experiments themselves. It is worth sending to peer review because the core idea addresses a real gap and the pipeline is described at a level that referees can evaluate and improve.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoWorld-VLA, a multi-expert world reasoning framework for Vision-Language-Action (VLA) models in autonomous driving. It extracts four complementary expert tokens—semantic interaction, geometric structure, dynamic evolution, and ego trajectory—via multi-source supervision to serve as explicit conditioning signals. These tokens guide a diffusion-based hierarchical multi-expert fusion planner that generates continuous ego trajectories while remaining coupled to scene context during denoising. The approach is evaluated on the NAVSIM v1 benchmark, claiming competitive performance in future scene generation and planning with strengths in collision avoidance and trajectory accuracy; ablations are reported to support token complementarity.

Significance. If the empirical results hold, the work provides a concrete advance in bridging world reasoning and action generation for end-to-end driving by replacing textual CoT or opaque latents with planner-accessible expert tokens that preserve spatiotemporal structure. The explicit multi-expert design and diffusion planner could improve safety-critical metrics. The commitment to release code at https://github.com/AFARI-Research/CoWorld-VLA is a clear strength for reproducibility and follow-up work.

major comments (2)

[Experiments] Experiments section: the claim of competitive results on NAVSIM v1 for collision avoidance and trajectory accuracy is central, yet the manuscript provides no quantitative tables, baseline comparisons, or variance statistics; without these, it is impossible to verify whether the reported gains are robust or sensitive to post-hoc choices in token supervision.
[§3.2] §3.2 (expert token construction): the four tokens are derived from external multi-source supervision signals and then used as direct conditioning for the diffusion planner; the paper does not specify the exact alignment or information-loss mechanism between these signals and the final planning loss, which is load-bearing for the complementarity claim.

minor comments (2)

[Abstract] The abstract states that code will be available but does not include a permanent DOI or commit hash; adding this would strengthen reproducibility.
Figure captions and the pipeline diagram would benefit from explicit labels for each expert token path and the precise fusion operation inside the diffusion steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help strengthen the clarity and rigor of our work on CoWorld-VLA. We appreciate the recognition of the framework's potential to bridge explicit world reasoning with action generation in autonomous driving. We address each major comment point by point below, outlining the revisions we will incorporate to improve the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the claim of competitive results on NAVSIM v1 for collision avoidance and trajectory accuracy is central, yet the manuscript provides no quantitative tables, baseline comparisons, or variance statistics; without these, it is impossible to verify whether the reported gains are robust or sensitive to post-hoc choices in token supervision.

Authors: We acknowledge that the current version of the manuscript does not include explicit quantitative tables with baseline comparisons and variance statistics, which limits the ability to fully assess robustness. In the revised manuscript, we will add a dedicated results table in the Experiments section reporting key NAVSIM v1 metrics (e.g., collision rate, trajectory accuracy via ADE/FDE, and planning success), direct comparisons against relevant baselines (including end-to-end VLA and diffusion-based planners), and standard deviations computed over multiple random seeds to demonstrate statistical reliability of the gains. revision: yes
Referee: [§3.2] §3.2 (expert token construction): the four tokens are derived from external multi-source supervision signals and then used as direct conditioning for the diffusion planner; the paper does not specify the exact alignment or information-loss mechanism between these signals and the final planning loss, which is load-bearing for the complementarity claim.

Authors: We agree that the precise alignment between the multi-source supervision signals and the planning loss, along with any potential information loss, needs explicit clarification to support the complementarity claim. The expert tokens are constructed to retain their original spatiotemporal structure and are injected directly via cross-attention conditioning into the diffusion planner at every denoising timestep, with the joint training loss ensuring end-to-end alignment without intermediate compression. In the revised §3.2, we will include the mathematical formulation of token injection, the conditioning mechanism, and how the multi-source losses are balanced with the trajectory denoising objective to minimize information loss. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs four expert token types (semantic interaction, geometric structure, dynamic evolution, ego trajectory) via multi-source external supervision signals that are independent of the final planning loss or diffusion outputs. These tokens then condition a standard diffusion-based hierarchical planner through joint denoising, with performance validated on the external NAVSIM v1 benchmark and ablations that test complementarity without reducing to self-definition or fitted-input renaming. No load-bearing step equates a claimed prediction to its own inputs by construction, and no self-citation chain is invoked to justify uniqueness or ansatz choices.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The four expert token types are constructed from supervision signals whose precise extraction rules are not detailed here.

pith-pipeline@v0.9.0 · 5586 in / 1172 out tokens · 63120 ms · 2026-05-14T21:33:05.502765+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens... diffusion-based hierarchical multi-expert fusion planner
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ablation studies further validate the complementarity of expert tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 18 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26286–26296, 2024

work page 2024
[3]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

work page 2024
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Open- drivevla: Towards end-to-end autonomous driving with large vision language action model

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Open- drivevla: Towards end-to-end autonomous driving with large vision language action model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13782–13790, 2026

work page 2026
[6]

Unified vision-language-action model

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[7]

Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3(1):22, 2025

Erfei Cui, Wenhai Wang, Zhiqi Li, Jiangwei Xie, Haoming Zou, Hanming Deng, Gen Luo, Lewei Lu, Xizhou Zhu, and Jifeng Dai. Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3(1):22, 2025

work page 2025
[8]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024a

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

work page arXiv 2024
[10]

Diffvla: Vision-language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381, 2025

Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Yunda Dong, et al. Diffvla: Vision-language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381, 2025

work page arXiv 2025
[11]

Vlp: Vision language planning for autonomous driving

Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, and Liu Ren. Vlp: Vision language planning for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14760–14769, 2024

work page 2024
[12]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025

work page 2025
[13]

nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810,

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

work page arXiv 2021
[14]

Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024. 10

work page 2024
[15]

Vadv2: End-to-end vectorized autonomous driving via probabilistic planning

Bo Jiang, Shaoyu Chen, Hao Gao, Bencheng Liao, Qian Zhang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. InThe Fourteenth International Conference on Learning Representations, 2024

work page 2024
[16]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

work page 2023
[17]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

work page 2022
[18]

Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers

Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26890–26900, 2025

work page 2025
[19]

DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, et al. Drivelaw: Unifying planning and video generation in a latent driving world.arXiv preprint arXiv:2512.23421, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Imagidrive: A unified imagination-and-planning framework for autonomous driving.arXiv preprint arXiv:2508.11428, 2025

Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, and Li Zhang. Imagidrive: A unified imagination-and-planning framework for autonomous driving.arXiv preprint arXiv:2508.11428, 2025

work page arXiv 2025
[21]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9(10):8186–8193, 2024

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9(10):8186–8193, 2024

work page 2024
[22]

2509.13769 , archivePrefix =

Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

work page arXiv 2025
[23]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

work page internal anchor Pith review arXiv 2025
[24]

Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024

Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024

work page arXiv 2024
[25]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Mull-Tokens: Modality-Agnostic Latent Thinking

Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, and Wen-Sheng Chu. Mull-tokens: Modality-agnostic latent thinking.arXiv preprint arXiv:2512.10941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025a

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

work page arXiv 2025
[28]

A comprehensive survey on world models for embodied AI.arXiv preprintarXiv:2510.16732, 2025

Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai.arXiv preprint arXiv:2510.16732, 2025

work page arXiv 2025
[29]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

work page 2024
[30]

Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

work page 2024
[31]

Drive- dreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InEuropean confer- ence on computer vision, pages 55–72. Springer, 2024. 11

work page 2024
[32]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Drivingworld: Constructing world model for autonomous driving via video gpt

Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Drivingworld: Constructing world model for autonomous driving via video gpt. arXiv preprint arXiv:2412.19505, 2024

work page arXiv 2024
[34]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023

Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023

work page arXiv 2023
[36]

Geodrive: 3d geometry-informed driving world model with precise action control.arXiv preprint arXiv:2505.22421, 2025

Anthony Chen, Wenzhao Zheng, Yida Wang, Xueyang Zhang, Kun Zhan, Peng Jia, Kurt Keutzer, and Shanghang Zhang. Geodrive: 3d geometry-informed driving world model with precise action control.arXiv preprint arXiv:2505.22421, 2025

work page arXiv 2025
[37]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025
[38]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15619–15629, 2023

work page 2023
[39]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024

work page 2024
[41]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024

work page 2024
[42]

Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

work page arXiv 2024
[43]

End-to-end driving with online trajectory evaluation via bev world model

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025

work page 2025
[44]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

work page 2024
[46]

Ts-vlm: Text-guided softsort pooling for vision-language models in multi-view driving reasoning.arXiv preprint arXiv:2505.12670, 2025

Lihong Chen, Hossein Hassani, and Soodeh Nikan. Ts-vlm: Text-guided softsort pooling for vision-language models in multi-view driving reasoning.arXiv preprint arXiv:2505.12670, 2025. 12

work page arXiv 2025
[47]

Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a

Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

work page arXiv 2023
[48]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review arXiv 2024
[49]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

work page 2025
[50]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review arXiv 2025
[51]

Analyzing reasoning consistency in large multimodal models under cross-modal conflicts.arXiv preprint arXiv:2601.04073, 2026

Zhihao Zhu, Jiafeng Liang, Shixin Jiang, Jinlan Fu, Ming Liu, Guanglu Sun, See-Kiong Ng, and Bing Qin. Analyzing reasoning consistency in large multimodal models under cross-modal conflicts.arXiv preprint arXiv:2601.04073, 2026

work page arXiv 2026
[52]

Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning.arXiv preprint arXiv:2505.11896, 2025

Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qing- ping Yang, and Shuangzhi Wu. Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning.arXiv preprint arXiv:2505.11896, 2025

work page arXiv 2025
[53]

DriveMoE: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025

work page arXiv 2025
[54]

Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928,

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, et al. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

work page arXiv 2026
[55]

Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation

Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation. InEuropean conference on computer vision, pages 329–345. Springer, 2024

work page 2024
[56]

Generalized predictive model for autonomous driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14662–14672, 2024

work page 2024
[57]

arXiv preprint arXiv:2311.01017 , year=

Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, and Raquel Urtasun. Copilot4d: Learning unsupervised world models for autonomous driving via discrete diffusion.arXiv preprint arXiv:2311.01017, 2023

work page arXiv 2023
[58]

Lidardm: Generative lidar simulation in a generated world

Vlas Zyrianov, Henry Che, Zhijian Liu, and Shenlong Wang. Lidardm: Generative lidar simulation in a generated world. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 6055–6062. IEEE, 2025

work page 2025
[59]

Uniscene: Unified occupancy-centric driving scene generation

Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation. InProceedings of the computer vision and pattern recognition conference, pages 11971–11981, 2025

work page 2025
[60]

Epona: Autoregressive diffusion world model for autonomous driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27220–27230, 2025

work page 2025
[61]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024. 13

work page 2024
[62]

Magicdrive- v2: High-resolution long video generation for autonomous driving with adaptive control

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive- v2: High-resolution long video generation for autonomous driving with adaptive control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 28135– 28144, 2025

work page 2025
[63]

Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation

Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, et al. Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27231–27241, 2025

work page 2025
[64]

Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal consistency.arXiv preprint arXiv:2506.07497, 2025

Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Lijun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, et al. Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal consistency.arXiv preprint arXiv:2506.07497, 2025

work page arXiv 2025
[65]

DriveVLA-W0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

work page arXiv 2025
[66]

Uni-world vla: Interleaved world modeling and planning for autonomous driving.arXiv preprint arXiv:2603.27287, 2026

Qiqi Liu, Huan Xu, Jingyu Li, Bin Sun, Zhihui Hao, Dangen She, Xiatian Zhu, and Li Zhang. Uni-world vla: Interleaved world modeling and planning for autonomous driving.arXiv preprint arXiv:2603.27287, 2026

work page arXiv 2026
[67]

Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning.arXiv preprint arXiv:2502.13144,

Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning.arXiv preprint arXiv:2502.13144, 2025

work page arXiv 2025
[68]

Presight: Enhancing autonomous vehicle perception with city-scale nerf priors

Tianyuan Yuan, Yucheng Mao, Jiawei Yang, Yicheng Liu, Yue Wang, and Hang Zhao. Presight: Enhancing autonomous vehicle perception with city-scale nerf priors. InEuropean Conference on Computer Vision, pages 323–339. Springer, 2024

work page 2024
[69]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

S3 Gaussian: Self-supervised street gaussians for autonomous driving.arXiv preprint arXiv:2405.20323, 2024

Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Shanghang Zhang. S3 Gaussian: Self-supervised street gaussians for autonomous driving.arXiv preprint arXiv:2405.20323, 2024

work page arXiv 2024
[71]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

work page 2023
[72]

Perceiver: General perception with iterative attention, 2021

Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. Perceiver: General perception with iterative attention, 2021

work page 2021
[73]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving,

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

work page arXiv 2025
[74]

World4drive: End-to-end autonomous driving via intention-aware physical latent world model

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28632–28642, 2025

work page 2025
[75]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 815–824, 2023

work page 2023
[76]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[77]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

work page 2025
[79]

Trajdiff: End-to-end autonomous driving without perception annotation

Xingtai Gui, Jianbo Zhao, Wencheng Han, Jikai Wang, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, and Jianbing Shen. Trajdiff: End-to-end autonomous driving without perception annotation. arXiv preprint arXiv:2512.00723, 2025

work page arXiv 2025
[80]

ReSim: Reliable World Simulation for Autonomous Driving

Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, and Li Chen. Resim: Reliable world simulation for au- tonomous driving.arXiv preprint arXiv:2506.09981, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.