arxiv: 2605.11550 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

The DAWN of World-Action Interactive Models

Hongbo Lu , Liang Yao , Chenghao He , Haoyu Wang , Xiang Gu , Xianfei Li , Wenlong Liao , Tao He

show 1 more author

Pai Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords world-action interactive modelsautonomous drivinglatent generative modelsmutual conditioningaction denoisingtrajectory planningsafety evaluation

0 comments

The pith

Coupling world prediction with action denoising through mutual feedback in latent space produces safer and more effective autonomous driving plans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that a plausible future scene depends on the action chosen while a good action depends on how the scene will evolve, so world prediction and action generation must condition each other rather than run in isolation or in one fixed order. It formalizes this reciprocity as World-Action Interactive Models and shows an implementation called DAWN that keeps both components in a compact semantic latent space. During inference the predicted world hypothesis guides action denoising while the emerging action hypothesis updates the world prediction, with only a short explicit rollout needed to support long-horizon trajectories. If this mutual refinement works, world models become directly usable for planning without separate pipelines or expensive pixel-level simulation. Readers should care because the approach targets exactly the interactive traffic scenes where current decoupled methods most often fail.

Core claim

The authors claim that formalizing World-Action Interactive Models and instantiating them as DAWN, which couples a World Predictor with a World-Conditioned Action Denoiser so that each recursively refines the other in latent space via short explicit rollouts, delivers strong planning performance together with favorable safety results across multiple autonomous driving benchmarks and thereby demonstrates that interactive world-action generation is a principled route to actionable world models.

What carries the argument

The reciprocal conditioning loop between the World Predictor and World-Conditioned Action Denoiser during inference, which allows each to update the other inside a compact semantic latent space.

If this is right

The interactive approach achieves strong planning performance on multiple autonomous driving benchmarks.
It produces favorable safety-related results in complex interactive scenes.
Only a short latent rollout suffices to generate long-horizon trajectories.
The method avoids the computational cost of full future rollout in pixel space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar mutual conditioning could be applied to other embodied tasks where actions and environmental responses are tightly interdependent.
The success of brief latent refinement suggests that exhaustive future simulation may be unnecessary for many planning problems.
Multi-agent interaction settings could adopt the same reciprocal generation mechanism to model mutual dependencies more directly.

Load-bearing premise

That a short explicit latent rollout with mutual conditioning between world predictor and action denoiser captures the necessary reciprocity without requiring full future rollout or pixel-space simulation.

What would settle it

Running the model with the feedback loop disabled and observing no drop in planning accuracy or safety metrics on standard interactive driving test sets would show that the mutual conditioning is not required.

Figures

Figures reproduced from arXiv: 2605.11550 by Chenghao He, Haoyu Wang, Hongbo Lu, Liang Yao, Pai Peng, Tao He, Wenlong Liao, Xianfei Li, Xiang Gu.

**Figure 1.** Figure 1: From WAMs to WAIM. Existing WAMs typically predict world and action in parallel, sequentially, or without explicit test-time rollout. In contrast, WAIM keeps a short latent world rollout and recursively couples world prediction with action generation during inference. actions in parallel from shared visual context, using separate heads or branches for scene evolution and motion planning [3, 56, 59]. Anothe… view at source ↗

**Figure 2.** Figure 2: Overview of DAWN. During training, DAWN learns compact latent world tokens with a Student/Teacher Vision-Encoder pair and an Auto-Encoder Resampler, supervises short latent rollout with a World Predictor, and trains a World-Conditioned Action Denoiser for trajectory generation. During inference, the Action Denoiser initializes actions from resampler latents and then recursively refines them with predictor … view at source ↗

**Figure 3.** Figure 3: Effect of interactive rounds. We further study how iterative refinement affects planning performance. This ablation directly tests whether DAWN benefits from repeated world-action interaction, or whether a single proposal is already sufficient. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative planning results. We compare human trajectories, Drive-JEPA, and DAWN in five representative driving scenarios. The top row shows front-view observations, and the bottom row shows the corresponding BEV visualization. DAWN produces trajectories that better follow road geometry and remain visually consistent with human driving behavior in complex intersections, narrow streets, and curved junctio… view at source ↗

**Figure 5.** Figure 5: Illustration of the latent world rollout design space. Zero-rollout methods such as Fast-WAM occupy the left endpoint, full predict-then-plan methods occupy the right endpoint, and DAWN targets a short-rollout regime in between, where compact future evolution provides useful foresight without full-horizon rollout. to 64 tokens slightly improves the aggregate PDMS from 82.8 to 83.2, alongside minor gains in… view at source ↗

**Figure 6.** Figure 6: More qualitative results of planning [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: More qualitative results of prediction [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: More qualitative results of prediction [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: More qualitative results of prediction [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: More qualitative results of feature [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: More qualitative results of feature [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

read the original abstract

A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines. We formalize this perspective as World-Action Interactive Models (WAIMs), and instantiate it in autonomous driving with \textbf{DAWN} (\textbf{D}enoising \textbf{A}ctions and \textbf{W}orld i\textbf{N}teractive model), a simple yet strong latent generative baseline. DAWN operates in a compact semantic latent space and couples a \emph{World Predictor} with a \emph{World-Conditioned Action Denoiser}: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction, so that both are recursively refined during inference. Rather than eliminating test-time world evolution altogether or rolling out the full future in pixel space, DAWN performs a short explicit latent rollout that is sufficient to support long-horizon trajectory generation in complex interactive scenes. Experiments show that DAWN achieves strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks. More broadly, our results suggest that interactive world-action generation is a principled path toward truly actionable world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAWN formalizes mutual conditioning between world prediction and action denoising in a short latent rollout, but the abstract shows no numbers or controls to back the performance claims.

read the letter

The paper names a real gap: most world-action models either run prediction and planning in parallel or in strict sequence, missing how each should shape the other in interactive scenes. DAWN tries to fix that by coupling a world predictor with a world-conditioned action denoiser that update each other recursively during inference, all inside a compact semantic latent space rather than pixel space. The short explicit rollout is presented as enough to support long-horizon trajectory generation without full future simulation. That framing of World-Action Interactive Models and the specific feedback loop feel like the actual new pieces relative to prior WAM work summarized in the abstract. The choice to stay latent and keep the interaction brief is pragmatic for driving applications. The abstract claims strong planning performance and favorable safety results on multiple benchmarks, which would matter if the numbers and controls back it up. The soft spot is exactly the one the stress test flags: there are no ablations on rollout length, no direct comparisons to non-interactive or full-rollout baselines, and no error analysis or quantitative details at all. Without those, it is impossible to know whether the mutual conditioning actually delivers the reciprocity or whether the results come from other design choices. The central claim that this limited inference-time feedback is a principled path therefore rests on unshown evidence. This is for people working on generative world models for autonomous driving who care about closing the loop between prediction and action. The idea is clear enough and the problem it targets is real, so it deserves a serious referee to examine the full experiments and check whether the short latent interaction holds up under scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper argues that existing World Action Models (WAMs) fail to capture the reciprocity between scene evolution and action selection, treating them as isolated or sequential. It formalizes World-Action Interactive Models (WAIMs) and instantiates them in autonomous driving via DAWN, a latent-space model coupling a World Predictor with a World-Conditioned Action Denoiser. During inference, the two components mutually condition each other in a short explicit latent rollout (world hypothesis conditions action denoising; denoised action updates world prediction). The authors claim this yields strong planning performance and favorable safety results on multiple autonomous driving benchmarks, positioning interactive world-action generation as a principled route to actionable world models.

Significance. If the experimental claims are substantiated with rigorous baselines, ablations, and quantitative metrics, the work would offer a clear conceptual advance in world models for robotics and autonomous driving by making bidirectional dependence explicit and efficient via latent-space mutual conditioning rather than full pixel rollouts. The emphasis on short explicit rollouts and the avoidance of rigid predict-then-plan pipelines is a strength, as is the focus on safety-critical interactive scenes. However, the current presentation provides no numbers, so the significance cannot yet be assessed.

major comments (2)

[Experiments] Experiments section: the abstract (and any corresponding results tables) asserts 'strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks' with no quantitative metrics, baseline comparisons (e.g., non-interactive WAMs or full-rollout variants), ablation studies on rollout length, or error analysis. This directly undermines evaluation of the central claim that the mutual-conditioning mechanism produces superior actionable models.
[Method] Method section (DAWN description): the claim that 'a short explicit latent rollout ... is sufficient to support long-horizon trajectory generation in complex interactive scenes' rests on the untested premise that limited inference-time feedback captures the necessary reciprocity. No ablations compare short vs. longer latent rollouts or latent vs. pixel-space simulation, nor is there analysis of whether the semantic latent space preserves fine-grained interactions required for safety-critical cases.

minor comments (2)

[Abstract] The acronym expansion for DAWN is given but the paper should consistently use the full name on first mention in every section for clarity.
[Method] Notation for the mutual conditioning loop (world hypothesis → action denoiser → updated world prediction) should be formalized with an equation or pseudocode to make the recursive refinement explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments rightly emphasize the importance of rigorous quantitative validation to support the central claims regarding World-Action Interactive Models. We respond to each major comment below and will revise the manuscript to address the identified gaps.

read point-by-point responses

Referee: [Experiments] Experiments section: the abstract (and any corresponding results tables) asserts 'strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks' with no quantitative metrics, baseline comparisons (e.g., non-interactive WAMs or full-rollout variants), ablation studies on rollout length, or error analysis. This directly undermines evaluation of the central claim that the mutual-conditioning mechanism produces superior actionable models.

Authors: We agree that the current manuscript version presents the experimental claims at a high level without the supporting quantitative details, baselines, ablations, or error analysis. This limits assessment of the mutual-conditioning benefits. In the revised manuscript we will expand the Experiments section with full results tables, direct comparisons to non-interactive WAMs and full-rollout variants, ablations on rollout length, and targeted error analysis on safety-critical cases to substantiate the performance claims. revision: yes
Referee: [Method] Method section (DAWN description): the claim that 'a short explicit latent rollout ... is sufficient to support long-horizon trajectory generation in complex interactive scenes' rests on the untested premise that limited inference-time feedback captures the necessary reciprocity. No ablations compare short vs. longer latent rollouts or latent vs. pixel-space simulation, nor is there analysis of whether the semantic latent space preserves fine-grained interactions required for safety-critical cases.

Authors: The referee correctly notes that the sufficiency of short latent rollouts and the adequacy of the semantic latent space for fine-grained interactions require explicit empirical support. We will add ablations comparing short versus longer latent rollouts, latent-space versus pixel-space simulation, and a focused analysis of interaction preservation in safety-critical scenarios, including qualitative and quantitative evidence that the latent representation retains the necessary details. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on empirical validation

full rationale

The paper formalizes WAIMs and instantiates DAWN via mutual conditioning between world predictor and action denoiser in latent space during short rollout, with results on benchmarks. No equations, derivations, or self-citations appear in the provided text that reduce any prediction or result to fitted inputs or prior self-work by construction. The reciprocity and sufficiency claims are presented as architectural choices supported by experimental outcomes rather than tautological definitions or load-bearing self-references. The chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The abstract introduces new model components and a new modeling perspective without specifying numerical parameters or external benchmarks.

axioms (1)

domain assumption Latent semantic space is sufficient to capture interactive scene dynamics for planning
Invoked when the paper states that short explicit latent rollout supports long-horizon trajectory generation

invented entities (2)

World-Action Interactive Models (WAIMs) no independent evidence
purpose: Formal framework that enforces reciprocity between world prediction and action generation
Newly defined in the paper to address limitations of existing WAMs
DAWN (Denoising Actions and World iNteractive model) no independent evidence
purpose: Concrete instantiation of WAIMs for autonomous driving using coupled world predictor and action denoiser
New model architecture introduced in the paper

pith-pipeline@v0.9.0 · 5554 in / 1354 out tokens · 30544 ms · 2026-05-13T02:00:34.889179+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DAWN couples a World Predictor with a World-Conditioned Action Denoiser: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction... short explicit latent rollout
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and 8-tick period unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DAWN performs a short explicit latent rollout that is sufficient to support long-horizon trajectory generation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 8 internal anchors

[1]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

work page 2025
[2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Vavim and vavam: Autonomous driving through video generative modeling,

Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, et al. Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672, 2025

work page arXiv 2025
[4]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

work page 2020
[5]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Int2planner: An intention-based multi-modal motion planner for integrated prediction and planning

Xiaolei Chen, Junchi Yan, Wenlong Liao, Tao He, and Pai Peng. Int2planner: An intention-based multi-modal motion planner for integrated prediction and planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14558–14566, 2025

work page 2025
[8]

Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving

Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 239–256. Springer, 2024

work page 2024
[9]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

work page 2022
[10]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advancesin Neural Information Processing Systems, 37:28706–28719, 2024

work page 2024
[11]

Understanding world or predicting future? a comprehensive survey of world models

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys, 58(3):1–38, 2025

work page 2025
[12]

A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

work page arXiv 2025
[13]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025

work page 2025
[14]

Vista: A generalizable driving world model with high fidelity and versatile controllability.Advancesin Neural Information Processing Systems, 37:91560–91596, 2024

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advancesin Neural Information Processing Systems, 37:91560–91596, 2024

work page 2024
[15]

ipad: Iterative proposal-centric end-to-end autonomous driving

Ke Guo, Haochen Liu, Xiaojun Wu, Jia Pan, and Chen Lv. ipad: Iterative proposal-centric end-to-end autonomous driving. arXiv preprint arXiv:2505.15111, 2025

work page arXiv 2025
[16]

Percept-W AM: Perception-enhanced world-awareness-action model for robust end-to-end autonomous driving.arXiv preprint arXiv:2511.19221, 2025

Jianhua Han, Meng Tian, Jiangtong Zhu, Fan He, Huixin Zhang, Sitong Guo, Dechang Zhu, Hao Tang, Pei Xu, Yuze Guo, et al. Percept-wam: Perception-enhanced world-awareness-action model for robust end-to-end autonomous driving. arXiv preprint arXiv:2511.19221, 2025. COWARobot 12

work page arXiv 2025
[17]

Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving

Wencheng Han, Dongqian Guo, Cheng-Zhong Xu, and Jianbing Shen. Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3347–3355, 2025

work page 2025
[18]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

work page 2022
[20]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

work page 2023
[21]

Available: https://arxiv.org/abs/2311.13549

Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai Wang. Adriver-i: A general world model for autonomous driving.arXiv preprint arXiv:2311.13549, 2023

work page arXiv 2023
[22]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

work page 2023
[23]

Imagidrive: A unified imagination-and-planning framework for autonomous driving.arXiv preprint arXiv:2508.11428, 2025

Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, and Li Zhang. Imagidrive: A unified imagination- and-planning framework for autonomous driving.arXiv preprint arXiv:2508.11428, 2025

work page arXiv 2025
[24]

Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving.arXiv preprint arXiv:2601.05640,

Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, Xianpeng Lang, Xiatian Zhu, and Li Zhang. Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving. arXiv preprint arXiv:2601.05640, 2026

work page arXiv 2026
[25]

Hydra-mdp++: Advancing end-to-end driving via expert- guided hydra-distillation,

Kailin Li, Zhenxin Li, Shiyi Lan, Yuan Xie, Zhizhong Zhang, Jiayi Liu, Zuxuan Wu, Zhiding Yu, and Jose M Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation. arXiv preprint arXiv:2503.12820, 2025

work page arXiv 2025
[26]

Generative planning with 3d-vision language pre-training for end-to-end autonomous driving

Tengpeng Li, Hanli Wang, Xianfei Li, Wenlong Liao, Tao He, and Pai Peng. Generative planning with 3d-vision language pre-training for end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4950–4958, 2025

work page 2025
[27]

Driverse: Navigation world model for driving simulation via multimodal trajectory prompting and motion alignment

Xiaofan Li, Chenming Wu, Zhao Yang, Zhihao Xu, Yumeng Zhang, Dingkang Liang, Ji Wan, and Jun Wang. Driverse: Navigation world model for driving simulation via multimodal trajectory prompting and motion alignment. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9753–9762, 2025

work page 2025
[28]

A comprehensive survey on world models for embodied AI.arXiv preprintarXiv:2510.16732, 2025

Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai. arXiv preprint arXiv:2510.16732, 2025

work page arXiv 2025
[29]

Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

work page arXiv 2024
[30]

Unidrivevla: Unifying understanding, perception, and action planning for autonomous driving

Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, et al. Unidrivevla: Unifying understanding, perception, and action planning for autonomous driving. arXiv preprint arXiv:2604.02190, 2026

work page arXiv 2026
[31]

Wildworld: A large-scale dataset for dynamic world modeling with actions and explicit state toward generative arpg

Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, and Kaipeng Zhang. Wildworld: A large-scale dataset for dynamic world modeling with actions and explicit state toward generative arpg. arXiv preprint arXiv:2603.23497, 2026

work page arXiv 2026
[32]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review arXiv 2024
[33]

Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024. COWARobot 13

work page 2024
[34]

Seeing the future, perceiving the future: A unified driving world model for future generation and perception.arXiv preprint arXiv:2503.13587, 2025

Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, and Xiang Bai. Seeing the future, perceiving the future: A unified driving world model for future generation and perception.arXiv preprint arXiv:2503.13587, 2025

work page arXiv 2025
[35]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

work page 2025
[36]

Uni-world vla: Interleaved world modeling and planning for autonomous driving.arXiv preprint arXiv:2603.27287, 2026

Qiqi Liu, Huan Xu, Jingyu Li, Bin Sun, Zhihui Hao, Dangen She, Xiatian Zhu, and Li Zhang. Uni-world vla: Interleaved world modeling and planning for autonomous driving.arXiv preprint arXiv:2603.27287, 2026

work page arXiv 2026
[37]

Unidwm: Towards a unified driving world model via multifaceted representation learning.arXiv preprint arXiv:2602.01536, 2026

Shuai Liu, Siheng Ren, Xiaoyao Zhu, Quanmin Liang, Zefeng Li, Qiang Li, Xin Hu, and Kai Huang. Unidwm: Towards a unified driving world model via multifaceted representation learning.arXiv preprint arXiv:2602.01536, 2026

work page arXiv 2026
[38]

Real-ad: Towards human-like reasoning in end-to-end autonomous driving

Yuhang Lu, Jiadong Tu, Yuexin Ma, and Xinge Zhu. Real-ad: Towards human-like reasoning in end-to-end autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 27783–27793, 2025

work page 2025
[39]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 815–824, 2023

work page 2023
[40]

Vla-r: Vision-language action retrieval toward open-world end-to-end autonomous driving.arXiv preprint arXiv:2511.12405, 2025

Hyunki Seong, Seongwoo Moon, Hojin Ahn, Jehun Kang, and David Hyunchul Shim. Vla-r: Vision-language action retrieval toward open-world end-to-end autonomous driving.arXiv preprint arXiv:2511.12405, 2025

work page arXiv 2025
[41]

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

work page arXiv 2025
[42]

Sparsedrive: End-to-end autonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

work page 2025
[43]

CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

Jiacheng Tang, Zhiyuan Zhou, Zhuolin He, Jia Zhang, Kai Zhang, and Jian Pu. Causalvad: De-confounding end-to-end autonomous driving via causal intervention.arXiv preprint arXiv:2603.18561, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Scene as occupancy

Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8406–8415, 2023

work page 2023
[45]

Comdrive: Comfort-oriented end-to-end autonomous driving

Junming Wang, Xingyu Zhang, Zebin Xing, Songen Gu, Xiaoyang Guo, Yang Hu, Ziying Song, Qian Zhang, Xiaoxiao Long, and Wei Yin. Comdrive: Comfort-oriented end-to-end autonomous driving. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2682–2689. IEEE, 2025

work page 2025
[46]

Latent-wam: Latent world action modeling for end-to-end autonomous driving

Linbo Wang, Yupeng Zheng, Qiang Chen, Shiwei Li, Yichen Zhang, Zebin Xing, Qichao Zhang, Xiang Li, Deheng Qian, Pengxuan Yang, et al. Latent-wam: Latent world action modeling for end-to-end autonomous driving. arXiv preprint arXiv:2603.24581, 2026

work page arXiv 2026
[47]

Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving

Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu, Xiaoyin Zheng, Xiao-Xiao Long, Chang- Tien Lu, and Cheng Lu. Drive-jepa: Video jepa meets multimodal trajectory distillation for end-to-end driving. arXiv preprint arXiv:2601.22032, 2026

work page arXiv 2026
[48]

Drivingdojo dataset: Advancing interactive and knowledge-enriched driving world model.Advances in Neural Information Processing Systems, 37:13020–13034, 2024

Yuqi Wang, Ke Cheng, Jiawei He, Qitai Wang, Hengchen Dai, Yuntao Chen, Fei Xia, and Zhaoxiang Zhang. Drivingdojo dataset: Advancing interactive and knowledge-enriched driving world model.Advances in Neural Information Processing Systems, 37:13020–13034, 2024

work page 2024
[49]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024

work page 2024
[50]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026. COWARobot 14

work page arXiv 2026
[51]

Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

work page 2025
[52]

2512.06112 , archivePrefix =

Yifang Xu, Jiahao Cui, Feipeng Cai, Zhihao Zhu, Hanlin Shang, Shan Luan, Mingwang Xu, Neng Zhang, Yaoyi Li, Jia Cai, et al. Wam-flow: Parallel coarse-to-fine motion planning via discrete flow matching for autonomous driving. arXiv preprint arXiv:2512.06112, 2025

work page arXiv 2025
[53]

Uncad: Towards safe end-to-end autonomous driving via online map uncertainty

Pengxuan Yang, Yupeng Zheng, Qichao Zhang, Kefei Zhu, Zebin Xing, Qiao Lin, Yun-Fu Liu, Zhiguo Su, and Dongbin Zhao. Uncad: Towards safe end-to-end autonomous driving via online map uncertainty. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 6408–6415. IEEE, 2025

work page 2025
[54]

Worldrft: Latent world model planning with reinforcement fine-tuning for autonomous driving

Pengxuan Yang, Ben Lu, Zhongpu Xia, Chao Han, Yinfeng Gao, Teng Zhang, Kun Zhan, XianPeng Lang, Yupeng Zheng, and Qichao Zhang. Worldrft: Latent world model planning with reinforcement fine-tuning for autonomous driving. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11649–11657, 2026

work page 2026
[55]

Drivesuprim: Towards precise trajectory selection for end-to-end planning

Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M Alvarez, and Zuxuan Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026

work page 2026
[56]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666, 2026

work page internal anchor Pith review arXiv 2026
[58]

Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

Bozhou Zhang, Nan Song, Jingyu Li, Xiatian Zhu, Jiankang Deng, and Li Zhang. Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

work page arXiv 2025
[59]

Epona: Autoregressive diffusion world model for autonomous driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 27220–27230, 2025

work page 2025
[60]

Drivedreamer4d: World models are effective data machines for 4d driving scene representation

Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, Boyuan Wang, Youyi Zhang, et al. Drivedreamer4d: World models are effective data machines for 4d driving scene representation. InProceedings of the computer vision and pattern recognition conference, pages 12015–12026, 2025

work page 2025
[61]

From forecasting to planning: Policy world model for collaborative state-action prediction.arXiv preprint arXiv:2510.19654, 2025

Zhida Zhao, Talas Fu, Yifan Wang, Lijun Wang, and Huchuan Lu. From forecasting to planning: Policy world model for collaborative state-action prediction.arXiv preprint arXiv:2510.19654, 2025

work page arXiv 2025
[62]

Genad: Generative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024

work page 2024
[63]

World4drive: End-to-end autonomous driving via intention-aware physical latent world model

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF InternationalConference on Computer Vision, pages 28632–28642, 2025

work page 2025
[64]

Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation

Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 27817–27827, 2025

work page 2025
[65]

Opendrivevla: Towards end-to-end autonomous driving with large vision language action model

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, Volker Tresp, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13782–13790, 2026. COWARobot 15 Appendix A Limitations This work has limitations at both the WAIM form...

work page 2026