arxiv: 2605.09701 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

DriveFuture: Future-Aware Latent World Models for Autonomous Driving

Lei Yang, Lin Liu, Shaoqing Xu, Xiangpo Zhou, Xiaotian Zhou, Yadan Luo, Yingyan Li, Yufeng Hong, Ziying Song

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords latent world modelsautonomous drivingfuture conditioningtrajectory planningdiffusion plannercross-attentionforesight in decisions

0 comments

The pith

Conditioning current latent states on future world states improves trajectory planning in autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that latent world models succeed when they use future states to shape the current representation rather than treating futures only as separate prediction targets. It does this by first forecasting future latents from the present state and ego action, then refining those forecasts against real future observations through cross-attention. The refined future-aware latent then directly conditions a diffusion planner. This separation of concerns during training allows the model to carry foresight into inference, where only its own predictions are available. A reader would care because the method promises decisions that are explicitly forward-looking without mixing current and future information in the same latent space.

Core claim

DriveFuture predicts future latent world states from the current latent state and ego action, then refines the prediction against the ground-truth future latent state via cross-attention. The resulting future-aware latent serves as an explicit condition for a diffusion-based trajectory planner. During inference the model substitutes its own predicted future latent for the ground-truth version.

What carries the argument

Cross-attention refinement of predicted future latents against ground-truth futures, which produces a planning-oriented future-aware latent used to condition the trajectory planner.

If this is right

Current and future features become less entangled because the refinement step forces the model to treat futures as an explicit conditioning signal.
The diffusion planner receives a latent that already encodes planning-relevant foresight rather than raw scene dynamics.
Performance remains high when ground-truth futures are replaced by model predictions, showing the training procedure transfers to deployment.
The same conditioning pattern can be applied to other latent world models that currently treat future states only as auxiliary targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same future-conditioning pattern might reduce the need for very long prediction horizons by letting short-term futures already shape immediate actions.
Extending the refinement step to multiple future time steps could allow planners to balance short-term safety with longer-term goals.
The approach may generalize to non-driving sequential tasks where decisions must anticipate downstream states without explicit supervision on those states.

Load-bearing premise

The cross-attention step during training creates a latent encoding that still extracts useful future information when the model must rely on its own imperfect predictions at inference time.

What would settle it

An ablation that removes the cross-attention refinement step and shows no drop in planning performance on the same driving benchmarks would falsify the claim that future conditioning is the key mechanism.

Figures

Figures reproduced from arXiv: 2605.09701 by Lei Yang, Lin Liu, Shaoqing Xu, Xiangpo Zhou, Xiaotian Zhou, Yadan Luo, Yingyan Li, Yufeng Hong, Ziying Song.

**Figure 1.** Figure 1: Motivation of DriveFuture. (a) Existing latent world models [17–19, 23, 20–22] primarily simulate future latent states and use them as prediction targets or supervision signals, without explicitly shaping the current representation for planning. (b) DriveFuture uses future latent states as direct conditions for the planning process. It adopts GT future states during training and predicted future states dur… view at source ↗

**Figure 2.** Figure 2: Overview of DriveFuture. Multi-view observations at time t are encoded by a shared Perception Encoder into a scene latent Zt. The Latent Dynamics Predictor conditions on Zt and a tokenised trajectory intent to produce a predicted future latent Zˆ t+T . During training, the future observation at t+T is encoded by the same Perception Encoder into Zt+T , the Future Alignment Adapter grounds Zˆ t+T via cross-a… view at source ↗

**Figure 3.** Figure 3: Visual comparison between latent world model World4Drive [ [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Compared with World4Drive[18], DriveFuture performs better in common failure cases, including collision, inefficient, and braking. This shows that conditioning current decision representations on future states is more effective than using future states merely as prediction targets. 5 Conclusion In this work, we propose DriveFuture, a future-aware latent world modeling framework for autonomous driving. U… view at source ↗

**Figure 4.** Figure 4: DriveFuture across multiple driving scenarios from NAVSIM-v2 navhard. [1]. [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

read the original abstract

Existing latent world models for autonomous driving have opened a promising path toward future-aware driving intelligence. However, they typically treat future latent states as prediction targets or auxiliary signals, rather than directly conditioning trajectory planning. This can entangle current and future features in latent space. In this work, we propose DriveFuture, a future-aware latent world modeling framework for autonomous driving that explicitly learns planning-oriented foresight by conditioning the current latent state modeling process on future world states. Specifically, during training, the model first predicts future latent world states from the current latent state and ego action, and then refines the prediction against the ground-truth future latent state via cross-attention. The resulting future-aware latent serves as an explicit condition for a diffusion-based trajectory planner. During inference, DriveFuture conditions on the predicted future latent state instead of the ground-truth future state. DriveFuture achieves SOTA performance on the public NAVSIM benchmarks, reaching \textbf{55.5} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}, \textbf{89.9} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navtest}}}, and \textbf{90.7} PDMS on NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}, respectively. These results suggest that the key to latent world modeling lies not merely in simulating future states, but more importantly in conditioning current decision-making on future states. Notably, as of April 2026, DriveFuture ranks \textbf{1st} on the \href{https://huggingface.co/spaces/AGC2025/e2e-driving-navhard}{NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}} leaderboard and achieves SOTA performance on \href{https://huggingface.co/spaces/AGC2024-P/e2e-driving-navtest}{NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DriveFuture gets SOTA on NAVSIM by conditioning the planner on future latents refined with ground truth only at training time, but the train-inference mismatch is untested.

read the letter

The main takeaway is that DriveFuture conditions both the current latent and the diffusion planner on future world states by predicting futures from current state plus ego action, then refining those predictions against ground-truth futures via cross-attention during training. At inference it switches to the raw predicted future latent. This produces the reported leaderboard numbers: 55.5 EPDMS on navhard, 89.9 on navtest for v2, and 90.7 PDMS on v1 navtest, with a first-place ranking as of April 2026. The design treats future latents as an explicit conditioning signal rather than just an auxiliary target, which is a clear step past the setups in the cited prior latent world models. The empirical gains on public benchmarks are the concrete output here. The architecture is straightforward enough that others could try the same conditioning pattern. The soft spot is exactly the distribution shift the stress-test flags. Training optimizes the planner under a refined, higher-fidelity future signal, but test time uses the model's own imperfect predictions with no reported ablation that removes the ground-truth refinement step or measures how much the refined and raw latents differ. Without those checks or error analysis on the future predictions, the EPDMS improvements could trace to the latent encoder, the diffusion planner, or the data rather than robust future conditioning. The abstract gives no equations or training details, so the full paper would need to supply those to make the central claim convincing. This is for researchers building end-to-end driving systems or latent world models who want a practical conditioning trick that already shows benchmark gains. It has enough of a distinct mechanism and reproducible leaderboard results to deserve peer review, even if the authors will likely need to add the missing ablations.

Referee Report

2 major / 0 minor

Summary. The paper introduces DriveFuture, a latent world model for autonomous driving that predicts future latent states from the current latent and ego action, then refines the prediction via cross-attention against ground-truth future latents exclusively during training. The resulting future-aware latent explicitly conditions a diffusion-based trajectory planner. At inference the planner receives only the raw predicted future latent. The method reports SOTA EPDMS scores of 55.5 on NAVSIM-v2 navhard, 89.9 on NAVSIM-v2 navtest, and 90.7 PDMS on NAVSIM-v1 navtest, claiming first place on the navhard leaderboard and arguing that the key advance is conditioning current decisions on future states rather than treating futures only as targets.

Significance. If the reported gains prove robust to the train-inference mismatch and are attributable to the explicit future-conditioning mechanism, the work would offer a concrete demonstration that foresight should directly shape current planning in latent world models. The SOTA numbers on public NAVSIM benchmarks would then indicate a practical step toward more anticipatory end-to-end driving policies.

major comments (2)

Abstract and §3 (method description): The cross-attention refinement is performed only against ground-truth future latents during training, yet inference conditions the planner on unrefined predicted latents. This introduces an unquantified distribution shift. No measurements of latent prediction error, cosine similarity between refined and raw latents, or error propagation to the planner are supplied, leaving the central claim that future-state conditioning is the key driver unverified.
Experiments section and results tables: The SOTA EPDMS figures (55.5 navhard, etc.) are presented without an ablation that removes the GT-refinement step while keeping the future prediction and diffusion planner fixed. Without this control, it remains possible that the gains arise from the latent encoder, diffusion architecture, or training data rather than the future-aware conditioning mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the training-inference consistency and the need for stronger isolation of the future-conditioning contribution. We address each major comment below and outline revisions to strengthen the paper.

read point-by-point responses

Referee: Abstract and §3 (method description): The cross-attention refinement is performed only against ground-truth future latents during training, yet inference conditions the planner on unrefined predicted latents. This introduces an unquantified distribution shift. No measurements of latent prediction error, cosine similarity between refined and raw latents, or error propagation to the planner are supplied, leaving the central claim that future-state conditioning is the key driver unverified.

Authors: We acknowledge the train-inference discrepancy introduced by the training-only cross-attention refinement. The refinement step is intended to improve the quality of the learned future latent representations by aligning predictions more closely with ground-truth futures during optimization, thereby enabling the model to produce better raw predictions at inference time. While the original submission did not include quantitative analysis of latent prediction error or cosine similarity, the strong benchmark results suggest the approach is effective. To directly address the concern and verify the central claim, we will add measurements of latent prediction error, cosine similarity between refined and raw predicted latents, and an analysis of error propagation to the planner in the revised manuscript. revision: yes
Referee: Experiments section and results tables: The SOTA EPDMS figures (55.5 navhard, etc.) are presented without an ablation that removes the GT-refinement step while keeping the future prediction and diffusion planner fixed. Without this control, it remains possible that the gains arise from the latent encoder, diffusion architecture, or training data rather than the future-aware conditioning mechanism.

Authors: We agree that an ablation isolating the GT-refinement step is necessary to attribute performance gains specifically to the future-aware conditioning mechanism. In the revised version, we will include a controlled ablation that disables the cross-attention refinement during training while retaining the future prediction module and diffusion-based planner unchanged. This will allow direct comparison of EPDMS scores and clarify whether the explicit future-state conditioning is the primary driver of the reported SOTA results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on benchmarks

full rationale

The paper proposes DriveFuture as an architectural framework: it predicts future latent states from current latent + ego action, applies cross-attention refinement against ground-truth future latents exclusively during training to produce a future-aware latent, and conditions a diffusion planner on that latent. At inference the planner uses the raw predicted latent. The central claim is that explicitly conditioning current decision-making on future states (rather than treating futures only as targets) yields better planning, supported by reported SOTA EPDMS scores on public NAVSIM benchmarks. No equations, parameter-fitting steps, uniqueness theorems, or self-citation chains appear in the provided text; the result is presented as an empirical engineering outcome rather than a derivation that reduces to its inputs by construction. The training/inference distinction is explicitly stated, so no load-bearing step collapses into a tautology or fitted input renamed as prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce new mathematical axioms or free parameters; the framework relies on standard latent world model and diffusion planner components whose details are not supplied here.

pith-pipeline@v0.9.0 · 5681 in / 1099 out tokens · 25381 ms · 2026-05-12T02:55:47.781445+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
the model first predicts future latent world states from the current latent state and ego action, and then refines the prediction against the ground-truth future latent state via cross-attention. The resulting future-aware latent serves as an explicit condition for a diffusion-based trajectory planner.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
LatentAlign anneals the planning condition from Zc t+T towards ˆZt+T over training
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear
8-step trajectory over a 4 second horizon with a 0.5 second interval

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 9 internal anchors

[1]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking,

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking,

work page
[2]

URLhttps://arxiv.org/abs/2406.15349

work page arXiv
[3]

Progressive robustness-aware world models in autonomous driving: A review and outlook.Authorea Preprints, 2025

Feiyang Jia, Caiyan Jia, Ziying Song, Zhicheng Bao, Lin Liu, Shaoqing Xu, Yan Gong, Lei Yang, Xinyu Zhang, Bin Sun, et al. Progressive robustness-aware world models in autonomous driving: A review and outlook.Authorea Preprints, 2025

work page 2025
[4]

A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving. arXiv preprint arXiv:2501.11260, 2025

work page arXiv 2025
[5]

The role of world models in shaping autonomous driving: A comprehensive survey.arXiv preprint arXiv:2502.10498, 2025

Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, and Xiang Bai. The role of world models in shaping autonomous driving: A comprehensive survey.arXiv preprint arXiv:2502.10498, 2025

work page arXiv 2025
[6]

Robustness-aware 3d object detection in autonomous driving: A review and outlook

Ziying Song, Lin Liu, Feiyang Jia, Yadan Luo, Caiyan Jia, Guoxin Zhang, Lei Yang, and Li Wang. Robustness-aware 3d object detection in autonomous driving: A review and outlook. IEEE Transactions on Intelligent Transportation Systems, 25(11):15407–15436, 2024

work page 2024
[7]

arXiv preprint arXiv:2603.01063 (2026) DVGT-2 19

Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, and Fuxi Wen. Unleashing vla potentials in autonomous driving via explicit learning from failures. arXiv preprint arXiv:2603.01063, 2026

work page arXiv 2026
[8]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17853–17862, June 2023

work page 2023
[9]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

work page 2025
[10]

Sparsedrive: End-to-end autonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

work page 2025
[11]

Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

work page 2025
[12]

GuideFlow: Constraint-guided flow matching for planning in end-to-end autonomous driving.arXiv preprint arXiv:2511.18729, 2025

Lin Liu, Caiyan Jia, Guanyi Yu, Ziying Song, JunQiao Li, Feiyang Jia, Peiliang Wu, Xiaoshuai Hao, and Yandan Luo. Guideflow: Constraint-guided flow matching for planning in end-to-end autonomous driving.arXiv preprint arXiv:2511.18729, 2025

work page arXiv 2025
[13]

FocalAD: Local Motion Planning for End-to-End Autonomous Driving

Bin Sun, Boao Zhang, Jiayi Lu, Xinjie Feng, Jiachen Shang, Rui Cao, Mengchao Zheng, Chuanye Wang, Shichun Yang, Yaoguang Cao, et al. Focalad: Local motion planning for end-to-end autonomous driving.arXiv preprint arXiv:2506.11419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22432–22441, 2025

work page 2025
[15]

Minddrive: An all-in-one framework bridging world models and vision-language model for end-to-end autonomous driving.arXiv preprint arXiv:2512.04441, 2025

Bin Suna, Yaoguang Caob, Yan Wanga, Rui Wanga, Jiachen Shanga, Xiejie Fenga, Jiayi Lu, Jia Shi, Shichun Yang, Xiaoyu Yane, et al. Minddrive: An all-in-one framework bridging world models and vision-language model for end-to-end autonomous driving.arXiv preprint arXiv:2512.04441, 2025. 10

work page arXiv 2025
[16]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page arXiv 2025
[18]

Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

work page arXiv 2024
[19]

World4drive: End-to-end autonomous driving via intention-aware physical latent world model

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28632–28642, 2025

work page 2025
[20]

Worldrft: Latent world model planning with reinforcement fine-tuning for autonomous driving

Pengxuan Yang, Ben Lu, Zhongpu Xia, Chao Han, Yinfeng Gao, Teng Zhang, Kun Zhan, XianPeng Lang, Yupeng Zheng, and Qichao Zhang. Worldrft: Latent world model planning with reinforcement fine-tuning for autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11649–11657, 2026

work page 2026
[21]

DriveWorld-VLA: Unified latent- space world modeling with vision-language-action for autonomous driving.arXiv preprint arXiv:2602.06521, 2026

Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, Long Chen, et al. Driveworld- vla: Unified latent-space world modeling with vision-language-action for autonomous driving. arXiv preprint arXiv:2602.06521, 2026

work page arXiv 2026
[22]

Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

work page arXiv 2025
[23]

DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, et al. Drivelaw: Unifying planning and video generation in a latent driving world.arXiv preprint arXiv:2512.23421, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024

work page 2024
[25]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review arXiv 2023
[26]

Drive- dreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InEuropean confer- ence on computer vision, pages 55–72. Springer, 2024

work page 2024
[27]

Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

work page 2024
[28]

Epona: Autoregressive diffusion world model for autonomous driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27220–27230, 2025

work page 2025
[29]

ReSim: Reliable World Simulation for Autonomous Driving

Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, and Li Chen. Resim: Reliable world simulation for au- tonomous driving.arXiv preprint arXiv:2506.09981, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Consisdrive: Identity-preserving driving world models for video generation by instance mask.arXiv preprint arXiv:2602.03213, 2026

Zhuoran Yang and Yanyong Zhang. Consisdrive: Identity-preserving driving world models for video generation by instance mask.arXiv preprint arXiv:2602.03213, 2026. 11

work page arXiv 2026
[31]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024

work page 2024
[32]

Available: https://arxiv.org/abs/2311.13549

Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai Wang. Adriver-i: A general world model for autonomous driving.arXiv preprint arXiv:2311.13549, 2023

work page arXiv 2023
[33]

Adawm: Adaptive world model based planning for autonomous driving

Hang Wang, Xin Ye, Feng Tao, Chenbin Pan, Abhirup Mallik, Burhaneddin Yaman, Liu Ren, and Junshan Zhang. Adawm: Adaptive world model based planning for autonomous driving. arXiv preprint arXiv:2501.13072, 2025

work page arXiv 2025
[34]

Occworld: Learning a 3d occupancy world model for autonomous driving

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024

work page 2024
[35]

Driving in the occupancy world: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving

Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the occupancy world: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9327–9335, 2025

work page 2025
[36]

Occsora: 4d occupancy generation models as world simulators for autonomous driving,

Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, and Jiwen Lu. Occsora: 4d occupancy generation models as world simulators for autonomous driving. arXiv preprint arXiv:2405.20337, 2024

work page arXiv 2024
[37]

Occllama: An occupancy- language-action generative world model for autonomous driving,

Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. Occllama: An occupancy-language-action generative world model for autonomous driving. arXiv preprint arXiv:2409.03272, 2024

work page arXiv 2024
[38]

Bevworld: A multimodal world model for autonomous driving via unified bev latent space,

Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiaofan Li, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, and Haifeng Wang. Bevworld: A multimodal world simulator for autonomous driving via scene-level bev latents.arXiv preprint arXiv:2407.05679, 2024

work page arXiv 2024
[39]

End-to-end driving with online trajectory evaluation via bev world model

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025

work page 2025
[40]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

work page 2022
[41]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

work page 2023
[42]

M2da: Multi-modal fusion transformer incorporating driver attention for autonomous driving.arXiv preprint arXiv:2403.12552, 2024

Dongyang Xu, Haokun Li, Qingfan Wang, Ziying Song, Lei Chen, and Hanming Deng. M2da: Multi-modal fusion transformer incorporating driver attention for autonomous driving.arXiv preprint arXiv:2403.12552, 2024

work page arXiv 2024
[43]

End- to-end autonomous driving without costly modularization and 3d manual annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Mingzhe Guo, Zhipeng Zhang, Yuan He, Ke Wang, Liping Jing, and Haibin Ling. End- to-end autonomous driving without costly modularization and 3d manual annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[44]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review arXiv 2024
[45]

Fully unified motion planning for end-to-end autonomous driving.arXiv preprint arXiv:2504.12667, 2025

Lin Liu, Caiyan Jia, Ziying Song, Hongyu Pan, Bencheng Liao, Wenchao Sun, Yongchang Zhang, Lei Yang, Yandan Luo, et al. Fully unified motion planning for end-to-end autonomous driving.arXiv preprint arXiv:2504.12667, 2025. 12

work page arXiv 2025
[46]

Genad: Gen- erative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Gen- erative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024

work page 2024
[47]

Bridging past and future: End-to-end autonomous driving with historical prediction and planning

Bozhou Zhang, Nan Song, Xin Jin, and Li Zhang. Bridging past and future: End-to-end autonomous driving with historical prediction and planning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6854–6863, 2025

work page 2025
[48]

Reinforced refinement with self-aware expansion for end-to-end autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

work page 2026
[49]

Diver: Reinforced diffusion breaks imitation bottlenecks in end-to-end autonomous driving.arXiv preprint arXiv:2507.04049, 2025

Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, and Yadan Luo. Diver: Reinforced diffusion breaks imitation bottlenecks in end-to-end autonomous driving.arXiv preprint arXiv:2507.04049, 2025

work page arXiv 2025
[50]

Generalized trajectory scoring for end-to-end multimodal planning.arXiv preprint arXiv:2506.06664, 2025

Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, and Jose M Alvarez. Generalized trajectory scoring for end-to-end multimodal planning.arXiv preprint arXiv:2506.06664, 2025

work page arXiv 2025
[51]

Openscene: The largest up-to-date 3d occupancy prediction bench- mark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

OpenScene Contributors. Openscene: The largest up-to-date 3d occupancy prediction bench- mark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

work page 2023
[52]

arXiv preprint arXiv:2106.11810 (2021) 3, 7

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. Nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles, 2022. URLhttps://arxiv.org/abs/2106.11810

work page arXiv 2022
[53]

arXiv preprint arXiv:2506.04218 (2025)

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, et al. Pseudo-simulation for autonomous driving.arXiv preprint arXiv:2506.04218, 2025

work page arXiv 2025
[54]

Senna: Bridging large vision-language models and end-to- end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

work page arXiv 2024
[55]

arXiv preprint arXiv:2506.06659 (2025)

Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M. Alvarez, and Zuxuan Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning.arXiv preprint arXiv:2506.06659, 2025

work page arXiv 2025
[56]

Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, and Jose M. Alvarez. Ztrs: Zero-imitation end-to-end autonomous driving with trajectory scoring.arXiv preprint arXiv:2510.24108, 2025

work page arXiv 2025
[57]

H. Tian, T. Li, H. Liu, J. Yang, Y . Qiu, G. Li, J. Wang, Y . Gao, Z. Zhang, et al. Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Driving on registers.arXiv preprint arXiv:2601.05083, 2026

Ellington Kirby, Alexandre Boulch, Yihong Xu, Yuan Yin, Gilles Puy, Eloi Zablocki, Andrei Bursuc, Spyros Gidaris, Renaud Marlet, Florent Bartoccioni, Anh-Quan Cao, Nermin Samet, Tuan-Hung Vu, and Matthieu Cord. Driving on registers.arXiv preprint arXiv:2601.05083, 2026

work page arXiv 2026
[59]

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

Zewei Zhou, Ruining Yang, Xuewei Qi, Yiluan Guo, Sherry X. Chen, Tao Feng, Kateryna Pistunova, Yishan Shen, et al. Spanvla: Efficient action bridging and learning from negative- recovery samples for vision-language-action model.arXiv preprint arXiv:2604.19710, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

Diffvla: Vision-language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381, 2025

Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Zongzheng Zhang, Xianda Guo, Hao Sun, and Hao Zhao. Diffvla: Vision- language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381, 2025

work page arXiv 2025
[61]

Kailin Li, Zhenxin Li, Shiyi Lan, Y . Xie, Z. Zhang, J. Liu, Z. Wu, Z. Yu, and Jose M. Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv e-prints, 2025. 13

work page 2025
[62]

arXiv preprint arXiv:2504.19580 (2025)

Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, Yanjun Huang, et al. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580, 2025

work page arXiv 2025
[63]

Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving.arXiv preprint arXiv:2512.07745,

Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, et al. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving.arXiv preprint arXiv:2512.07745, 2025

work page arXiv 2025
[64]

Latent-wam: Latent world action modeling for end-to-end autonomous driving

Linbo Wang, Yupeng Zheng, Qiang Chen, Shiwei Li, Yichen Zhang, Zebin Xing, Qichao Zhang, Xiang Li, Deheng Qian, et al. Latent-wam: Latent world action modeling for end-to-end autonomous driving.arXiv preprint arXiv:2603.24581, 2026

work page arXiv 2026
[65]

PARA-Drive: Parallelized architecture for real-time autonomous driving

Xinshuo Weng et al. PARA-Drive: Parallelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[66]

C. Yuan, Z. Zhang, J. Sun, S. Sun, Z. Huang, C. D. W. Lee, D. Li, Y . Han, A. Wong, K. P. Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba. In International Symposium on Robotics Research, 2024

work page 2024
[67]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

work page internal anchor Pith review arXiv 2025
[68]

Transfuser: Imitation with transformer-based sensor fu- sion for autonomous driving,

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving, 2022. URL https://arxiv.org/abs/2205.15997

work page arXiv 2022
[69]

2603.29163 , archivePrefix =

Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Xiang Li, Yining Shi, and Sifa Zheng. Sparsedrivev2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026. 14 A Experiment Details A.1 Datasets and Benchmarks Datasets.We train and evaluate DriveFuture on the nuPlan (OpenScene) data used by the public NAVSIMbenchmar...

work page arXiv 2026