ReSim: Reliable World Simulation for Autonomous Driving

Andreas Geiger; Hongyang Li; Jiazhi Yang; Kashyap Chitta; Li Chen; Long Chen; Shenyuan Gao; Xiangyu Yue; Xiaosong Jia; Yuqian Shao

arxiv: 2506.09981 · v2 · submitted 2025-06-11 · 💻 cs.CV · cs.RO

ReSim: Reliable World Simulation for Autonomous Driving

Jiazhi Yang , Kashyap Chitta , Shenyuan Gao , Long Chen , Yuqian Shao , Xiaosong Jia , Hongyang Li , Andreas Geiger

show 2 more authors

Xiangyu Yue Li Chen

This is my paper

Pith reviewed 2026-05-19 09:24 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords autonomous drivingworld modeldriving simulationdiffusion transformerpolicy evaluationvideo generationreward estimation

0 comments

The pith

ReSim simulates open-world driving scenarios under hazardous non-expert actions by training on mixed real and simulator data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Driving world models built only on safe expert trajectories cannot follow rare hazardous behaviors, which restricts their use for testing new policies or planning in risky situations. The paper mixes real-world human demonstrations with diverse non-expert trajectories collected from a simulator such as CARLA to form a heterogeneous training set. A diffusion transformer is equipped with new strategies for integrating conditioning signals, producing the ReSim model that generates future scenes with higher fidelity and better control over both expert and non-expert actions. An added Video2Reward module extracts reward signals from the simulated futures to support downstream tasks. A sympathetic reader cares because this approach could let developers evaluate autonomous driving systems across a much wider range of behaviors without real-world danger.

Core claim

The authors claim that enriching real-world driving data with simulator-collected non-expert trajectories and training a controllable diffusion transformer world model on the combined corpus produces reliable simulations of diverse open-world scenarios under various ego actions, including hazardous ones. Strategies are introduced to integrate conditioning signals effectively for improved controllability and visual fidelity. The Video2Reward module then derives reward estimates from ReSim outputs to enable planning and policy selection.

What carries the argument

A diffusion transformer world model trained on a heterogeneous corpus of real expert trajectories and simulator non-expert data, with added conditioning integration strategies and a Video2Reward module that estimates rewards from simulated video futures.

If this is right

ReSim achieves up to 44% higher visual fidelity than prior models.
Controllability improves by over 50% for both expert and non-expert actions.
Planning performance on NAVSIM rises by 2% and policy selection by 25%.
Simulated futures now support reward-based judgment of diverse driving actions including hazardous ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This mixed-data training strategy could apply to other robotics domains where safe expert data is common but risky exploration is needed.
The Video2Reward module might transfer to reward estimation in other video prediction systems for decision making.
If domain gaps remain small, the method could speed up testing of rare-event robustness in autonomous systems without new real-world collection.

Load-bearing premise

Mixing simulator trajectories with real-world demonstrations does not create unmodeled domain gaps that degrade performance on real hazardous scenarios.

What would settle it

Apply ReSim to a recorded real-world hazardous maneuver such as sudden swerving and measure whether the generated future frames match actual vehicle dynamics and scene elements in held-out footage.

Figures

Figures reproduced from arXiv: 2506.09981 by Andreas Geiger, Hongyang Li, Jiazhi Yang, Kashyap Chitta, Li Chen, Long Chen, Shenyuan Gao, Xiangyu Yue, Xiaosong Jia, Yuqian Shao.

**Figure 1.** Figure 1: Overview of ReSim. (a) Heterogeneous driving data includes (i,ii) experts’ safe driving logs, and (iii) potentially dangerous (non-expert) driving behaviors from simulations. (b) Prior driving world models are trained on expert data solely, leading to consistently safe yet inaccurate imaginations; in ReSim, we leverage all sources of data to simulate reliable and realistic futures, and build a robust rewar… view at source ↗

**Figure 2.** Figure 2: Video2Reward model (V2R). Top: V2R is supervised by infraction score of both safe and hazardous data from simulation, deriving the reward from a driving video. Bottom: In real-world inference, the predicted video of ReSim in reaction to a proposed action is fed into V2R to estimate the action’s reward. In detail, our Video2Reward model (V2R) is established on a frozen DINOv2 backbone [50] with an additio… view at source ↗

**Figure 3.** Figure 3: Video prediction-based policy. ReSim conditions on the history context (left) to synthesize a plausible visual plan (middle), which is then translated into an ego trajectory via an IDM (right). 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Human evaluation of non-expert action controllability. ReSim gets the most votes in both realism and trajectory following [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons of non-expert action controllability. ReSim reliably simulates hazardous outcomes from the non-expert action, while other methods either fail to follow the specified trajectory or compromise the scenario’s consistency. ⋆ : without simulated data in training. Waymo nuScenes Expert Act. Non-expert Act. Non-expert Act. Expert Act [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Zero-shot action controllability. ReSim can reliably follow both expert and non-expert actions in various scenarios from zero-shot datasets. ReSim yields significantly better results in a zero-shot manner compared to in-distribution models. We also provide qualitative comparisons for long-term future prediction in Appendix Sec. C, where [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Reward correlation. Our method of composing ReSim and Video2Reward model yields more accurate rewards compared to baselines in both datasets [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Closed-loop visual simulation example. A policy with front view only runs within the imaginary world generated by ReSim. The policy is adapted from XVO [60]. Uniform Sampling Unbalanced Sampling [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Effect of unbalanced noise sampling. Training with unbalanced noise sampling yields improved motion and scenario consistency. W/O DCL W/ DCL, K=1 W/ DCL, K=4 [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReSim mixes CARLA non-expert trajectories with real data to improve controllability in a diffusion driving world model, but the domain gap remains a live concern that needs clearer checks.

read the letter

ReSim's core move is to train a diffusion transformer world model on a mix of real driving videos and CARLA simulator trajectories so it can follow non-expert and hazardous actions that are scarce in real data alone. They also add a Video2Reward head that turns the generated future videos into reward signals for planning and policy selection. That combination is what the paper actually contributes beyond prior diffusion-based driving simulators. The conditioning tweaks they describe for the transformer look like they help the model stick to the input actions more reliably while keeping visual quality up. The reported lifts in fidelity and controllability, plus the modest NAVSIM gains, are the kind of downstream numbers that matter for people who want to use simulation for safety testing. If the evaluation splits and baselines are handled cleanly, this is a practical incremental step. The soft spot is still the domain gap. CARLA has simpler physics, different rendering, and control distributions that do not match real sensors or vehicle dynamics. Without explicit real-only ablations or adaptation layers shown in the results, it is possible the model is picking up simulator artifacts that inflate controllability on mixed test sets but weaken reward estimates or planning performance when the simulated futures are applied to genuine data. The abstract does not detail how they tested for this transfer, so that is the section I would read first in the full manuscript. This paper is aimed at researchers building world models for autonomous driving and anyone working on simulation-based policy evaluation. A reader focused on practical AV safety tools would get the most out of the Video2Reward idea and the heterogeneous training approach. The work shows clear thinking on the controllability problem and honest engagement with the video generation literature, so it deserves a serious referee even if the domain-gap handling needs tightening. I would send it out for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ReSim, a controllable diffusion-transformer world model for simulating open-world driving scenarios under expert and non-expert (including hazardous) actions. It enriches real-world human demonstrations with non-expert trajectories collected in CARLA, introduces conditioning integration strategies to improve fidelity and controllability, and adds a Video2Reward module that derives reward signals from the simulated futures. Quantitative claims include up to 44% higher visual fidelity, over 50% better controllability for both action types, and 2%/25% gains in planning and policy selection on NAVSIM.

Significance. If the central claims hold after addressing domain-gap concerns, the work would be significant for autonomous-driving world models: it directly tackles the rarity of hazardous trajectories in real data and supplies a practical bridge from simulation to reward-based policy evaluation. The Video2Reward component is a concrete contribution that could be reused beyond this architecture.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the reported 44% visual-fidelity and >50% controllability gains are presented without explicit evaluation protocols, baseline definitions, statistical significance tests, or precise definitions of “non-expert” actions; these omissions make the numbers impossible to interpret or reproduce from the given information.
[§3.2] §3.2 (Data Construction): mixing CARLA non-expert trajectories with real-world data is load-bearing for the generalization claim, yet no domain-adaptation layers, cycle-consistency losses, or real-only ablation studies are described; without them the model may exploit simulator-specific artifacts, undermining transfer to real hazardous scenarios and the downstream NAVSIM reward estimates.

minor comments (2)

[Figure 3 and §4.3] Figure 3 and §4.3: axis labels and legend entries for the controllability metrics are difficult to read; enlarge fonts and add a table of exact numerical values.
[§2] §2 (Related Work): the discussion of prior driving world models omits recent diffusion-based video generators that also condition on actions; add these references for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below, along with our plans for revisions to improve clarity, reproducibility, and robustness.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported 44% visual-fidelity and >50% controllability gains are presented without explicit evaluation protocols, baseline definitions, statistical significance tests, or precise definitions of “non-expert” actions; these omissions make the numbers impossible to interpret or reproduce from the given information.

Authors: We appreciate the referee's emphasis on reproducibility. Evaluation protocols are described in §4, including visual fidelity metrics (FVD, FID) and controllability measures (action prediction accuracy and trajectory deviation). Baselines include prior world models such as DriveDreamer and Vista, with non-expert actions defined as CARLA trajectories exhibiting high deviation from expert human demonstrations (e.g., via steering/throttle variance thresholds). To address the concern directly, we will add a dedicated evaluation protocol subsection, report statistical significance via paired t-tests with p-values, and provide explicit formulas for the percentage gains in the revised version. revision: yes
Referee: [§3.2] §3.2 (Data Construction): mixing CARLA non-expert trajectories with real-world data is load-bearing for the generalization claim, yet no domain-adaptation layers, cycle-consistency losses, or real-only ablation studies are described; without them the model may exploit simulator-specific artifacts, undermining transfer to real hazardous scenarios and the downstream NAVSIM reward estimates.

Authors: We acknowledge the importance of addressing potential domain gaps. Our conditioning integration strategies (§3.1) and heterogeneous training enable the model to generalize without dedicated adaptation layers or cycle-consistency losses, as supported by strong real-world NAVSIM results. However, to strengthen the claim, we will add a real-only ablation study in the revised experiments section. We disagree that explicit domain-adaptation is required here, as the diffusion-transformer architecture and data preprocessing sufficiently mitigate simulator artifacts for the reported controllability and planning gains. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical training of mixed-data world model

full rationale

The paper presents an empirical ML approach: a diffusion transformer is trained on a mixed corpus of real expert trajectories and CARLA non-expert data, with additional conditioning strategies and a separate Video2Reward module. Reported gains (fidelity, controllability, NAVSIM planning) are measured outcomes on held-out sets rather than quantities that reduce by construction to the training inputs or fitted parameters. No self-definitional equations, fitted-input predictions, or load-bearing self-citation chains appear in the derivation; the central claims remain falsifiable through external benchmarks and do not rely on renaming or smuggling prior ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the assumption that simulator data can be seamlessly integrated with real data and that the Video2Reward module produces scores that correlate with downstream planning utility; no explicit free parameters or invented physical entities are mentioned.

pith-pipeline@v0.9.0 · 5783 in / 1266 out tokens · 29251 ms · 2026-05-19T09:24:39.497119+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ReSim is built on CogVideoX, a high-capacity diffusion transformer... Ldiffusion + λLdynamics... unbalanced noise sampling... Video2Reward model... DINOv2 backbone
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
Learning Vision-Language-Action World Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving
cs.RO 2026-04 unverdicted novelty 6.0

Sim2Real-AD enables zero-shot transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles, reporting 75-90% success rates in car-following, obstacle avoidance, and stop-sign scenarios without real-world RL...
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
cs.CV 2025-12 unverdicted novelty 6.0

DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.

Reference graph

Works this paper leans on

127 extracted references · 127 canonical work pages · cited by 7 Pith papers · 14 internal anchors

[1]

A path towards autonomous machine intelligence

Yann LeCun. A path towards autonomous machine intelligence. Open Review, 62, 2022. 1, 2, 4, 17

work page 2022
[2]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In NeurIPS, 2018. 1, 3, 17, 18, 21

work page 2018
[3]

Video as the new language for real-world decision making

Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, and Dale Schuurmans. Video as the new language for real-world decision making. In ICML, 2024. 1, 5, 17

work page 2024
[4]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023. 1

work page 2023
[5]

Policy pre-training for autonomous driving via self-supervised geometric modeling

Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, and Yu Qiao. Policy pre-training for autonomous driving via self-supervised geometric modeling. In ICLR, 2023. 1

work page 2023
[6]

Temporal difference learning for model predictive control

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. In ICML, 2022. 1

work page 2022
[7]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983, 2024. 1, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In CVPR, 2025. 1, 2, 17

work page 2025
[9]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. In NeurIPS, 2024. 1, 17

work page 2024
[10]

Dream to Control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to Control: Learning behaviors by latent imagination. In ICLR, 2020. 1, 2, 3, 4, 17, 21

work page 2020
[11]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In ICLR, 2024. 1, 17, 19

work page 2024
[12]

Pathdreamer: A world model for indoor navigation

Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. In ICCV, 2021. 1, 17

work page 2021
[13]

Learning interactive real-world simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In ICLR, 2024. 1, 17, 18

work page 2024
[14]

Generalized predictive model for autonomous driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. In CVPR, 2024. 1, 3, 4, 7, 17, 18, 20, 21, 22

work page 2024
[15]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 1, 3, 4, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In NeurIPS, 2024. 1, 2, 3, 4, 6, 7, 8, 17, 19, 20, 21, 22 10

work page 2024
[17]

Drivedreamer-2: Llm-enhanced world models for diverse driving video generation,

Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, and Xingang Wang. DriveDreamer-2: LLM-enhanced world models for diverse driving video generation. arXiv preprint arXiv:2403.06845, 2024. 1, 7, 17

work page arXiv 2024
[18]

GEM: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. GEM: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. In CVPR, 2025. 1, 7, 17

work page 2025
[19]

A control-centric benchmark for video prediction

Stephen Tian, Chelsea Finn, and Jiajun Wu. A control-centric benchmark for video prediction. In ICLR,

work page
[20]

AdaWorld: Learning adaptable world models with latent actions

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. AdaWorld: Learning adaptable world models with latent actions. In ICML, 2025. 1

work page 2025
[21]

ACT-Bench: Towards action controllable world models for autonomous driving

Hidehisa Arai, Keishi Ishihara, Tsubasa Takahashi, and Yu Yamaguchi. ACT-Bench: Towards action controllable world models for autonomous driving. arXiv preprint arXiv:2412.05337, 2024. 1

work page arXiv 2024
[22]

Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024. 1

work page 2024
[23]

NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking. In NeurIPS Datasets and Benchmarks, 2024. 1, 2, 3, 4, 6, 7, 8, 18, 20, 21, 22

work page 2024
[24]

Learning to drive from a world on rails

Dian Chen, Vladlen Koltun, and Philipp Krähenbühl. Learning to drive from a world on rails. In ICCV,

work page
[25]

Rates of motor vehicle crashes, injuries and deaths in relation to driver age, united states, 2014-2015

Brian Tefft. Rates of motor vehicle crashes, injuries and deaths in relation to driver age, united states, 2014-2015. AAA Foundation for Traffic Safety., 2017. 1

work page 2014
[26]

ActiveAD: Planning- oriented active learning for end-to-end autonomous driving

Han Lu, Xiaosong Jia, Yichen Xie, Wenlong Liao, Xiaokang Yang, and Junchi Yan. ActiveAD: Planning- oriented active learning for end-to-end autonomous driving. arXiv preprint arXiv:2403.02877, 2024. 1

work page arXiv 2024
[27]

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385,

work page internal anchor Pith review arXiv
[28]

SimGen: Simulator-conditioned driving scene generation

Yunsong Zhou, Michael Simon, Zhenghao Mark Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, and Bolei Zhou. SimGen: Simulator-conditioned driving scene generation. In NeurIPS, 2024. 2

work page 2024
[29]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In CoRL, 2017. 2, 3, 4, 5, 7, 22

work page 2017
[30]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. In ICLR, 2025. 2, 3, 4, 18, 20, 22

work page 2025
[31]

Driving into the Future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the Future: Multiview visual forecasting and planning with world model for autonomous driving. In CVPR,

work page
[32]

Enhancing end-to-end autonomous driving with latent world model

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model. In ICLR, 2025. 2, 8

work page 2025
[33]

Trans- Fuser: Imitation with transformer-based sensor fusion for autonomous driving

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Trans- Fuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE TPAMI, 2023. 2, 8

work page 2023
[34]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yuxin Pan, Giancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yuxin Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020. 3, 6, 18, 21, 22

work page 2020
[35]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020. 3, 6, 18, 20, 21, 22 11

work page 2020
[36]

Mastering atari with discrete world models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In ICLR, 2021. 3, 17

work page 2021
[37]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. 3, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

DriveDreamer: Towards real-world-driven world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. DriveDreamer: Towards real-world-driven world models for autonomous driving. In ECCV, 2024. 3, 4, 7, 17

work page 2024
[39]

Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. In NeurIPS Datasets and Benchmarks, 2024. 3, 18, 22

work page 2024
[40]

DriveLM: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. In ECCV, 2024. 3

work page 2024
[41]

PDM-Lite: A rule-based planner for carla leaderboard 2.0

Jens Beißwenger. PDM-Lite: A rule-based planner for carla leaderboard 2.0. https://github.com/ OpenDriveLab/DriveLM/blob/DriveLM-CARLA/pdm_lite/docs/report.pdf, 2024. 3, 18

work page 2024
[42]

Track4Gen: Teaching video diffusion models to track points improves video generation

Hyeonho Jeong, Chun-Hao Paul Huang, Jong Chul Ye, Niloy Mitra, and Duygu Ceylan. Track4Gen: Teaching video diffusion models to track points improves video generation. In CVPR, 2025. 4

work page 2025
[43]

MotiF: Making text count in image animation with motion focal loss

Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, and Xi Yin. MotiF: Making text count in image animation with motion focal loss. In CVPR, 2025. 4

work page 2025
[44]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable Video Diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 4, 17, 18, 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024. 4

work page 2024
[46]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. 4

work page 2020
[47]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022. 4

work page 2022
[48]

Snap Video: Scaled spatiotemporal transformers for text-to-video synthesis

Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap Video: Scaled spatiotemporal transformers for text-to-video synthesis. In CVPR, 2024. 4

work page 2024
[49]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 4

work page 2022
[50]

DINOv2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision. TMLR, 2024. 5, 19, 22

work page 2024
[51]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 5

work page 2021
[52]

https://leaderboard.carla.org/, 2022

CARLA autonomous driving leaderboard. https://leaderboard.carla.org/, 2022. 5, 19

work page 2022
[53]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. In NeurIPS, 2023. 5

work page 2023
[54]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

PiP: Planning-informed trajectory prediction for autonomous driving

Haoran Song, Wenchao Ding, Yuxuan Chen, Shaojie Shen, Michael Yu Wang, and Qifeng Chen. PiP: Planning-informed trajectory prediction for autonomous driving. In ECCV, 2020. 5

work page 2020
[56]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017. 6 12

work page 2017
[57]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards Accurate Generative Models of Videos: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[58]

DriveGAN: Towards a controllable high-quality neural simulation

Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. DriveGAN: Towards a controllable high-quality neural simulation. In CVPR, 2021. 7, 17

work page 2021
[59]

WoV oGen: World volume-aware diffusion for controllable multi-camera driving scene generation

Jiachen Lu, Ze Huang, Jiahui Zhang, Zeyu Yang, and Li Zhang. WoV oGen: World volume-aware diffusion for controllable multi-camera driving scene generation. In ECCV, 2024. 7

work page 2024
[60]

XVO: Generalized visual odometry via cross-modal self-training

Lei Lai, Zhongkai Shangguan, Jimuyang Zhang, and Eshed Ohn-Bar. XVO: Generalized visual odometry via cross-modal self-training. In ICCV, 2023. 8, 9, 20, 22

work page 2023
[61]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, et al. Planning-oriented autonomous driving. In CVPR, 2023. 8, 20, 21

work page 2023
[62]

Driving- gpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers

Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. DrivingGPT: Unifying driving world modeling and planning with multi-modal autoregressive transformers. arXiv preprint arXiv:2412.18607, 2024. 8

work page arXiv 2024
[63]

Pre-training contextualized world models with in-the-wild videos for reinforcement learning

Jialong Wu, Haoyu Ma, Chaoyi Deng, and Mingsheng Long. Pre-training contextualized world models with in-the-wild videos for reinforcement learning. In NeurIPS, 2023. 17

work page 2023
[64]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual Fore- sight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018. 17

work page internal anchor Pith review Pith/arXiv arXiv 2018
[65]

Deep visual foresight for planning robot motion

Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In ICRA, 2017. 17

work page 2017
[66]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019. 17

work page 2019
[67]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In ICML, 2024. 17

work page 2024
[68]

Structured world models from human videos

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. In RSS, 2023. 17

work page 2023
[69]

Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson

Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. In ICLR, 2024. 17

work page 2024
[70]

Flatfusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving.arXiv preprint arXiv:2408.06832, 2024

Yutao Zhu, Xiaosong Jia, Xinyu Yang, and Junchi Yan. FlatFusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving. arXiv preprint arXiv:2408.06832, 2024. 17

work page arXiv 2024
[71]

Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-VLA: Enhancing robot manipulation with interleaved image-text instructions. arXiv preprint arXiv:2505.02152, 2025. 17

work page arXiv 2025
[72]

Learning from all vehicles

Dian Chen and Philipp Krähenbühl. Learning from all vehicles. In CVPR, 2022. 17

work page 2022
[73]

Curse of rarity for autonomous vehicles

Henry X Liu and Shuo Feng. Curse of rarity for autonomous vehicles. Nature Communications, 2024. 17

work page 2024
[74]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. In NeurIPS, 2022. 17

work page 2022
[75]

Think Twice before Driving: Towards scalable decoders for end-to-end autonomous driving

Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think Twice before Driving: Towards scalable decoders for end-to-end autonomous driving. In CVPR, 2023. 17

work page 2023
[76]

DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In ICCV,

work page
[77]

DriveTransformer: Unified transformer for scalable end-to-end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. DriveTransformer: Unified transformer for scalable end-to-end autonomous driving. In ICLR, 2025. 17

work page 2025
[78]

The arcade learning environment: An evaluation platform for general agents

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 2013. 17 13

work page 2013
[79]

Dm_control: Software and tasks for continuous control

Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. Dm_control: Software and tasks for continuous control. Software Impacts, 2020. 17

work page 2020
[80]

ViZDoom: A doom-based ai research platform for visual reinforcement learning

Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. ViZDoom: A doom-based ai research platform for visual reinforcement learning. In CIG, 2016. 17

work page 2016

Showing first 80 references.

[1] [1]

A path towards autonomous machine intelligence

Yann LeCun. A path towards autonomous machine intelligence. Open Review, 62, 2022. 1, 2, 4, 17

work page 2022

[2] [2]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In NeurIPS, 2018. 1, 3, 17, 18, 21

work page 2018

[3] [3]

Video as the new language for real-world decision making

Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, and Dale Schuurmans. Video as the new language for real-world decision making. In ICML, 2024. 1, 5, 17

work page 2024

[4] [4]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023. 1

work page 2023

[5] [5]

Policy pre-training for autonomous driving via self-supervised geometric modeling

Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, and Yu Qiao. Policy pre-training for autonomous driving via self-supervised geometric modeling. In ICLR, 2023. 1

work page 2023

[6] [6]

Temporal difference learning for model predictive control

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. In ICML, 2022. 1

work page 2022

[7] [7]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983, 2024. 1, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In CVPR, 2025. 1, 2, 17

work page 2025

[9] [9]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. In NeurIPS, 2024. 1, 17

work page 2024

[10] [10]

Dream to Control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to Control: Learning behaviors by latent imagination. In ICLR, 2020. 1, 2, 3, 4, 17, 21

work page 2020

[11] [11]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In ICLR, 2024. 1, 17, 19

work page 2024

[12] [12]

Pathdreamer: A world model for indoor navigation

Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. In ICCV, 2021. 1, 17

work page 2021

[13] [13]

Learning interactive real-world simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In ICLR, 2024. 1, 17, 18

work page 2024

[14] [14]

Generalized predictive model for autonomous driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. In CVPR, 2024. 1, 3, 4, 7, 17, 18, 20, 21, 22

work page 2024

[15] [15]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 1, 3, 4, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In NeurIPS, 2024. 1, 2, 3, 4, 6, 7, 8, 17, 19, 20, 21, 22 10

work page 2024

[17] [17]

Drivedreamer-2: Llm-enhanced world models for diverse driving video generation,

Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, and Xingang Wang. DriveDreamer-2: LLM-enhanced world models for diverse driving video generation. arXiv preprint arXiv:2403.06845, 2024. 1, 7, 17

work page arXiv 2024

[18] [18]

GEM: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. GEM: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. In CVPR, 2025. 1, 7, 17

work page 2025

[19] [19]

A control-centric benchmark for video prediction

Stephen Tian, Chelsea Finn, and Jiajun Wu. A control-centric benchmark for video prediction. In ICLR,

work page

[20] [20]

AdaWorld: Learning adaptable world models with latent actions

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. AdaWorld: Learning adaptable world models with latent actions. In ICML, 2025. 1

work page 2025

[21] [21]

ACT-Bench: Towards action controllable world models for autonomous driving

Hidehisa Arai, Keishi Ishihara, Tsubasa Takahashi, and Yu Yamaguchi. ACT-Bench: Towards action controllable world models for autonomous driving. arXiv preprint arXiv:2412.05337, 2024. 1

work page arXiv 2024

[22] [22]

Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024. 1

work page 2024

[23] [23]

NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking. In NeurIPS Datasets and Benchmarks, 2024. 1, 2, 3, 4, 6, 7, 8, 18, 20, 21, 22

work page 2024

[24] [24]

Learning to drive from a world on rails

Dian Chen, Vladlen Koltun, and Philipp Krähenbühl. Learning to drive from a world on rails. In ICCV,

work page

[25] [25]

Rates of motor vehicle crashes, injuries and deaths in relation to driver age, united states, 2014-2015

Brian Tefft. Rates of motor vehicle crashes, injuries and deaths in relation to driver age, united states, 2014-2015. AAA Foundation for Traffic Safety., 2017. 1

work page 2014

[26] [26]

ActiveAD: Planning- oriented active learning for end-to-end autonomous driving

Han Lu, Xiaosong Jia, Yichen Xie, Wenlong Liao, Xiaokang Yang, and Junchi Yan. ActiveAD: Planning- oriented active learning for end-to-end autonomous driving. arXiv preprint arXiv:2403.02877, 2024. 1

work page arXiv 2024

[27] [27]

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385,

work page internal anchor Pith review arXiv

[28] [28]

SimGen: Simulator-conditioned driving scene generation

Yunsong Zhou, Michael Simon, Zhenghao Mark Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, and Bolei Zhou. SimGen: Simulator-conditioned driving scene generation. In NeurIPS, 2024. 2

work page 2024

[29] [29]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In CoRL, 2017. 2, 3, 4, 5, 7, 22

work page 2017

[30] [30]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. In ICLR, 2025. 2, 3, 4, 18, 20, 22

work page 2025

[31] [31]

Driving into the Future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the Future: Multiview visual forecasting and planning with world model for autonomous driving. In CVPR,

work page

[32] [32]

Enhancing end-to-end autonomous driving with latent world model

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model. In ICLR, 2025. 2, 8

work page 2025

[33] [33]

Trans- Fuser: Imitation with transformer-based sensor fusion for autonomous driving

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Trans- Fuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE TPAMI, 2023. 2, 8

work page 2023

[34] [34]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yuxin Pan, Giancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yuxin Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020. 3, 6, 18, 21, 22

work page 2020

[35] [35]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020. 3, 6, 18, 20, 21, 22 11

work page 2020

[36] [36]

Mastering atari with discrete world models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In ICLR, 2021. 3, 17

work page 2021

[37] [37]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. 3, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

DriveDreamer: Towards real-world-driven world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. DriveDreamer: Towards real-world-driven world models for autonomous driving. In ECCV, 2024. 3, 4, 7, 17

work page 2024

[39] [39]

Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. In NeurIPS Datasets and Benchmarks, 2024. 3, 18, 22

work page 2024

[40] [40]

DriveLM: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. In ECCV, 2024. 3

work page 2024

[41] [41]

PDM-Lite: A rule-based planner for carla leaderboard 2.0

Jens Beißwenger. PDM-Lite: A rule-based planner for carla leaderboard 2.0. https://github.com/ OpenDriveLab/DriveLM/blob/DriveLM-CARLA/pdm_lite/docs/report.pdf, 2024. 3, 18

work page 2024

[42] [42]

Track4Gen: Teaching video diffusion models to track points improves video generation

Hyeonho Jeong, Chun-Hao Paul Huang, Jong Chul Ye, Niloy Mitra, and Duygu Ceylan. Track4Gen: Teaching video diffusion models to track points improves video generation. In CVPR, 2025. 4

work page 2025

[43] [43]

MotiF: Making text count in image animation with motion focal loss

Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, and Xi Yin. MotiF: Making text count in image animation with motion focal loss. In CVPR, 2025. 4

work page 2025

[44] [44]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable Video Diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 4, 17, 18, 21

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024. 4

work page 2024

[46] [46]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. 4

work page 2020

[47] [47]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022. 4

work page 2022

[48] [48]

Snap Video: Scaled spatiotemporal transformers for text-to-video synthesis

Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap Video: Scaled spatiotemporal transformers for text-to-video synthesis. In CVPR, 2024. 4

work page 2024

[49] [49]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 4

work page 2022

[50] [50]

DINOv2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision. TMLR, 2024. 5, 19, 22

work page 2024

[51] [51]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 5

work page 2021

[52] [52]

https://leaderboard.carla.org/, 2022

CARLA autonomous driving leaderboard. https://leaderboard.carla.org/, 2022. 5, 19

work page 2022

[53] [53]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. In NeurIPS, 2023. 5

work page 2023

[54] [54]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

PiP: Planning-informed trajectory prediction for autonomous driving

Haoran Song, Wenchao Ding, Yuxuan Chen, Shaojie Shen, Michael Yu Wang, and Qifeng Chen. PiP: Planning-informed trajectory prediction for autonomous driving. In ECCV, 2020. 5

work page 2020

[56] [56]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017. 6 12

work page 2017

[57] [57]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards Accurate Generative Models of Videos: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018

[58] [58]

DriveGAN: Towards a controllable high-quality neural simulation

Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. DriveGAN: Towards a controllable high-quality neural simulation. In CVPR, 2021. 7, 17

work page 2021

[59] [59]

WoV oGen: World volume-aware diffusion for controllable multi-camera driving scene generation

Jiachen Lu, Ze Huang, Jiahui Zhang, Zeyu Yang, and Li Zhang. WoV oGen: World volume-aware diffusion for controllable multi-camera driving scene generation. In ECCV, 2024. 7

work page 2024

[60] [60]

XVO: Generalized visual odometry via cross-modal self-training

Lei Lai, Zhongkai Shangguan, Jimuyang Zhang, and Eshed Ohn-Bar. XVO: Generalized visual odometry via cross-modal self-training. In ICCV, 2023. 8, 9, 20, 22

work page 2023

[61] [61]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, et al. Planning-oriented autonomous driving. In CVPR, 2023. 8, 20, 21

work page 2023

[62] [62]

Driving- gpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers

Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. DrivingGPT: Unifying driving world modeling and planning with multi-modal autoregressive transformers. arXiv preprint arXiv:2412.18607, 2024. 8

work page arXiv 2024

[63] [63]

Pre-training contextualized world models with in-the-wild videos for reinforcement learning

Jialong Wu, Haoyu Ma, Chaoyi Deng, and Mingsheng Long. Pre-training contextualized world models with in-the-wild videos for reinforcement learning. In NeurIPS, 2023. 17

work page 2023

[64] [64]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual Fore- sight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018. 17

work page internal anchor Pith review Pith/arXiv arXiv 2018

[65] [65]

Deep visual foresight for planning robot motion

Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In ICRA, 2017. 17

work page 2017

[66] [66]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019. 17

work page 2019

[67] [67]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In ICML, 2024. 17

work page 2024

[68] [68]

Structured world models from human videos

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. In RSS, 2023. 17

work page 2023

[69] [69]

Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson

Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. In ICLR, 2024. 17

work page 2024

[70] [70]

Flatfusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving.arXiv preprint arXiv:2408.06832, 2024

Yutao Zhu, Xiaosong Jia, Xinyu Yang, and Junchi Yan. FlatFusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving. arXiv preprint arXiv:2408.06832, 2024. 17

work page arXiv 2024

[71] [71]

Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-VLA: Enhancing robot manipulation with interleaved image-text instructions. arXiv preprint arXiv:2505.02152, 2025. 17

work page arXiv 2025

[72] [72]

Learning from all vehicles

Dian Chen and Philipp Krähenbühl. Learning from all vehicles. In CVPR, 2022. 17

work page 2022

[73] [73]

Curse of rarity for autonomous vehicles

Henry X Liu and Shuo Feng. Curse of rarity for autonomous vehicles. Nature Communications, 2024. 17

work page 2024

[74] [74]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. In NeurIPS, 2022. 17

work page 2022

[75] [75]

Think Twice before Driving: Towards scalable decoders for end-to-end autonomous driving

Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think Twice before Driving: Towards scalable decoders for end-to-end autonomous driving. In CVPR, 2023. 17

work page 2023

[76] [76]

DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In ICCV,

work page

[77] [77]

DriveTransformer: Unified transformer for scalable end-to-end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. DriveTransformer: Unified transformer for scalable end-to-end autonomous driving. In ICLR, 2025. 17

work page 2025

[78] [78]

The arcade learning environment: An evaluation platform for general agents

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 2013. 17 13

work page 2013

[79] [79]

Dm_control: Software and tasks for continuous control

Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. Dm_control: Software and tasks for continuous control. Software Impacts, 2020. 17

work page 2020

[80] [80]

ViZDoom: A doom-based ai research platform for visual reinforcement learning

Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. ViZDoom: A doom-based ai research platform for visual reinforcement learning. In CIG, 2016. 17

work page 2016