pith. sign in

arxiv: 2506.09981 · v2 · submitted 2025-06-11 · 💻 cs.CV · cs.RO

ReSim: Reliable World Simulation for Autonomous Driving

Pith reviewed 2026-05-19 09:24 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords autonomous drivingworld modeldriving simulationdiffusion transformerpolicy evaluationvideo generationreward estimation
0
0 comments X

The pith

ReSim simulates open-world driving scenarios under hazardous non-expert actions by training on mixed real and simulator data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Driving world models built only on safe expert trajectories cannot follow rare hazardous behaviors, which restricts their use for testing new policies or planning in risky situations. The paper mixes real-world human demonstrations with diverse non-expert trajectories collected from a simulator such as CARLA to form a heterogeneous training set. A diffusion transformer is equipped with new strategies for integrating conditioning signals, producing the ReSim model that generates future scenes with higher fidelity and better control over both expert and non-expert actions. An added Video2Reward module extracts reward signals from the simulated futures to support downstream tasks. A sympathetic reader cares because this approach could let developers evaluate autonomous driving systems across a much wider range of behaviors without real-world danger.

Core claim

The authors claim that enriching real-world driving data with simulator-collected non-expert trajectories and training a controllable diffusion transformer world model on the combined corpus produces reliable simulations of diverse open-world scenarios under various ego actions, including hazardous ones. Strategies are introduced to integrate conditioning signals effectively for improved controllability and visual fidelity. The Video2Reward module then derives reward estimates from ReSim outputs to enable planning and policy selection.

What carries the argument

A diffusion transformer world model trained on a heterogeneous corpus of real expert trajectories and simulator non-expert data, with added conditioning integration strategies and a Video2Reward module that estimates rewards from simulated video futures.

If this is right

  • ReSim achieves up to 44% higher visual fidelity than prior models.
  • Controllability improves by over 50% for both expert and non-expert actions.
  • Planning performance on NAVSIM rises by 2% and policy selection by 25%.
  • Simulated futures now support reward-based judgment of diverse driving actions including hazardous ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This mixed-data training strategy could apply to other robotics domains where safe expert data is common but risky exploration is needed.
  • The Video2Reward module might transfer to reward estimation in other video prediction systems for decision making.
  • If domain gaps remain small, the method could speed up testing of rare-event robustness in autonomous systems without new real-world collection.

Load-bearing premise

Mixing simulator trajectories with real-world demonstrations does not create unmodeled domain gaps that degrade performance on real hazardous scenarios.

What would settle it

Apply ReSim to a recorded real-world hazardous maneuver such as sudden swerving and measure whether the generated future frames match actual vehicle dynamics and scene elements in held-out footage.

Figures

Figures reproduced from arXiv: 2506.09981 by Andreas Geiger, Hongyang Li, Jiazhi Yang, Kashyap Chitta, Li Chen, Long Chen, Shenyuan Gao, Xiangyu Yue, Xiaosong Jia, Yuqian Shao.

Figure 1
Figure 1. Figure 1: Overview of ReSim. (a) Heterogeneous driving data includes (i,ii) experts’ safe driving logs, and (iii) potentially dangerous (non-expert) driving behaviors from simulations. (b) Prior driving world models are trained on expert data solely, leading to consistently safe yet inaccurate imaginations; in ReSim, we leverage all sources of data to simulate reliable and realistic futures, and build a robust rewar… view at source ↗
Figure 2
Figure 2. Figure 2: Video2Reward model (V2R). Top: V2R is supervised by infraction score of both safe and haz￾ardous data from simulation, deriving the reward from a driving video. Bottom: In real-world inference, the predicted video of ReSim in reaction to a proposed action is fed into V2R to estimate the action’s reward. In detail, our Video2Reward model (V2R) is established on a frozen DINOv2 back￾bone [50] with an additio… view at source ↗
Figure 3
Figure 3. Figure 3: Video prediction-based policy. ReSim conditions on the history context (left) to synthesize a plausible visual plan (middle), which is then translated into an ego trajectory via an IDM (right). 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Human evaluation of non-expert action controllability. ReSim gets the most votes in both realism and trajectory following [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons of non-expert action controllability. ReSim reliably simulates hazardous outcomes from the non-expert action, while other methods either fail to follow the specified trajectory or compromise the scenario’s consistency. ⋆ : without simulated data in training. Waymo nuScenes Expert Act. Non-expert Act. Non-expert Act. Expert Act [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Zero-shot action controllability. ReSim can reliably follow both expert and non-expert actions in various scenarios from zero-shot datasets. ReSim yields significantly better results in a zero-shot manner compared to in-distribution models. We also provide qualitative comparisons for long-term future prediction in Appendix Sec. C, where [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reward correlation. Our method of composing ReSim and Video2Reward model yields more accurate rewards com￾pared to baselines in both datasets [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Closed-loop visual simulation example. A policy with front view only runs within the imaginary world generated by ReSim. The policy is adapted from XVO [60]. Uniform Sampling Unbalanced Sampling [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of unbalanced noise sampling. Training with unbalanced noise sampling yields improved motion and scenario consistency. W/O DCL W/ DCL, K=1 W/ DCL, K=4 [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ReSim, a controllable diffusion-transformer world model for simulating open-world driving scenarios under expert and non-expert (including hazardous) actions. It enriches real-world human demonstrations with non-expert trajectories collected in CARLA, introduces conditioning integration strategies to improve fidelity and controllability, and adds a Video2Reward module that derives reward signals from the simulated futures. Quantitative claims include up to 44% higher visual fidelity, over 50% better controllability for both action types, and 2%/25% gains in planning and policy selection on NAVSIM.

Significance. If the central claims hold after addressing domain-gap concerns, the work would be significant for autonomous-driving world models: it directly tackles the rarity of hazardous trajectories in real data and supplies a practical bridge from simulation to reward-based policy evaluation. The Video2Reward component is a concrete contribution that could be reused beyond this architecture.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the reported 44% visual-fidelity and >50% controllability gains are presented without explicit evaluation protocols, baseline definitions, statistical significance tests, or precise definitions of “non-expert” actions; these omissions make the numbers impossible to interpret or reproduce from the given information.
  2. [§3.2] §3.2 (Data Construction): mixing CARLA non-expert trajectories with real-world data is load-bearing for the generalization claim, yet no domain-adaptation layers, cycle-consistency losses, or real-only ablation studies are described; without them the model may exploit simulator-specific artifacts, undermining transfer to real hazardous scenarios and the downstream NAVSIM reward estimates.
minor comments (2)
  1. [Figure 3 and §4.3] Figure 3 and §4.3: axis labels and legend entries for the controllability metrics are difficult to read; enlarge fonts and add a table of exact numerical values.
  2. [§2] §2 (Related Work): the discussion of prior driving world models omits recent diffusion-based video generators that also condition on actions; add these references for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below, along with our plans for revisions to improve clarity, reproducibility, and robustness.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported 44% visual-fidelity and >50% controllability gains are presented without explicit evaluation protocols, baseline definitions, statistical significance tests, or precise definitions of “non-expert” actions; these omissions make the numbers impossible to interpret or reproduce from the given information.

    Authors: We appreciate the referee's emphasis on reproducibility. Evaluation protocols are described in §4, including visual fidelity metrics (FVD, FID) and controllability measures (action prediction accuracy and trajectory deviation). Baselines include prior world models such as DriveDreamer and Vista, with non-expert actions defined as CARLA trajectories exhibiting high deviation from expert human demonstrations (e.g., via steering/throttle variance thresholds). To address the concern directly, we will add a dedicated evaluation protocol subsection, report statistical significance via paired t-tests with p-values, and provide explicit formulas for the percentage gains in the revised version. revision: yes

  2. Referee: [§3.2] §3.2 (Data Construction): mixing CARLA non-expert trajectories with real-world data is load-bearing for the generalization claim, yet no domain-adaptation layers, cycle-consistency losses, or real-only ablation studies are described; without them the model may exploit simulator-specific artifacts, undermining transfer to real hazardous scenarios and the downstream NAVSIM reward estimates.

    Authors: We acknowledge the importance of addressing potential domain gaps. Our conditioning integration strategies (§3.1) and heterogeneous training enable the model to generalize without dedicated adaptation layers or cycle-consistency losses, as supported by strong real-world NAVSIM results. However, to strengthen the claim, we will add a real-only ablation study in the revised experiments section. We disagree that explicit domain-adaptation is required here, as the diffusion-transformer architecture and data preprocessing sufficiently mitigate simulator artifacts for the reported controllability and planning gains. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical training of mixed-data world model

full rationale

The paper presents an empirical ML approach: a diffusion transformer is trained on a mixed corpus of real expert trajectories and CARLA non-expert data, with additional conditioning strategies and a separate Video2Reward module. Reported gains (fidelity, controllability, NAVSIM planning) are measured outcomes on held-out sets rather than quantities that reduce by construction to the training inputs or fitted parameters. No self-definitional equations, fitted-input predictions, or load-bearing self-citation chains appear in the derivation; the central claims remain falsifiable through external benchmarks and do not rely on renaming or smuggling prior ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the assumption that simulator data can be seamlessly integrated with real data and that the Video2Reward module produces scores that correlate with downstream planning utility; no explicit free parameters or invented physical entities are mentioned.

pith-pipeline@v0.9.0 · 5783 in / 1266 out tokens · 29251 ms · 2026-05-19T09:24:39.497119+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.

  2. Learning Vision-Language-Action World Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 7.0

    VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

  3. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  4. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  5. DriveFuture: Future-Aware Latent World Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.

  6. Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving

    cs.RO 2026-04 unverdicted novelty 6.0

    Sim2Real-AD enables zero-shot transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles, reporting 75-90% success rates in car-following, obstacle avoidance, and stop-sign scenarios without real-world RL...

  7. ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.

  8. DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

    cs.CV 2025-12 unverdicted novelty 6.0

    DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.

Reference graph

Works this paper leans on

127 extracted references · 127 canonical work pages · cited by 7 Pith papers · 14 internal anchors

  1. [1]

    A path towards autonomous machine intelligence

    Yann LeCun. A path towards autonomous machine intelligence. Open Review, 62, 2022. 1, 2, 4, 17

  2. [2]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In NeurIPS, 2018. 1, 3, 17, 18, 21

  3. [3]

    Video as the new language for real-world decision making

    Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, and Dale Schuurmans. Video as the new language for real-world decision making. In ICML, 2024. 1, 5, 17

  4. [4]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023. 1

  5. [5]

    Policy pre-training for autonomous driving via self-supervised geometric modeling

    Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, and Yu Qiao. Policy pre-training for autonomous driving via self-supervised geometric modeling. In ICLR, 2023. 1

  6. [6]

    Temporal difference learning for model predictive control

    Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. In ICML, 2022. 1

  7. [7]

    DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

    Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983, 2024. 1, 17

  8. [8]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In CVPR, 2025. 1, 2, 17

  9. [9]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. In NeurIPS, 2024. 1, 17

  10. [10]

    Dream to Control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to Control: Learning behaviors by latent imagination. In ICLR, 2020. 1, 2, 3, 4, 17, 21

  11. [11]

    Diffusion models are real-time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In ICLR, 2024. 1, 17, 19

  12. [12]

    Pathdreamer: A world model for indoor navigation

    Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. In ICCV, 2021. 1, 17

  13. [13]

    Learning interactive real-world simulators

    Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In ICLR, 2024. 1, 17, 18

  14. [14]

    Generalized predictive model for autonomous driving

    Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. In CVPR, 2024. 1, 3, 4, 7, 17, 18, 20, 21, 22

  15. [15]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 1, 3, 4, 17

  16. [16]

    Vista: A generalizable driving world model with high fidelity and versatile controllability

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In NeurIPS, 2024. 1, 2, 3, 4, 6, 7, 8, 17, 19, 20, 21, 22 10

  17. [17]

    Drivedreamer-2: Llm-enhanced world models for diverse driving video generation,

    Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, and Xingang Wang. DriveDreamer-2: LLM-enhanced world models for diverse driving video generation. arXiv preprint arXiv:2403.06845, 2024. 1, 7, 17

  18. [18]

    GEM: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control

    Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. GEM: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. In CVPR, 2025. 1, 7, 17

  19. [19]

    A control-centric benchmark for video prediction

    Stephen Tian, Chelsea Finn, and Jiajun Wu. A control-centric benchmark for video prediction. In ICLR,

  20. [20]

    AdaWorld: Learning adaptable world models with latent actions

    Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. AdaWorld: Learning adaptable world models with latent actions. In ICML, 2025. 1

  21. [21]

    ACT-Bench: Towards action controllable world models for autonomous driving

    Hidehisa Arai, Keishi Ishihara, Tsubasa Takahashi, and Yu Yamaguchi. ACT-Bench: Towards action controllable world models for autonomous driving. arXiv preprint arXiv:2412.05337, 2024. 1

  22. [22]

    Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024

    Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024. 1

  23. [23]

    NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking. In NeurIPS Datasets and Benchmarks, 2024. 1, 2, 3, 4, 6, 7, 8, 18, 20, 21, 22

  24. [24]

    Learning to drive from a world on rails

    Dian Chen, Vladlen Koltun, and Philipp Krähenbühl. Learning to drive from a world on rails. In ICCV,

  25. [25]

    Rates of motor vehicle crashes, injuries and deaths in relation to driver age, united states, 2014-2015

    Brian Tefft. Rates of motor vehicle crashes, injuries and deaths in relation to driver age, united states, 2014-2015. AAA Foundation for Traffic Safety., 2017. 1

  26. [26]

    ActiveAD: Planning- oriented active learning for end-to-end autonomous driving

    Han Lu, Xiaosong Jia, Yichen Xie, Wenlong Liao, Xiaokang Yang, and Junchi Yan. ActiveAD: Planning- oriented active learning for end-to-end autonomous driving. arXiv preprint arXiv:2403.02877, 2024. 1

  27. [27]

    How Far is Video Generation from World Model: A Physical Law Perspective

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385,

  28. [28]

    SimGen: Simulator-conditioned driving scene generation

    Yunsong Zhou, Michael Simon, Zhenghao Mark Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, and Bolei Zhou. SimGen: Simulator-conditioned driving scene generation. In NeurIPS, 2024. 2

  29. [29]

    CARLA: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In CoRL, 2017. 2, 3, 4, 5, 7, 22

  30. [30]

    CogVideoX: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. In ICLR, 2025. 2, 3, 4, 18, 20, 22

  31. [31]

    Driving into the Future: Multiview visual forecasting and planning with world model for autonomous driving

    Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the Future: Multiview visual forecasting and planning with world model for autonomous driving. In CVPR,

  32. [32]

    Enhancing end-to-end autonomous driving with latent world model

    Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model. In ICLR, 2025. 2, 8

  33. [33]

    Trans- Fuser: Imitation with transformer-based sensor fusion for autonomous driving

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Trans- Fuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE TPAMI, 2023. 2, 8

  34. [34]

    Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yuxin Pan, Giancarlo Baldan, and Oscar Beijbom

    Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yuxin Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020. 3, 6, 18, 21, 22

  35. [35]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020. 3, 6, 18, 20, 21, 22 11

  36. [36]

    Mastering atari with discrete world models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In ICLR, 2021. 3, 17

  37. [37]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. 3, 17

  38. [38]

    DriveDreamer: Towards real-world-driven world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. DriveDreamer: Towards real-world-driven world models for autonomous driving. In ECCV, 2024. 3, 4, 7, 17

  39. [39]

    Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. In NeurIPS Datasets and Benchmarks, 2024. 3, 18, 22

  40. [40]

    DriveLM: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. In ECCV, 2024. 3

  41. [41]

    PDM-Lite: A rule-based planner for carla leaderboard 2.0

    Jens Beißwenger. PDM-Lite: A rule-based planner for carla leaderboard 2.0. https://github.com/ OpenDriveLab/DriveLM/blob/DriveLM-CARLA/pdm_lite/docs/report.pdf, 2024. 3, 18

  42. [42]

    Track4Gen: Teaching video diffusion models to track points improves video generation

    Hyeonho Jeong, Chun-Hao Paul Huang, Jong Chul Ye, Niloy Mitra, and Duygu Ceylan. Track4Gen: Teaching video diffusion models to track points improves video generation. In CVPR, 2025. 4

  43. [43]

    MotiF: Making text count in image animation with motion focal loss

    Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, and Xi Yin. MotiF: Making text count in image animation with motion focal loss. In CVPR, 2025. 4

  44. [44]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable Video Diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 4, 17, 18, 21

  45. [45]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024. 4

  46. [46]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. 4

  47. [47]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022. 4

  48. [48]

    Snap Video: Scaled spatiotemporal transformers for text-to-video synthesis

    Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap Video: Scaled spatiotemporal transformers for text-to-video synthesis. In CVPR, 2024. 4

  49. [49]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 4

  50. [50]

    DINOv2: Learning robust visual features without supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision. TMLR, 2024. 5, 19, 22

  51. [51]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 5

  52. [52]

    https://leaderboard.carla.org/, 2022

    CARLA autonomous driving leaderboard. https://leaderboard.carla.org/, 2022. 5, 19

  53. [53]

    Learning universal policies via text-guided video generation

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. In NeurIPS, 2023. 5

  54. [54]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 5

  55. [55]

    PiP: Planning-informed trajectory prediction for autonomous driving

    Haoran Song, Wenchao Ding, Yuxuan Chen, Shaojie Shen, Michael Yu Wang, and Qifeng Chen. PiP: Planning-informed trajectory prediction for autonomous driving. In ECCV, 2020. 5

  56. [56]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017. 6 12

  57. [57]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards Accurate Generative Models of Videos: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018. 6

  58. [58]

    DriveGAN: Towards a controllable high-quality neural simulation

    Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. DriveGAN: Towards a controllable high-quality neural simulation. In CVPR, 2021. 7, 17

  59. [59]

    WoV oGen: World volume-aware diffusion for controllable multi-camera driving scene generation

    Jiachen Lu, Ze Huang, Jiahui Zhang, Zeyu Yang, and Li Zhang. WoV oGen: World volume-aware diffusion for controllable multi-camera driving scene generation. In ECCV, 2024. 7

  60. [60]

    XVO: Generalized visual odometry via cross-modal self-training

    Lei Lai, Zhongkai Shangguan, Jimuyang Zhang, and Eshed Ohn-Bar. XVO: Generalized visual odometry via cross-modal self-training. In ICCV, 2023. 8, 9, 20, 22

  61. [61]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, et al. Planning-oriented autonomous driving. In CVPR, 2023. 8, 20, 21

  62. [62]

    Driving- gpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers

    Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. DrivingGPT: Unifying driving world modeling and planning with multi-modal autoregressive transformers. arXiv preprint arXiv:2412.18607, 2024. 8

  63. [63]

    Pre-training contextualized world models with in-the-wild videos for reinforcement learning

    Jialong Wu, Haoyu Ma, Chaoyi Deng, and Mingsheng Long. Pre-training contextualized world models with in-the-wild videos for reinforcement learning. In NeurIPS, 2023. 17

  64. [64]

    Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

    Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual Fore- sight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018. 17

  65. [65]

    Deep visual foresight for planning robot motion

    Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In ICRA, 2017. 17

  66. [66]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019. 17

  67. [67]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In ICML, 2024. 17

  68. [68]

    Structured world models from human videos

    Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. In RSS, 2023. 17

  69. [69]

    Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson

    Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. In ICLR, 2024. 17

  70. [70]

    Flatfusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving.arXiv preprint arXiv:2408.06832, 2024

    Yutao Zhu, Xiaosong Jia, Xinyu Yang, and Junchi Yan. FlatFusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving. arXiv preprint arXiv:2408.06832, 2024. 17

  71. [71]

    Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

    Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-VLA: Enhancing robot manipulation with interleaved image-text instructions. arXiv preprint arXiv:2505.02152, 2025. 17

  72. [72]

    Learning from all vehicles

    Dian Chen and Philipp Krähenbühl. Learning from all vehicles. In CVPR, 2022. 17

  73. [73]

    Curse of rarity for autonomous vehicles

    Henry X Liu and Shuo Feng. Curse of rarity for autonomous vehicles. Nature Communications, 2024. 17

  74. [74]

    Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

    Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. In NeurIPS, 2022. 17

  75. [75]

    Think Twice before Driving: Towards scalable decoders for end-to-end autonomous driving

    Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think Twice before Driving: Towards scalable decoders for end-to-end autonomous driving. In CVPR, 2023. 17

  76. [76]

    DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving

    Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In ICCV,

  77. [77]

    DriveTransformer: Unified transformer for scalable end-to-end autonomous driving

    Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. DriveTransformer: Unified transformer for scalable end-to-end autonomous driving. In ICLR, 2025. 17

  78. [78]

    The arcade learning environment: An evaluation platform for general agents

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 2013. 17 13

  79. [79]

    Dm_control: Software and tasks for continuous control

    Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. Dm_control: Software and tasks for continuous control. Software Impacts, 2020. 17

  80. [80]

    ViZDoom: A doom-based ai research platform for visual reinforcement learning

    Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. ViZDoom: A doom-based ai research platform for visual reinforcement learning. In CIG, 2016. 17

Showing first 80 references.