ReSim: Reliable World Simulation for Autonomous Driving
Pith reviewed 2026-05-19 09:24 UTC · model grok-4.3
The pith
ReSim simulates open-world driving scenarios under hazardous non-expert actions by training on mixed real and simulator data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that enriching real-world driving data with simulator-collected non-expert trajectories and training a controllable diffusion transformer world model on the combined corpus produces reliable simulations of diverse open-world scenarios under various ego actions, including hazardous ones. Strategies are introduced to integrate conditioning signals effectively for improved controllability and visual fidelity. The Video2Reward module then derives reward estimates from ReSim outputs to enable planning and policy selection.
What carries the argument
A diffusion transformer world model trained on a heterogeneous corpus of real expert trajectories and simulator non-expert data, with added conditioning integration strategies and a Video2Reward module that estimates rewards from simulated video futures.
If this is right
- ReSim achieves up to 44% higher visual fidelity than prior models.
- Controllability improves by over 50% for both expert and non-expert actions.
- Planning performance on NAVSIM rises by 2% and policy selection by 25%.
- Simulated futures now support reward-based judgment of diverse driving actions including hazardous ones.
Where Pith is reading between the lines
- This mixed-data training strategy could apply to other robotics domains where safe expert data is common but risky exploration is needed.
- The Video2Reward module might transfer to reward estimation in other video prediction systems for decision making.
- If domain gaps remain small, the method could speed up testing of rare-event robustness in autonomous systems without new real-world collection.
Load-bearing premise
Mixing simulator trajectories with real-world demonstrations does not create unmodeled domain gaps that degrade performance on real hazardous scenarios.
What would settle it
Apply ReSim to a recorded real-world hazardous maneuver such as sudden swerving and measure whether the generated future frames match actual vehicle dynamics and scene elements in held-out footage.
Figures
read the original abstract
How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ReSim, a controllable diffusion-transformer world model for simulating open-world driving scenarios under expert and non-expert (including hazardous) actions. It enriches real-world human demonstrations with non-expert trajectories collected in CARLA, introduces conditioning integration strategies to improve fidelity and controllability, and adds a Video2Reward module that derives reward signals from the simulated futures. Quantitative claims include up to 44% higher visual fidelity, over 50% better controllability for both action types, and 2%/25% gains in planning and policy selection on NAVSIM.
Significance. If the central claims hold after addressing domain-gap concerns, the work would be significant for autonomous-driving world models: it directly tackles the rarity of hazardous trajectories in real data and supplies a practical bridge from simulation to reward-based policy evaluation. The Video2Reward component is a concrete contribution that could be reused beyond this architecture.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the reported 44% visual-fidelity and >50% controllability gains are presented without explicit evaluation protocols, baseline definitions, statistical significance tests, or precise definitions of “non-expert” actions; these omissions make the numbers impossible to interpret or reproduce from the given information.
- [§3.2] §3.2 (Data Construction): mixing CARLA non-expert trajectories with real-world data is load-bearing for the generalization claim, yet no domain-adaptation layers, cycle-consistency losses, or real-only ablation studies are described; without them the model may exploit simulator-specific artifacts, undermining transfer to real hazardous scenarios and the downstream NAVSIM reward estimates.
minor comments (2)
- [Figure 3 and §4.3] Figure 3 and §4.3: axis labels and legend entries for the controllability metrics are difficult to read; enlarge fonts and add a table of exact numerical values.
- [§2] §2 (Related Work): the discussion of prior driving world models omits recent diffusion-based video generators that also condition on actions; add these references for completeness.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below, along with our plans for revisions to improve clarity, reproducibility, and robustness.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported 44% visual-fidelity and >50% controllability gains are presented without explicit evaluation protocols, baseline definitions, statistical significance tests, or precise definitions of “non-expert” actions; these omissions make the numbers impossible to interpret or reproduce from the given information.
Authors: We appreciate the referee's emphasis on reproducibility. Evaluation protocols are described in §4, including visual fidelity metrics (FVD, FID) and controllability measures (action prediction accuracy and trajectory deviation). Baselines include prior world models such as DriveDreamer and Vista, with non-expert actions defined as CARLA trajectories exhibiting high deviation from expert human demonstrations (e.g., via steering/throttle variance thresholds). To address the concern directly, we will add a dedicated evaluation protocol subsection, report statistical significance via paired t-tests with p-values, and provide explicit formulas for the percentage gains in the revised version. revision: yes
-
Referee: [§3.2] §3.2 (Data Construction): mixing CARLA non-expert trajectories with real-world data is load-bearing for the generalization claim, yet no domain-adaptation layers, cycle-consistency losses, or real-only ablation studies are described; without them the model may exploit simulator-specific artifacts, undermining transfer to real hazardous scenarios and the downstream NAVSIM reward estimates.
Authors: We acknowledge the importance of addressing potential domain gaps. Our conditioning integration strategies (§3.1) and heterogeneous training enable the model to generalize without dedicated adaptation layers or cycle-consistency losses, as supported by strong real-world NAVSIM results. However, to strengthen the claim, we will add a real-only ablation study in the revised experiments section. We disagree that explicit domain-adaptation is required here, as the diffusion-transformer architecture and data preprocessing sufficiently mitigate simulator artifacts for the reported controllability and planning gains. revision: partial
Circularity Check
No circularity in empirical training of mixed-data world model
full rationale
The paper presents an empirical ML approach: a diffusion transformer is trained on a mixed corpus of real expert trajectories and CARLA non-expert data, with additional conditioning strategies and a separate Video2Reward module. Reported gains (fidelity, controllability, NAVSIM planning) are measured outcomes on held-out sets rather than quantities that reduce by construction to the training inputs or fitted parameters. No self-definitional equations, fitted-input predictions, or load-bearing self-citation chains appear in the derivation; the central claims remain falsifiable through external benchmarks and do not rely on renaming or smuggling prior ansatzes.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ReSim is built on CogVideoX, a high-capacity diffusion transformer... Ldiffusion + λLdynamics... unbalanced noise sampling... Video2Reward model... DINOv2 backbone
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 8 Pith papers
-
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
-
Learning Vision-Language-Action World Models for Autonomous Driving
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
-
Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving
Sim2Real-AD enables zero-shot transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles, reporting 75-90% success rates in car-following, obstacle avoidance, and stop-sign scenarios without real-world RL...
-
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
-
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
Reference graph
Works this paper leans on
-
[1]
A path towards autonomous machine intelligence
Yann LeCun. A path towards autonomous machine intelligence. Open Review, 62, 2022. 1, 2, 4, 17
work page 2022
-
[2]
Recurrent world models facilitate policy evolution
David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In NeurIPS, 2018. 1, 3, 17, 18, 21
work page 2018
-
[3]
Video as the new language for real-world decision making
Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, and Dale Schuurmans. Video as the new language for real-world decision making. In ICML, 2024. 1, 5, 17
work page 2024
-
[4]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023. 1
work page 2023
-
[5]
Policy pre-training for autonomous driving via self-supervised geometric modeling
Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, and Yu Qiao. Policy pre-training for autonomous driving via self-supervised geometric modeling. In ICLR, 2023. 1
work page 2023
-
[6]
Temporal difference learning for model predictive control
Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. In ICML, 2022. 1
work page 2022
-
[7]
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983, 2024. 1, 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In CVPR, 2025. 1, 2, 17
work page 2025
-
[9]
Diffusion for world modeling: Visual details matter in atari
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. In NeurIPS, 2024. 1, 17
work page 2024
-
[10]
Dream to Control: Learning behaviors by latent imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to Control: Learning behaviors by latent imagination. In ICLR, 2020. 1, 2, 3, 4, 17, 21
work page 2020
-
[11]
Diffusion models are real-time game engines
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In ICLR, 2024. 1, 17, 19
work page 2024
-
[12]
Pathdreamer: A world model for indoor navigation
Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. In ICCV, 2021. 1, 17
work page 2021
-
[13]
Learning interactive real-world simulators
Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In ICLR, 2024. 1, 17, 18
work page 2024
-
[14]
Generalized predictive model for autonomous driving
Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. In CVPR, 2024. 1, 3, 4, 7, 17, 18, 20, 21, 22
work page 2024
-
[15]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 1, 3, 4, 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Vista: A generalizable driving world model with high fidelity and versatile controllability
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In NeurIPS, 2024. 1, 2, 3, 4, 6, 7, 8, 17, 19, 20, 21, 22 10
work page 2024
-
[17]
Drivedreamer-2: Llm-enhanced world models for diverse driving video generation,
Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, and Xingang Wang. DriveDreamer-2: LLM-enhanced world models for diverse driving video generation. arXiv preprint arXiv:2403.06845, 2024. 1, 7, 17
-
[18]
Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. GEM: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. In CVPR, 2025. 1, 7, 17
work page 2025
-
[19]
A control-centric benchmark for video prediction
Stephen Tian, Chelsea Finn, and Jiajun Wu. A control-centric benchmark for video prediction. In ICLR,
-
[20]
AdaWorld: Learning adaptable world models with latent actions
Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. AdaWorld: Learning adaptable world models with latent actions. In ICML, 2025. 1
work page 2025
-
[21]
ACT-Bench: Towards action controllable world models for autonomous driving
Hidehisa Arai, Keishi Ishihara, Tsubasa Takahashi, and Yu Yamaguchi. ACT-Bench: Towards action controllable world models for autonomous driving. arXiv preprint arXiv:2412.05337, 2024. 1
-
[22]
Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024
Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024. 1
work page 2024
-
[23]
NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking
Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking. In NeurIPS Datasets and Benchmarks, 2024. 1, 2, 3, 4, 6, 7, 8, 18, 20, 21, 22
work page 2024
-
[24]
Learning to drive from a world on rails
Dian Chen, Vladlen Koltun, and Philipp Krähenbühl. Learning to drive from a world on rails. In ICCV,
-
[25]
Brian Tefft. Rates of motor vehicle crashes, injuries and deaths in relation to driver age, united states, 2014-2015. AAA Foundation for Traffic Safety., 2017. 1
work page 2014
-
[26]
ActiveAD: Planning- oriented active learning for end-to-end autonomous driving
Han Lu, Xiaosong Jia, Yichen Xie, Wenlong Liao, Xiaokang Yang, and Junchi Yan. ActiveAD: Planning- oriented active learning for end-to-end autonomous driving. arXiv preprint arXiv:2403.02877, 2024. 1
-
[27]
How Far is Video Generation from World Model: A Physical Law Perspective
Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385,
work page internal anchor Pith review arXiv
-
[28]
SimGen: Simulator-conditioned driving scene generation
Yunsong Zhou, Michael Simon, Zhenghao Mark Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, and Bolei Zhou. SimGen: Simulator-conditioned driving scene generation. In NeurIPS, 2024. 2
work page 2024
-
[29]
CARLA: An open urban driving simulator
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In CoRL, 2017. 2, 3, 4, 5, 7, 22
work page 2017
-
[30]
CogVideoX: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. In ICLR, 2025. 2, 3, 4, 18, 20, 22
work page 2025
-
[31]
Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the Future: Multiview visual forecasting and planning with world model for autonomous driving. In CVPR,
-
[32]
Enhancing end-to-end autonomous driving with latent world model
Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model. In ICLR, 2025. 2, 8
work page 2025
-
[33]
Trans- Fuser: Imitation with transformer-based sensor fusion for autonomous driving
Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Trans- Fuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE TPAMI, 2023. 2, 8
work page 2023
-
[34]
Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yuxin Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020. 3, 6, 18, 21, 22
work page 2020
-
[35]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020. 3, 6, 18, 20, 21, 22 11
work page 2020
-
[36]
Mastering atari with discrete world models
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In ICLR, 2021. 3, 17
work page 2021
-
[37]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. 3, 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
DriveDreamer: Towards real-world-driven world models for autonomous driving
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. DriveDreamer: Towards real-world-driven world models for autonomous driving. In ECCV, 2024. 3, 4, 7, 17
work page 2024
-
[39]
Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving
Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. In NeurIPS Datasets and Benchmarks, 2024. 3, 18, 22
work page 2024
-
[40]
DriveLM: Driving with graph visual question answering
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. In ECCV, 2024. 3
work page 2024
-
[41]
PDM-Lite: A rule-based planner for carla leaderboard 2.0
Jens Beißwenger. PDM-Lite: A rule-based planner for carla leaderboard 2.0. https://github.com/ OpenDriveLab/DriveLM/blob/DriveLM-CARLA/pdm_lite/docs/report.pdf, 2024. 3, 18
work page 2024
-
[42]
Track4Gen: Teaching video diffusion models to track points improves video generation
Hyeonho Jeong, Chun-Hao Paul Huang, Jong Chul Ye, Niloy Mitra, and Duygu Ceylan. Track4Gen: Teaching video diffusion models to track points improves video generation. In CVPR, 2025. 4
work page 2025
-
[43]
MotiF: Making text count in image animation with motion focal loss
Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, and Xi Yin. MotiF: Making text count in image animation with motion focal loss. In CVPR, 2025. 4
work page 2025
-
[44]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable Video Diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 4, 17, 18, 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024. 4
work page 2024
-
[46]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. 4
work page 2020
-
[47]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022. 4
work page 2022
-
[48]
Snap Video: Scaled spatiotemporal transformers for text-to-video synthesis
Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap Video: Scaled spatiotemporal transformers for text-to-video synthesis. In CVPR, 2024. 4
work page 2024
-
[49]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 4
work page 2022
-
[50]
DINOv2: Learning robust visual features without supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision. TMLR, 2024. 5, 19, 22
work page 2024
-
[51]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 5
work page 2021
-
[52]
https://leaderboard.carla.org/, 2022
CARLA autonomous driving leaderboard. https://leaderboard.carla.org/, 2022. 5, 19
work page 2022
-
[53]
Learning universal policies via text-guided video generation
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. In NeurIPS, 2023. 5
work page 2023
-
[54]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
PiP: Planning-informed trajectory prediction for autonomous driving
Haoran Song, Wenchao Ding, Yuxuan Chen, Shaojie Shen, Michael Yu Wang, and Qifeng Chen. PiP: Planning-informed trajectory prediction for autonomous driving. In ECCV, 2020. 5
work page 2020
-
[56]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017. 6 12
work page 2017
-
[57]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards Accurate Generative Models of Videos: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018. 6
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[58]
DriveGAN: Towards a controllable high-quality neural simulation
Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. DriveGAN: Towards a controllable high-quality neural simulation. In CVPR, 2021. 7, 17
work page 2021
-
[59]
WoV oGen: World volume-aware diffusion for controllable multi-camera driving scene generation
Jiachen Lu, Ze Huang, Jiahui Zhang, Zeyu Yang, and Li Zhang. WoV oGen: World volume-aware diffusion for controllable multi-camera driving scene generation. In ECCV, 2024. 7
work page 2024
-
[60]
XVO: Generalized visual odometry via cross-modal self-training
Lei Lai, Zhongkai Shangguan, Jimuyang Zhang, and Eshed Ohn-Bar. XVO: Generalized visual odometry via cross-modal self-training. In ICCV, 2023. 8, 9, 20, 22
work page 2023
-
[61]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, et al. Planning-oriented autonomous driving. In CVPR, 2023. 8, 20, 21
work page 2023
-
[62]
Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. DrivingGPT: Unifying driving world modeling and planning with multi-modal autoregressive transformers. arXiv preprint arXiv:2412.18607, 2024. 8
-
[63]
Pre-training contextualized world models with in-the-wild videos for reinforcement learning
Jialong Wu, Haoyu Ma, Chaoyi Deng, and Mingsheng Long. Pre-training contextualized world models with in-the-wild videos for reinforcement learning. In NeurIPS, 2023. 17
work page 2023
-
[64]
Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control
Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual Fore- sight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018. 17
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[65]
Deep visual foresight for planning robot motion
Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In ICRA, 2017. 17
work page 2017
-
[66]
Learning latent dynamics for planning from pixels
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019. 17
work page 2019
-
[67]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In ICML, 2024. 17
work page 2024
-
[68]
Structured world models from human videos
Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. In RSS, 2023. 17
work page 2023
-
[69]
Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson
Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. In ICLR, 2024. 17
work page 2024
-
[70]
Yutao Zhu, Xiaosong Jia, Xinyu Yang, and Junchi Yan. FlatFusion: Delving into details of sparse transformer-based camera-lidar fusion for autonomous driving. arXiv preprint arXiv:2408.06832, 2024. 17
-
[71]
Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-VLA: Enhancing robot manipulation with interleaved image-text instructions. arXiv preprint arXiv:2505.02152, 2025. 17
-
[72]
Dian Chen and Philipp Krähenbühl. Learning from all vehicles. In CVPR, 2022. 17
work page 2022
-
[73]
Curse of rarity for autonomous vehicles
Henry X Liu and Shuo Feng. Curse of rarity for autonomous vehicles. Nature Communications, 2024. 17
work page 2024
-
[74]
Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline
Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. In NeurIPS, 2022. 17
work page 2022
-
[75]
Think Twice before Driving: Towards scalable decoders for end-to-end autonomous driving
Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think Twice before Driving: Towards scalable decoders for end-to-end autonomous driving. In CVPR, 2023. 17
work page 2023
-
[76]
Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In ICCV,
-
[77]
DriveTransformer: Unified transformer for scalable end-to-end autonomous driving
Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. DriveTransformer: Unified transformer for scalable end-to-end autonomous driving. In ICLR, 2025. 17
work page 2025
-
[78]
The arcade learning environment: An evaluation platform for general agents
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 2013. 17 13
work page 2013
-
[79]
Dm_control: Software and tasks for continuous control
Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. Dm_control: Software and tasks for continuous control. Software Impacts, 2020. 17
work page 2020
-
[80]
ViZDoom: A doom-based ai research platform for visual reinforcement learning
Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. ViZDoom: A doom-based ai research platform for visual reinforcement learning. In CIG, 2016. 17
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.