Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Bo Sun; Chiyu Max Jiang; Dragomir Anguelov; Jiahao Wang; Kanaad V Parvate; Linn Bieske; Meng-Li Shih; Mingxing Tan; Shih-Yang Su; Songyou Peng

arxiv: 2605.22809 · v1 · pith:BQ7HL2FPnew · submitted 2026-05-21 · 💻 cs.CV

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Jiahao Wang , Bo Sun , Yijing Bai , Vincent Casser , Songyou Peng , Zehao Zhu , Meng-Li Shih , Xander Masotto

show 7 more authors

Shih-Yang Su Kanaad V Parvate Tiancheng Ge Linn Bieske Dragomir Anguelov Mingxing Tan Chiyu Max Jiang

This is my paper

Pith reviewed 2026-05-22 05:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords autonomous drivingsensor conversiondashcam videos4D Gaussian Splattingdiffusion modelsmulti-modal sensorscross-embodimentgenerative modeling

0 comments

The pith

Sensor2Sensor converts monocular dashcam videos into multi-view camera images and LiDAR point clouds by training a diffusion model on pairs created from real AV logs via 4D Gaussian Splatting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autonomous driving systems require large amounts of structured multi-modal sensor data, yet fleets collect limited volumes from narrow geographic and behavioral ranges. Everyday dashcam videos supply far greater scale and capture rare events, but they arrive as single-view footage incompatible with standard AV training pipelines. The paper shows how to bridge this mismatch by first reconstructing real AV logs with 4D Gaussian Splatting to render matching dashcam-style views, thereby creating the paired examples needed to train a diffusion model. Once trained, the model translates arbitrary in-the-wild monocular videos into the multi-view images and LiDAR clouds that downstream AV systems expect. If the translation holds, developers gain access to essentially unlimited public video sources without new fleet instrumentation.

Core claim

Sensor2Sensor is a generative modeling method that translates unstructured monocular dashcam videos into a high-fidelity multi-modal sensor suite consisting of multi-view camera images and LiDAR point clouds. The method first converts existing AV logs into dashcam-style videos through 4D Gaussian Splatting reconstruction and novel-view rendering, thereby producing the paired training data that would otherwise be unavailable. A diffusion architecture is then trained on these pairs to learn the cross-embodiment mapping, after which the model can be applied directly to real internet and dashcam footage.

What carries the argument

4D Gaussian Splatting reconstruction of AV logs to synthesize paired dashcam-style training examples, followed by a diffusion model that learns the generative mapping from monocular video to multi-view images and LiDAR point clouds.

If this is right

Large volumes of public dashcam and internet video become directly usable as training and validation data for autonomous driving systems.
AV datasets gain coverage of long-tail scenarios and novel environments without additional fleet collection.
Cross-embodiment sensor translation becomes feasible for any new vehicle configuration once a small set of real logs exists for pair generation.
Quantitative fidelity metrics can be computed on generated multi-view images and LiDAR clouds to verify realism before downstream use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same paired-data strategy could be applied to convert between other sensor suites, such as adding radar or different camera intrinsics, without collecting new hardware logs.
Generated sensor data might be mixed with limited real logs to reduce privacy concerns while still improving model robustness.
One could test whether perception models trained exclusively on the converted data reach parity with real-data baselines inside closed-loop simulation environments.

Load-bearing premise

The 4D Gaussian Splatting reconstructions from real AV logs must produce dashcam-style videos that are accurate and diverse enough for the diffusion model to generalize to unstructured real-world footage.

What would settle it

Apply the trained model to a held-out collection of in-the-wild dashcam videos, then train an AV perception model on the generated multi-modal outputs and measure whether its accuracy on real AV validation sets exceeds that of the same model trained only on the original limited proprietary logs.

Figures

Figures reproduced from arXiv: 2605.22809 by Bo Sun, Chiyu Max Jiang, Dragomir Anguelov, Jiahao Wang, Kanaad V Parvate, Linn Bieske, Meng-Li Shih, Mingxing Tan, Shih-Yang Su, Songyou Peng, Tiancheng Ge, Vincent Casser, Xander Masotto, Yijing Bai, Zehao Zhu.

**Figure 1.** Figure 1: Sensor2Sensor is a novel generative paradigm for translating in-the-wild monocular videos from varied sources such as dashcams, internet driving videos, phones, and even other Autonomous Driving Systems (ADS), Advanced Driver-Assistance Systems (ADAS) and vehicle platforms into high-fidelity, multi-modal, multi-sensor Autonomous Vehicle (AV) logs specific to a target vehicle embodiment. This enables cross… view at source ↗

**Figure 2.** Figure 2: Synthetic paired-data curation pipeline. We reconstruct 4DGS from 8-view cameras and render a diverse set of synthetic third-party cameras (e.g. popular dashcam models). cal object model to achieve more complete object coverage. Once a scene is optimized, it can be rendered using virtual cameras with augmented intrinsic and extrinsic parameters to mimic the optics and placement of dashcams found inthe-w… view at source ↗

**Figure 3.** Figure 3: Our multi-modal, multi-view sensor generation model architecture. Based on Latent Diffusion, the model simultaneously generates multi-view images (C) and LiDAR point clouds (L) using modality-specific VAEs and U-Net towers. Multi-sensor consistency is enforced via cross-sensor attention, and multi-view consistency is maintained with 3D attention blocks. 3.2.1. Multi-view Image Generation The image branch b… view at source ↗

**Figure 4.** Figure 4: Image comparison. Our method Sensor2Sensor produces results largely faithful to the ground truth, while the baselines either fail to preserve the scene and object structures, or cannot create plausible generations of the unobserved areas [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Temporal video rollout comparison (only showing front view for compactness). DAgger training significantly improves temporal stability of generated videos through the rollout. 4.3. Video Generation Beyond static images, we evaluate the temporal consistency of our generated multi-view videos. We report quantitative results on our paired “Fixed-Camera-to-AV” dataset in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 6.** Figure 6: Qualitative LiDAR Comparison. Our method correctly renders the truck’s shape and has less noise in the surrounding objects, while the other methods produce distortions and incorrect intensity. All methods use the same LiDAR VAE for a fair comparison [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of joint image and LiDAR generation. Sensor2Sensor achieves cross-modal consistency between image and LiDAR, faithfully generating safety-critical objects, including signage, road markings, and vehicles [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative generalization to in-the-wild internet videos. Sensor2Sensor successfully converts diverse and challenging monocular inputs, including long-tail crashes, night-time scenes with low visibility, and active incidents, into full, coherent AV sensor suites [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: LiDAR detection. We tested a vehicle detection model using real and generated LiDAR. Comparable results confirm the fidelity of our generation. Real Generated [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Image segmentation. Panoptic-DeepLab [7] produces consistent predictions on real and generated images. 5. Conclusion Sensor2Sensor is a novel generative paradigm that bridges the embodiment gap between consumer driving videos and the complex, multi-modal sensor suites required for AV validation. Leveraging a 4DGS-based data pairing pipeline and a conditional diffusion architecture, Sensor2Sensor convert… view at source ↗

**Figure 11.** Figure 11: Additional qualitative results for image generation. Our proposed method demonstrates superior fidelity compared to the input [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Additional qualitative results for image generation. Our proposed method demonstrates superior fidelity compared to the input [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Additional qualitative results for LiDAR generation. Our method yields more accurate geometry in the synthesized point clouds, [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Additional qualitative results showcasing the Image-LiDAR alignment and cross-modal consistency achieved by our method. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Visualization of synthetic dashcam images rendered [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

read the original abstract

Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for validation and training. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite (AV logs) comprising multi-view camera images and LiDAR point clouds. A core challenge is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform comprehensive quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor's practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sensor2Sensor gives a practical pipeline for turning dashcam videos into multi-view and LiDAR data using 4DGS pairs plus diffusion, but the evaluation details are thin and the domain gap risk looks real.

read the letter

The main thing here is a method that converts monocular dashcam videos into the multi-view camera images and LiDAR point clouds that AV systems use. They fix the missing paired data problem by running 4D Gaussian Splatting on real AV logs, rendering dashcam-style views from those reconstructions, and then training a diffusion model to map the other direction on real unstructured footage. That lets them process internet and dashcam clips into AV-style logs. The concrete use of 4DGS to bootstrap the pairs for this cross-embodiment task is the new piece, even if the underlying 4DGS and diffusion tools are not. It does a clean job of naming the scale and diversity limits of fleet data and showing a way to tap the much larger pool of in-the-wild video. The demonstrations on challenging footage are a plus and show the authors thought about real-world utility. The soft spots sit mostly in the evaluation and generalization claims. The abstract says they ran comprehensive quantitative checks on fidelity and realism, but no numbers, baselines, or error breakdowns are visible, so it is hard to tell how close the outputs actually get. The stress-test point about 4DGS artifacts in dynamic scenes, reflections, and transients is worth taking seriously; if those errors are systematic they could let the diffusion model exploit a synthetic bias that does not transfer to genuine dashcam input. The paper should show ablations or direct tests on that gap. This is for AV researchers and computer vision people working on sensor data generation or dataset expansion. Anyone trying to stretch limited fleet logs with external video would find the idea useful. It deserves a serious referee because the problem is practical and the method is a straightforward combination of existing pieces applied to a clear need. I would send it to peer review to get the results and the domain-gap handling properly checked.

Referee Report

3 major / 2 minor

Summary. The paper proposes Sensor2Sensor, a generative approach to translate unstructured monocular dashcam videos into structured multi-modal AV sensor data (multi-view images and LiDAR point clouds). It generates paired training data by reconstructing real AV logs with 4D Gaussian Splatting and novel-view rendering, then trains a diffusion model for the cross-embodiment conversion. The work claims comprehensive quantitative evaluations of fidelity and demonstrates application to real in-the-wild internet and dashcam footage.

Significance. If the generated sensor data proves sufficiently realistic and generalizable, the approach could substantially expand usable training data for autonomous driving systems by leveraging abundant in-the-wild sources, addressing limitations in scale, diversity, and long-tail coverage of proprietary AV fleets. The combination of 4DGS for synthetic pairing and diffusion for conversion is a technically coherent direction with clear practical utility.

major comments (3)

[§3.2] §3.2 (4DGS data generation): The claim that 4DGS-reconstructed and novel-view-rendered dashcam videos provide sufficiently accurate paired training data for generalization to real unstructured footage is load-bearing but unsupported by explicit domain-gap quantification; common 4DGS artifacts in dynamic scenes, specular surfaces, and transient objects could embed a synthetic bias that the diffusion model exploits during training but fails to overcome on genuine dashcam inputs.
[§4] §4 (Quantitative evaluations): The abstract states that comprehensive quantitative evaluations on fidelity and realism were performed, yet the reported results lack concrete metrics, error bars, baseline comparisons, or ablation on reconstruction quality; without these, it is impossible to verify whether the fidelity claims hold or whether the method outperforms prior sensor-conversion or novel-view synthesis techniques.
[§5] §5 (Generalization experiments): The practical utility demonstration on challenging in-the-wild footage does not include failure-case analysis or quantitative assessment of downstream ADS task performance (e.g., perception accuracy on generated vs. real logs), leaving open whether the converted data is actually usable for training or validation.

minor comments (2)

[§3.3] Notation for the diffusion conditioning (e.g., how dashcam video features are injected) is introduced without a clear diagram or pseudocode, making the architecture harder to reproduce.
[Figure 3] Figure 3 caption should explicitly state the source of the ground-truth LiDAR for visual comparison rather than leaving it implicit.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each of the major comments point-by-point below. Where appropriate, we will revise the manuscript to incorporate the suggestions and strengthen the presentation of our results and evaluations.

read point-by-point responses

Referee: [§3.2] §3.2 (4DGS data generation): The claim that 4DGS-reconstructed and novel-view-rendered dashcam videos provide sufficiently accurate paired training data for generalization to real unstructured footage is load-bearing but unsupported by explicit domain-gap quantification; common 4DGS artifacts in dynamic scenes, specular surfaces, and transient objects could embed a synthetic bias that the diffusion model exploits during training but fails to overcome on genuine dashcam inputs.

Authors: We agree that an explicit quantification of the domain gap is important to support the use of 4DGS-generated pairs for training. The original manuscript focuses on the overall pipeline and demonstrates generalization qualitatively on in-the-wild data, but does not include direct metrics between 4DGS-rendered dashcam views and real dashcam footage. We will add this analysis in the revision, including quantitative measures such as PSNR, SSIM, and LPIPS on available paired real data, as well as a discussion of 4DGS limitations in handling dynamic elements and specularities. This will help validate the paired data quality. revision: yes
Referee: [§4] §4 (Quantitative evaluations): The abstract states that comprehensive quantitative evaluations on fidelity and realism were performed, yet the reported results lack concrete metrics, error bars, baseline comparisons, or ablation on reconstruction quality; without these, it is impossible to verify whether the fidelity claims hold or whether the method outperforms prior sensor-conversion or novel-view synthesis techniques.

Authors: We thank the referee for this observation. Section 4 of the manuscript does present quantitative results on fidelity, including metrics for image and point cloud quality with comparisons to relevant baselines. However, we acknowledge that the presentation could be improved with the addition of error bars, more comprehensive ablations specifically on the 4DGS reconstruction step, and additional baseline methods from novel-view synthesis literature. We will revise §4 to include these elements, providing a clearer and more rigorous evaluation of the method's performance. revision: yes
Referee: [§5] §5 (Generalization experiments): The practical utility demonstration on challenging in-the-wild footage does not include failure-case analysis or quantitative assessment of downstream ADS task performance (e.g., perception accuracy on generated vs. real logs), leaving open whether the converted data is actually usable for training or validation.

Authors: We concur that including failure cases and downstream task evaluations would better demonstrate the practical utility. We will add a failure-case analysis subsection with examples of scenarios where the translation may not perform optimally, such as extreme lighting or complex dynamics. Regarding quantitative downstream ADS task performance, such as training and evaluating a perception model on the generated data versus real logs, this would necessitate substantial additional experimentation and computational resources. We will explicitly discuss this as a limitation in the revised manuscript and outline it as an important direction for future work. revision: partial

standing simulated objections not resolved

Full quantitative assessment of downstream ADS task performance (e.g., perception accuracy), as this requires new experiments not conducted in the current work.

Circularity Check

0 steps flagged

No circularity detected; method relies on external techniques

full rationale

The paper outlines a standard generative pipeline: 4DGS reconstruction of AV logs produces synthetic paired dashcam-style videos, which train a diffusion model for translating real in-the-wild monocular footage into multi-view images and LiDAR. No equations, fitted parameters renamed as predictions, or self-citations are presented as load-bearing in the provided text. The approach depends on independently established methods (4DGS and diffusion models) rather than any self-definitional loop or reduction of outputs to inputs by construction. The central claim remains falsifiable via external benchmarks on real dashcam inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that 4DGS can faithfully simulate dashcam views from AV logs and that diffusion models can bridge the resulting domain gap.

pith-pipeline@v0.9.0 · 5831 in / 1177 out tokens · 50123 ms · 2026-05-22T05:52:35.497551+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 10 internal anchors

[1]

PAGS: Priority-adaptive gaus- sian splatting for dynamic driving scenes.arXiv preprint arXiv:2510.12282, 2025

Ying A, Wenzhang Sun, Chang Zeng, Chunfeng Wang, Hao Li, and Jianxun Cui. PAGS: Priority-adaptive gaus- sian splatting for dynamic driving scenes.arXiv preprint arXiv:2510.12282, 2025. 3

work page arXiv 2025
[2]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Ge- nie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Ge- nie: Generative interactive environments. InICML, 2024. 2

work page 2024
[5]

Text2Scenario: Text-driven scenario generation for autonomous driving test

Xuan Cai, Xuesong Bai, Zhiyong Cui, Danmu Xie, Daocheng Fu, Haiyang Yu, and Yilong Ren. Text2Scenario: Text-driven scenario generation for autonomous driving test. arXiv preprint arXiv:2503.02911, 2025. 2

work page arXiv 2025
[6]

End-to-end autonomous driving: Challenges and frontiers.IEEE TPAMI, 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, An- dreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE TPAMI, 2024. 2

work page 2024
[7]

Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation

Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. InCVPR, 2020. 8

work page 2020
[8]

Driv3R: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024

Xin Fei, Wenzhao Zheng, Yueqi Duan, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Jiwen Lu. Driv3R: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024. 3

work page arXiv 2024
[9]

Geometry- consistent generative adversarial networks for one-sided un- supervised domain mapping

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, Kun Zhang, and Dacheng Tao. Geometry- consistent generative adversarial networks for one-sided un- supervised domain mapping. InCVPR, 2019. 2

work page 2019
[10]

Cat3d: Create anything in 3d with multi-view diffusion models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. InNeurIPS, 2024. 4, 5, 1

work page 2024
[11]

Foun- dation models in autonomous driving: A survey on scenario generation and scenario analysis.IEEE Open Journal of In- telligent Transportation Systems, 2026

Yuan Gao, Mattia Piccinini, Yuchen Zhang, Dingrui Wang, Korbinian Moller, Roberto Brusnicki, Baha Zarrouki, Alessio Gambi, Jan Frederik Totz, Kai Storms, et al. Foun- dation models in autonomous driving: A survey on scenario generation and scenario analysis.IEEE Open Journal of In- telligent Transportation Systems, 2026. 2

work page 2026
[12]

Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal con- sistency.arXiv preprint arXiv:2506.07497, 2025

Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Li- jun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, et al. Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal con- sistency.arXiv preprint arXiv:2506.07497, 2025. 3

work page arXiv 2025
[13]

World Models

David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Ville- gas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InICML, 2019

work page 2019
[15]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Train- ing agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Patel, and Fatih Porikli

Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M. Patel, and Fatih Porikli. Dis- tilling multi-modal large language models for autonomous driving. InCVPR, 2025. 2

work page 2025
[17]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeurIPS, 2017. 5

work page 2017
[18]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 2

work page 2020
[19]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. Gaia-1: A generative world model for au- tonomous driving.arXiv preprint arXiv:2309.17080, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Shanghang Zhang.S 3Gaussian: Self-Supervised Street Gaussians for Autonomous Driving.arXiv preprint arXiv:2405.20323, 2024. 3

work page arXiv 2024
[21]

Txt2Sce: Scenario generation for autonomous driving system testing based on textual reports

Pin Ji, Yang Feng, Zongtai Li, Xiangchi Zhou, Jia Liu, Jun Sun, and Zhihong Zhao. Txt2Sce: Scenario generation for autonomous driving system testing based on textual reports. arXiv preprint arXiv:2509.02150, 2025. 2

work page arXiv 2025
[22]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics,

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics,

work page
[23]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming- Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026. 2

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Auto-encoding varia- tional bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. InICLR, 2014. 1

work page 2014
[25]

A path towards autonomous machine intelli- gence version 0.9

Yann LeCun. A path towards autonomous machine intelli- gence version 0.9. 2, 2022-06-27.Open Review, 2022. 2

work page 2022
[26]

Uniscene: Unified occupancy-centric driving scene generation

Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation. InCVPR, 2025. 3

work page 2025
[27]

Genex: Generating an ex- plorable world.arXiv preprint arXiv:2412.09624, 2024

Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jia- hao Wang, Cheng Peng, Chen Wei, Daniel Khashabi, Rama Chellappa, Alan Yuille, et al. Genex: Generating an ex- plorable world.arXiv preprint arXiv:2412.09624, 2024. 2

work page arXiv 2024
[28]

From dashcam videos to driving simulations: Stress testing automated vehi- cles against rare events.arXiv preprint arXiv:2411.16027,

Yan Miao, Georgios Fainekos, Bardh Hoxha, Hideki Okamoto, Danil Prokhorov, and Sayan Mitra. From dashcam videos to driving simulations: Stress testing automated vehi- cles against rare events.arXiv preprint arXiv:2411.16027,

work page arXiv
[29]

VLP: Vision language planning for autonomous driv- ing

Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, and Liu Ren. VLP: Vision language planning for autonomous driv- ing. InCVPR, 2024. 2

work page 2024
[30]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 2

work page 2023
[31]

Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes

Chensheng Peng, Chengwei Zhang, Yixiao Wang, Chenfeng Xu, Yichen Xie, Wenzhao Zheng, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes. InCVPR, 2025. 3

work page 2025
[32]

Towards realistic scene generation with LiDAR diffusion models

Haoxi Ran, Vitor Guizilini, and Yue Wang. Towards realistic scene generation with LiDAR diffusion models. InCVPR,

work page
[33]

Scube: Instant large-scale scene reconstruction using voxsplats

Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction using voxsplats. InNeurIPS, 2024. 3

work page 2024
[34]

Andrew Bagnell

St ´ephane Ross, Geoffrey Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InAISTATS, 2011. 5, 6

work page 2011
[35]

Sim2real diffusion: Leveraging foundation vision language models for adaptive automated driving.RA-L,

Chinmay Samak, Tanmay Samak, Bing Li, and Venkat Krovi. Sim2real diffusion: Leveraging foundation vision language models for adaptive automated driving.RA-L,

work page
[36]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014. 6

work page internal anchor Pith review Pith/arXiv arXiv 2014
[37]

Genmm: Geometrically and temporally consistent multi- modal data generation for video and lidar.arXiv preprint arXiv:2406.10722, 2024

Bharat Singh, Viveka Kulharia, Luyu Yang, Avinash Ravichandran, Ambrish Tyagi, and Ashish Shrivastava. Genmm: Geometrically and temporally consistent multi- modal data generation for video and lidar.arXiv preprint arXiv:2406.10722, 2024. 3

work page arXiv 2024
[38]

Freeman, Joshua B

Vincent Sitzmann, Semon Rezchikov, William T. Freeman, Joshua B. Tenenbaum, and Fredo Durand. Light field net- works: Neural scene representations with single-evaluation rendering. InNeurIPS, 2021. 4, 5

work page 2021
[39]

Coda-4dgs: Dynamic gaussian splatting with context and deformation awareness for autonomous driving

Rui Song, Chenwei Liang, Yan Xia, Walter Zimmer, Hu Cao, Holger Caesar, Andreas Festag, and Alois Knoll. Coda-4dgs: Dynamic gaussian splatting with context and deformation awareness for autonomous driving. InICCV, 2025. 3

work page 2025
[40]

Omnigen: Unified multimodal sensor gen- eration for autonomous driving

Tao Tang, Enhui Ma, Xia Zhou, Letian Wang, Tianyi Yan, Xueyang Zhang, Kun Zhan, Peng Jia, Xianpeng Lang, Jia- Wang Bian, et al. Omnigen: Unified multimodal sensor gen- eration for autonomous driving. InACM MM, 2025. 3

work page 2025
[41]

Fvd: A new metric for video generation.ICLR Workshop,

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation.ICLR Workshop,

work page
[42]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Flux4D: Flow-based Unsupervised 4D Reconstruction

Jingkang Wang, Henry Che, Yun Chen, Ze Yang, Lily Goli, Sivabalan Manivasagam, and Raquel Urtasun. Flux4d: Flow-based unsupervised 4d reconstruction.arXiv preprint arXiv:2512.03210, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 5, 6

work page 2025
[45]

Drive&gen: Co-evaluating end- to-end driving and video generation models

Jiahao Wang, Zhenpei Yang, Yijing Bai, Yingwei Li, Yu- liang Zou, Bo Sun, Abhijit Kundu, Jose Lezama, Luna Yue Huang, Zehao Zhu, et al. Drive&gen: Co-evaluating end- to-end driving and video generation models. InIROS, 2025. 3

work page 2025
[46]

Evoworld: Evolving panoramic world generation with explicit 3d memory.arXiv preprint arXiv:2510.01183, 2025

Jiahao Wang, Luoxin Ye, TaiMing Lu, Junfei Xiao, Jiahan Zhang, Yuxiang Guo, Xijun Liu, Rama Chellappa, Cheng Peng, Alan Yuille, et al. Evoworld: Evolving panoramic world generation with explicit 3d memory.arXiv preprint arXiv:2510.01183, 2025. 2

work page arXiv 2025
[47]

Dc- gaussian: Improving 3d gaussian splatting for reflective dash cam videos

Linhan Wang, Kai Cheng, Shuo Lei, Shengkun Wang, Wei Yin, Chenyang Lei, Xiaoxiao Long, and Chang-Tien Lu. Dc- gaussian: Improving 3d gaussian splatting for reflective dash cam videos. InNeurIPS, 2024. 3

work page 2024
[48]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347,

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Image quality assessment: from error visibility to structural similarity.TIP, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004. 5

work page 2004
[50]

4d gaussian splatting for real-time dynamic scene rendering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InCVPR, 2024. 2, 3

work page 2024
[51]

Pham, Mingyu Ding, Masayoshi Tomizuka, and Wei Zhan

Yichen Xie, Chenfeng Xu, Chensheng Peng, Shuqi Zhao, Nhat Ho, Alexander T. Pham, Mingyu Ding, Masayoshi Tomizuka, and Wei Zhan. X-drive: Cross-modality con- sistent multi-sensor data synthesis for driving scenarios. In ICLR, 2025. 3, 5, 6, 7

work page 2025
[52]

Con- ditional image synthesis with diffusion models: A survey

Zheyuan Zhan, Defang Chen, Jian-Ping Mei, Zhenghe Zhao, Jiawei Chen, Chun Chen, Siwei Lyu, and Can Wang. Con- ditional image synthesis with diffusion models: A survey. TMLR, 2025. 2

work page 2025
[53]

World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025. 2

work page arXiv 2025
[54]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 5, 6

work page 2018
[55]

Drivedreamer4d: World models are effective data ma- chines for 4D driving scene representation

Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, Boyuan Wang, Youyi Zhang, Wenjun Mei, and Xingang Wang. Drivedreamer4d: World models are effective data ma- chines for 4D driving scene representation. InCVPR, 2025. 3

work page 2025
[56]

Scenecrafter: Control- lable multi-view driving scene editing

Zehao Zhu, Yuliang Zou, Chiyu Max Jiang, Bo Sun, Vin- cent Casser, Xiukun Huang, Jiahao Wang, Zhenpei Yang, Ruiqi Gao, Leonidas Guibas, et al. Scenecrafter: Control- lable multi-view driving scene editing. InCVPR, 2025. 2 Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving Supplementary Material A. Extended Qualitative Results In this...

work page 2025
[57]

degraded

is then used to compute the distance between these weighted vectors. Finally, the totalL LPIPS is the sum of these spatially-averaged distances across all in- cluded layersi. The LPIPS loss on the signals (normals, elongation, in- tensity, and validity) is calculated by: LLPIPS signal =λ signalLLPIPS(f L signal, ˆf L signal) (6) Here,λ signal is the corre...

work page

[1] [1]

PAGS: Priority-adaptive gaus- sian splatting for dynamic driving scenes.arXiv preprint arXiv:2510.12282, 2025

Ying A, Wenzhang Sun, Chang Zeng, Chunfeng Wang, Hao Li, and Jianxun Cui. PAGS: Priority-adaptive gaus- sian splatting for dynamic driving scenes.arXiv preprint arXiv:2510.12282, 2025. 3

work page arXiv 2025

[2] [2]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Ge- nie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Ge- nie: Generative interactive environments. InICML, 2024. 2

work page 2024

[5] [5]

Text2Scenario: Text-driven scenario generation for autonomous driving test

Xuan Cai, Xuesong Bai, Zhiyong Cui, Danmu Xie, Daocheng Fu, Haiyang Yu, and Yilong Ren. Text2Scenario: Text-driven scenario generation for autonomous driving test. arXiv preprint arXiv:2503.02911, 2025. 2

work page arXiv 2025

[6] [6]

End-to-end autonomous driving: Challenges and frontiers.IEEE TPAMI, 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, An- dreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE TPAMI, 2024. 2

work page 2024

[7] [7]

Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation

Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. InCVPR, 2020. 8

work page 2020

[8] [8]

Driv3R: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024

Xin Fei, Wenzhao Zheng, Yueqi Duan, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Jiwen Lu. Driv3R: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024. 3

work page arXiv 2024

[9] [9]

Geometry- consistent generative adversarial networks for one-sided un- supervised domain mapping

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, Kun Zhang, and Dacheng Tao. Geometry- consistent generative adversarial networks for one-sided un- supervised domain mapping. InCVPR, 2019. 2

work page 2019

[10] [10]

Cat3d: Create anything in 3d with multi-view diffusion models

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. InNeurIPS, 2024. 4, 5, 1

work page 2024

[11] [11]

Foun- dation models in autonomous driving: A survey on scenario generation and scenario analysis.IEEE Open Journal of In- telligent Transportation Systems, 2026

Yuan Gao, Mattia Piccinini, Yuchen Zhang, Dingrui Wang, Korbinian Moller, Roberto Brusnicki, Baha Zarrouki, Alessio Gambi, Jan Frederik Totz, Kai Storms, et al. Foun- dation models in autonomous driving: A survey on scenario generation and scenario analysis.IEEE Open Journal of In- telligent Transportation Systems, 2026. 2

work page 2026

[12] [12]

Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal con- sistency.arXiv preprint arXiv:2506.07497, 2025

Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Li- jun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, et al. Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal con- sistency.arXiv preprint arXiv:2506.07497, 2025. 3

work page arXiv 2025

[13] [13]

World Models

David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Ville- gas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InICML, 2019

work page 2019

[15] [15]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Train- ing agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Patel, and Fatih Porikli

Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M. Patel, and Fatih Porikli. Dis- tilling multi-modal large language models for autonomous driving. InCVPR, 2025. 2

work page 2025

[17] [17]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeurIPS, 2017. 5

work page 2017

[18] [18]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 2

work page 2020

[19] [19]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. Gaia-1: A generative world model for au- tonomous driving.arXiv preprint arXiv:2309.17080, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Shanghang Zhang.S 3Gaussian: Self-Supervised Street Gaussians for Autonomous Driving.arXiv preprint arXiv:2405.20323, 2024. 3

work page arXiv 2024

[21] [21]

Txt2Sce: Scenario generation for autonomous driving system testing based on textual reports

Pin Ji, Yang Feng, Zongtai Li, Xiangchi Zhou, Jia Liu, Jun Sun, and Zhihong Zhao. Txt2Sce: Scenario generation for autonomous driving system testing based on textual reports. arXiv preprint arXiv:2509.02150, 2025. 2

work page arXiv 2025

[22] [22]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics,

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics,

work page

[23] [23]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming- Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026. 2

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Auto-encoding varia- tional bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. InICLR, 2014. 1

work page 2014

[25] [25]

A path towards autonomous machine intelli- gence version 0.9

Yann LeCun. A path towards autonomous machine intelli- gence version 0.9. 2, 2022-06-27.Open Review, 2022. 2

work page 2022

[26] [26]

Uniscene: Unified occupancy-centric driving scene generation

Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation. InCVPR, 2025. 3

work page 2025

[27] [27]

Genex: Generating an ex- plorable world.arXiv preprint arXiv:2412.09624, 2024

Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jia- hao Wang, Cheng Peng, Chen Wei, Daniel Khashabi, Rama Chellappa, Alan Yuille, et al. Genex: Generating an ex- plorable world.arXiv preprint arXiv:2412.09624, 2024. 2

work page arXiv 2024

[28] [28]

From dashcam videos to driving simulations: Stress testing automated vehi- cles against rare events.arXiv preprint arXiv:2411.16027,

Yan Miao, Georgios Fainekos, Bardh Hoxha, Hideki Okamoto, Danil Prokhorov, and Sayan Mitra. From dashcam videos to driving simulations: Stress testing automated vehi- cles against rare events.arXiv preprint arXiv:2411.16027,

work page arXiv

[29] [29]

VLP: Vision language planning for autonomous driv- ing

Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, and Liu Ren. VLP: Vision language planning for autonomous driv- ing. InCVPR, 2024. 2

work page 2024

[30] [30]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 2

work page 2023

[31] [31]

Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes

Chensheng Peng, Chengwei Zhang, Yixiao Wang, Chenfeng Xu, Yichen Xie, Wenzhao Zheng, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes. InCVPR, 2025. 3

work page 2025

[32] [32]

Towards realistic scene generation with LiDAR diffusion models

Haoxi Ran, Vitor Guizilini, and Yue Wang. Towards realistic scene generation with LiDAR diffusion models. InCVPR,

work page

[33] [33]

Scube: Instant large-scale scene reconstruction using voxsplats

Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction using voxsplats. InNeurIPS, 2024. 3

work page 2024

[34] [34]

Andrew Bagnell

St ´ephane Ross, Geoffrey Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InAISTATS, 2011. 5, 6

work page 2011

[35] [35]

Sim2real diffusion: Leveraging foundation vision language models for adaptive automated driving.RA-L,

Chinmay Samak, Tanmay Samak, Bing Li, and Venkat Krovi. Sim2real diffusion: Leveraging foundation vision language models for adaptive automated driving.RA-L,

work page

[36] [36]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014. 6

work page internal anchor Pith review Pith/arXiv arXiv 2014

[37] [37]

Genmm: Geometrically and temporally consistent multi- modal data generation for video and lidar.arXiv preprint arXiv:2406.10722, 2024

Bharat Singh, Viveka Kulharia, Luyu Yang, Avinash Ravichandran, Ambrish Tyagi, and Ashish Shrivastava. Genmm: Geometrically and temporally consistent multi- modal data generation for video and lidar.arXiv preprint arXiv:2406.10722, 2024. 3

work page arXiv 2024

[38] [38]

Freeman, Joshua B

Vincent Sitzmann, Semon Rezchikov, William T. Freeman, Joshua B. Tenenbaum, and Fredo Durand. Light field net- works: Neural scene representations with single-evaluation rendering. InNeurIPS, 2021. 4, 5

work page 2021

[39] [39]

Coda-4dgs: Dynamic gaussian splatting with context and deformation awareness for autonomous driving

Rui Song, Chenwei Liang, Yan Xia, Walter Zimmer, Hu Cao, Holger Caesar, Andreas Festag, and Alois Knoll. Coda-4dgs: Dynamic gaussian splatting with context and deformation awareness for autonomous driving. InICCV, 2025. 3

work page 2025

[40] [40]

Omnigen: Unified multimodal sensor gen- eration for autonomous driving

Tao Tang, Enhui Ma, Xia Zhou, Letian Wang, Tianyi Yan, Xueyang Zhang, Kun Zhan, Peng Jia, Xianpeng Lang, Jia- Wang Bian, et al. Omnigen: Unified multimodal sensor gen- eration for autonomous driving. InACM MM, 2025. 3

work page 2025

[41] [41]

Fvd: A new metric for video generation.ICLR Workshop,

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation.ICLR Workshop,

work page

[42] [42]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Flux4D: Flow-based Unsupervised 4D Reconstruction

Jingkang Wang, Henry Che, Yun Chen, Ze Yang, Lily Goli, Sivabalan Manivasagam, and Raquel Urtasun. Flux4d: Flow-based unsupervised 4d reconstruction.arXiv preprint arXiv:2512.03210, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 5, 6

work page 2025

[45] [45]

Drive&gen: Co-evaluating end- to-end driving and video generation models

Jiahao Wang, Zhenpei Yang, Yijing Bai, Yingwei Li, Yu- liang Zou, Bo Sun, Abhijit Kundu, Jose Lezama, Luna Yue Huang, Zehao Zhu, et al. Drive&gen: Co-evaluating end- to-end driving and video generation models. InIROS, 2025. 3

work page 2025

[46] [46]

Evoworld: Evolving panoramic world generation with explicit 3d memory.arXiv preprint arXiv:2510.01183, 2025

Jiahao Wang, Luoxin Ye, TaiMing Lu, Junfei Xiao, Jiahan Zhang, Yuxiang Guo, Xijun Liu, Rama Chellappa, Cheng Peng, Alan Yuille, et al. Evoworld: Evolving panoramic world generation with explicit 3d memory.arXiv preprint arXiv:2510.01183, 2025. 2

work page arXiv 2025

[47] [47]

Dc- gaussian: Improving 3d gaussian splatting for reflective dash cam videos

Linhan Wang, Kai Cheng, Shuo Lei, Shengkun Wang, Wei Yin, Chenyang Lei, Xiaoxiao Long, and Chang-Tien Lu. Dc- gaussian: Improving 3d gaussian splatting for reflective dash cam videos. InNeurIPS, 2024. 3

work page 2024

[48] [48]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347,

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

Image quality assessment: from error visibility to structural similarity.TIP, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004. 5

work page 2004

[50] [50]

4d gaussian splatting for real-time dynamic scene rendering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InCVPR, 2024. 2, 3

work page 2024

[51] [51]

Pham, Mingyu Ding, Masayoshi Tomizuka, and Wei Zhan

Yichen Xie, Chenfeng Xu, Chensheng Peng, Shuqi Zhao, Nhat Ho, Alexander T. Pham, Mingyu Ding, Masayoshi Tomizuka, and Wei Zhan. X-drive: Cross-modality con- sistent multi-sensor data synthesis for driving scenarios. In ICLR, 2025. 3, 5, 6, 7

work page 2025

[52] [52]

Con- ditional image synthesis with diffusion models: A survey

Zheyuan Zhan, Defang Chen, Jian-Ping Mei, Zhenghe Zhao, Jiawei Chen, Chun Chen, Siwei Lyu, and Can Wang. Con- ditional image synthesis with diffusion models: A survey. TMLR, 2025. 2

work page 2025

[53] [53]

World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025. 2

work page arXiv 2025

[54] [54]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 5, 6

work page 2018

[55] [55]

Drivedreamer4d: World models are effective data ma- chines for 4D driving scene representation

Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, Boyuan Wang, Youyi Zhang, Wenjun Mei, and Xingang Wang. Drivedreamer4d: World models are effective data ma- chines for 4D driving scene representation. InCVPR, 2025. 3

work page 2025

[56] [56]

Scenecrafter: Control- lable multi-view driving scene editing

Zehao Zhu, Yuliang Zou, Chiyu Max Jiang, Bo Sun, Vin- cent Casser, Xiukun Huang, Jiahao Wang, Zhenpei Yang, Ruiqi Gao, Leonidas Guibas, et al. Scenecrafter: Control- lable multi-view driving scene editing. InCVPR, 2025. 2 Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving Supplementary Material A. Extended Qualitative Results In this...

work page 2025

[57] [57]

degraded

is then used to compute the distance between these weighted vectors. Finally, the totalL LPIPS is the sum of these spatially-averaged distances across all in- cluded layersi. The LPIPS loss on the signals (normals, elongation, in- tensity, and validity) is calculated by: LLPIPS signal =λ signalLLPIPS(f L signal, ˆf L signal) (6) Here,λ signal is the corre...

work page