Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos

Angela Chen; Anurag Ghosh; Francesco Pittaluga; Juan Alvarez-Padilla; Khiem Vuong; Manmohan Chandraker; Srinivasa Narasimhan

arxiv: 2606.07366 · v1 · pith:XCWDP7FInew · submitted 2026-06-05 · 💻 cs.CV · cs.LG· cs.RO

Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos

Anurag Ghosh , Francesco Pittaluga , Khiem Vuong , Angela Chen , Juan Alvarez-Padilla , Manmohan Chandraker , Srinivasa Narasimhan This is my paper

Pith reviewed 2026-06-27 22:13 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO

keywords dashcam videos4D scene reconstructionclosed-loop simulationwork zonesautonomous driving plannersnovel view synthesisdriving logs

0 comments

The pith

Dash2Sim converts monocular dashcam videos into metric geo-referenced 4D driving logs verifiable against independent maps without annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that extracts accurate 4D scenes from ordinary dashcam footage captured in the wild. It produces logs that plug directly into existing simulators and checks them against maintained maps. The resulting ROADWork4D dataset covers thousands of work-zone scenes across many cities. Tests on a verified subset show that current planners, even rule-based ones, cannot execute the lane changes these temporary zones demand. Depth estimates from the same process also raise the quality of novel-view synthesis.

Core claim

Dash2Sim turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies each one against an independently maintained map without annotations. Applied to a large video corpus it produces the ROADWork4D benchmark of 4,244 scenes containing 2.7M 3D objects across 17 cities. On the verified closed-loop subset ROADWork4D-CL all tested planners fail to perform the lane changes required by temporary work-zone channels.

What carries the argument

Dash2Sim pipeline that recovers metric 4D scenes from monocular video and performs annotation-free map verification for simulator compatibility.

If this is right

Rule-based, hybrid, and learning-based planners all fail to execute required lane changes in verified work-zone scenarios.
Dense depth recovered by the method raises novel-view synthesis quality by up to 19 percent on perceptual metrics.
The benchmark supplies 2,201 closed-loop scenes that expose generalization gaps in current planning approaches.
Work-zone situations captured from real dashcams become usable for simulation without manual annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dashcam video could become a scalable source for populating simulators with rare driving conditions beyond the 17 cities already processed.
The same reconstruction step might support sensor simulation pipelines that condition on recovered depth rather than synthetic assets.
Extending verification to other long-tailed events such as construction detours or weather changes would test the same monocular-to-metric premise.

Load-bearing premise

Monocular video alone supplies enough information to recover metric-accurate 4D scenes that stay faithful when used for closed-loop simulation and map verification.

What would settle it

Running the generated 4D logs through a simulator and finding that the resulting vehicle paths or object placements deviate measurably from the original dashcam trajectories or the independent map data.

Figures

Figures reproduced from arXiv: 2606.07366 by Angela Chen, Anurag Ghosh, Francesco Pittaluga, Juan Alvarez-Padilla, Khiem Vuong, Manmohan Chandraker, Srinivasa Narasimhan.

**Figure 1.** Figure 1: Top row: Dash2Sim takes an in-the-wild monocular dashcam video (left) and recovers a georeferenced 3D reconstruction at metric scale (center). Long-tailed objects such as cones, barrels, and barricades are detected, tracked, and lifted to 3D (right). Middle row: The resulting 4D driving log supports closed-loop simulation with reactive agents. Bottom row: Applying Dash2Sim to a large dashcam dataset produ… view at source ↗

**Figure 2.** Figure 2: Dense, geo-referenced point clouds recovered by Dash2Sim. ROADWork4D scenes after reconstruction. Roads, lane markings, sidewalks, vegetation, and small long-tailed work-zone objects (cones, tubular markers, signs) are reconstructed at metric scale from in-the-wild monocular videos. Best viewed zoomed in. Supplemental material, website and walkthrough video contain additional media and visualizations. det… view at source ↗

**Figure 3.** Figure 3: Driving logs from ROADWork4D recovered by Dash2Sim. Our logs include long-tailed work zone scenarios with rare objects, layouts and complex lane change behaviors such as crossing the double yellow line in a work zone (top row). We believe our Dash2Sim framework and ROADWork4D dataset would spur research in a variety of self-driving tasks, including closed loop planning and sensor simulation. Supplemental m… view at source ↗

**Figure 4.** Figure 4: Recovered ego trajectories. Red: coarse GPS tracks from the dashcam video. Blue: metric 6-DoF poses recovered by Dash2Sim. Coarse GPS drifts into adjacent blocks, while recovered poses align with the driven lanes. We validate Dash2Sim components against ground truth or downstream proxies. For example, Street View anchoring recovers metric ego trajectories with median translation error within 10 cm and rot… view at source ↗

**Figure 5.** Figure 5: Planner Error Analysis. PlanTF [58] shows low progress, Diffusion Planner [40] is aggressive and has a high crash rate, and Pluto [24] sacrifices comfort for progress. PDM-Closed [25], while being safe, is very conservative. Pluto Hybrid [24] balances all metrics. Results and Insights. Our results show that a counterintuitive finding from prior work [25], shown on regular driving scenarios, extends even … view at source ↗

**Figure 6.** Figure 6: Novel View Synthesis of ROADWork4D Scenes with Our Depth Supervision. Dense depth from Dash2Sim improves novel-view synthesis quality, especially for small long-tailed objects such as vertical panels and temporary traffic control signs that contain textual details relevant for navigating work zones [17]. improves perceptual details of both common and long-tailed objects, including legibility of temporary s… view at source ↗

read the original abstract

Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dash2Sim gives a practical pipeline for monocular dashcam to metric 4D logs on work zones and shows planners failing there, but the abstract supplies no scale or trajectory error numbers to back the metric claims.

read the letter

The core contribution is a pipeline that converts ordinary dashcam video into geo-referenced 4D logs usable in existing simulators, plus a new benchmark dataset ROADWork4D with over 4k scenes across 17 cities focused on work zones. They verify the outputs against independent maps without annotations and run closed-loop planner tests on a 2k-scene subset, reporting that rule-based and hybrid planners do better than learned ones but all miss the required lane changes.

The dataset size and the map-verification step without extra labels are the parts that feel useful. Collecting real long-tail scenes at this scale is a step past the usual small-city or synthetic setups, and the 19% perceptual gain on novel-view synthesis from the recovered depth is a concrete side result.

The soft spot is exactly what the stress-test note flags: the central claim is metric accuracy from monocular video alone, yet the abstract gives no scale-error stats, absolute trajectory error against RTK or LiDAR, or cross-city consistency checks. Without those numbers it is hard to know whether the planner failures come from the scenes or from reconstruction drift. The methods section will need to show this explicitly.

This paper is for people building simulators or testing planners on rare driving events. The dataset could be worth using even if the full pipeline needs tighter validation. I would send it to peer review because the data collection and the problem statement are substantial enough to merit referee time, though the authors should expect questions on the quantitative accuracy of the 4D logs.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Dash2Sim, a framework that converts in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with simulators by verifying against independently maintained maps without annotations. It creates the ROADWork4D dataset spanning 4,244 scenes with 2.7M 3D objects across 17 cities, and on the ROADWork4D-CL subset, demonstrates that all tested privileged closed-loop planners fail to execute required lane changes in work zone scenarios. It also reports that dense depth from Dash2Sim improves novel-view synthesis by up to 19%.

Significance. If the reconstructions are indeed metric-accurate and faithful, this work would be significant for enabling large-scale, diverse driving simulations from readily available dashcam data, particularly for rare events like work zones, and could provide a valuable benchmark for planner evaluation in closed-loop settings.

major comments (2)

[Abstract and method description] Abstract and method description: The central claim that monocular video yields metric-accurate 4D logs via map verification lacks any reported quantitative validation such as scale-error statistics, absolute trajectory error against ground truth (e.g., RTK or LiDAR), or consistency checks across the 17 cities. This is load-bearing for attributing planner failures to planner limitations rather than reconstruction errors.
[Experiments on ROADWork4D-CL] Experiments on ROADWork4D-CL: The evaluation of planners shows failures in lane changes, but without details on how the 4D logs are integrated into the simulator or any error analysis of the reconstruction, it is unclear if the results are due to the method or data quality.

minor comments (2)

[Abstract] Typo: 'eachone' should be 'each one'.
[Abstract] The abstract states high-level findings but supplies no equations, implementation details, or quantitative planner results, making it difficult to assess the claims.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on validation and experimental clarity. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract and method description] Abstract and method description: The central claim that monocular video yields metric-accurate 4D logs via map verification lacks any reported quantitative validation such as scale-error statistics, absolute trajectory error against ground truth (e.g., RTK or LiDAR), or consistency checks across the 17 cities. This is load-bearing for attributing planner failures to planner limitations rather than reconstruction errors.

Authors: Our method establishes metric scale and geo-referencing through explicit verification of each reconstructed scene against independently maintained maps, without requiring additional sensors or annotations. We do not report scale-error statistics or absolute trajectory error against RTK/LiDAR because no such ground-truth data exists for these in-the-wild monocular dashcam videos. We will revise the method section to describe the map-verification procedure in greater detail, including the specific consistency criteria applied across the 17 cities and any indirect checks performed. revision: partial
Referee: [Experiments on ROADWork4D-CL] Experiments on ROADWork4D-CL: The evaluation of planners shows failures in lane changes, but without details on how the 4D logs are integrated into the simulator or any error analysis of the reconstruction, it is unclear if the results are due to the method or data quality.

Authors: We agree that additional details are warranted. In the revised manuscript we will add an explicit description of how the metric 4D logs are imported and used within the closed-loop simulator, including coordinate-frame alignment and dynamic object handling. For reconstruction error analysis we note the inherent limitation that direct ground-truth comparisons are unavailable; however, the map-verification step serves as the primary fidelity check. We will also include further qualitative examples of reconstruction quality in the supplement. revision: yes

standing simulated objections not resolved

Direct quantitative metrics such as scale-error statistics or absolute trajectory error against RTK/LiDAR ground truth cannot be reported, as no such high-precision ground truth is available for the in-the-wild dashcam videos used to construct ROADWork4D.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents a technical framework (Dash2Sim) for converting monocular dashcam videos into metric, geo-referenced 4D logs that are then verified against independently maintained external maps. No equations, derivations, or self-referential definitions appear in the abstract or described method. The central claims rest on producing outputs that can be checked against external data sources rather than on internal fits, parameter renamings, or self-citation chains that reduce the result to its inputs by construction. This matches the default expectation for most papers and the reader's explicit assessment of score 1.0 with no reducing steps. Concerns about quantitative validation of scale accuracy belong to correctness or evidence strength, not circularity per the analysis rules.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework is presented at a high level without implementation details.

pith-pipeline@v0.9.1-grok · 5812 in / 1240 out tokens · 24310 ms · 2026-06-27T22:13:36.583695+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

111 extracted references · 14 linked inside Pith

[1]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. InCVPR, 2012

2012
[2]

Caesar, V

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuScenes: A multimodal dataset for autonomous driving. InCVPR, 2020

2020
[3]

Chang, J

M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, and J. Hays. Argoverse: 3D tracking and forecasting with rich maps. InCVPR, 2019

2019
[4]

Wilson, W

B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, P. Carr, and J. Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. InNeurIPS Datasets and Benchmarks Track, 2021

2021
[5]

Ettinger, S

S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y . Chai, B. Sapp, C. R. Qi, Y . Zhou, Z. Yang, A. Chouard, P. Sun, J. Ngiam, V . Vasudevan, A. McCauley, J. Shlens, and D. Anguelov. Large scale interactive motion forecasting for autonomous driving: The Waymo open motion dataset. InICCV, 2021

2021
[6]

R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. White, B. Sapp, M. Tan, J.-J. Hwang, and D. Anguelov. WOD-E2E: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

arXiv 2025
[7]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InNeurIPS Datasets and Benchmarks Track, 2024

2024
[8]

W. Cao, M. Hallgarten, T. Li, D. Dauner, X. Gu, C. Wang, Y . Miron, M. Aiello, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. Pseudo-simulation for autonomous driving. InCoRL, 2025

2025
[9]

Caesar, J

H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. M. Wolff, A. H. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021

Pith/arXiv arXiv 2021
[10]

Dosovitskiy, G

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. CARLA: An open urban driving simulator. InCoRL, 2017

2017
[11]

X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan. Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS Datasets and Benchmarks Track, 2024

2024
[12]

Gerstenecker, A

S. Gerstenecker, A. Geiger, and K. Renz. Fail2Drive: Benchmarking closed-loop driving generalization.arXiv preprint arXiv:2604.08535, 2026

Pith/arXiv arXiv 2026
[13]

Hallgarten, J

M. Hallgarten, J. Zapata, M. Stoll, K. Renz, and A. Zell. Can vehicle motion planning generalize to realistic long-tail scenarios? InIROS, 2024

2024
[14]

30 https://www.autoinsurance.com/research/ dash-cam-usage-report/, 2026

autoinsurance. 30 https://www.autoinsurance.com/research/ dash-cam-usage-report/, 2026. Accessed: 2026-04-29

2026
[15]

Allshire, H

A. Allshire, H. Choi, J. Zhang, D. McAllister, A. Zhang, C. M. Kim, T. Darrell, P. Abbeel, J. Malik, and A. Kanazawa. Visual imitation enables contextual humanoid control. InCoRL, 2025

2025
[16]

Hartley and A

R. Hartley and A. Zisserman.Multiple View Geometry in Computer Vision. 2003. 29

2003
[17]

Ghosh, S

A. Ghosh, S. Zheng, R. Tamburo, K. Vuong, J. Alvarez-Padilla, H. Zhu, M. Cardei, N. Dunn, C. Mertz, and S. G. Narasimhan. Roadwork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones. InICCV, 2025

2025
[18]

Zendel, K

O. Zendel, K. Honauer, M. Murschitz, D. Steininger, and G. F. Dominguez. Wilddash-creating hazard-aware benchmarks. InECCV, 2018

2018
[19]

Zendel, M

O. Zendel, M. Sch ¨orghuber, B. Rainer, M. Murschitz, and C. Beleznai. Unifying panoptic segmentation for autonomous driving. InCVPR, 2022

2022
[20]

S. K. Bashetty, H. B. Amor, and G. Fainekos. Deepcrashtest: Turning dashcam videos into virtual crash tests for automated driving systems. InICRA, 2020

2020
[21]

Y . Miao, G. Fainekos, B. Hoxha, H. Okamoto, D. Prokhorov, and S. Mitra. From dashcam videos to driving simulations: Stress testing automated vehicles against rare events. InAAAI Workshops, 2025

2025
[22]

J. Bote. Cruise vehicle gets stuck in wet concrete while driving in San Francisco. https:// www.sfgate.com/tech/article/cruise-stuck-wet-concrete-sf-18297946.php , 2023

2023
[23]

Waymo halts freeway rides after robotaxis strug- gle in construction zones

TechCrunch. Waymo halts freeway rides after robotaxis strug- gle in construction zones. https://techcrunch.com/2026/05/21/ waymo-halts-freeway-rides-after-robotaxis-struggle-in-construction-zones/ , 2026

2026
[24]

Cheng, Y

J. Cheng, Y . Chen, and Q. Chen. Pluto: Pushing the limit of imitation learning-based planning for autonomous driving.arXiv preprint arXiv:2404.14327, 2024

arXiv 2024
[25]

Dauner, M

D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning-based vehicle motion planning. InCoRL, 2023

2023
[26]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. InICRA, 2024

2024
[27]

Cheang, G

C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024
[28]

S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. NeurIPS, 2023

2023
[29]

Hwang, R

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Pith/arXiv arXiv 2024
[30]

Wagner, O

R. Wagner, O. S. Tas, J. Villa, F. Hauser, Y . Shen, M. Steiner, D. Strutz, C. Fernandez, C. Kinzig, G. S. Guitierrez-Cabello, et al. Longtail driving scenarios with reasoning traces: The kitscenes longtail dataset.arXiv preprint arXiv:2603.23607, 2026

Pith/arXiv arXiv 2026
[32]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y . Balaji, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025. 30

Pith/arXiv arXiv 2025
[33]

X. Ren, Y . Lu, T. Cao, R. Gao, S. Huang, A. Sabour, T. Shen, T. Pfaff, J. Z. Wu, R. Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025

arXiv 2025
[34]

X. Ren, T. Shen, J. Huang, H. Ling, Y . Lu, M. Nimier-David, T. M¨uller, A. Keller, S. Fidler, and J. Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InCVPR, 2025

2025
[35]

J. Wang, B. Sun, Y . Bai, V . Casser, S. Peng, Z. Zhu, M.-L. Shih, X. Masotto, S.-Y . Su, K. V . Parvate, et al. Sensor2sensor: Cross-embodiment sensor conversion for autonomous driving. arXiv preprint arXiv:2605.22809, 2026

Pith/arXiv arXiv 2026
[36]

H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.TPAMI, 2025

2025
[37]

Z. Chen, J. Yang, J. Huang, R. De Lutio, J. Martinez Esturo, B. Ivanovic, O. Litany, Z. Gojcic, S. Fidler, M. Pavone, et al. Omnire: Omni urban scene reconstruction. InICLR, 2025

2025
[38]

Chitta, D

K. Chitta, D. Dauner, and A. Geiger. Sledge: Synthesizing driving environments with generative models and rule-based traffic. InECCV, 2024

2024
[39]

Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

Pith/arXiv arXiv 2024
[40]

Zheng, R

Y . Zheng, R. Liang, K. Zheng, J. Zheng, L. Mao, J. Li, W. Gu, R. Ai, S. Li, X. Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance. InICLR, 2025

2025
[41]

Ghosh, S

A. Ghosh, S. Narasimhan, M. Chandraker, and F. Pittaluga. Rad-lad: Rule and language grounded autonomous driving in real-time.arXiv preprint arXiv:2603.28522, 2026

arXiv 2026
[42]

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering. InECCV, 2024

2024
[43]

X. Zhou, X. Han, F. Yang, Y . Ma, V . Tresp, and A. Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model. InAAAI, 2026

2026
[44]

Z. Xu, Y . Bai, Y . Zhang, Z. Li, F. Xia, K.-Y . K. Wong, J. Wang, and H. Zhao. Drivegpt4-v2: Harnessing large language model capabilities for enhanced closed-loop autonomous driving. InCVPR, 2025

2025
[45]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InCVPR, 2025

2025
[46]

J. Lu, J. Guan, Z. Huang, J. Li, G. Li, L. Kong, Y . Li, H. Wang, S. Xu, Y . Luo, et al. Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026

Pith/arXiv arXiv 2026
[47]

Carion, L

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025
[48]

Huang, J

W. Huang, J. Zhang, S. Li, T. Jia, J. Duan, Y . Cheng, J. Cho, M. Wallingford, R. Soraki, C. D. Kim, et al. Wilddet3d: Scaling promptable 3d detection in the wild.arXiv preprint arXiv:2604.08626, 2026

Pith/arXiv arXiv 2026
[49]

A. R. Zamir, T. Wekel, P. Agrawal, C. Wei, J. Malik, and S. Savarese. Generic 3d representation via pose estimation and matching. InECCV, 2016. 31

2016
[50]

Vuong, R

K. Vuong, R. Tamburo, and S. G. Narasimhan. Toward planet-wide traffic camera calibration. InWACV, 2024

2024
[51]

Vuong, A

K. Vuong, A. Ghosh, D. Ramanan, S. Narasimhan, and S. Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. InCVPR, 2025

2025
[52]

Berton and C

G. Berton and C. Masone. Megaloc: One retrieval to place them all. InCVPR Workshops, 2025

2025
[53]

OpenStreetMap

OpenStreetMap contributors. OpenStreetMap. https://www.openstreetmap.org, 2017. Data retrieved from https://planet.openstreetmap.org

2017
[54]

Sriram, B

N. Sriram, B. Liu, F. Pittaluga, and M. Chandraker. Smart: Simultaneous multi-agent recurrent trajectory prediction. InECCV, 2020

2020
[55]

P. Cai, Y . Lee, Y . Luo, and D. Hsu. Summit: A simulator for urban driving in massive mixed traffic. InICRA, 2020

2020
[56]

Eskenazi

J. Eskenazi. Waymo rolls toward San Francisco Airport. a showdown is brewing. https://missionlocal.org/2024/12/ waymo-rolls-toward-san-francisco-airport-showdown-brewing/, 2024

2024
[57]

Sural, N

S. Sural, N. Sahu, and R. Rajkumar. Workzone3d: A multimodal dataset for 3d work zone perception in autonomous driving. InWACV, 2026

2026
[58]

Cheng, Y

J. Cheng, Y . Chen, X. Mei, B. Yang, B. Li, and M. Liu. Rethinking imitation-based planner for autonomous driving. InICRA, 2024

2024
[59]

Treiber, A

M. Treiber, A. Hennecke, and D. Helbing. Congested traffic states in empirical observations and microscopic simulations.Physical review E, 2000

2000
[60]

L. Pan, D. Bar´ath, M. Pollefeys, and J. L. Sch¨onberger. Global structure-from-motion revisited. InECCV, 2024

2024
[61]

L. Pan, J. L. Sch¨onberger, and M. Pollefeys. Global structure-from-motion meets feedforward reconstruction. InCVPR, 2026

2026
[62]

Hagedorn, L

S. Hagedorn, L. Donkov, A. Distelzweig, and A. P. Condurache. When planners meet reality: How learned, reactive traffic agents shift nuplan benchmarks.arXiv preprint arXiv:2510.14677, 2025

arXiv 2025
[63]

R. Luo, H. Yang, M. Watson, A. Sharma, S. Veer, E. Schmerling, and M. Pavone. Sim2val: Leveraging correlation across test platforms for variance-reduced metric estimation.arXiv preprint arXiv:2506.20553, 2025

arXiv 2025
[64]

Pinto and A

L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. InICRA, 2016

2016
[65]

Levine, P

S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection.IJRR, 2018

2018
[66]

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling.NeurIPS, 2021

2021
[67]

Emmons, B

S. Emmons, B. Eysenbach, I. Kostrikov, and S. Levine. Rvs: What is essential for offline rl via supervised learning?arXiv preprint arXiv:2112.10751, 2021

arXiv 2021
[68]

G. L. Smith, S. F. Schmidt, and L. A. McGee.Application of statistical filter theory to the optimal estimation of position and velocity on board a circumlunar vehicle. 1962. 32

1962
[69]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025
[70]

N. V . Keetha, N. M¨uller, J. Sch¨onberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. Weber, N. Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. In3DV, 2025

2025
[71]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

Pith/arXiv arXiv 2023
[72]

S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.NeurIPS, 2024

2024
[73]

X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InECCV, 2024

2024
[74]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018
[75]

Ghildyal and F

A. Ghildyal and F. Liu. Shift-tolerant perceptual similarity metric. InECCV, 2022

2022
[76]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017

2017
[77]

Kynk¨a¨anniemi, T

T. Kynk¨a¨anniemi, T. Karras, M. Aittala, T. Aila, and J. Lehtinen. The role of imagenet classes in fr\’echet inception distance.arXiv preprint arXiv:2203.06026, 2022

arXiv 2022
[78]

Tumanyan, O

N. Tumanyan, O. Bar-Tal, S. Bagon, and T. Dekel. Splicing vit features for semantic appearance transfer. InCVPR, 2022

2022
[79]

S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

Pith/arXiv arXiv 2023
[80]

Wickrema, S

C. Wickrema, S. Leary, S. Sarkar, M. Giglio, E. Bianchi, E. Mace, and M. Twardowski. Benchmarking image similarity metrics for novel view synthesis applications.arXiv preprint arXiv:2506.12563, 2025

arXiv 2025
[81]

M. Khan, H. Fazlali, D. Sharma, T. Cao, D. Bai, Y . Ren, and B. Liu. Autosplat: Constrained gaussian splatting for autonomous driving scene reconstruction. InICRA, 2025

2025

Showing first 80 references.

[1] [1]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. InCVPR, 2012

2012

[2] [2]

Caesar, V

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuScenes: A multimodal dataset for autonomous driving. InCVPR, 2020

2020

[3] [3]

Chang, J

M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, and J. Hays. Argoverse: 3D tracking and forecasting with rich maps. InCVPR, 2019

2019

[4] [4]

Wilson, W

B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, P. Carr, and J. Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. InNeurIPS Datasets and Benchmarks Track, 2021

2021

[5] [5]

Ettinger, S

S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y . Chai, B. Sapp, C. R. Qi, Y . Zhou, Z. Yang, A. Chouard, P. Sun, J. Ngiam, V . Vasudevan, A. McCauley, J. Shlens, and D. Anguelov. Large scale interactive motion forecasting for autonomous driving: The Waymo open motion dataset. InICCV, 2021

2021

[6] [6]

R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. White, B. Sapp, M. Tan, J.-J. Hwang, and D. Anguelov. WOD-E2E: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

arXiv 2025

[7] [7]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InNeurIPS Datasets and Benchmarks Track, 2024

2024

[8] [8]

W. Cao, M. Hallgarten, T. Li, D. Dauner, X. Gu, C. Wang, Y . Miron, M. Aiello, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. Pseudo-simulation for autonomous driving. InCoRL, 2025

2025

[9] [9]

Caesar, J

H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. M. Wolff, A. H. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021

Pith/arXiv arXiv 2021

[10] [10]

Dosovitskiy, G

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. CARLA: An open urban driving simulator. InCoRL, 2017

2017

[11] [11]

X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan. Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS Datasets and Benchmarks Track, 2024

2024

[12] [12]

Gerstenecker, A

S. Gerstenecker, A. Geiger, and K. Renz. Fail2Drive: Benchmarking closed-loop driving generalization.arXiv preprint arXiv:2604.08535, 2026

Pith/arXiv arXiv 2026

[13] [13]

Hallgarten, J

M. Hallgarten, J. Zapata, M. Stoll, K. Renz, and A. Zell. Can vehicle motion planning generalize to realistic long-tail scenarios? InIROS, 2024

2024

[14] [14]

30 https://www.autoinsurance.com/research/ dash-cam-usage-report/, 2026

autoinsurance. 30 https://www.autoinsurance.com/research/ dash-cam-usage-report/, 2026. Accessed: 2026-04-29

2026

[15] [15]

Allshire, H

A. Allshire, H. Choi, J. Zhang, D. McAllister, A. Zhang, C. M. Kim, T. Darrell, P. Abbeel, J. Malik, and A. Kanazawa. Visual imitation enables contextual humanoid control. InCoRL, 2025

2025

[16] [16]

Hartley and A

R. Hartley and A. Zisserman.Multiple View Geometry in Computer Vision. 2003. 29

2003

[17] [17]

Ghosh, S

A. Ghosh, S. Zheng, R. Tamburo, K. Vuong, J. Alvarez-Padilla, H. Zhu, M. Cardei, N. Dunn, C. Mertz, and S. G. Narasimhan. Roadwork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones. InICCV, 2025

2025

[18] [18]

Zendel, K

O. Zendel, K. Honauer, M. Murschitz, D. Steininger, and G. F. Dominguez. Wilddash-creating hazard-aware benchmarks. InECCV, 2018

2018

[19] [19]

Zendel, M

O. Zendel, M. Sch ¨orghuber, B. Rainer, M. Murschitz, and C. Beleznai. Unifying panoptic segmentation for autonomous driving. InCVPR, 2022

2022

[20] [20]

S. K. Bashetty, H. B. Amor, and G. Fainekos. Deepcrashtest: Turning dashcam videos into virtual crash tests for automated driving systems. InICRA, 2020

2020

[21] [21]

Y . Miao, G. Fainekos, B. Hoxha, H. Okamoto, D. Prokhorov, and S. Mitra. From dashcam videos to driving simulations: Stress testing automated vehicles against rare events. InAAAI Workshops, 2025

2025

[22] [22]

J. Bote. Cruise vehicle gets stuck in wet concrete while driving in San Francisco. https:// www.sfgate.com/tech/article/cruise-stuck-wet-concrete-sf-18297946.php , 2023

2023

[23] [23]

Waymo halts freeway rides after robotaxis strug- gle in construction zones

TechCrunch. Waymo halts freeway rides after robotaxis strug- gle in construction zones. https://techcrunch.com/2026/05/21/ waymo-halts-freeway-rides-after-robotaxis-struggle-in-construction-zones/ , 2026

2026

[24] [24]

Cheng, Y

J. Cheng, Y . Chen, and Q. Chen. Pluto: Pushing the limit of imitation learning-based planning for autonomous driving.arXiv preprint arXiv:2404.14327, 2024

arXiv 2024

[25] [25]

Dauner, M

D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning-based vehicle motion planning. InCoRL, 2023

2023

[26] [26]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. InICRA, 2024

2024

[27] [27]

Cheang, G

C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024

[28] [28]

S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. NeurIPS, 2023

2023

[29] [29]

Hwang, R

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Pith/arXiv arXiv 2024

[30] [30]

Wagner, O

R. Wagner, O. S. Tas, J. Villa, F. Hauser, Y . Shen, M. Steiner, D. Strutz, C. Fernandez, C. Kinzig, G. S. Guitierrez-Cabello, et al. Longtail driving scenarios with reasoning traces: The kitscenes longtail dataset.arXiv preprint arXiv:2603.23607, 2026

Pith/arXiv arXiv 2026

[31] [32]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y . Balaji, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025. 30

Pith/arXiv arXiv 2025

[32] [33]

X. Ren, Y . Lu, T. Cao, R. Gao, S. Huang, A. Sabour, T. Shen, T. Pfaff, J. Z. Wu, R. Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025

arXiv 2025

[33] [34]

X. Ren, T. Shen, J. Huang, H. Ling, Y . Lu, M. Nimier-David, T. M¨uller, A. Keller, S. Fidler, and J. Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InCVPR, 2025

2025

[34] [35]

J. Wang, B. Sun, Y . Bai, V . Casser, S. Peng, Z. Zhu, M.-L. Shih, X. Masotto, S.-Y . Su, K. V . Parvate, et al. Sensor2sensor: Cross-embodiment sensor conversion for autonomous driving. arXiv preprint arXiv:2605.22809, 2026

Pith/arXiv arXiv 2026

[35] [36]

H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.TPAMI, 2025

2025

[36] [37]

Z. Chen, J. Yang, J. Huang, R. De Lutio, J. Martinez Esturo, B. Ivanovic, O. Litany, Z. Gojcic, S. Fidler, M. Pavone, et al. Omnire: Omni urban scene reconstruction. InICLR, 2025

2025

[37] [38]

Chitta, D

K. Chitta, D. Dauner, and A. Geiger. Sledge: Synthesizing driving environments with generative models and rule-based traffic. InECCV, 2024

2024

[38] [39]

Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

Pith/arXiv arXiv 2024

[39] [40]

Zheng, R

Y . Zheng, R. Liang, K. Zheng, J. Zheng, L. Mao, J. Li, W. Gu, R. Ai, S. Li, X. Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance. InICLR, 2025

2025

[40] [41]

Ghosh, S

A. Ghosh, S. Narasimhan, M. Chandraker, and F. Pittaluga. Rad-lad: Rule and language grounded autonomous driving in real-time.arXiv preprint arXiv:2603.28522, 2026

arXiv 2026

[41] [42]

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering. InECCV, 2024

2024

[42] [43]

X. Zhou, X. Han, F. Yang, Y . Ma, V . Tresp, and A. Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model. InAAAI, 2026

2026

[43] [44]

Z. Xu, Y . Bai, Y . Zhang, Z. Li, F. Xia, K.-Y . K. Wong, J. Wang, and H. Zhao. Drivegpt4-v2: Harnessing large language model capabilities for enhanced closed-loop autonomous driving. InCVPR, 2025

2025

[44] [45]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InCVPR, 2025

2025

[45] [46]

J. Lu, J. Guan, Z. Huang, J. Li, G. Li, L. Kong, Y . Li, H. Wang, S. Xu, Y . Luo, et al. Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026

Pith/arXiv arXiv 2026

[46] [47]

Carion, L

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025

[47] [48]

Huang, J

W. Huang, J. Zhang, S. Li, T. Jia, J. Duan, Y . Cheng, J. Cho, M. Wallingford, R. Soraki, C. D. Kim, et al. Wilddet3d: Scaling promptable 3d detection in the wild.arXiv preprint arXiv:2604.08626, 2026

Pith/arXiv arXiv 2026

[48] [49]

A. R. Zamir, T. Wekel, P. Agrawal, C. Wei, J. Malik, and S. Savarese. Generic 3d representation via pose estimation and matching. InECCV, 2016. 31

2016

[49] [50]

Vuong, R

K. Vuong, R. Tamburo, and S. G. Narasimhan. Toward planet-wide traffic camera calibration. InWACV, 2024

2024

[50] [51]

Vuong, A

K. Vuong, A. Ghosh, D. Ramanan, S. Narasimhan, and S. Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. InCVPR, 2025

2025

[51] [52]

Berton and C

G. Berton and C. Masone. Megaloc: One retrieval to place them all. InCVPR Workshops, 2025

2025

[52] [53]

OpenStreetMap

OpenStreetMap contributors. OpenStreetMap. https://www.openstreetmap.org, 2017. Data retrieved from https://planet.openstreetmap.org

2017

[53] [54]

Sriram, B

N. Sriram, B. Liu, F. Pittaluga, and M. Chandraker. Smart: Simultaneous multi-agent recurrent trajectory prediction. InECCV, 2020

2020

[54] [55]

P. Cai, Y . Lee, Y . Luo, and D. Hsu. Summit: A simulator for urban driving in massive mixed traffic. InICRA, 2020

2020

[55] [56]

Eskenazi

J. Eskenazi. Waymo rolls toward San Francisco Airport. a showdown is brewing. https://missionlocal.org/2024/12/ waymo-rolls-toward-san-francisco-airport-showdown-brewing/, 2024

2024

[56] [57]

Sural, N

S. Sural, N. Sahu, and R. Rajkumar. Workzone3d: A multimodal dataset for 3d work zone perception in autonomous driving. InWACV, 2026

2026

[57] [58]

Cheng, Y

J. Cheng, Y . Chen, X. Mei, B. Yang, B. Li, and M. Liu. Rethinking imitation-based planner for autonomous driving. InICRA, 2024

2024

[58] [59]

Treiber, A

M. Treiber, A. Hennecke, and D. Helbing. Congested traffic states in empirical observations and microscopic simulations.Physical review E, 2000

2000

[59] [60]

L. Pan, D. Bar´ath, M. Pollefeys, and J. L. Sch¨onberger. Global structure-from-motion revisited. InECCV, 2024

2024

[60] [61]

L. Pan, J. L. Sch¨onberger, and M. Pollefeys. Global structure-from-motion meets feedforward reconstruction. InCVPR, 2026

2026

[61] [62]

Hagedorn, L

S. Hagedorn, L. Donkov, A. Distelzweig, and A. P. Condurache. When planners meet reality: How learned, reactive traffic agents shift nuplan benchmarks.arXiv preprint arXiv:2510.14677, 2025

arXiv 2025

[62] [63]

R. Luo, H. Yang, M. Watson, A. Sharma, S. Veer, E. Schmerling, and M. Pavone. Sim2val: Leveraging correlation across test platforms for variance-reduced metric estimation.arXiv preprint arXiv:2506.20553, 2025

arXiv 2025

[63] [64]

Pinto and A

L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. InICRA, 2016

2016

[64] [65]

Levine, P

S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection.IJRR, 2018

2018

[65] [66]

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling.NeurIPS, 2021

2021

[66] [67]

Emmons, B

S. Emmons, B. Eysenbach, I. Kostrikov, and S. Levine. Rvs: What is essential for offline rl via supervised learning?arXiv preprint arXiv:2112.10751, 2021

arXiv 2021

[67] [68]

G. L. Smith, S. F. Schmidt, and L. A. McGee.Application of statistical filter theory to the optimal estimation of position and velocity on board a circumlunar vehicle. 1962. 32

1962

[68] [69]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025

[69] [70]

N. V . Keetha, N. M¨uller, J. Sch¨onberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. Weber, N. Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. In3DV, 2025

2025

[70] [71]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

Pith/arXiv arXiv 2023

[71] [72]

S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.NeurIPS, 2024

2024

[72] [73]

X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InECCV, 2024

2024

[73] [74]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018

[74] [75]

Ghildyal and F

A. Ghildyal and F. Liu. Shift-tolerant perceptual similarity metric. InECCV, 2022

2022

[75] [76]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017

2017

[76] [77]

Kynk¨a¨anniemi, T

T. Kynk¨a¨anniemi, T. Karras, M. Aittala, T. Aila, and J. Lehtinen. The role of imagenet classes in fr\’echet inception distance.arXiv preprint arXiv:2203.06026, 2022

arXiv 2022

[77] [78]

Tumanyan, O

N. Tumanyan, O. Bar-Tal, S. Bagon, and T. Dekel. Splicing vit features for semantic appearance transfer. InCVPR, 2022

2022

[78] [79]

S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

Pith/arXiv arXiv 2023

[79] [80]

Wickrema, S

C. Wickrema, S. Leary, S. Sarkar, M. Giglio, E. Bianchi, E. Mace, and M. Twardowski. Benchmarking image similarity metrics for novel view synthesis applications.arXiv preprint arXiv:2506.12563, 2025

arXiv 2025

[80] [81]

M. Khan, H. Fazlali, D. Sharma, T. Cao, D. Bai, Y . Ren, and B. Liu. Autosplat: Constrained gaussian splatting for autonomous driving scene reconstruction. InICRA, 2025

2025