pith. sign in

arxiv: 2606.07366 · v1 · pith:XCWDP7FInew · submitted 2026-06-05 · 💻 cs.CV · cs.LG· cs.RO

Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos

Pith reviewed 2026-06-27 22:13 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO
keywords dashcam videos4D scene reconstructionclosed-loop simulationwork zonesautonomous driving plannersnovel view synthesisdriving logs
0
0 comments X

The pith

Dash2Sim converts monocular dashcam videos into metric geo-referenced 4D driving logs verifiable against independent maps without annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that extracts accurate 4D scenes from ordinary dashcam footage captured in the wild. It produces logs that plug directly into existing simulators and checks them against maintained maps. The resulting ROADWork4D dataset covers thousands of work-zone scenes across many cities. Tests on a verified subset show that current planners, even rule-based ones, cannot execute the lane changes these temporary zones demand. Depth estimates from the same process also raise the quality of novel-view synthesis.

Core claim

Dash2Sim turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies each one against an independently maintained map without annotations. Applied to a large video corpus it produces the ROADWork4D benchmark of 4,244 scenes containing 2.7M 3D objects across 17 cities. On the verified closed-loop subset ROADWork4D-CL all tested planners fail to perform the lane changes required by temporary work-zone channels.

What carries the argument

Dash2Sim pipeline that recovers metric 4D scenes from monocular video and performs annotation-free map verification for simulator compatibility.

If this is right

  • Rule-based, hybrid, and learning-based planners all fail to execute required lane changes in verified work-zone scenarios.
  • Dense depth recovered by the method raises novel-view synthesis quality by up to 19 percent on perceptual metrics.
  • The benchmark supplies 2,201 closed-loop scenes that expose generalization gaps in current planning approaches.
  • Work-zone situations captured from real dashcams become usable for simulation without manual annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dashcam video could become a scalable source for populating simulators with rare driving conditions beyond the 17 cities already processed.
  • The same reconstruction step might support sensor simulation pipelines that condition on recovered depth rather than synthetic assets.
  • Extending verification to other long-tailed events such as construction detours or weather changes would test the same monocular-to-metric premise.

Load-bearing premise

Monocular video alone supplies enough information to recover metric-accurate 4D scenes that stay faithful when used for closed-loop simulation and map verification.

What would settle it

Running the generated 4D logs through a simulator and finding that the resulting vehicle paths or object placements deviate measurably from the original dashcam trajectories or the independent map data.

Figures

Figures reproduced from arXiv: 2606.07366 by Angela Chen, Anurag Ghosh, Francesco Pittaluga, Juan Alvarez-Padilla, Khiem Vuong, Manmohan Chandraker, Srinivasa Narasimhan.

Figure 1
Figure 1. Figure 1: Top row: Dash2Sim takes an in-the-wild monocular dashcam video (left) and recovers a geo￾referenced 3D reconstruction at metric scale (center). Long-tailed objects such as cones, barrels, and barricades are detected, tracked, and lifted to 3D (right). Middle row: The resulting 4D driving log supports closed-loop simulation with reactive agents. Bottom row: Applying Dash2Sim to a large dashcam dataset produ… view at source ↗
Figure 2
Figure 2. Figure 2: Dense, geo-referenced point clouds recovered by Dash2Sim. ROADWork4D scenes after recon￾struction. Roads, lane markings, sidewalks, vegetation, and small long-tailed work-zone objects (cones, tubular markers, signs) are reconstructed at metric scale from in-the-wild monocular videos. Best viewed zoomed in. Supplemental material, website and walkthrough video contain additional media and visualizations. det… view at source ↗
Figure 3
Figure 3. Figure 3: Driving logs from ROADWork4D recovered by Dash2Sim. Our logs include long-tailed work zone scenarios with rare objects, layouts and complex lane change behaviors such as crossing the double yellow line in a work zone (top row). We believe our Dash2Sim framework and ROADWork4D dataset would spur research in a variety of self-driving tasks, including closed loop planning and sensor simulation. Supplemental m… view at source ↗
Figure 4
Figure 4. Figure 4: Recovered ego trajectories. Red: coarse GPS tracks from the dashcam video. Blue: metric 6-DoF poses recovered by Dash2Sim. Coarse GPS drifts into adjacent blocks, while recovered poses align with the driven lanes. We validate Dash2Sim components against ground truth or downstream proxies. For exam￾ple, Street View anchoring recovers metric ego trajectories with median translation error within 10 cm and rot… view at source ↗
Figure 5
Figure 5. Figure 5: Planner Error Analysis. PlanTF [58] shows low progress, Diffusion Planner [40] is aggressive and has a high crash rate, and Pluto [24] sacrifices comfort for progress. PDM-Closed [25], while being safe, is very conservative. Pluto Hybrid [24] balances all metrics. Results and Insights. Our results show that a counterin￾tuitive finding from prior work [25], shown on regular driv￾ing scenarios, extends even … view at source ↗
Figure 6
Figure 6. Figure 6: Novel View Synthesis of ROADWork4D Scenes with Our Depth Supervision. Dense depth from Dash2Sim improves novel-view synthesis quality, especially for small long-tailed objects such as vertical panels and temporary traffic control signs that contain textual details relevant for navigating work zones [17]. improves perceptual details of both common and long-tailed objects, including legibility of temporary s… view at source ↗
read the original abstract

Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Dash2Sim, a framework that converts in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with simulators by verifying against independently maintained maps without annotations. It creates the ROADWork4D dataset spanning 4,244 scenes with 2.7M 3D objects across 17 cities, and on the ROADWork4D-CL subset, demonstrates that all tested privileged closed-loop planners fail to execute required lane changes in work zone scenarios. It also reports that dense depth from Dash2Sim improves novel-view synthesis by up to 19%.

Significance. If the reconstructions are indeed metric-accurate and faithful, this work would be significant for enabling large-scale, diverse driving simulations from readily available dashcam data, particularly for rare events like work zones, and could provide a valuable benchmark for planner evaluation in closed-loop settings.

major comments (2)
  1. [Abstract and method description] Abstract and method description: The central claim that monocular video yields metric-accurate 4D logs via map verification lacks any reported quantitative validation such as scale-error statistics, absolute trajectory error against ground truth (e.g., RTK or LiDAR), or consistency checks across the 17 cities. This is load-bearing for attributing planner failures to planner limitations rather than reconstruction errors.
  2. [Experiments on ROADWork4D-CL] Experiments on ROADWork4D-CL: The evaluation of planners shows failures in lane changes, but without details on how the 4D logs are integrated into the simulator or any error analysis of the reconstruction, it is unclear if the results are due to the method or data quality.
minor comments (2)
  1. [Abstract] Typo: 'eachone' should be 'each one'.
  2. [Abstract] The abstract states high-level findings but supplies no equations, implementation details, or quantitative planner results, making it difficult to assess the claims.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on validation and experimental clarity. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract and method description] Abstract and method description: The central claim that monocular video yields metric-accurate 4D logs via map verification lacks any reported quantitative validation such as scale-error statistics, absolute trajectory error against ground truth (e.g., RTK or LiDAR), or consistency checks across the 17 cities. This is load-bearing for attributing planner failures to planner limitations rather than reconstruction errors.

    Authors: Our method establishes metric scale and geo-referencing through explicit verification of each reconstructed scene against independently maintained maps, without requiring additional sensors or annotations. We do not report scale-error statistics or absolute trajectory error against RTK/LiDAR because no such ground-truth data exists for these in-the-wild monocular dashcam videos. We will revise the method section to describe the map-verification procedure in greater detail, including the specific consistency criteria applied across the 17 cities and any indirect checks performed. revision: partial

  2. Referee: [Experiments on ROADWork4D-CL] Experiments on ROADWork4D-CL: The evaluation of planners shows failures in lane changes, but without details on how the 4D logs are integrated into the simulator or any error analysis of the reconstruction, it is unclear if the results are due to the method or data quality.

    Authors: We agree that additional details are warranted. In the revised manuscript we will add an explicit description of how the metric 4D logs are imported and used within the closed-loop simulator, including coordinate-frame alignment and dynamic object handling. For reconstruction error analysis we note the inherent limitation that direct ground-truth comparisons are unavailable; however, the map-verification step serves as the primary fidelity check. We will also include further qualitative examples of reconstruction quality in the supplement. revision: yes

standing simulated objections not resolved
  • Direct quantitative metrics such as scale-error statistics or absolute trajectory error against RTK/LiDAR ground truth cannot be reported, as no such high-precision ground truth is available for the in-the-wild dashcam videos used to construct ROADWork4D.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents a technical framework (Dash2Sim) for converting monocular dashcam videos into metric, geo-referenced 4D logs that are then verified against independently maintained external maps. No equations, derivations, or self-referential definitions appear in the abstract or described method. The central claims rest on producing outputs that can be checked against external data sources rather than on internal fits, parameter renamings, or self-citation chains that reduce the result to its inputs by construction. This matches the default expectation for most papers and the reader's explicit assessment of score 1.0 with no reducing steps. Concerns about quantitative validation of scale accuracy belong to correctness or evidence strength, not circularity per the analysis rules.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework is presented at a high level without implementation details.

pith-pipeline@v0.9.1-grok · 5812 in / 1240 out tokens · 24310 ms · 2026-06-27T22:13:36.583695+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

111 extracted references · 14 linked inside Pith

  1. [1]

    Geiger, P

    A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. InCVPR, 2012

  2. [2]

    Caesar, V

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuScenes: A multimodal dataset for autonomous driving. InCVPR, 2020

  3. [3]

    Chang, J

    M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, and J. Hays. Argoverse: 3D tracking and forecasting with rich maps. InCVPR, 2019

  4. [4]

    Wilson, W

    B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, P. Carr, and J. Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. InNeurIPS Datasets and Benchmarks Track, 2021

  5. [5]

    Ettinger, S

    S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y . Chai, B. Sapp, C. R. Qi, Y . Zhou, Z. Yang, A. Chouard, P. Sun, J. Ngiam, V . Vasudevan, A. McCauley, J. Shlens, and D. Anguelov. Large scale interactive motion forecasting for autonomous driving: The Waymo open motion dataset. InICCV, 2021

  6. [6]

    R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. White, B. Sapp, M. Tan, J.-J. Hwang, and D. Anguelov. WOD-E2E: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

  7. [7]

    Dauner, M

    D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. NA VSIM: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InNeurIPS Datasets and Benchmarks Track, 2024

  8. [8]

    W. Cao, M. Hallgarten, T. Li, D. Dauner, X. Gu, C. Wang, Y . Miron, M. Aiello, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. Pseudo-simulation for autonomous driving. InCoRL, 2025

  9. [9]

    Caesar, J

    H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. M. Wolff, A. H. Lang, L. Fletcher, O. Beijbom, and S. Omari. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021

  10. [10]

    Dosovitskiy, G

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. CARLA: An open urban driving simulator. InCoRL, 2017

  11. [11]

    X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan. Bench2Drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS Datasets and Benchmarks Track, 2024

  12. [12]

    Gerstenecker, A

    S. Gerstenecker, A. Geiger, and K. Renz. Fail2Drive: Benchmarking closed-loop driving generalization.arXiv preprint arXiv:2604.08535, 2026

  13. [13]

    Hallgarten, J

    M. Hallgarten, J. Zapata, M. Stoll, K. Renz, and A. Zell. Can vehicle motion planning generalize to realistic long-tail scenarios? InIROS, 2024

  14. [14]

    30 https://www.autoinsurance.com/research/ dash-cam-usage-report/, 2026

    autoinsurance. 30 https://www.autoinsurance.com/research/ dash-cam-usage-report/, 2026. Accessed: 2026-04-29

  15. [15]

    Allshire, H

    A. Allshire, H. Choi, J. Zhang, D. McAllister, A. Zhang, C. M. Kim, T. Darrell, P. Abbeel, J. Malik, and A. Kanazawa. Visual imitation enables contextual humanoid control. InCoRL, 2025

  16. [16]

    Hartley and A

    R. Hartley and A. Zisserman.Multiple View Geometry in Computer Vision. 2003. 29

  17. [17]

    Ghosh, S

    A. Ghosh, S. Zheng, R. Tamburo, K. Vuong, J. Alvarez-Padilla, H. Zhu, M. Cardei, N. Dunn, C. Mertz, and S. G. Narasimhan. Roadwork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones. InICCV, 2025

  18. [18]

    Zendel, K

    O. Zendel, K. Honauer, M. Murschitz, D. Steininger, and G. F. Dominguez. Wilddash-creating hazard-aware benchmarks. InECCV, 2018

  19. [19]

    Zendel, M

    O. Zendel, M. Sch ¨orghuber, B. Rainer, M. Murschitz, and C. Beleznai. Unifying panoptic segmentation for autonomous driving. InCVPR, 2022

  20. [20]

    S. K. Bashetty, H. B. Amor, and G. Fainekos. Deepcrashtest: Turning dashcam videos into virtual crash tests for automated driving systems. InICRA, 2020

  21. [21]

    Y . Miao, G. Fainekos, B. Hoxha, H. Okamoto, D. Prokhorov, and S. Mitra. From dashcam videos to driving simulations: Stress testing automated vehicles against rare events. InAAAI Workshops, 2025

  22. [22]

    J. Bote. Cruise vehicle gets stuck in wet concrete while driving in San Francisco. https:// www.sfgate.com/tech/article/cruise-stuck-wet-concrete-sf-18297946.php , 2023

  23. [23]

    Waymo halts freeway rides after robotaxis strug- gle in construction zones

    TechCrunch. Waymo halts freeway rides after robotaxis strug- gle in construction zones. https://techcrunch.com/2026/05/21/ waymo-halts-freeway-rides-after-robotaxis-struggle-in-construction-zones/ , 2026

  24. [24]

    Cheng, Y

    J. Cheng, Y . Chen, and Q. Chen. Pluto: Pushing the limit of imitation learning-based planning for autonomous driving.arXiv preprint arXiv:2404.14327, 2024

  25. [25]

    Dauner, M

    D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning-based vehicle motion planning. InCoRL, 2023

  26. [26]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. InICRA, 2024

  27. [27]

    Cheang, G

    C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

  28. [28]

    S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. NeurIPS, 2023

  29. [29]

    Hwang, R

    J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

  30. [30]

    Wagner, O

    R. Wagner, O. S. Tas, J. Villa, F. Hauser, Y . Shen, M. Steiner, D. Strutz, C. Fernandez, C. Kinzig, G. S. Guitierrez-Cabello, et al. Longtail driving scenarios with reasoning traces: The kitscenes longtail dataset.arXiv preprint arXiv:2603.23607, 2026

  31. [32]

    Agarwal, A

    N. Agarwal, A. Ali, M. Bala, Y . Balaji, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025. 30

  32. [33]

    X. Ren, Y . Lu, T. Cao, R. Gao, S. Huang, A. Sabour, T. Shen, T. Pfaff, J. Z. Wu, R. Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025

  33. [34]

    X. Ren, T. Shen, J. Huang, H. Ling, Y . Lu, M. Nimier-David, T. M¨uller, A. Keller, S. Fidler, and J. Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InCVPR, 2025

  34. [35]

    J. Wang, B. Sun, Y . Bai, V . Casser, S. Peng, Z. Zhu, M.-L. Shih, X. Masotto, S.-Y . Su, K. V . Parvate, et al. Sensor2sensor: Cross-embodiment sensor conversion for autonomous driving. arXiv preprint arXiv:2605.22809, 2026

  35. [36]

    H. Zhou, L. Lin, J. Wang, Y . Lu, D. Bai, B. Liu, Y . Wang, A. Geiger, and Y . Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.TPAMI, 2025

  36. [37]

    Z. Chen, J. Yang, J. Huang, R. De Lutio, J. Martinez Esturo, B. Ivanovic, O. Litany, Z. Gojcic, S. Fidler, M. Pavone, et al. Omnire: Omni urban scene reconstruction. InICLR, 2025

  37. [38]

    Chitta, D

    K. Chitta, D. Dauner, and A. Geiger. Sledge: Synthesizing driving environments with generative models and rule-based traffic. InECCV, 2024

  38. [39]

    Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra- mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

  39. [40]

    Zheng, R

    Y . Zheng, R. Liang, K. Zheng, J. Zheng, L. Mao, J. Li, W. Gu, R. Ai, S. Li, X. Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance. InICLR, 2025

  40. [41]

    Ghosh, S

    A. Ghosh, S. Narasimhan, M. Chandraker, and F. Pittaluga. Rad-lad: Rule and language grounded autonomous driving in real-time.arXiv preprint arXiv:2603.28522, 2026

  41. [42]

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering. InECCV, 2024

  42. [43]

    X. Zhou, X. Han, F. Yang, Y . Ma, V . Tresp, and A. Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model. InAAAI, 2026

  43. [44]

    Z. Xu, Y . Bai, Y . Zhang, Z. Li, F. Xia, K.-Y . K. Wong, J. Wang, and H. Zhao. Drivegpt4-v2: Harnessing large language model capabilities for enhanced closed-loop autonomous driving. InCVPR, 2025

  44. [45]

    B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InCVPR, 2025

  45. [46]

    J. Lu, J. Guan, Z. Huang, J. Li, G. Li, L. Kong, Y . Li, H. Wang, S. Xu, Y . Luo, et al. Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026

  46. [47]

    Carion, L

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  47. [48]

    Huang, J

    W. Huang, J. Zhang, S. Li, T. Jia, J. Duan, Y . Cheng, J. Cho, M. Wallingford, R. Soraki, C. D. Kim, et al. Wilddet3d: Scaling promptable 3d detection in the wild.arXiv preprint arXiv:2604.08626, 2026

  48. [49]

    A. R. Zamir, T. Wekel, P. Agrawal, C. Wei, J. Malik, and S. Savarese. Generic 3d representation via pose estimation and matching. InECCV, 2016. 31

  49. [50]

    Vuong, R

    K. Vuong, R. Tamburo, and S. G. Narasimhan. Toward planet-wide traffic camera calibration. InWACV, 2024

  50. [51]

    Vuong, A

    K. Vuong, A. Ghosh, D. Ramanan, S. Narasimhan, and S. Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. InCVPR, 2025

  51. [52]

    Berton and C

    G. Berton and C. Masone. Megaloc: One retrieval to place them all. InCVPR Workshops, 2025

  52. [53]

    OpenStreetMap

    OpenStreetMap contributors. OpenStreetMap. https://www.openstreetmap.org, 2017. Data retrieved from https://planet.openstreetmap.org

  53. [54]

    Sriram, B

    N. Sriram, B. Liu, F. Pittaluga, and M. Chandraker. Smart: Simultaneous multi-agent recurrent trajectory prediction. InECCV, 2020

  54. [55]

    P. Cai, Y . Lee, Y . Luo, and D. Hsu. Summit: A simulator for urban driving in massive mixed traffic. InICRA, 2020

  55. [56]

    Eskenazi

    J. Eskenazi. Waymo rolls toward San Francisco Airport. a showdown is brewing. https://missionlocal.org/2024/12/ waymo-rolls-toward-san-francisco-airport-showdown-brewing/, 2024

  56. [57]

    Sural, N

    S. Sural, N. Sahu, and R. Rajkumar. Workzone3d: A multimodal dataset for 3d work zone perception in autonomous driving. InWACV, 2026

  57. [58]

    Cheng, Y

    J. Cheng, Y . Chen, X. Mei, B. Yang, B. Li, and M. Liu. Rethinking imitation-based planner for autonomous driving. InICRA, 2024

  58. [59]

    Treiber, A

    M. Treiber, A. Hennecke, and D. Helbing. Congested traffic states in empirical observations and microscopic simulations.Physical review E, 2000

  59. [60]

    L. Pan, D. Bar´ath, M. Pollefeys, and J. L. Sch¨onberger. Global structure-from-motion revisited. InECCV, 2024

  60. [61]

    L. Pan, J. L. Sch¨onberger, and M. Pollefeys. Global structure-from-motion meets feedforward reconstruction. InCVPR, 2026

  61. [62]

    Hagedorn, L

    S. Hagedorn, L. Donkov, A. Distelzweig, and A. P. Condurache. When planners meet reality: How learned, reactive traffic agents shift nuplan benchmarks.arXiv preprint arXiv:2510.14677, 2025

  62. [63]

    R. Luo, H. Yang, M. Watson, A. Sharma, S. Veer, E. Schmerling, and M. Pavone. Sim2val: Leveraging correlation across test platforms for variance-reduced metric estimation.arXiv preprint arXiv:2506.20553, 2025

  63. [64]

    Pinto and A

    L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. InICRA, 2016

  64. [65]

    Levine, P

    S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection.IJRR, 2018

  65. [66]

    L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling.NeurIPS, 2021

  66. [67]

    Emmons, B

    S. Emmons, B. Eysenbach, I. Kostrikov, and S. Levine. Rvs: What is essential for offline rl via supervised learning?arXiv preprint arXiv:2112.10751, 2021

  67. [68]

    G. L. Smith, S. F. Schmidt, and L. A. McGee.Application of statistical filter theory to the optimal estimation of position and velocity on board a circumlunar vehicle. 1962. 32

  68. [69]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

  69. [70]

    N. V . Keetha, N. M¨uller, J. Sch¨onberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. Weber, N. Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. In3DV, 2025

  70. [71]

    A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

  71. [72]

    S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.NeurIPS, 2024

  72. [73]

    X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InECCV, 2024

  73. [74]

    Zhang, P

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  74. [75]

    Ghildyal and F

    A. Ghildyal and F. Liu. Shift-tolerant perceptual similarity metric. InECCV, 2022

  75. [76]

    Heusel, H

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017

  76. [77]

    Kynk¨a¨anniemi, T

    T. Kynk¨a¨anniemi, T. Karras, M. Aittala, T. Aila, and J. Lehtinen. The role of imagenet classes in fr\’echet inception distance.arXiv preprint arXiv:2203.06026, 2022

  77. [78]

    Tumanyan, O

    N. Tumanyan, O. Bar-Tal, S. Bagon, and T. Dekel. Splicing vit features for semantic appearance transfer. InCVPR, 2022

  78. [79]

    S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

  79. [80]

    Wickrema, S

    C. Wickrema, S. Leary, S. Sarkar, M. Giglio, E. Bianchi, E. Mace, and M. Twardowski. Benchmarking image similarity metrics for novel view synthesis applications.arXiv preprint arXiv:2506.12563, 2025

  80. [81]

    M. Khan, H. Fazlali, D. Sharma, T. Cao, D. Bai, Y . Ren, and B. Liu. Autosplat: Constrained gaussian splatting for autonomous driving scene reconstruction. InICRA, 2025

Showing first 80 references.