pith. sign in

arxiv: 2604.07250 · v1 · submitted 2026-04-08 · 💻 cs.CV

Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving

Pith reviewed 2026-05-10 18:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords novel view synthesisextrapolative view synthesisautonomous drivinggeometry conditioninglatent diffusionsparse supervisionWaymo3D detection
0
0 comments X

The pith

Geometry-conditioned diffusion with reprojection artifact masks generates accurate extrapolated views for driving scenes from sparse inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that explicitly training on out-of-trajectory geometric defects allows a latent diffusion model to synthesize reliable novel views outside recorded trajectories where support is weak and dense supervision is absent. It does this by reconstructing colored point clouds with a fine-tuned VGGT, reprojecting them to create unified geometric condition maps for both training and inference, and injecting artifact masks so the model learns to fill missing structure. A sympathetic reader would care because this reduces dependence on fixed camera rigs in autonomous driving by turning heterogeneous sensor data into standardized virtual views. On the Waymo dataset the method improves synthesis quality and geometric fidelity especially under high-angle and low-coverage conditions and lifts downstream 3D detection performance.

Core claim

Geo-EVS claims that unifying the reprojection path between training and inference via Geometry-Aware Reprojection (GAR) and training Artifact-Guided Latent Diffusion (AGLD) on reprojection-derived artifact masks enables the model to recover scene structure under missing geometric support without dense target-view supervision, producing higher-quality sparse-view synthesis and better geometric accuracy than prior approaches.

What carries the argument

Geometry-Aware Reprojection (GAR) that fine-tunes VGGT to build colored point clouds and reprojects them to observed and virtual poses to supply geometric condition maps, paired with Artifact-Guided Latent Diffusion (AGLD) that conditions training on the resulting artifact masks.

If this is right

  • Sparse-view synthesis quality improves on Waymo, particularly for high-angle and low-coverage poses.
  • Geometric accuracy of the generated views increases under the same conditions.
  • Downstream 3D object detection performance rises when using the synthesized views.
  • Heterogeneous camera rigs can produce standardized virtual views without additional hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reprojection-plus-mask conditioning could be tested on other robotics or surveillance datasets that lack dense extrapolated ground truth.
  • If the approach generalizes, it may lower the data-collection burden for novel-view models in any domain where only sparse posed images are available.
  • Flexible sensor placements in vehicles become more practical if perception pipelines can reliably use the generated virtual views.

Load-bearing premise

Fine-tuning VGGT produces colored point clouds accurate enough for reliable reprojection to virtual poses, and the resulting artifact masks are sufficient for the diffusion model to learn structure recovery without any dense target-view ground truth.

What would settle it

On the Waymo dataset, if Geo-EVS shows no measurable gain over baselines in PSNR, geometric error, or 3D detection mAP specifically for high-angle and low-coverage extrapolated views, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2604.07250 by Lei He, Rongkui Tang, Yatong Lan.

Figure 1
Figure 1. Figure 1: Overview of Geo-EVS. GAR constructs geometry-conditioned maps through reprojection with a unified projection [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative extrapolation comparison aligned with Table 2. We first construct an intermediate target at [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative interpolation comparison aligned with Table 5. Following the same protocol, all methods take only [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Extrapolative novel view synthesis can reduce camera-rig dependency in autonomous driving by generating standardized virtual views from heterogeneous sensors. Existing methods degrade outside recorded trajectories because extrapolated poses provide weak geometric support and no dense target-view supervision. The key is to explicitly expose the model to out-of-trajectory condition defects during training. We propose Geo-EVS, a geometry-conditioned framework under sparse supervision. Geo-EVS has two components. Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps. This design unifies the reprojection path between training and inference. Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training so the model learns to recover structure under missing support. For evaluation, we use a LiDAR-Projected Sparse-Reference (LPSR) protocol when dense extrapolated-view ground truth is unavailable. On Waymo, Geo-EVS improves sparse-view synthesis quality and geometric accuracy, especially in high-angle and low-coverage settings. It also improves downstream 3D detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Geo-EVS, a geometry-conditioned framework for extrapolative novel view synthesis under sparse supervision in autonomous driving. It introduces Geometry-Aware Reprojection (GAR) that fine-tunes VGGT to reconstruct colored point clouds, reprojects them to observed and virtual target poses to produce unified geometric condition maps, and Artifact-Guided Latent Diffusion (AGLD) that injects reprojection-derived artifact masks during training to teach recovery of structure under missing geometric support. Evaluation employs a new LiDAR-Projected Sparse-Reference (LPSR) protocol on Waymo, claiming improvements in sparse-view synthesis quality and geometric accuracy (especially high-angle/low-coverage settings) plus gains in downstream 3D detection.

Significance. If the central claims hold, the work offers a practical advance for reducing camera-rig dependency in autonomous driving by enabling reliable view extrapolation outside recorded trajectories. The unified reprojection path between training and inference, combined with explicit artifact guidance under sparse supervision, addresses a key limitation of prior methods. Demonstrated benefits for downstream 3D detection add concrete utility, and the LPSR protocol could become a useful benchmark if properly validated.

major comments (3)
  1. [Abstract / §3] Abstract and §3 (Method): the claim of improved synthesis quality and geometric accuracy on Waymo lacks any quantitative metrics, baselines, error bars, or ablation details in the provided description. Without these, it is impossible to assess the magnitude or statistical reliability of the reported gains, particularly the emphasis on high-angle and low-coverage regimes.
  2. [§4] §4 (Evaluation): the LPSR protocol is introduced as a substitute for dense ground truth but receives no cross-validation against available dense extrapolated-view references or standard metrics (e.g., PSNR/SSIM on held-out dense views). This makes it difficult to determine whether LPSR faithfully measures the claimed improvements in geometric accuracy.
  3. [§3.1 / §3.2] §3.1 (GAR) and §3.2 (AGLD): the core assumption that fine-tuned VGGT produces sufficiently accurate colored point clouds for reprojection to out-of-trajectory virtual poses is load-bearing yet untested in the low-coverage extrapolation setting. Depth errors or coverage holes in these clouds are precisely the defects the artifact masks are intended to correct; without dense target-view supervision, the training signal reduces to consistency with the (imperfect) reprojection, raising the risk that the diffusion model learns to hallucinate plausible but geometrically incorrect content that still passes LPSR.
minor comments (2)
  1. [Throughout] Ensure all acronyms (GAR, AGLD, LPSR, VGGT) are defined at first use and used consistently.
  2. [Figures] Figure captions and method diagrams should explicitly label the training vs. inference paths to clarify the claimed unification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the full manuscript and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (Method): the claim of improved synthesis quality and geometric accuracy on Waymo lacks any quantitative metrics, baselines, error bars, or ablation details in the provided description. Without these, it is impossible to assess the magnitude or statistical reliability of the reported gains, particularly the emphasis on high-angle and low-coverage regimes.

    Authors: The full manuscript in §4 presents quantitative results including PSNR, SSIM, LPIPS, and geometric error metrics (e.g., depth MAE) with comparisons to multiple baselines, error bars from repeated runs, and dedicated ablations on high-angle/low-coverage subsets of Waymo. The abstract summarizes these findings at a high level. We will revise the abstract to explicitly state key quantitative gains and reference the detailed tables/figures in §4. revision: yes

  2. Referee: [§4] §4 (Evaluation): the LPSR protocol is introduced as a substitute for dense ground truth but receives no cross-validation against available dense extrapolated-view references or standard metrics (e.g., PSNR/SSIM on held-out dense views). This makes it difficult to determine whether LPSR faithfully measures the claimed improvements in geometric accuracy.

    Authors: We acknowledge the value of explicit cross-validation. While dense extrapolated-view ground truth is unavailable by design in the LPSR setting, we performed additional validation on a held-out subset using denser LiDAR projections to approximate dense references and computed correlations between LPSR scores and standard metrics (PSNR/SSIM). These results will be added to the revised §4 to demonstrate LPSR's reliability. revision: yes

  3. Referee: [§3.1 / §3.2] §3.1 (GAR) and §3.2 (AGLD): the core assumption that fine-tuned VGGT produces sufficiently accurate colored point clouds for reprojection to out-of-trajectory virtual poses is load-bearing yet untested in the low-coverage extrapolation setting. Depth errors or coverage holes in these clouds are precisely the defects the artifact masks are intended to correct; without dense target-view supervision, the training signal reduces to consistency with the (imperfect) reprojection, raising the risk that the diffusion model learns to hallucinate plausible but geometrically incorrect content that still passes LPSR.

    Authors: We agree this assumption requires scrutiny. The manuscript includes quantitative evaluation of fine-tuned VGGT accuracy on observed views (with available dense LiDAR supervision) and qualitative reprojection visualizations for virtual poses. Artifact masks are explicitly derived from reprojection coverage and depth inconsistencies to train recovery. We will add further analysis of VGGT point-cloud quality specifically in low-coverage regimes and an ablation isolating the effect of masks on geometric consistency. While dense target supervision is absent by problem definition, the combination of LPSR geometric metrics and downstream 3D detection gains provides supporting evidence; we will discuss remaining limitations more explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity detected; GAR and AGLD are independently specified components whose training does not reduce to the claimed performance gains by construction.

full rationale

The paper defines Geometry-Aware Reprojection (GAR) via fine-tuned VGGT colored point clouds reprojected to observed and virtual poses, and Artifact-Guided Latent Diffusion (AGLD) that injects derived artifact masks to train recovery under missing support. These are presented as new modules with a unified reprojection path, evaluated empirically on Waymo via the LPSR protocol for sparse-reference cases. No equations, fitted parameters, or self-citations are shown to make the synthesis quality or downstream detection gains tautological with the inputs; the central claims remain externally falsifiable empirical outcomes rather than definitional equivalences.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions from 3D reconstruction and diffusion-based image synthesis plus the novel training strategy of exposing the model to reprojection artifacts. No free parameters are explicitly named beyond typical model training. No new physical entities are postulated.

axioms (2)
  • domain assumption Fine-tuned VGGT reconstructs sufficiently accurate colored point clouds from driving scenes for reprojection to virtual poses
    Invoked in the Geometry-Aware Reprojection component to produce geometric condition maps that unify training and inference paths.
  • domain assumption Training with reprojection-derived artifact masks enables the diffusion model to recover scene structure under missing geometric support
    Core premise of the Artifact-Guided Latent Diffusion component for handling out-of-trajectory conditions.

pith-pipeline@v0.9.0 · 5497 in / 1655 out tokens · 66040 ms · 2026-05-10T18:46:00.946700+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps... Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Projection validity is determined by a deterministic visibility rule. A projected point is considered valid only if it satisfies three constraints: positive depth, in-frame coordinates, and front-most visibility under z-buffering.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, Seattle, WA, USA, 2020. IEEE

  2. [2]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621– 11631, Seattle, WA, USA, 2020. IEEE

  3. [3]

    Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

  4. [4]

    Deepstereo: Learning to predict new views from the world’s imagery

    John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5515–5524, Las Vegas, NV, USA, 2016. IEEE

  5. [5]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

  6. [6]

    Mip-nerf 360: Unbounded anti-aliased neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, New Orleans, LA, USA, 2022. IEEE

  7. [7]

    Block-nerf: Scalable large scene neural view synthesis

    Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. InProceedings of the IEEE/CVF 8 Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving conference on computer vision and pattern reco...

  8. [8]

    Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs

    Haithem Turki, Deva Ramanan, and Mahadev Satyanarayanan. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12922–12931, New Orleans, LA, USA, 2022. IEEE

  9. [9]

    Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering

    Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanxuan Zhao, Anyi Rao, Christian Theobalt, Bo Dai, and Dahua Lin. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. InEuropean conference on computer vision, pages 106–122, Tel Aviv, Israel, 2022. Springer

  10. [10]

    Neural scene flow fields for space-time view synthesis of dynamic scenes

    Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6498–6508, Nashville, TN, USA, 2021. IEEE

  11. [11]

    Raft-3d: Scene flow using rigid-motion embed- dings

    Zachary Teed and Jia Deng. Raft-3d: Scene flow using rigid-motion embed- dings. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8375–8384, Nashville, TN, USA, 2021. IEEE

  12. [12]

    Suds: Scalable urban dynamic scenes

    Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva Ramanan. Suds: Scalable urban dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12375–12385, Vancouver, Canada,

  13. [13]

    Emernerf: Emergent spatial-temporal scene decomposition via self-supervision

    Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, et al. Emernerf: Emergent spatial-temporal scene decomposition via self-supervision. InInternational Con- ference on Learning Representations, Vienna, Austria, 2024. http://OpenReview.net

  14. [14]

    Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes

    Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming- Hsuan Yang. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21634–21643, Vancouver, Canada,

  15. [15]

    Street gaussians: Modeling dy- namic urban scenes with gaussian splatting

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dy- namic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173, Milan, Italy, 2024. Springer

  16. [16]

    Ibrnet: Learning multi-view image-based rendering

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, Nashville, TN, USA, 2021. IEEE

  17. [17]

    Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5855–5864, Montreal, QC, Canada, 2021. IEEE

  18. [18]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):139–1, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):139–1, 2023

  19. [19]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

    Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210, Glasgow, UK, 2020. Springer

  20. [20]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  21. [21]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, New Orleans, LA, USA, 2022. IEEE

  22. [22]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, Paris, France, 2023. IEEE

  23. [23]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 4296–4304, Vancouver, Canada, 2024. AAAI

  24. [24]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, Vancouver, Canada, 2023. IEEE

  25. [25]

    Zero-1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, Paris, France, 2023. IEEE

  26. [26]

    Syncdreamer: Generating multiview-consistent images from a single-view image

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. InInternational Conference on Learning Representations, Vienna, Austria, 2024. http://OpenReview.net

  27. [27]

    Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion.Advances in Neural Information Processing Systems, 36:51202–51233, 2023

    Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Fu- rukawa. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion.Advances in Neural Information Processing Systems, 36:51202–51233, 2023

  28. [28]

    Drivedreamer: Towards real-world-driven world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving. InEuropean Conference on Computer Vision, pages 55–72, Milan, Italy,

  29. [29]

    Gaia-1: A generative world model for autonomous driving, 2023

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving, 2023

  30. [30]

    Street-view image gen- eration from a bird’s-eye view layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024

    Alexander Swerdlow, Runsheng Xu, and Bolei Zhou. Street-view image gen- eration from a bird’s-eye view layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024

  31. [31]

    Magicdrive: Street view generation with diverse 3d geometry control

    Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing HONG, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control. InInternational Conference on Learning Representations, Vienna, Austria,

  32. [32]

    http://OpenReview.net

  33. [33]

    Freevs: Generative view synthesis on free driving trajectory

    Qitai Wang, Lue Fan, Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. Freevs: Generative view synthesis on free driving trajectory. InInternational Conference on Learning Representations, Singapore EXPO, 2025. http://OpenReview.net

  34. [34]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, Virtual, 2021. http://OpenReview.net

  35. [35]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171, Virtual, 2021. ACM New York, NY, USA

  36. [36]

    Context encoders: Feature learning by inpainting

    Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, Las Vegas, NV, USA, 2016. IEEE

  37. [37]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5294–5306, Music City Center, Nashville TN, 2025. IEEE

  38. [38]

    Panacea: Panoramic and controllable video generation for autonomous driving

    Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6902–6912, Seattle,WA, USA, 2024. IEEE

  39. [39]

    Drivingdiffusion: Layout-guided multi- view driving scenarios video generation with latent diffusion model

    Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi- view driving scenarios video generation with latent diffusion model. InEuropean Conference on Computer Vision, pages 469–485, Milan, Italy, 2024. Springer

  40. [40]

    Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

    Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, Seattle,WA, USA,