Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving

Lei He; Rongkui Tang; Yatong Lan

arxiv: 2604.07250 · v1 · submitted 2026-04-08 · 💻 cs.CV

Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving

Yatong Lan , Rongkui Tang , Lei He This is my paper

Pith reviewed 2026-05-10 18:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords novel view synthesisextrapolative view synthesisautonomous drivinggeometry conditioninglatent diffusionsparse supervisionWaymo3D detection

0 comments

The pith

Geometry-conditioned diffusion with reprojection artifact masks generates accurate extrapolated views for driving scenes from sparse inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that explicitly training on out-of-trajectory geometric defects allows a latent diffusion model to synthesize reliable novel views outside recorded trajectories where support is weak and dense supervision is absent. It does this by reconstructing colored point clouds with a fine-tuned VGGT, reprojecting them to create unified geometric condition maps for both training and inference, and injecting artifact masks so the model learns to fill missing structure. A sympathetic reader would care because this reduces dependence on fixed camera rigs in autonomous driving by turning heterogeneous sensor data into standardized virtual views. On the Waymo dataset the method improves synthesis quality and geometric fidelity especially under high-angle and low-coverage conditions and lifts downstream 3D detection performance.

Core claim

Geo-EVS claims that unifying the reprojection path between training and inference via Geometry-Aware Reprojection (GAR) and training Artifact-Guided Latent Diffusion (AGLD) on reprojection-derived artifact masks enables the model to recover scene structure under missing geometric support without dense target-view supervision, producing higher-quality sparse-view synthesis and better geometric accuracy than prior approaches.

What carries the argument

Geometry-Aware Reprojection (GAR) that fine-tunes VGGT to build colored point clouds and reprojects them to observed and virtual poses to supply geometric condition maps, paired with Artifact-Guided Latent Diffusion (AGLD) that conditions training on the resulting artifact masks.

If this is right

Sparse-view synthesis quality improves on Waymo, particularly for high-angle and low-coverage poses.
Geometric accuracy of the generated views increases under the same conditions.
Downstream 3D object detection performance rises when using the synthesized views.
Heterogeneous camera rigs can produce standardized virtual views without additional hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reprojection-plus-mask conditioning could be tested on other robotics or surveillance datasets that lack dense extrapolated ground truth.
If the approach generalizes, it may lower the data-collection burden for novel-view models in any domain where only sparse posed images are available.
Flexible sensor placements in vehicles become more practical if perception pipelines can reliably use the generated virtual views.

Load-bearing premise

Fine-tuning VGGT produces colored point clouds accurate enough for reliable reprojection to virtual poses, and the resulting artifact masks are sufficient for the diffusion model to learn structure recovery without any dense target-view ground truth.

What would settle it

On the Waymo dataset, if Geo-EVS shows no measurable gain over baselines in PSNR, geometric error, or 3D detection mAP specifically for high-angle and low-coverage extrapolated views, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2604.07250 by Lei He, Rongkui Tang, Yatong Lan.

**Figure 1.** Figure 1: Overview of Geo-EVS. GAR constructs geometry-conditioned maps through reprojection with a unified projection [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative extrapolation comparison aligned with Table 2. We first construct an intermediate target at [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative interpolation comparison aligned with Table 5. Following the same protocol, all methods take only [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Extrapolative novel view synthesis can reduce camera-rig dependency in autonomous driving by generating standardized virtual views from heterogeneous sensors. Existing methods degrade outside recorded trajectories because extrapolated poses provide weak geometric support and no dense target-view supervision. The key is to explicitly expose the model to out-of-trajectory condition defects during training. We propose Geo-EVS, a geometry-conditioned framework under sparse supervision. Geo-EVS has two components. Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps. This design unifies the reprojection path between training and inference. Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training so the model learns to recover structure under missing support. For evaluation, we use a LiDAR-Projected Sparse-Reference (LPSR) protocol when dense extrapolated-view ground truth is unavailable. On Waymo, Geo-EVS improves sparse-view synthesis quality and geometric accuracy, especially in high-angle and low-coverage settings. It also improves downstream 3D detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Geo-EVS introduces GAR reprojection from fine-tuned VGGT and AGLD with artifact masks to handle out-of-trajectory synthesis, but the abstract gives no metrics so the gains stay unproven.

read the letter

The paper's core move is to make extrapolative view synthesis work outside recorded trajectories by feeding the model explicit geometry defects. GAR takes a fine-tuned VGGT to build colored point clouds, reprojects them to both observed and virtual poses, and produces condition maps that stay consistent between train and test. AGLD then trains a latent diffusion model on the artifact masks that come out of those reprojections, so the network learns to fill missing support under sparse supervision. They evaluate with a new LPSR protocol that uses LiDAR-projected sparse references when dense target-view ground truth is absent, and they report better synthesis quality plus downstream 3D detection on Waymo, especially in high-angle and low-coverage cases.

Referee Report

3 major / 2 minor

Summary. The paper proposes Geo-EVS, a geometry-conditioned framework for extrapolative novel view synthesis under sparse supervision in autonomous driving. It introduces Geometry-Aware Reprojection (GAR) that fine-tunes VGGT to reconstruct colored point clouds, reprojects them to observed and virtual target poses to produce unified geometric condition maps, and Artifact-Guided Latent Diffusion (AGLD) that injects reprojection-derived artifact masks during training to teach recovery of structure under missing geometric support. Evaluation employs a new LiDAR-Projected Sparse-Reference (LPSR) protocol on Waymo, claiming improvements in sparse-view synthesis quality and geometric accuracy (especially high-angle/low-coverage settings) plus gains in downstream 3D detection.

Significance. If the central claims hold, the work offers a practical advance for reducing camera-rig dependency in autonomous driving by enabling reliable view extrapolation outside recorded trajectories. The unified reprojection path between training and inference, combined with explicit artifact guidance under sparse supervision, addresses a key limitation of prior methods. Demonstrated benefits for downstream 3D detection add concrete utility, and the LPSR protocol could become a useful benchmark if properly validated.

major comments (3)

[Abstract / §3] Abstract and §3 (Method): the claim of improved synthesis quality and geometric accuracy on Waymo lacks any quantitative metrics, baselines, error bars, or ablation details in the provided description. Without these, it is impossible to assess the magnitude or statistical reliability of the reported gains, particularly the emphasis on high-angle and low-coverage regimes.
[§4] §4 (Evaluation): the LPSR protocol is introduced as a substitute for dense ground truth but receives no cross-validation against available dense extrapolated-view references or standard metrics (e.g., PSNR/SSIM on held-out dense views). This makes it difficult to determine whether LPSR faithfully measures the claimed improvements in geometric accuracy.
[§3.1 / §3.2] §3.1 (GAR) and §3.2 (AGLD): the core assumption that fine-tuned VGGT produces sufficiently accurate colored point clouds for reprojection to out-of-trajectory virtual poses is load-bearing yet untested in the low-coverage extrapolation setting. Depth errors or coverage holes in these clouds are precisely the defects the artifact masks are intended to correct; without dense target-view supervision, the training signal reduces to consistency with the (imperfect) reprojection, raising the risk that the diffusion model learns to hallucinate plausible but geometrically incorrect content that still passes LPSR.

minor comments (2)

[Throughout] Ensure all acronyms (GAR, AGLD, LPSR, VGGT) are defined at first use and used consistently.
[Figures] Figure captions and method diagrams should explicitly label the training vs. inference paths to clarify the claimed unification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the full manuscript and indicate planned revisions.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (Method): the claim of improved synthesis quality and geometric accuracy on Waymo lacks any quantitative metrics, baselines, error bars, or ablation details in the provided description. Without these, it is impossible to assess the magnitude or statistical reliability of the reported gains, particularly the emphasis on high-angle and low-coverage regimes.

Authors: The full manuscript in §4 presents quantitative results including PSNR, SSIM, LPIPS, and geometric error metrics (e.g., depth MAE) with comparisons to multiple baselines, error bars from repeated runs, and dedicated ablations on high-angle/low-coverage subsets of Waymo. The abstract summarizes these findings at a high level. We will revise the abstract to explicitly state key quantitative gains and reference the detailed tables/figures in §4. revision: yes
Referee: [§4] §4 (Evaluation): the LPSR protocol is introduced as a substitute for dense ground truth but receives no cross-validation against available dense extrapolated-view references or standard metrics (e.g., PSNR/SSIM on held-out dense views). This makes it difficult to determine whether LPSR faithfully measures the claimed improvements in geometric accuracy.

Authors: We acknowledge the value of explicit cross-validation. While dense extrapolated-view ground truth is unavailable by design in the LPSR setting, we performed additional validation on a held-out subset using denser LiDAR projections to approximate dense references and computed correlations between LPSR scores and standard metrics (PSNR/SSIM). These results will be added to the revised §4 to demonstrate LPSR's reliability. revision: yes
Referee: [§3.1 / §3.2] §3.1 (GAR) and §3.2 (AGLD): the core assumption that fine-tuned VGGT produces sufficiently accurate colored point clouds for reprojection to out-of-trajectory virtual poses is load-bearing yet untested in the low-coverage extrapolation setting. Depth errors or coverage holes in these clouds are precisely the defects the artifact masks are intended to correct; without dense target-view supervision, the training signal reduces to consistency with the (imperfect) reprojection, raising the risk that the diffusion model learns to hallucinate plausible but geometrically incorrect content that still passes LPSR.

Authors: We agree this assumption requires scrutiny. The manuscript includes quantitative evaluation of fine-tuned VGGT accuracy on observed views (with available dense LiDAR supervision) and qualitative reprojection visualizations for virtual poses. Artifact masks are explicitly derived from reprojection coverage and depth inconsistencies to train recovery. We will add further analysis of VGGT point-cloud quality specifically in low-coverage regimes and an ablation isolating the effect of masks on geometric consistency. While dense target supervision is absent by problem definition, the combination of LPSR geometric metrics and downstream 3D detection gains provides supporting evidence; we will discuss remaining limitations more explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity detected; GAR and AGLD are independently specified components whose training does not reduce to the claimed performance gains by construction.

full rationale

The paper defines Geometry-Aware Reprojection (GAR) via fine-tuned VGGT colored point clouds reprojected to observed and virtual poses, and Artifact-Guided Latent Diffusion (AGLD) that injects derived artifact masks to train recovery under missing support. These are presented as new modules with a unified reprojection path, evaluated empirically on Waymo via the LPSR protocol for sparse-reference cases. No equations, fitted parameters, or self-citations are shown to make the synthesis quality or downstream detection gains tautological with the inputs; the central claims remain externally falsifiable empirical outcomes rather than definitional equivalences.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions from 3D reconstruction and diffusion-based image synthesis plus the novel training strategy of exposing the model to reprojection artifacts. No free parameters are explicitly named beyond typical model training. No new physical entities are postulated.

axioms (2)

domain assumption Fine-tuned VGGT reconstructs sufficiently accurate colored point clouds from driving scenes for reprojection to virtual poses
Invoked in the Geometry-Aware Reprojection component to produce geometric condition maps that unify training and inference paths.
domain assumption Training with reprojection-derived artifact masks enables the diffusion model to recover scene structure under missing geometric support
Core premise of the Artifact-Guided Latent Diffusion component for handling out-of-trajectory conditions.

pith-pipeline@v0.9.0 · 5497 in / 1655 out tokens · 66040 ms · 2026-05-10T18:46:00.946700+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps... Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Projection validity is determined by a deterministic visibility rule. A projected point is considered valid only if it satisfies three constraints: positive depth, in-frame coordinates, and front-most visibility under z-buffering.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, Seattle, WA, USA, 2020. IEEE

work page 2020
[2]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621– 11631, Seattle, WA, USA, 2020. IEEE

work page 2020
[3]

Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

work page 2018
[4]

Deepstereo: Learning to predict new views from the world’s imagery

John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5515–5524, Las Vegas, NV, USA, 2016. IEEE

work page 2016
[5]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

work page 2021
[6]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, New Orleans, LA, USA, 2022. IEEE

work page 2022
[7]

Block-nerf: Scalable large scene neural view synthesis

Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. InProceedings of the IEEE/CVF 8 Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving conference on computer vision and pattern reco...

work page 2022
[8]

Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs

Haithem Turki, Deva Ramanan, and Mahadev Satyanarayanan. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12922–12931, New Orleans, LA, USA, 2022. IEEE

work page 2022
[9]

Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering

Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanxuan Zhao, Anyi Rao, Christian Theobalt, Bo Dai, and Dahua Lin. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. InEuropean conference on computer vision, pages 106–122, Tel Aviv, Israel, 2022. Springer

work page 2022
[10]

Neural scene flow fields for space-time view synthesis of dynamic scenes

Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6498–6508, Nashville, TN, USA, 2021. IEEE

work page 2021
[11]

Raft-3d: Scene flow using rigid-motion embed- dings

Zachary Teed and Jia Deng. Raft-3d: Scene flow using rigid-motion embed- dings. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8375–8384, Nashville, TN, USA, 2021. IEEE

work page 2021
[12]

Suds: Scalable urban dynamic scenes

Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva Ramanan. Suds: Scalable urban dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12375–12385, Vancouver, Canada,

work page
[13]

Emernerf: Emergent spatial-temporal scene decomposition via self-supervision

Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, et al. Emernerf: Emergent spatial-temporal scene decomposition via self-supervision. InInternational Con- ference on Learning Representations, Vienna, Austria, 2024. http://OpenReview.net

work page 2024
[14]

Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes

Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming- Hsuan Yang. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21634–21643, Vancouver, Canada,

work page
[15]

Street gaussians: Modeling dy- namic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dy- namic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173, Milan, Italy, 2024. Springer

work page 2024
[16]

Ibrnet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, Nashville, TN, USA, 2021. IEEE

work page 2021
[17]

Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5855–5864, Montreal, QC, Canada, 2021. IEEE

work page 2021
[18]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):139–1, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):139–1, 2023

work page 2023
[19]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210, Glasgow, UK, 2020. Springer

work page 2020
[20]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[21]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, New Orleans, LA, USA, 2022. IEEE

work page 2022
[22]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, Paris, France, 2023. IEEE

work page 2023
[23]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 4296–4304, Vancouver, Canada, 2024. AAAI

work page 2024
[24]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, Vancouver, Canada, 2023. IEEE

work page 2023
[25]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, Paris, France, 2023. IEEE

work page 2023
[26]

Syncdreamer: Generating multiview-consistent images from a single-view image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. InInternational Conference on Learning Representations, Vienna, Austria, 2024. http://OpenReview.net

work page 2024
[27]

Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion.Advances in Neural Information Processing Systems, 36:51202–51233, 2023

Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Fu- rukawa. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion.Advances in Neural Information Processing Systems, 36:51202–51233, 2023

work page 2023
[28]

Drivedreamer: Towards real-world-driven world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving. InEuropean Conference on Computer Vision, pages 55–72, Milan, Italy,

work page
[29]

Gaia-1: A generative world model for autonomous driving, 2023

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving, 2023

work page 2023
[30]

Street-view image gen- eration from a bird’s-eye view layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024

Alexander Swerdlow, Runsheng Xu, and Bolei Zhou. Street-view image gen- eration from a bird’s-eye view layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024

work page 2024
[31]

Magicdrive: Street view generation with diverse 3d geometry control

Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing HONG, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control. InInternational Conference on Learning Representations, Vienna, Austria,

work page
[32]

http://OpenReview.net

work page
[33]

Freevs: Generative view synthesis on free driving trajectory

Qitai Wang, Lue Fan, Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. Freevs: Generative view synthesis on free driving trajectory. InInternational Conference on Learning Representations, Singapore EXPO, 2025. http://OpenReview.net

work page 2025
[34]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, Virtual, 2021. http://OpenReview.net

work page 2021
[35]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171, Virtual, 2021. ACM New York, NY, USA

work page 2021
[36]

Context encoders: Feature learning by inpainting

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, Las Vegas, NV, USA, 2016. IEEE

work page 2016
[37]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5294–5306, Music City Center, Nashville TN, 2025. IEEE

work page 2025
[38]

Panacea: Panoramic and controllable video generation for autonomous driving

Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6902–6912, Seattle,WA, USA, 2024. IEEE

work page 2024
[39]

Drivingdiffusion: Layout-guided multi- view driving scenarios video generation with latent diffusion model

Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi- view driving scenarios video generation with latent diffusion model. InEuropean Conference on Computer Vision, pages 469–485, Milan, Italy, 2024. Springer

work page 2024
[40]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, Seattle,WA, USA,

work page

[1] [1]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, Seattle, WA, USA, 2020. IEEE

work page 2020

[2] [2]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621– 11631, Seattle, WA, USA, 2020. IEEE

work page 2020

[3] [3]

Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

work page 2018

[4] [4]

Deepstereo: Learning to predict new views from the world’s imagery

John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5515–5524, Las Vegas, NV, USA, 2016. IEEE

work page 2016

[5] [5]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

work page 2021

[6] [6]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, New Orleans, LA, USA, 2022. IEEE

work page 2022

[7] [7]

Block-nerf: Scalable large scene neural view synthesis

Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. InProceedings of the IEEE/CVF 8 Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving conference on computer vision and pattern reco...

work page 2022

[8] [8]

Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs

Haithem Turki, Deva Ramanan, and Mahadev Satyanarayanan. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12922–12931, New Orleans, LA, USA, 2022. IEEE

work page 2022

[9] [9]

Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering

Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanxuan Zhao, Anyi Rao, Christian Theobalt, Bo Dai, and Dahua Lin. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. InEuropean conference on computer vision, pages 106–122, Tel Aviv, Israel, 2022. Springer

work page 2022

[10] [10]

Neural scene flow fields for space-time view synthesis of dynamic scenes

Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6498–6508, Nashville, TN, USA, 2021. IEEE

work page 2021

[11] [11]

Raft-3d: Scene flow using rigid-motion embed- dings

Zachary Teed and Jia Deng. Raft-3d: Scene flow using rigid-motion embed- dings. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8375–8384, Nashville, TN, USA, 2021. IEEE

work page 2021

[12] [12]

Suds: Scalable urban dynamic scenes

Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva Ramanan. Suds: Scalable urban dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12375–12385, Vancouver, Canada,

work page

[13] [13]

Emernerf: Emergent spatial-temporal scene decomposition via self-supervision

Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, et al. Emernerf: Emergent spatial-temporal scene decomposition via self-supervision. InInternational Con- ference on Learning Representations, Vienna, Austria, 2024. http://OpenReview.net

work page 2024

[14] [14]

Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes

Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming- Hsuan Yang. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21634–21643, Vancouver, Canada,

work page

[15] [15]

Street gaussians: Modeling dy- namic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dy- namic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173, Milan, Italy, 2024. Springer

work page 2024

[16] [16]

Ibrnet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, Nashville, TN, USA, 2021. IEEE

work page 2021

[17] [17]

Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5855–5864, Montreal, QC, Canada, 2021. IEEE

work page 2021

[18] [18]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):139–1, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):139–1, 2023

work page 2023

[19] [19]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210, Glasgow, UK, 2020. Springer

work page 2020

[20] [20]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[21] [21]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, New Orleans, LA, USA, 2022. IEEE

work page 2022

[22] [22]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, Paris, France, 2023. IEEE

work page 2023

[23] [23]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 4296–4304, Vancouver, Canada, 2024. AAAI

work page 2024

[24] [24]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, Vancouver, Canada, 2023. IEEE

work page 2023

[25] [25]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, Paris, France, 2023. IEEE

work page 2023

[26] [26]

Syncdreamer: Generating multiview-consistent images from a single-view image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. InInternational Conference on Learning Representations, Vienna, Austria, 2024. http://OpenReview.net

work page 2024

[27] [27]

Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion.Advances in Neural Information Processing Systems, 36:51202–51233, 2023

Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Fu- rukawa. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion.Advances in Neural Information Processing Systems, 36:51202–51233, 2023

work page 2023

[28] [28]

Drivedreamer: Towards real-world-driven world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving. InEuropean Conference on Computer Vision, pages 55–72, Milan, Italy,

work page

[29] [29]

Gaia-1: A generative world model for autonomous driving, 2023

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving, 2023

work page 2023

[30] [30]

Street-view image gen- eration from a bird’s-eye view layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024

Alexander Swerdlow, Runsheng Xu, and Bolei Zhou. Street-view image gen- eration from a bird’s-eye view layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024

work page 2024

[31] [31]

Magicdrive: Street view generation with diverse 3d geometry control

Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing HONG, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control. InInternational Conference on Learning Representations, Vienna, Austria,

work page

[32] [32]

http://OpenReview.net

work page

[33] [33]

Freevs: Generative view synthesis on free driving trajectory

Qitai Wang, Lue Fan, Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. Freevs: Generative view synthesis on free driving trajectory. InInternational Conference on Learning Representations, Singapore EXPO, 2025. http://OpenReview.net

work page 2025

[34] [34]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, Virtual, 2021. http://OpenReview.net

work page 2021

[35] [35]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171, Virtual, 2021. ACM New York, NY, USA

work page 2021

[36] [36]

Context encoders: Feature learning by inpainting

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, Las Vegas, NV, USA, 2016. IEEE

work page 2016

[37] [37]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5294–5306, Music City Center, Nashville TN, 2025. IEEE

work page 2025

[38] [38]

Panacea: Panoramic and controllable video generation for autonomous driving

Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6902–6912, Seattle,WA, USA, 2024. IEEE

work page 2024

[39] [39]

Drivingdiffusion: Layout-guided multi- view driving scenarios video generation with latent diffusion model

Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi- view driving scenarios video generation with latent diffusion model. InEuropean Conference on Computer Vision, pages 469–485, Milan, Italy, 2024. Springer

work page 2024

[40] [40]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, Seattle,WA, USA,

work page