Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction

Henry Che; Jingkang Wang; Raquel Urtasun; Sivabalan Manivasagam; Yun Chen; Ze Yang

arxiv: 2605.22420 · v1 · pith:7IVIUCQXnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.RO

Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction

Henry Che , Jingkang Wang , Yun Chen , Ze Yang , Sivabalan Manivasagam , Raquel Urtasun This is my paper

Pith reviewed 2026-05-22 07:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords urban scene reconstruction3D Gaussian representationdiffusion modelsviewpoint generalizationneural renderingautonomous driving simulationsensor simulation

0 comments

The pith

GenRe uses a diffusion model to enhance any pretrained 3D Gaussian urban scene so it renders accurately from new viewpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Urban scene reconstruction from real drives works well along the original camera path but breaks when the viewpoint shifts, such as during a lane change. This limits its use for closed-loop self-driving simulation that requires arbitrary camera positions. The paper introduces GenRe, which learns to apply diffusion-based generative priors across many different scenes. It takes any existing 3D Gaussian model and corrects its shortcomings in minutes without per-scene retraining. The result is a representation that stays high-quality and stable even at previously unseen angles.

Core claim

GenRe is a diffusion-guided generalizable enhancer that accepts any pretrained 3D Gaussian representation of an urban scene and repairs its deficiencies within a few minutes. By distilling generative priors learned across diverse scenes, GenRe yields robust, high-fidelity representations that generalize reliably to challenging unseen viewpoints such as lane changes.

What carries the argument

GenRe, which distills generative priors from a diffusion model trained on many scenes to correct and generalize a given 3D Gaussian representation.

If this is right

High-fidelity rendering becomes feasible for large viewpoint changes such as lane shifts without retraining per scene.
Enhancement time drops to minutes rather than the costly optimization required by prior methods.
Downstream tasks including sensor simulation for autonomous driving receive more stable and accurate scene models.
The approach scales to many environments because one trained enhancer works on varied pretrained Gaussians.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cross-scene distillation idea could be tested on other 3D representations beyond Gaussians to check transferability.
A general enhancer of this form might shorten the overall pipeline by allowing quick initial reconstructions followed by one-shot correction.
Further experiments with extreme viewpoint shifts or novel scene types would help map where the learned priors stop generalizing.

Load-bearing premise

A diffusion model trained to distill priors across diverse scenes can fix deficiencies in any new pretrained 3D Gaussian representation without requiring per-scene optimization or fine-tuning.

What would settle it

On a held-out urban scene, apply GenRe to a pretrained Gaussian and render images from a large lateral viewpoint shift; if the output images show clear artifacts or lower fidelity than ground truth while a per-scene optimized baseline succeeds, the generalization claim fails.

Figures

Figures reproduced from arXiv: 2605.22420 by Henry Che, Jingkang Wang, Raquel Urtasun, Sivabalan Manivasagam, Yun Chen, Ze Yang.

**Figure 1.** Figure 1: We introduce GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within minutes, producing robust, high-fidelity reconstructions that render reliably at novel viewpoints. Abstract— Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-drivin… view at source ↗

**Figure 2.** Figure 2: GenRe pipeline for urban scene reconstruction. GenRe is composed of three steps. First, any 3DGS-based reconstruction methods are used to obtain an initial representation. Then, we render at novel viewpoint (e.g., 3m shifts) and adopt a diffusion-based neural fixer FNet (Sec. III-B) to fix the degraded artifacts. Finally, we leverage a generalizable enhancer ENet (Sec. III-C) that predicts per-Gaussian res… view at source ↗

**Figure 3.** Figure 3: 2D neural fixer (FNet) overview. FNet takes a 3DGS-rendered view I˜, conditions on the reference image Iref and the rendered LiDAR map Ilidar, and produces the fixed image Ifixed. We fine-tune FNet from the pre-trained single-step diffusion model SD-Turbo [23]. Given the camera projection matrix Π, the 3D Gaussians are projected onto the image plane and rasterized into per-ray fragments. After depth sortin… view at source ↗

**Figure 4.** Figure 4: Generalizable 3D enhancer (ENet) overview. ENet iteratively refines a 3DGS scene using rendering-guided gradients. At iteration t, ENet takes the current 3D Gaussians Gt and per-Gaussian gradients ∇Gt (from rendering loss) and predicts residuals ∆Gt to update the scene to Gt+1. Source and novel views are compared with ground-truth I and fixed targets Ifixed to compute losses Lsrc(I˜src, I) and Lnovel(I˜ no… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison to state-of-the-art neural reconstruction methods under large extrapolation. Our method yields higher realism, fewer artifacts. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison to state-of-the-art 2D neural fixers. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 9.** Figure 9: GenRe+ shows minimal detection and segmentation domain gap. TABLE VII DOWNSTREAM DOMAIN GAP EVALUATION. Methods Detection Segmentation AP ↑ Recall ↑ IoU ↑ AP ↑ Recall ↑ IoU ↑ 3DGS [12] 0.560 0.376 0.505 0.558 0.375 0.501 Difix3D [30] 0.670 0.434 0.611 0.670 0.434 0.598 GenRe+ 0.785 0.607 0.728 0.768 0.596 0.723 TABLE VIII DOWNSTREAM TRAINING WITH DATA AUGMENTATION. Methods mAP↑ AP@1m↑ AP@2m↑ AP@4m↑ Real 0.… view at source ↗

**Figure 8.** Figure 8: GenRe+ can support diverse variants for reactive log replay, such as dynamic actor removals, actors insertions, and actors manipulation. back). Each rollout starts from a lateral offset of 3 m, and all synthetic scenarios are manually vetted for plausibility. We report image quality (FID) against baselines. As shown in Tab. VI and [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

read the original abstract

Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GenRe claims a single diffusion enhancer can fix arbitrary pretrained 3D Gaussians across urban scenes in minutes and generalize to large viewpoint shifts, but this rests on priors being complete enough to handle unseen deficiencies without adaptation.

read the letter

The key takeaway is that GenRe uses a diffusion model trained across scenes to enhance pretrained 3D Gaussian representations quickly and make them generalize better to unseen viewpoints in urban driving scenes. This moves past the common per-scene optimization by distilling generative priors from multiple scenes into a single enhancer. It takes any 3DGS input and improves it in minutes, targeting issues like quality drop under large shifts such as lane changes. The paper does well in identifying the fragility of prior distilled representations and proposing a more scalable alternative for sensor simulation in autonomous driving. The approach builds on established 3D Gaussian splatting and diffusion techniques without introducing overly complex new machinery. If the experiments demonstrate consistent gains in fidelity and efficiency over baselines, that would be a practical step forward. A potential soft spot is whether the cross-scene priors can handle deficiencies in arbitrary pretrained models without some form of adaptation. The concern that novel artifacts outside the training distribution might lead to inconsistent fixes or hallucinations is worth checking. The abstract asserts robust generalization, but the lack of visible details on ablations or failure modes makes it hard to assess how well this holds in practice. The soundness feels limited until the quantitative results are examined closely. This work is for researchers and engineers focused on neural rendering and simulation for self-driving vehicles. A reader dealing with viewpoint generalization in 3D reconstruction would find value in the framing and the efficiency claims. I would recommend putting it through peer review. The problem it tackles is relevant, the idea has some novelty in its generalizable setup, and referees can evaluate whether the evidence supports the central claims about reliable fixes without per-scene work.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GenRe, a diffusion-guided generalizable enhancer for urban scene reconstruction. It takes any pretrained 3D Gaussian Splatting (3DGS) representation as input and applies a diffusion model trained once across diverse scenes to distill generative priors, fixing deficiencies in a few minutes without per-scene optimization or fine-tuning. The resulting representations are claimed to be robust and high-fidelity, generalizing reliably to large unseen viewpoint shifts (e.g., lane changes) that degrade standard neural rendering methods. Experiments are reported to show improvements over prior approaches in both quality and efficiency, with benefits for downstream tasks in autonomous driving sensor simulation.

Significance. If the generalization and efficiency claims hold under rigorous testing, this would be a useful contribution to neural rendering for urban environments. Removing the need for costly per-scene optimization while improving robustness to viewpoint changes could support more scalable closed-loop simulation for self-driving development, addressing a practical bottleneck in current 3D reconstruction pipelines.

major comments (2)

[§3] §3 (Method, diffusion enhancer training): The central claim that a single cross-scene diffusion model can reliably detect and correct arbitrary scene-specific deficiencies in any pretrained 3DGS (without hidden adaptation or per-scene fine-tuning) is load-bearing for the generalization result. The manuscript should include a concrete analysis or failure-case experiments showing behavior when the input 3DGS contains artifacts outside the training distribution, such as novel lighting, sensor noise, or geometry errors, to substantiate that the distilled output remains consistent.
[§4.3] §4.3 (Generalization experiments, lane-change viewpoint results): The reported gains in rendering quality for large viewpoint shifts rely on the diffusion step producing geometry and texture that can be consistently distilled back into the 3D representation. Additional controls are needed to isolate whether improvements stem from the diffusion priors or from implicit scene-specific cues in the training data, as the skeptic concern about hallucinated inconsistencies would directly undermine the 'generalizes reliably' assertion.

minor comments (2)

[Abstract] The abstract states that GenRe 'outperforms existing methods in both quality and efficiency,' but the main text should explicitly list the quantitative metrics (e.g., PSNR, SSIM, LPIPS) and the exact baselines used in the primary comparison table for immediate clarity.
[§3] Notation for the diffusion conditioning and the 3DGS-to-image projection step could be made more explicit in the method diagram and equations to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We have carefully considered each point and revised the manuscript to strengthen the claims regarding generalization and robustness. Below we provide point-by-point responses.

read point-by-point responses

Referee: [§3] §3 (Method, diffusion enhancer training): The central claim that a single cross-scene diffusion model can reliably detect and correct arbitrary scene-specific deficiencies in any pretrained 3DGS (without hidden adaptation or per-scene fine-tuning) is load-bearing for the generalization result. The manuscript should include a concrete analysis or failure-case experiments showing behavior when the input 3DGS contains artifacts outside the training distribution, such as novel lighting, sensor noise, or geometry errors, to substantiate that the distilled output remains consistent.

Authors: We agree that explicit validation on out-of-distribution artifacts is necessary to support the central claim. Although our cross-scene training already exposes the model to varied urban conditions, we will add a new failure-case analysis subsection to §3 in the revised manuscript. This will include controlled experiments injecting novel lighting variations, sensor noise, and geometry errors into input 3DGS representations, with both qualitative renderings and quantitative metrics (PSNR, SSIM, LPIPS) demonstrating the consistency of the distilled outputs and any observed limitations. revision: yes
Referee: [§4.3] §4.3 (Generalization experiments, lane-change viewpoint results): The reported gains in rendering quality for large viewpoint shifts rely on the diffusion step producing geometry and texture that can be consistently distilled back into the 3D representation. Additional controls are needed to isolate whether improvements stem from the diffusion priors or from implicit scene-specific cues in the training data, as the skeptic concern about hallucinated inconsistencies would directly undermine the 'generalizes reliably' assertion.

Authors: We appreciate the need to isolate the source of improvements. Our diffusion model is trained once across multiple diverse scenes with no per-scene adaptation or embeddings, which already minimizes scene-specific cues. To further address this, we will add an ablation study to the revised §4.3 comparing the cross-scene GenRe against a scene-specific fine-tuned variant on the lane-change viewpoint tests. This will quantify the generalization benefit attributable to the shared priors. We will also include additional multi-view consistency checks to directly evaluate potential hallucinated inconsistencies, reporting any cases where artifacts appear. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes GenRe, a method that takes any pretrained 3D Gaussian representation as input and applies a diffusion model trained across diverse scenes to enhance it for better generalization to unseen viewpoints. No equations, derivations, or parameter-fitting procedures are described in the provided text that would reduce a claimed prediction or result to a quantity defined by the paper's own inputs or outputs. The approach relies on external pretrained diffusion models and 3D representations, with claims of efficiency and robustness supported by experimental outcomes rather than tautological constructions or self-referential definitions. This matches the absence of any load-bearing self-citations or ansatz smuggling in the abstract and description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available, so ledger is limited to the core domain assumption stated in the proposal.

axioms (1)

domain assumption Diffusion models trained on diverse urban scenes can provide transferable generative priors that fix deficiencies in any input 3D Gaussian representation.
This assumption underpins the claim that a single enhancer works across scenes without per-scene optimization.

pith-pipeline@v0.9.0 · 5732 in / 1153 out tokens · 29474 ms · 2026-05-22T07:13:19.910449+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes... one-step diffusion neural fixer... generalizable enhancer network (ENet) that predicts per-Gaussian residuals
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We unroll the enhancer for T iterations... Sparse UNet as the 3D enhancer network

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

G3r: Gradient guided generalizable reconstruction

Yun Chen, Jingkang Wang, Ze Yang, Sivabalan Manivasagam, and Raquel Urtasun. G3r: Gradient guided generalizable reconstruction. In ECCV, 2025

work page 2025
[2]

Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering.arXiv, 2023

Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, and Li Zhang. Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering.arXiv, 2023

work page 2023
[3]

Splatformer: Point transformer for robust 3d gaussian splatting

Yutong Chen, Marko Mihajlovic, Xiyi Chen, Yiming Wang, Sergey Prokudin, and Siyu Tang. Splatformer: Point transformer for robust 3d gaussian splatting. InICLR, 2025

work page 2025
[4]

Omnire: Omni urban scene reconstruction

Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Janick Mar- tinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni urban scene reconstruction. InICLR, 2024

work page 2024
[5]

Objaverse-xl: A universe of 10m+ 3d objects.arXiv, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv, 2023

work page 2023
[6]

López, and Vladlen Koltun

Alexey Dosovitskiy, Germán Ros, Felipe Codevilla, Antonio M. López, and Vladlen Koltun. CARLA: an open urban driving simulator. In CoRL, 2017

work page 2017
[7]

Freesim: Toward free-viewpoint camera simulation in driving scenes

Lue Fan, Hao Zhang, Qitai Wang, Hongsheng Li, and Zhaoxiang Zhang. Freesim: Toward free-viewpoint camera simulation in driving scenes. InCVPR, 2025

work page 2025
[8]

Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation.arXiv, 2025

Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, and Hao Zhao. Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation.arXiv, 2025

work page 2025
[9]

Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving

Georg Hess, Carl Lindström, Maryam Fatemi, Christoffer Petersson, and Lennart Svensson. Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving. InCVPR, 2025

work page 2025
[10]

Lora: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022

work page 2022
[11]

Vegs: View extrapolation of urban scenes in 3d gaussian splatting using learned priors

Sungwon Hwang, Min-Jung Kim, Taewoong Kang, Jayeon Kang, and Jaegul Choo. Vegs: View extrapolation of urban scenes in 3d gaussian splatting using learned priors. InECCV, 2024

work page 2024
[12]

3D gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering. InTOG, 2023

work page 2023
[13]

Wonder3d: Single image to 3d using cross-domain diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. InCVPR, 2024

work page 2024
[14]

Lidarsim: Realistic lidar simulation by leveraging the real world

Sivabalan Manivasagam, Shenlong Wang, Kelvin Wong, Wenyuan Zeng, Mikita Sazanovich, Shuhan Tan, Bin Yang, Wei-Chiu Ma, and Raquel Urtasun. Lidarsim: Realistic lidar simulation by leveraging the real world. InCVPR, 2020

work page 2020
[15]

Dreamdrive: Generative 4d scene modeling from street view images

Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, and Yue Wang. Dreamdrive: Generative 4d scene modeling from street view images. InICRA, 2025

work page 2025
[16]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

work page 2020
[17]

Recondreamer: Crafting world models for driving scene reconstruction via online restoration.arxiv, 2024

Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Guan Huang, Chen Liu, Yuyin Chen, Yida Wang, Xueyang Zhang, Yifei Zhan, Kun Zhan, Peng Jia, Xianpeng Lang, Xingang Wang, and Wenjun Mei. Recondreamer: Crafting world models for driving scene reconstruction via online restoration.arxiv, 2024

work page 2024
[18]

One-step image translation with text-to-image models.arXiv, 2024

Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models.arXiv, 2024

work page 2024
[19]

On aliased resizing and surprising subtleties in gan evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022

work page 2022
[20]

Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes

Chensheng Peng, Chengwei Zhang, Yixiao Wang, Chenfeng Xu, Yichen Xie, Wenzhao Zheng, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes. InCVPR, 2025

work page 2025
[21]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. InICLR, 2023

work page 2023
[22]

Scube: Instant large-scale scene reconstruction using voxsplats

Xuanchi Ren, Yifan Lu, Hanxue Liang, Jay Zhangjie Wu, Huan Ling, Mike Chen, Francis Fidler, Sanja annd Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction using voxsplats. In NeurIPS, 2024

work page 2024
[23]

Adversarial diffusion distillation.arXiv, 2023

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation.arXiv, 2023

work page 2023
[24]

Airsim: High-fidelity visual and physical simulation for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. InField and service robotics, 2018

work page 2018
[25]

NeuRAD: Neural rendering for autonomous driving

Adam Tonderski, Carl Lindström, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. NeuRAD: Neural rendering for autonomous driving. InCVPR, 2024

work page 2024
[26]

Flux4d: Flow-based unsupervised 4d reconstruction

Jingkang Wang, Henry Che, Yun Chen, Ze Yang, Lily Goli, Sivabalan Manivasagam, and Raquel Urtasun. Flux4d: Flow-based unsupervised 4d reconstruction. InNeurIPS, 2025

work page 2025
[27]

Cadsim: Robust and scalable in-the-wild 3d reconstruction for control- lable sensor simulation

Jingkang Wang, Sivabalan Manivasagam, Yun Chen, Ze Yang, Ioan An- drei Bârsan, Anqi Joyce Yang, Wei-Chiu Ma, and Raquel Urtasun. Cadsim: Robust and scalable in-the-wild 3d reconstruction for control- lable sensor simulation. InCoRL, 2022

work page 2022
[28]

Advsim: Generating safety-critical scenarios for self-driving vehicles

Jingkang Wang, Ava Pun, James Tu, Sivabalan Manivasagam, Abbas Sadat, Sergio Casas, Mengye Ren, and Raquel Urtasun. Advsim: Generating safety-critical scenarios for self-driving vehicles. InCVPR, 2021

work page 2021
[29]

Freevs: Generative view synthesis on free driving trajectory

Qitai Wang, Lue Fan, Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. Freevs: Generative view synthesis on free driving trajectory. InICLR, 2025

work page 2025
[30]

Difix3d+: Improving 3d reconstructions with single-step diffusion models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. InCVPR, 2025

work page 2025
[31]

Reconfusion: 3d reconstruction with diffusion priors

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfusion: 3d reconstruction with diffusion priors. InCVPR, 2024

work page 2024
[32]

Detectron2

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/ facebookresearch/detectron2, 2019

work page 2019
[33]

Pandaset: Advanced sensor suite dataset for autonomous driving

Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driving. In ITSC, 2021

work page 2021
[34]

Street gaussians: Modeling dynamic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. In ECCV, 2024

work page 2024
[35]

Streetcrafter: Street view synthesis with controllable video diffusion models

Yunzhi Yan, Zhen Xu, Haotong Lin, Haian Jin, Haoyu Guo, Yida Wang, Kun Zhan, Xianpeng Lang, Hujun Bao, Xiaowei Zhou, and Sida Peng. Streetcrafter: Street view synthesis with controllable video diffusion models. InCVPR, 2025

work page 2025
[36]

Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai

Chenyu Yang, Yuntao Chen, Haofei Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Y . Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. ArXiv, 2022

work page 2022
[37]

Unisim: A neural closed-loop sensor simulator

Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei- Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. InCVPR, 2023

work page 2023
[38]

Genassets: Generating in-the-wild 3d assets in latent space

Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, and Raquel Urtasun. Genassets: Generating in-the-wild 3d assets in latent space. InCVPR, 2025

work page 2025
[39]

Recondreamer++: Harmonizing generative and reconstructive models for driving scene representation

Guosheng Zhao, Xiaofeng Wang, Chaojun Ni, Zheng Zhu, Wenkang Qin, Guan Huang, and Xingang Wang. Recondreamer++: Harmonizing generative and reconstructive models for driving scene representation. arxiv, 2025

work page 2025
[40]

Mudg: Taming multi-modal diffusion with gaussian splatting for urban scene reconstruction.arXiv, 2025

Yingshuang Zou, Yikang Ding, Chuanrui Zhang, Jiazhe Guo, Bohan Li, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, and Haoqian Wang. Mudg: Taming multi-modal diffusion with gaussian splatting for urban scene reconstruction.arXiv, 2025

work page 2025

[1] [1]

G3r: Gradient guided generalizable reconstruction

Yun Chen, Jingkang Wang, Ze Yang, Sivabalan Manivasagam, and Raquel Urtasun. G3r: Gradient guided generalizable reconstruction. In ECCV, 2025

work page 2025

[2] [2]

Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering.arXiv, 2023

Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, and Li Zhang. Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering.arXiv, 2023

work page 2023

[3] [3]

Splatformer: Point transformer for robust 3d gaussian splatting

Yutong Chen, Marko Mihajlovic, Xiyi Chen, Yiming Wang, Sergey Prokudin, and Siyu Tang. Splatformer: Point transformer for robust 3d gaussian splatting. InICLR, 2025

work page 2025

[4] [4]

Omnire: Omni urban scene reconstruction

Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Janick Mar- tinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni urban scene reconstruction. InICLR, 2024

work page 2024

[5] [5]

Objaverse-xl: A universe of 10m+ 3d objects.arXiv, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv, 2023

work page 2023

[6] [6]

López, and Vladlen Koltun

Alexey Dosovitskiy, Germán Ros, Felipe Codevilla, Antonio M. López, and Vladlen Koltun. CARLA: an open urban driving simulator. In CoRL, 2017

work page 2017

[7] [7]

Freesim: Toward free-viewpoint camera simulation in driving scenes

Lue Fan, Hao Zhang, Qitai Wang, Hongsheng Li, and Zhaoxiang Zhang. Freesim: Toward free-viewpoint camera simulation in driving scenes. InCVPR, 2025

work page 2025

[8] [8]

Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation.arXiv, 2025

Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, and Hao Zhao. Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation.arXiv, 2025

work page 2025

[9] [9]

Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving

Georg Hess, Carl Lindström, Maryam Fatemi, Christoffer Petersson, and Lennart Svensson. Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving. InCVPR, 2025

work page 2025

[10] [10]

Lora: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022

work page 2022

[11] [11]

Vegs: View extrapolation of urban scenes in 3d gaussian splatting using learned priors

Sungwon Hwang, Min-Jung Kim, Taewoong Kang, Jayeon Kang, and Jaegul Choo. Vegs: View extrapolation of urban scenes in 3d gaussian splatting using learned priors. InECCV, 2024

work page 2024

[12] [12]

3D gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering. InTOG, 2023

work page 2023

[13] [13]

Wonder3d: Single image to 3d using cross-domain diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. InCVPR, 2024

work page 2024

[14] [14]

Lidarsim: Realistic lidar simulation by leveraging the real world

Sivabalan Manivasagam, Shenlong Wang, Kelvin Wong, Wenyuan Zeng, Mikita Sazanovich, Shuhan Tan, Bin Yang, Wei-Chiu Ma, and Raquel Urtasun. Lidarsim: Realistic lidar simulation by leveraging the real world. InCVPR, 2020

work page 2020

[15] [15]

Dreamdrive: Generative 4d scene modeling from street view images

Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, and Yue Wang. Dreamdrive: Generative 4d scene modeling from street view images. InICRA, 2025

work page 2025

[16] [16]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

work page 2020

[17] [17]

Recondreamer: Crafting world models for driving scene reconstruction via online restoration.arxiv, 2024

Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Guan Huang, Chen Liu, Yuyin Chen, Yida Wang, Xueyang Zhang, Yifei Zhan, Kun Zhan, Peng Jia, Xianpeng Lang, Xingang Wang, and Wenjun Mei. Recondreamer: Crafting world models for driving scene reconstruction via online restoration.arxiv, 2024

work page 2024

[18] [18]

One-step image translation with text-to-image models.arXiv, 2024

Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models.arXiv, 2024

work page 2024

[19] [19]

On aliased resizing and surprising subtleties in gan evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022

work page 2022

[20] [20]

Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes

Chensheng Peng, Chengwei Zhang, Yixiao Wang, Chenfeng Xu, Yichen Xie, Wenzhao Zheng, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes. InCVPR, 2025

work page 2025

[21] [21]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. InICLR, 2023

work page 2023

[22] [22]

Scube: Instant large-scale scene reconstruction using voxsplats

Xuanchi Ren, Yifan Lu, Hanxue Liang, Jay Zhangjie Wu, Huan Ling, Mike Chen, Francis Fidler, Sanja annd Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction using voxsplats. In NeurIPS, 2024

work page 2024

[23] [23]

Adversarial diffusion distillation.arXiv, 2023

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation.arXiv, 2023

work page 2023

[24] [24]

Airsim: High-fidelity visual and physical simulation for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. InField and service robotics, 2018

work page 2018

[25] [25]

NeuRAD: Neural rendering for autonomous driving

Adam Tonderski, Carl Lindström, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. NeuRAD: Neural rendering for autonomous driving. InCVPR, 2024

work page 2024

[26] [26]

Flux4d: Flow-based unsupervised 4d reconstruction

Jingkang Wang, Henry Che, Yun Chen, Ze Yang, Lily Goli, Sivabalan Manivasagam, and Raquel Urtasun. Flux4d: Flow-based unsupervised 4d reconstruction. InNeurIPS, 2025

work page 2025

[27] [27]

Cadsim: Robust and scalable in-the-wild 3d reconstruction for control- lable sensor simulation

Jingkang Wang, Sivabalan Manivasagam, Yun Chen, Ze Yang, Ioan An- drei Bârsan, Anqi Joyce Yang, Wei-Chiu Ma, and Raquel Urtasun. Cadsim: Robust and scalable in-the-wild 3d reconstruction for control- lable sensor simulation. InCoRL, 2022

work page 2022

[28] [28]

Advsim: Generating safety-critical scenarios for self-driving vehicles

Jingkang Wang, Ava Pun, James Tu, Sivabalan Manivasagam, Abbas Sadat, Sergio Casas, Mengye Ren, and Raquel Urtasun. Advsim: Generating safety-critical scenarios for self-driving vehicles. InCVPR, 2021

work page 2021

[29] [29]

Freevs: Generative view synthesis on free driving trajectory

Qitai Wang, Lue Fan, Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. Freevs: Generative view synthesis on free driving trajectory. InICLR, 2025

work page 2025

[30] [30]

Difix3d+: Improving 3d reconstructions with single-step diffusion models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. InCVPR, 2025

work page 2025

[31] [31]

Reconfusion: 3d reconstruction with diffusion priors

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfusion: 3d reconstruction with diffusion priors. InCVPR, 2024

work page 2024

[32] [32]

Detectron2

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/ facebookresearch/detectron2, 2019

work page 2019

[33] [33]

Pandaset: Advanced sensor suite dataset for autonomous driving

Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driving. In ITSC, 2021

work page 2021

[34] [34]

Street gaussians: Modeling dynamic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. In ECCV, 2024

work page 2024

[35] [35]

Streetcrafter: Street view synthesis with controllable video diffusion models

Yunzhi Yan, Zhen Xu, Haotong Lin, Haian Jin, Haoyu Guo, Yida Wang, Kun Zhan, Xianpeng Lang, Hujun Bao, Xiaowei Zhou, and Sida Peng. Streetcrafter: Street view synthesis with controllable video diffusion models. InCVPR, 2025

work page 2025

[36] [36]

Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai

Chenyu Yang, Yuntao Chen, Haofei Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Y . Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. ArXiv, 2022

work page 2022

[37] [37]

Unisim: A neural closed-loop sensor simulator

Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei- Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. InCVPR, 2023

work page 2023

[38] [38]

Genassets: Generating in-the-wild 3d assets in latent space

Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, and Raquel Urtasun. Genassets: Generating in-the-wild 3d assets in latent space. InCVPR, 2025

work page 2025

[39] [39]

Recondreamer++: Harmonizing generative and reconstructive models for driving scene representation

Guosheng Zhao, Xiaofeng Wang, Chaojun Ni, Zheng Zhu, Wenkang Qin, Guan Huang, and Xingang Wang. Recondreamer++: Harmonizing generative and reconstructive models for driving scene representation. arxiv, 2025

work page 2025

[40] [40]

Mudg: Taming multi-modal diffusion with gaussian splatting for urban scene reconstruction.arXiv, 2025

Yingshuang Zou, Yikang Ding, Chuanrui Zhang, Jiazhe Guo, Bohan Li, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, and Haoqian Wang. Mudg: Taming multi-modal diffusion with gaussian splatting for urban scene reconstruction.arXiv, 2025

work page 2025