pith. sign in

arxiv: 2605.22420 · v1 · pith:7IVIUCQXnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.RO

Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction

Pith reviewed 2026-05-22 07:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords urban scene reconstruction3D Gaussian representationdiffusion modelsviewpoint generalizationneural renderingautonomous driving simulationsensor simulation
0
0 comments X

The pith

GenRe uses a diffusion model to enhance any pretrained 3D Gaussian urban scene so it renders accurately from new viewpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Urban scene reconstruction from real drives works well along the original camera path but breaks when the viewpoint shifts, such as during a lane change. This limits its use for closed-loop self-driving simulation that requires arbitrary camera positions. The paper introduces GenRe, which learns to apply diffusion-based generative priors across many different scenes. It takes any existing 3D Gaussian model and corrects its shortcomings in minutes without per-scene retraining. The result is a representation that stays high-quality and stable even at previously unseen angles.

Core claim

GenRe is a diffusion-guided generalizable enhancer that accepts any pretrained 3D Gaussian representation of an urban scene and repairs its deficiencies within a few minutes. By distilling generative priors learned across diverse scenes, GenRe yields robust, high-fidelity representations that generalize reliably to challenging unseen viewpoints such as lane changes.

What carries the argument

GenRe, which distills generative priors from a diffusion model trained on many scenes to correct and generalize a given 3D Gaussian representation.

If this is right

  • High-fidelity rendering becomes feasible for large viewpoint changes such as lane shifts without retraining per scene.
  • Enhancement time drops to minutes rather than the costly optimization required by prior methods.
  • Downstream tasks including sensor simulation for autonomous driving receive more stable and accurate scene models.
  • The approach scales to many environments because one trained enhancer works on varied pretrained Gaussians.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cross-scene distillation idea could be tested on other 3D representations beyond Gaussians to check transferability.
  • A general enhancer of this form might shorten the overall pipeline by allowing quick initial reconstructions followed by one-shot correction.
  • Further experiments with extreme viewpoint shifts or novel scene types would help map where the learned priors stop generalizing.

Load-bearing premise

A diffusion model trained to distill priors across diverse scenes can fix deficiencies in any new pretrained 3D Gaussian representation without requiring per-scene optimization or fine-tuning.

What would settle it

On a held-out urban scene, apply GenRe to a pretrained Gaussian and render images from a large lateral viewpoint shift; if the output images show clear artifacts or lower fidelity than ground truth while a per-scene optimized baseline succeeds, the generalization claim fails.

Figures

Figures reproduced from arXiv: 2605.22420 by Henry Che, Jingkang Wang, Raquel Urtasun, Sivabalan Manivasagam, Yun Chen, Ze Yang.

Figure 1
Figure 1. Figure 1: We introduce GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within minutes, producing robust, high-fidelity reconstructions that render reliably at novel viewpoints. Abstract— Urban scene reconstruction from real-world obser￾vations has emerged as a powerful tool for self-drivin… view at source ↗
Figure 2
Figure 2. Figure 2: GenRe pipeline for urban scene reconstruction. GenRe is composed of three steps. First, any 3DGS-based reconstruction methods are used to obtain an initial representation. Then, we render at novel viewpoint (e.g., 3m shifts) and adopt a diffusion-based neural fixer FNet (Sec. III-B) to fix the degraded artifacts. Finally, we leverage a generalizable enhancer ENet (Sec. III-C) that predicts per-Gaussian res… view at source ↗
Figure 3
Figure 3. Figure 3: 2D neural fixer (FNet) overview. FNet takes a 3DGS-rendered view I˜, conditions on the reference image Iref and the rendered LiDAR map Ilidar, and produces the fixed image Ifixed. We fine-tune FNet from the pre-trained single-step diffusion model SD-Turbo [23]. Given the camera projection matrix Π, the 3D Gaussians are projected onto the image plane and rasterized into per-ray fragments. After depth sortin… view at source ↗
Figure 4
Figure 4. Figure 4: Generalizable 3D enhancer (ENet) overview. ENet iteratively refines a 3DGS scene using rendering-guided gradients. At iteration t, ENet takes the current 3D Gaussians Gt and per-Gaussian gradients ∇Gt (from rendering loss) and predicts residuals ∆Gt to update the scene to Gt+1. Source and novel views are compared with ground-truth I and fixed targets Ifixed to compute losses Lsrc(I˜src, I) and Lnovel(I˜ no… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison to state-of-the-art neural reconstruction methods under large extrapolation. Our method yields higher realism, fewer artifacts. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison to state-of-the-art 2D neural fixers. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: GenRe+ shows minimal detection and segmentation domain gap. TABLE VII DOWNSTREAM DOMAIN GAP EVALUATION. Methods Detection Segmentation AP ↑ Recall ↑ IoU ↑ AP ↑ Recall ↑ IoU ↑ 3DGS [12] 0.560 0.376 0.505 0.558 0.375 0.501 Difix3D [30] 0.670 0.434 0.611 0.670 0.434 0.598 GenRe+ 0.785 0.607 0.728 0.768 0.596 0.723 TABLE VIII DOWNSTREAM TRAINING WITH DATA AUGMENTATION. Methods mAP↑ AP@1m↑ AP@2m↑ AP@4m↑ Real 0.… view at source ↗
Figure 8
Figure 8. Figure 8: GenRe+ can support diverse variants for reactive log replay, such as dynamic actor removals, actors insertions, and actors manipulation. back). Each rollout starts from a lateral offset of 3 m, and all synthetic scenarios are manually vetted for plausibility. We report image quality (FID) against baselines. As shown in Tab. VI and [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
read the original abstract

Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GenRe, a diffusion-guided generalizable enhancer for urban scene reconstruction. It takes any pretrained 3D Gaussian Splatting (3DGS) representation as input and applies a diffusion model trained once across diverse scenes to distill generative priors, fixing deficiencies in a few minutes without per-scene optimization or fine-tuning. The resulting representations are claimed to be robust and high-fidelity, generalizing reliably to large unseen viewpoint shifts (e.g., lane changes) that degrade standard neural rendering methods. Experiments are reported to show improvements over prior approaches in both quality and efficiency, with benefits for downstream tasks in autonomous driving sensor simulation.

Significance. If the generalization and efficiency claims hold under rigorous testing, this would be a useful contribution to neural rendering for urban environments. Removing the need for costly per-scene optimization while improving robustness to viewpoint changes could support more scalable closed-loop simulation for self-driving development, addressing a practical bottleneck in current 3D reconstruction pipelines.

major comments (2)
  1. [§3] §3 (Method, diffusion enhancer training): The central claim that a single cross-scene diffusion model can reliably detect and correct arbitrary scene-specific deficiencies in any pretrained 3DGS (without hidden adaptation or per-scene fine-tuning) is load-bearing for the generalization result. The manuscript should include a concrete analysis or failure-case experiments showing behavior when the input 3DGS contains artifacts outside the training distribution, such as novel lighting, sensor noise, or geometry errors, to substantiate that the distilled output remains consistent.
  2. [§4.3] §4.3 (Generalization experiments, lane-change viewpoint results): The reported gains in rendering quality for large viewpoint shifts rely on the diffusion step producing geometry and texture that can be consistently distilled back into the 3D representation. Additional controls are needed to isolate whether improvements stem from the diffusion priors or from implicit scene-specific cues in the training data, as the skeptic concern about hallucinated inconsistencies would directly undermine the 'generalizes reliably' assertion.
minor comments (2)
  1. [Abstract] The abstract states that GenRe 'outperforms existing methods in both quality and efficiency,' but the main text should explicitly list the quantitative metrics (e.g., PSNR, SSIM, LPIPS) and the exact baselines used in the primary comparison table for immediate clarity.
  2. [§3] Notation for the diffusion conditioning and the 3DGS-to-image projection step could be made more explicit in the method diagram and equations to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We have carefully considered each point and revised the manuscript to strengthen the claims regarding generalization and robustness. Below we provide point-by-point responses.

read point-by-point responses
  1. Referee: [§3] §3 (Method, diffusion enhancer training): The central claim that a single cross-scene diffusion model can reliably detect and correct arbitrary scene-specific deficiencies in any pretrained 3DGS (without hidden adaptation or per-scene fine-tuning) is load-bearing for the generalization result. The manuscript should include a concrete analysis or failure-case experiments showing behavior when the input 3DGS contains artifacts outside the training distribution, such as novel lighting, sensor noise, or geometry errors, to substantiate that the distilled output remains consistent.

    Authors: We agree that explicit validation on out-of-distribution artifacts is necessary to support the central claim. Although our cross-scene training already exposes the model to varied urban conditions, we will add a new failure-case analysis subsection to §3 in the revised manuscript. This will include controlled experiments injecting novel lighting variations, sensor noise, and geometry errors into input 3DGS representations, with both qualitative renderings and quantitative metrics (PSNR, SSIM, LPIPS) demonstrating the consistency of the distilled outputs and any observed limitations. revision: yes

  2. Referee: [§4.3] §4.3 (Generalization experiments, lane-change viewpoint results): The reported gains in rendering quality for large viewpoint shifts rely on the diffusion step producing geometry and texture that can be consistently distilled back into the 3D representation. Additional controls are needed to isolate whether improvements stem from the diffusion priors or from implicit scene-specific cues in the training data, as the skeptic concern about hallucinated inconsistencies would directly undermine the 'generalizes reliably' assertion.

    Authors: We appreciate the need to isolate the source of improvements. Our diffusion model is trained once across multiple diverse scenes with no per-scene adaptation or embeddings, which already minimizes scene-specific cues. To further address this, we will add an ablation study to the revised §4.3 comparing the cross-scene GenRe against a scene-specific fine-tuned variant on the lane-change viewpoint tests. This will quantify the generalization benefit attributable to the shared priors. We will also include additional multi-view consistency checks to directly evaluate potential hallucinated inconsistencies, reporting any cases where artifacts appear. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes GenRe, a method that takes any pretrained 3D Gaussian representation as input and applies a diffusion model trained across diverse scenes to enhance it for better generalization to unseen viewpoints. No equations, derivations, or parameter-fitting procedures are described in the provided text that would reduce a claimed prediction or result to a quantity defined by the paper's own inputs or outputs. The approach relies on external pretrained diffusion models and 3D representations, with claims of efficiency and robustness supported by experimental outcomes rather than tautological constructions or self-referential definitions. This matches the absence of any load-bearing self-citations or ansatz smuggling in the abstract and description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available, so ledger is limited to the core domain assumption stated in the proposal.

axioms (1)
  • domain assumption Diffusion models trained on diverse urban scenes can provide transferable generative priors that fix deficiencies in any input 3D Gaussian representation.
    This assumption underpins the claim that a single enhancer works across scenes without per-scene optimization.

pith-pipeline@v0.9.0 · 5732 in / 1153 out tokens · 29474 ms · 2026-05-22T07:13:19.910449+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    G3r: Gradient guided generalizable reconstruction

    Yun Chen, Jingkang Wang, Ze Yang, Sivabalan Manivasagam, and Raquel Urtasun. G3r: Gradient guided generalizable reconstruction. In ECCV, 2025

  2. [2]

    Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering.arXiv, 2023

    Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, and Li Zhang. Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering.arXiv, 2023

  3. [3]

    Splatformer: Point transformer for robust 3d gaussian splatting

    Yutong Chen, Marko Mihajlovic, Xiyi Chen, Yiming Wang, Sergey Prokudin, and Siyu Tang. Splatformer: Point transformer for robust 3d gaussian splatting. InICLR, 2025

  4. [4]

    Omnire: Omni urban scene reconstruction

    Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Janick Mar- tinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni urban scene reconstruction. InICLR, 2024

  5. [5]

    Objaverse-xl: A universe of 10m+ 3d objects.arXiv, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv, 2023

  6. [6]

    López, and Vladlen Koltun

    Alexey Dosovitskiy, Germán Ros, Felipe Codevilla, Antonio M. López, and Vladlen Koltun. CARLA: an open urban driving simulator. In CoRL, 2017

  7. [7]

    Freesim: Toward free-viewpoint camera simulation in driving scenes

    Lue Fan, Hao Zhang, Qitai Wang, Hongsheng Li, and Zhaoxiang Zhang. Freesim: Toward free-viewpoint camera simulation in driving scenes. InCVPR, 2025

  8. [8]

    Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation.arXiv, 2025

    Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, and Hao Zhao. Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation.arXiv, 2025

  9. [9]

    Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving

    Georg Hess, Carl Lindström, Maryam Fatemi, Christoffer Petersson, and Lennart Svensson. Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving. InCVPR, 2025

  10. [10]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022

  11. [11]

    Vegs: View extrapolation of urban scenes in 3d gaussian splatting using learned priors

    Sungwon Hwang, Min-Jung Kim, Taewoong Kang, Jayeon Kang, and Jaegul Choo. Vegs: View extrapolation of urban scenes in 3d gaussian splatting using learned priors. InECCV, 2024

  12. [12]

    3D gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering. InTOG, 2023

  13. [13]

    Wonder3d: Single image to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. InCVPR, 2024

  14. [14]

    Lidarsim: Realistic lidar simulation by leveraging the real world

    Sivabalan Manivasagam, Shenlong Wang, Kelvin Wong, Wenyuan Zeng, Mikita Sazanovich, Shuhan Tan, Bin Yang, Wei-Chiu Ma, and Raquel Urtasun. Lidarsim: Realistic lidar simulation by leveraging the real world. InCVPR, 2020

  15. [15]

    Dreamdrive: Generative 4d scene modeling from street view images

    Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, and Yue Wang. Dreamdrive: Generative 4d scene modeling from street view images. InICRA, 2025

  16. [16]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

  17. [17]

    Recondreamer: Crafting world models for driving scene reconstruction via online restoration.arxiv, 2024

    Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Guan Huang, Chen Liu, Yuyin Chen, Yida Wang, Xueyang Zhang, Yifei Zhan, Kun Zhan, Peng Jia, Xianpeng Lang, Xingang Wang, and Wenjun Mei. Recondreamer: Crafting world models for driving scene reconstruction via online restoration.arxiv, 2024

  18. [18]

    One-step image translation with text-to-image models.arXiv, 2024

    Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models.arXiv, 2024

  19. [19]

    On aliased resizing and surprising subtleties in gan evaluation

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022

  20. [20]

    Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes

    Chensheng Peng, Chengwei Zhang, Yixiao Wang, Chenfeng Xu, Yichen Xie, Wenzhao Zheng, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes. InCVPR, 2025

  21. [21]

    Barron, and Ben Mildenhall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. InICLR, 2023

  22. [22]

    Scube: Instant large-scale scene reconstruction using voxsplats

    Xuanchi Ren, Yifan Lu, Hanxue Liang, Jay Zhangjie Wu, Huan Ling, Mike Chen, Francis Fidler, Sanja annd Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction using voxsplats. In NeurIPS, 2024

  23. [23]

    Adversarial diffusion distillation.arXiv, 2023

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation.arXiv, 2023

  24. [24]

    Airsim: High-fidelity visual and physical simulation for autonomous vehicles

    Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. InField and service robotics, 2018

  25. [25]

    NeuRAD: Neural rendering for autonomous driving

    Adam Tonderski, Carl Lindström, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. NeuRAD: Neural rendering for autonomous driving. InCVPR, 2024

  26. [26]

    Flux4d: Flow-based unsupervised 4d reconstruction

    Jingkang Wang, Henry Che, Yun Chen, Ze Yang, Lily Goli, Sivabalan Manivasagam, and Raquel Urtasun. Flux4d: Flow-based unsupervised 4d reconstruction. InNeurIPS, 2025

  27. [27]

    Cadsim: Robust and scalable in-the-wild 3d reconstruction for control- lable sensor simulation

    Jingkang Wang, Sivabalan Manivasagam, Yun Chen, Ze Yang, Ioan An- drei Bârsan, Anqi Joyce Yang, Wei-Chiu Ma, and Raquel Urtasun. Cadsim: Robust and scalable in-the-wild 3d reconstruction for control- lable sensor simulation. InCoRL, 2022

  28. [28]

    Advsim: Generating safety-critical scenarios for self-driving vehicles

    Jingkang Wang, Ava Pun, James Tu, Sivabalan Manivasagam, Abbas Sadat, Sergio Casas, Mengye Ren, and Raquel Urtasun. Advsim: Generating safety-critical scenarios for self-driving vehicles. InCVPR, 2021

  29. [29]

    Freevs: Generative view synthesis on free driving trajectory

    Qitai Wang, Lue Fan, Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. Freevs: Generative view synthesis on free driving trajectory. InICLR, 2025

  30. [30]

    Difix3d+: Improving 3d reconstructions with single-step diffusion models

    Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. InCVPR, 2025

  31. [31]

    Reconfusion: 3d reconstruction with diffusion priors

    Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfusion: 3d reconstruction with diffusion priors. InCVPR, 2024

  32. [32]

    Detectron2

    Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/ facebookresearch/detectron2, 2019

  33. [33]

    Pandaset: Advanced sensor suite dataset for autonomous driving

    Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driving. In ITSC, 2021

  34. [34]

    Street gaussians: Modeling dynamic urban scenes with gaussian splatting

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. In ECCV, 2024

  35. [35]

    Streetcrafter: Street view synthesis with controllable video diffusion models

    Yunzhi Yan, Zhen Xu, Haotong Lin, Haian Jin, Haoyu Guo, Yida Wang, Kun Zhan, Xianpeng Lang, Hujun Bao, Xiaowei Zhou, and Sida Peng. Streetcrafter: Street view synthesis with controllable video diffusion models. InCVPR, 2025

  36. [36]

    Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai

    Chenyu Yang, Yuntao Chen, Haofei Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Y . Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. ArXiv, 2022

  37. [37]

    Unisim: A neural closed-loop sensor simulator

    Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei- Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. InCVPR, 2023

  38. [38]

    Genassets: Generating in-the-wild 3d assets in latent space

    Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, and Raquel Urtasun. Genassets: Generating in-the-wild 3d assets in latent space. InCVPR, 2025

  39. [39]

    Recondreamer++: Harmonizing generative and reconstructive models for driving scene representation

    Guosheng Zhao, Xiaofeng Wang, Chaojun Ni, Zheng Zhu, Wenkang Qin, Guan Huang, and Xingang Wang. Recondreamer++: Harmonizing generative and reconstructive models for driving scene representation. arxiv, 2025

  40. [40]

    Mudg: Taming multi-modal diffusion with gaussian splatting for urban scene reconstruction.arXiv, 2025

    Yingshuang Zou, Yikang Ding, Chuanrui Zhang, Jiazhe Guo, Bohan Li, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, and Haoqian Wang. Mudg: Taming multi-modal diffusion with gaussian splatting for urban scene reconstruction.arXiv, 2025