GenAssets: Generating in-the-wild 3D Assets in Latent Space

Haowei Zhang; Jingkang Wang; Raquel Urtasun; Sivabalan Manivasagam; Yun Chen; Ze Yang

arxiv: 2604.23010 · v1 · submitted 2026-04-24 · 💻 cs.CV · cs.RO

GenAssets: Generating in-the-wild 3D Assets in Latent Space

Ze Yang , Jingkang Wang , Haowei Zhang , Sivabalan Manivasagam , Yun Chen , Raquel Urtasun This is my paper

Pith reviewed 2026-05-08 12:21 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords 3D asset generationlatent diffusionneural renderingin-the-wild dataautonomous driving simulationLiDAR camera fusionocclusion handlingreconstruct then generate

0 comments

The pith

A 3D latent diffusion model generates complete high-quality assets from sparse in-the-wild driving sensor data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to create diverse 3D assets of traffic participants that have full geometry and appearance, even though the input LiDAR and camera observations come from real driving scenes with limited viewpoints and frequent occlusions. Standard neural reconstruction produces incomplete results that only look good near the original camera positions, while direct diffusion models fail to handle the partial and sparse nature of the data. The solution first trains an occlusion-aware neural renderer across many scenes to embed objects into a clean latent space, then runs a diffusion process inside that space to synthesize new complete assets. This matters for autonomy development because simulation requires large numbers of realistic 3D models that can be rendered from any angle without manual authoring.

Core claim

We propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high-quality 3D assets with complete geometry and appearance. Key to our method is a reconstruct-then-generate approach that first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a diffusion model that operates on the latent space. We show our method outperforms existing reconstruction and generation based methods, unlocking diverse and scalable content creation for simulation.

What carries the argument

The reconstruct-then-generate pipeline: occlusion-aware neural rendering trained across multiple scenes to produce a latent space for partially observed objects, followed by a diffusion model that samples complete assets inside that latent space.

If this is right

Generated assets render consistently from arbitrary viewpoints rather than only near the original observations.
The method produces complete geometry and appearance for traffic participants even from single-pass driving captures.
It scales content creation for multi-sensor simulation without requiring dense multi-view captures or manual modeling.
Outperforms both pure neural-rendering reconstruction and standard diffusion generation on in-the-wild driving scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-space reconstruction step could be applied to other domains that suffer from sparse, occluded observations such as indoor robotics or aerial mapping.
Once the latent space exists, the diffusion stage could be conditioned on additional attributes like vehicle type or weather to further increase scenario variety in simulation.
Integration of these assets into closed-loop simulators would allow testing of perception and planning modules on far more diverse object configurations than real data alone provides.

Load-bearing premise

An occlusion-aware neural rendering model trained over multiple scenes can reliably construct a high-quality latent space for objects observed under sparse viewpoints and partial occlusions in driving data.

What would settle it

If assets generated by the model, when rendered from completely novel viewpoints far from any training observation, fail to match the appearance and geometry statistics of held-out real sensor captures of the same object categories, the claim that the latent space supports faithful completion would not hold.

Figures

Figures reproduced from arXiv: 2604.23010 by Haowei Zhang, Jingkang Wang, Raquel Urtasun, Sivabalan Manivasagam, Yun Chen, Ze Yang.

**Figure 1.** Figure 1: GenAssets takes in-the-wild camera image(s) and point cloud(s), and automatically reconstruct or generate 360° assets. Our 3D assets are diverse and high-quality with complete geometry and appearance, allowing for realistic and scalable sensor simulation. Abstract High-quality 3D assets for traffic participants are critical for multi-sensor simulation, which is essential for the safe end-to-end development… view at source ↗

**Figure 2.** Figure 2: Learning latent asset representation. We learn a low-dimensional object latent space that generates complete assets by training across multiple scenes via occlusion-aware neural rendering. The asset decoder is trained to map low-dimension latent codes into neural assets which are then composed with learnable per-scene background models to match real-world sensor observations. poorly to unseen viewpoints. N… view at source ↗

**Figure 3.** Figure 3: Left: Training asset diffusion model in latent space. Right: Sampling diffusion model for (un)conditional neural asset generation. guiding the latent space towards a standard normal distribution, similar to [35, 65]: LKL = 1 2 ∥µ 2 i +σ 2 i −1−log(σ 2 i )∥1, where µi and σi represent the mean and standard deviation components of latent code ci , i.e., ci = µ 2 i + σi ⊙ ϵ, with ϵ ∼ N (0, I). This regulari… view at source ↗

**Figure 4.** Figure 4: Top: Sparse view synthesis. GenAssets generalizes well on this extreme setting thanks to low-dimensional latent space learned across many scenes, while the SoTA reconstruction methods are less robust and produce noticeable visual artifacts (e.g., missing, blurry or distorted appearance). Middle: Novel camera synthesis. We train on frames from the front camera and evaluate on frames from the front-left came… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on unconditional generation. Our methods generates more diverse, complete and higher-quality 3D assets compared to SoTA 3D generative models. Ours MeshLRM Ours MeshLRM Ours MeshLRM view at source ↗

**Figure 6.** Figure 6: Qualitative results on single-image to 3D. 4.4. Applications Conditional Generation: The flexibility of our framework enables various conditional generation tasks. Specifically, we freeze the learned latent codes and train a conditional diffusion model fdiff(c (t) , t, y) using classifier-free guidance. We explore conditioning on fine-grained actor classes and time-of-day (day/night), with results prese… view at source ↗

read the original abstract

High-quality 3D assets for traffic participants are critical for multi-sensor simulation, which is essential for the safe end-to-end development of autonomy. Building assets from in-the-wild data is key for diversity and realism, but existing neural-rendering based reconstruction methods are slow and generate assets that render well only from viewpoints close to the original observations, limiting their usefulness in simulation. Recent diffusion-based generative models build complete and diverse assets, but perform poorly on in-the-wild driving scenes, where observed actors are captured under sparse and limited fields of view, and are partially occluded. In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high-quality 3D assets with complete geometry and appearance. Key to our method is a "reconstruct-then-generate" approach that first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a diffusion model that operates on the latent space. We show our method outperforms existing reconstruction and generation based methods, unlocking diverse and scalable content creation for simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper offers a reconstruct-then-generate approach to 3D asset creation for driving scenes but the evaluation details are key to judging its effectiveness.

read the letter

The one thing to know is that this paper puts forward a reconstruct-then-generate pipeline for creating complete 3D assets of traffic participants using data from real driving scenes with LiDAR and cameras. They first train an occlusion-aware neural rendering model across multiple scenes to learn a latent space for objects that can handle partial views and occlusions. Then they train a diffusion model in that latent space to generate new, diverse assets with full geometry and appearance. This stands out because it directly tackles the shortcomings of existing approaches in the context of in-the-wild data. Reconstruction techniques tend to produce assets that only look good from angles close to the original capture, which limits their use in simulation. Generative diffusion models, on the other hand, often fail when the input observations are sparse and occluded, as is common with moving vehicles in driving footage. By building the latent space first in a multi-scene setup, the hope is to get better encodings that allow the diffusion to produce high-quality outputs. The paper does well in framing the problem around the needs of multi-sensor simulation for autonomy development. It's a recognized issue that high-quality, diverse 3D assets are hard to come by at scale, and this tries to make use of the large amounts of fleet data available. Where it gets soft is in the reliance on the neural renderer to produce a latent space that truly captures unobserved geometry rather than filling in with priors. In driving scenes, objects are seen from limited angles and often blocked, so disentangling the object from context is tricky. If that step doesn't work as intended, the generated assets might look complete but not reflect real variations. The abstract claims better performance than baselines, but without the actual numbers, ablations, or details on how they measured completeness and quality, it's difficult to gauge the improvement. This kind of work is for people in computer vision and robotics focused on simulation and asset generation. Readers interested in practical applications for end-to-end autonomy testing would find value in the approach, even if they adapt parts of it. Given that it engages with real limitations in the field and proposes a structured solution, it deserves a serious referee to check the experiments and see if the results support the claims. I would recommend putting it through peer review.

Referee Report

2 major / 0 minor

Summary. The paper proposes GenAssets, a 3D latent diffusion model that learns from in-the-wild LiDAR and camera data captured by a sensor platform. It uses a reconstruct-then-generate pipeline: an occlusion-aware neural rendering model is trained jointly over multiple scenes to construct a latent space for objects observed under sparse viewpoints and partial occlusions, after which a diffusion model operates in that latent space to generate 3D assets with complete geometry and appearance. The abstract asserts that this outperforms existing reconstruction and generation baselines for multi-sensor simulation of traffic participants.

Significance. If the central assumption holds, the approach could enable scalable generation of diverse, complete 3D assets from real driving data, addressing the slowness of per-scene neural reconstruction and the failure of standard diffusion models on limited-view, occluded observations. This would support more realistic simulation for autonomy development.

major comments (2)

[Abstract] Abstract: The claim that the method 'outperforms existing reconstruction and generation based methods' is unsupported by any quantitative metrics, ablation studies, tables, or experimental details. This prevents verification of the central claim.
[Abstract] Abstract (reconstruct-then-generate description): The pipeline assumes that the occlusion-aware neural renderer, trained jointly over multiple scenes, produces object latents encoding complete unobserved geometry and appearance rather than imputing from dataset priors. No analysis, ablations, or evidence is supplied to show that the latent space recovers missing parts from sparse, occluded driving views (<90° total viewpoint range, frequent partial occlusions) instead of collapsing to averages. This assumption is load-bearing, as every generated asset is decoded from samples in this latent space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important points about supporting claims in the abstract and providing evidence for the core assumptions in our reconstruct-then-generate pipeline. We address each major comment below and commit to revisions that will strengthen the paper without altering its central contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the method 'outperforms existing reconstruction and generation based methods' is unsupported by any quantitative metrics, ablation studies, tables, or experimental details. This prevents verification of the central claim.

Authors: We agree that the abstract would be strengthened by including specific quantitative support for the performance claim. The full manuscript contains detailed experimental results in Sections 4 and 5, including tables with metrics such as PSNR, IoU for geometry, and FID for appearance, demonstrating consistent improvements over reconstruction and diffusion baselines. To address this directly, we will revise the abstract to incorporate key numerical results (e.g., average gains of X% on primary metrics) while maintaining its concise nature. revision: yes
Referee: [Abstract] Abstract (reconstruct-then-generate description): The pipeline assumes that the occlusion-aware neural renderer, trained jointly over multiple scenes, produces object latents encoding complete unobserved geometry and appearance rather than imputing from dataset priors. No analysis, ablations, or evidence is supplied to show that the latent space recovers missing parts from sparse, occluded driving views (<90° total viewpoint range, frequent partial occlusions) instead of collapsing to averages. This assumption is load-bearing, as every generated asset is decoded from samples in this latent space.

Authors: This is a substantive point about the properties of the learned latent space. Our joint multi-scene training is intended to promote completion of unobserved geometry through shared priors across diverse observations, and we provide supporting evidence via qualitative comparisons and quantitative metrics showing that our generated assets are more complete than per-scene baselines. That said, we acknowledge the absence of targeted analysis isolating recovery of missing parts versus dataset averaging. We will add a dedicated ablation subsection (including visualizations of decoded outputs from progressively sparser/occluded inputs against held-out ground truth) to directly demonstrate the latent space behavior. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive pipeline with no equations or self-referential reductions

full rationale

The paper describes a reconstruct-then-generate method that first trains an occlusion-aware neural renderer across scenes to produce object latents, then trains a diffusion model on those latents. No equations, derivations, or fitted-parameter predictions appear in the provided text. The central claim is an empirical method statement rather than a mathematical reduction; the neural-rendering step is presented as an enabling component whose validity is external to the diffusion stage. No self-citation chains, ansatzes smuggled via prior work, or renamings of known results are load-bearing for the output. The reader's assessment of score 2.0 is consistent with absence of circular structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated beyond standard assumptions of neural rendering and latent diffusion models.

pith-pipeline@v0.9.0 · 5520 in / 1075 out tokens · 31501 ms · 2026-05-08T12:21:54.402363+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

115 extracted references · 1 canonical work pages

[1]

Renderdiffusion: Image diffusion for 3d reconstruction, in- painting and generation

Titas Anciukevi ˇcius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J Mitra, and Paul Guerrero. Renderdiffusion: Image diffusion for 3d reconstruction, in- painting and generation. InCVPR, 2023. 3

2023
[2]

Barron, Ben Mildenhall, Dor Verbin, Pratul P

Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields.CVPR, 2022. 13

2022
[3]

Zip-nerf: Anti-aliased grid- based neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid- based neural radiance fields. InICCV, 2023. 22

2023
[4]

Gaudi: A neural architect for immersive 3d scene generation.NeurIPS, 2022

Miguel Angel Bautista, Pengsheng Guo, Samira Abnar, Walter Talbott, Alexander Toshev, Zhuoyuan Chen, Lau- rent Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, et al. Gaudi: A neural architect for immersive 3d scene generation.NeurIPS, 2022. 5

2022
[5]

Demystifying mmd gans.arXiv, 2018

Mikołaj Bi ´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv, 2018. 7, 16

2018
[6]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InCVPR, 2020. 8, 18

2020
[7]

Lightplane: Highly-scalable components for neu- ral 3d fields.arXiv, 2024

Ang Cao, Justin Johnson, Andrea Vedaldi, and David Novotny. Lightplane: Highly-scalable components for neu- ral 3d fields.arXiv, 2024. 3

2024
[8]

pi-gan: Periodic implicit genera- tive adversarial networks for 3d-aware image synthesis

Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit genera- tive adversarial networks for 3d-aware image synthesis. In CVPR, 2021. 2

2021
[9]

Efficient geometry-aware 3d generative adversarial networks

Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InCVPR, 2022. 2, 3, 6, 7, 8, 14, 15, 16

2022
[10]

pixelsplat: 3d gaussian splats from im- age pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from im- age pairs for scalable generalizable 3d reconstruction. In CVPR, 2024. 6, 7, 14, 15

2024
[11]

Single-stage dif- fusion nerf: A unified approach to 3d generation and recon- struction

Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage dif- fusion nerf: A unified approach to 3d generation and recon- struction. InICCV, 2023. 2, 3, 5, 6, 7, 8, 15, 16

2023
[12]

Geosim: Realistic video simu- lation via geometry-aware composition for self-driving

Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang, Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin Yumer, and Raquel Urtasun. Geosim: Realistic video simu- lation via geometry-aware composition for self-driving. In CVPR, 2021. 2

2021
[13]

G3r: Gradient guided gen- eralizable reconstruction

Yun Chen, Jingkang Wang, Ze Yang, Sivabalan Mani- vasagam, and Raquel Urtasun. G3r: Gradient guided gen- eralizable reconstruction. InECCV, 2025. 2, 6, 7, 14, 15

2025
[14]

Omnire: Omni urban scene reconstruction.arXiv, 2024

Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lu- tio, Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni urban scene reconstruction.arXiv, 2024. 1, 2, 6, 22

2024
[15]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, 2023. 2

2023
[16]

Diffusion models beat gans on image synthesis.NeurIPS, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.NeurIPS, 2021. 5

2021
[17]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Anto- nio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InCoRL, 2017. 1

2017
[18]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR,
[19]

Dynamic 3d gaussian fields for urban areas.arXiv, 2024

Tobias Fischer, Jonas Kulhanek, Samuel Rota Bul `o, Lorenzo Porzi, Marc Pollefeys, and Peter Kontschieder. Dynamic 3d gaussian fields for urban areas.arXiv, 2024. 1

2024
[20]

De- tail me more: Improving gan’s photo-realism of complex scenes

Raghudeep Gadde, Qianli Feng, and Aleix M Martinez. De- tail me more: Improving gan’s photo-realism of complex scenes. InICCV, 2021. 5

2021
[21]

Get3d: A generative model of high quality 3d tex- tured shapes learned from images.NeurIPS, 2022

Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d tex- tured shapes learned from images.NeurIPS, 2022. 2, 3

2022
[22]

Cat3d: Create anything in 3d with multi-view diffusion models.arXiv, 2024

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv, 2024. 3

2024
[23]

Generative adversarial nets.NeurIPS,

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.NeurIPS,
[24]

Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion

Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M Susskind, Christian Theobalt, Lingjie Liu, and Ravi Ra- mamoorthi. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. InICML,
[25]

Gans trained by a two time-scale update rule converge to a local nash equi- librium.NeurIPS, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equi- librium.NeurIPS, 2017. 6, 7, 16

2017
[26]

Denoising dif- fusion probabilistic models.NeurIPS, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.NeurIPS, 2020. 3, 5, 13

2020
[27]

LRM: Large reconstruction model for single im- age to 3d.arXiv, 2023

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large reconstruction model for single im- age to 3d.arXiv, 2023. 2

2023
[28]

Rangeldm: Fast realistic lidar point cloud generation

Qianjiang Hu, Zhimin Zhang, and Wei Hu. Rangeldm: Fast realistic lidar point cloud generation. InECCV, 2025. 3

2025
[29]

VEGS: View extrapolation of ur- ban scenes in 3d gaussian splatting using learned priors

Sungwon Hwang, Min-Jung Kim, Taewoong Kang, Jayeon Kang, and Jaegul Choo. VEGS: View extrapolation of ur- ban scenes in 3d gaussian splatting using learned priors. arXiv, 2024. 2, 3

2024
[30]

Image-to-image translation with conditional adver- sarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. InCVPR, 2017. 4

2017
[31]

Codenerf: Disentan- gled neural radiance fields for object categories

Wonbong Jang and Lourdes Agapito. Codenerf: Disentan- gled neural radiance fields for object categories. InICCV,
[32]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InCVPR, 2019. 2

2019
[33]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 2023. 22

2023
[34]

Autosplat: Constrained gaussian splatting for autonomous driving scene reconstruction.arXiv, 2024

Mustafa Khan, Hamidreza Fazlali, Dhruv Sharma, Tong- tong Cao, Dongfeng Bai, Yuan Ren, and Bingbing Liu. Autosplat: Constrained gaussian splatting for autonomous driving scene reconstruction.arXiv, 2024. 1, 2

2024
[35]

Auto-encoding variational bayes

Diederik P Kingma. Auto-encoding variational bayes. arXiv, 2013. 3, 5, 13

2013
[36]

Adam: A method for stochastic optimization.ICLR, 2015

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.ICLR, 2015. 14

2015
[37]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, 2023. 6, 15, 16

2023
[38]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv, 2023. 3

2023
[39]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InECCV, 2022. 8

2022
[40]

Magic3d: High- resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High- resolution text-to-3d content creation. InCVPR, 2023. 2, 3

2023
[41]

Neural scene rasterization for large scene rendering in real time

Jeffrey Yunfan Liu, Yun Chen, Ze Yang, Jingkang Wang, Sivabalan Manivasagam, and Raquel Urtasun. Neural scene rasterization for large scene rendering in real time. InICCV,
[42]

One-2-3-45: Any sin- gle image to 3d mesh in 45 seconds without per-shape op- timization.NeurIPS, 2024

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any sin- gle image to 3d mesh in 45 seconds without per-shape op- timization.NeurIPS, 2024. 3

2024
[43]

Meshformer: High- quality mesh generation with 3d-guided reconstruction model.arXiv, 2024

Minghua Liu, Chong Zeng, Xinyue Wei, Ruoxi Shi, Ling- hao Chen, Chao Xu, Mengqi Zhang, Zhaoning Wang, Xi- aoshuai Zhang, Isabella Liu, et al. Meshformer: High- quality mesh generation with 3d-guided reconstruction model.arXiv, 2024. 20

2024
[44]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. InICCV, 2023. 3

2023
[45]

Meshdif- fusion: Score-based generative 3d mesh modeling.arXiv,

Zhen Liu, Yao Feng, Michael J Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu. Meshdif- fusion: Score-based generative 3d mesh modeling.arXiv,
[46]

Wonder3d: Single image to 3d using cross-domain diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. InCVPR,
[47]

Diffusion probabilistic models for 3d point cloud generation

Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. InCVPR, 2021. 3

2021
[48]

Towards zero domain gap: A comprehensive study of realistic lidar simulation for autonomy testing

Sivabalan Manivasagam, Ioan Andrei B ˆarsan, Jingkang Wang, Ze Yang, and Raquel Urtasun. Towards zero domain gap: A comprehensive study of realistic lidar simulation for autonomy testing. InICCV, 2023. 1

2023
[49]

Lt3sd: Latent trees for 3d scene diffusion.arXiv, 2024

Quan Meng, Lei Li, Matthias Nießner, and Angela Dai. Lt3sd: Latent trees for 3d scene diffusion.arXiv, 2024. 3

2024
[50]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021. 2

2021
[51]

Autorf: Learning 3d object radiance fields from single view observations

Norman M ¨uller, Andrea Simonelli, Lorenzo Porzi, Samuel Rota Bul `o, Matthias Nießner, and Peter Kontschieder. Autorf: Learning 3d object radiance fields from single view observations. InCVPR, 2022. 2

2022
[52]

Diffrf: Rendering-guided 3d radiance field diffusion

Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. InCVPR, 2023. 2, 3, 5, 6

2023
[53]

Extracting triangular 3d models, materials, and lighting from images

Jacob Munkberg, Jon Hasselgren, Tianchang Shen, Jun Gao, Wenzheng Chen, Alex Evans, Thomas M ¨uller, and Sanja Fidler. Extracting triangular 3d models, materials, and lighting from images. InCVPR, 2022. 2

2022
[54]

Giraffe: Repre- senting scenes as compositional generative neural feature fields

Michael Niemeyer and Andreas Geiger. Giraffe: Repre- senting scenes as compositional generative neural feature fields. InCVPR, 2021. 2

2021
[55]

Au- todecoding latent 3d diffusion models.NeurIPS, 2023

Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc V Gool, and Sergey Tulyakov. Au- todecoding latent 3d diffusion models.NeurIPS, 2023. 2, 3

2023
[56]

Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction

Michael Oechsle, Songyou Peng, and Andreas Geiger. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. InICCV, 2021. 2

2021
[57]

Neural scene graphs for dynamic scenes

Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. Neural scene graphs for dynamic scenes. In CVPR, 2021. 2

2021
[58]

SDXL: Improving latent diffusion mod- els for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion mod- els for high-resolution image synthesis. InICLR, 2024. 3

2024
[59]

Dreamfusion: Text-to-3d using 2d diffusion.arXiv,

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv,
[60]

Neural lighting simulation for urban scenes

Ava Pun, Gary Sun, Jingkang Wang, Yun Chen, Ze Yang, Sivabalan Manivasagam, Wei-Chiu Ma, and Raquel Ur- tasun. Neural lighting simulation for urban scenes. In NeurIPS, 2023. 2, 22

2023
[61]

Towards realis- tic scene generation with lidar diffusion models

Haoxi Ran, Vitor Guizilini, and Yue Wang. Towards realis- tic scene generation with lidar diffusion models. InCVPR,
[62]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. In CVPR, 2024. 3 10

2024
[63]

SCube: Instant large-scale scene reconstruc- tion using voxsplats.arXiv, 2024

Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, and Ji- ahui Huang. SCube: Instant large-scale scene reconstruc- tion using voxsplats.arXiv, 2024. 2

2024
[64]

L3dg: Latent 3d gaussian diffusion

Barbara Roessle, Norman M ¨uller, Lorenzo Porzi, Samuel Rota Bul `o, Peter Kontschieder, Angela Dai, and Matthias Nießner. L3dg: Latent 3d gaussian diffusion. arXiv, 2024. 2, 3

2024
[65]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 2, 3, 5

2022
[66]

Lgsvl simulator: A high fidelity simulator for autonomous driving

Guodong Rong, Byung Hyun Shin, Hadi Tabatabaee, Qiang Lu, Steve Lemke, M ¯artin ¸ˇs Mo ˇzeiko, Eric Boise, Geehoon Uhm, Mark Gerow, and Shalin Mehta. Lgsvl simulator: A high fidelity simulator for autonomous driving. InITSC,
[67]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InMICCAI, 2015. 3, 5, 13

2015
[68]

Progressive distillation for fast sampling of diffusion models.arXiv, 2022

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv, 2022. 13

2022
[69]

Adv3d: Gener- ating safety-critical 3d objects through closed-loop simula- tion

Jay Sarva, Jingkang Wang, James Tu, Yuwen Xiong, Siva- balan Manivasagam, and Raquel Urtasun. Adv3d: Gener- ating safety-critical 3d objects through closed-loop simula- tion. InCoRL, 2023. 1

2023
[70]

AirSim: High-fidelity visual and physical simula- tion for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-fidelity visual and physical simula- tion for autonomous vehicles. InField and service robotics,
[71]

Gina-3d: Learning to generate implicit neural assets in the wild

Bokui Shen, Xinchen Yan, Charles R Qi, Mahyar Najibi, Boyang Deng, Leonidas Guibas, Yin Zhou, and Dragomir Anguelov. Gina-3d: Learning to generate implicit neural assets in the wild. InCVPR, 2023. 2

2023
[72]

Mvdream: Multi-view diffusion for 3d generation.arXiv, 2023

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation.arXiv, 2023. 3

2023
[73]

3d neural field genera- tion using triplane diffusion

J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field genera- tion using triplane diffusion. InCVPR, 2023. 3, 5

2023
[74]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015. 5

2015
[75]

Denois- ing diffusion implicit models.arXiv, 2020

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models.arXiv, 2020. 3, 5

2020
[76]

Score- based generative modeling through stochastic differential equations.arXiv, 2020

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations.arXiv, 2020. 3, 5

2020
[77]

Viewset diffusion:(0-) image-conditioned 3d gen- erative models from 2d data

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset diffusion:(0-) image-conditioned 3d gen- erative models from 2d data. InICCV, 2023. 3

2023
[78]

Block-nerf: Scalable large scene neural view synthesis

Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. InCVPR, 2022. 1

2022
[79]

Dreamgaussian: Generative gaussian splatting for efficient 3d content creation.arXiv, 2023

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation.arXiv, 2023. 3

2023
[80]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. In ECCV, 2025. 3

2025

Showing first 80 references.

[1] [1]

Renderdiffusion: Image diffusion for 3d reconstruction, in- painting and generation

Titas Anciukevi ˇcius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J Mitra, and Paul Guerrero. Renderdiffusion: Image diffusion for 3d reconstruction, in- painting and generation. InCVPR, 2023. 3

2023

[2] [2]

Barron, Ben Mildenhall, Dor Verbin, Pratul P

Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields.CVPR, 2022. 13

2022

[3] [3]

Zip-nerf: Anti-aliased grid- based neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid- based neural radiance fields. InICCV, 2023. 22

2023

[4] [4]

Gaudi: A neural architect for immersive 3d scene generation.NeurIPS, 2022

Miguel Angel Bautista, Pengsheng Guo, Samira Abnar, Walter Talbott, Alexander Toshev, Zhuoyuan Chen, Lau- rent Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, et al. Gaudi: A neural architect for immersive 3d scene generation.NeurIPS, 2022. 5

2022

[5] [5]

Demystifying mmd gans.arXiv, 2018

Mikołaj Bi ´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv, 2018. 7, 16

2018

[6] [6]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InCVPR, 2020. 8, 18

2020

[7] [7]

Lightplane: Highly-scalable components for neu- ral 3d fields.arXiv, 2024

Ang Cao, Justin Johnson, Andrea Vedaldi, and David Novotny. Lightplane: Highly-scalable components for neu- ral 3d fields.arXiv, 2024. 3

2024

[8] [8]

pi-gan: Periodic implicit genera- tive adversarial networks for 3d-aware image synthesis

Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit genera- tive adversarial networks for 3d-aware image synthesis. In CVPR, 2021. 2

2021

[9] [9]

Efficient geometry-aware 3d generative adversarial networks

Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InCVPR, 2022. 2, 3, 6, 7, 8, 14, 15, 16

2022

[10] [10]

pixelsplat: 3d gaussian splats from im- age pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from im- age pairs for scalable generalizable 3d reconstruction. In CVPR, 2024. 6, 7, 14, 15

2024

[11] [11]

Single-stage dif- fusion nerf: A unified approach to 3d generation and recon- struction

Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage dif- fusion nerf: A unified approach to 3d generation and recon- struction. InICCV, 2023. 2, 3, 5, 6, 7, 8, 15, 16

2023

[12] [12]

Geosim: Realistic video simu- lation via geometry-aware composition for self-driving

Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang, Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin Yumer, and Raquel Urtasun. Geosim: Realistic video simu- lation via geometry-aware composition for self-driving. In CVPR, 2021. 2

2021

[13] [13]

G3r: Gradient guided gen- eralizable reconstruction

Yun Chen, Jingkang Wang, Ze Yang, Sivabalan Mani- vasagam, and Raquel Urtasun. G3r: Gradient guided gen- eralizable reconstruction. InECCV, 2025. 2, 6, 7, 14, 15

2025

[14] [14]

Omnire: Omni urban scene reconstruction.arXiv, 2024

Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lu- tio, Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni urban scene reconstruction.arXiv, 2024. 1, 2, 6, 22

2024

[15] [15]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, 2023. 2

2023

[16] [16]

Diffusion models beat gans on image synthesis.NeurIPS, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.NeurIPS, 2021. 5

2021

[17] [17]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Anto- nio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InCoRL, 2017. 1

2017

[18] [18]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR,

[19] [19]

Dynamic 3d gaussian fields for urban areas.arXiv, 2024

Tobias Fischer, Jonas Kulhanek, Samuel Rota Bul `o, Lorenzo Porzi, Marc Pollefeys, and Peter Kontschieder. Dynamic 3d gaussian fields for urban areas.arXiv, 2024. 1

2024

[20] [20]

De- tail me more: Improving gan’s photo-realism of complex scenes

Raghudeep Gadde, Qianli Feng, and Aleix M Martinez. De- tail me more: Improving gan’s photo-realism of complex scenes. InICCV, 2021. 5

2021

[21] [21]

Get3d: A generative model of high quality 3d tex- tured shapes learned from images.NeurIPS, 2022

Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d tex- tured shapes learned from images.NeurIPS, 2022. 2, 3

2022

[22] [22]

Cat3d: Create anything in 3d with multi-view diffusion models.arXiv, 2024

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv, 2024. 3

2024

[23] [23]

Generative adversarial nets.NeurIPS,

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.NeurIPS,

[24] [24]

Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion

Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M Susskind, Christian Theobalt, Lingjie Liu, and Ravi Ra- mamoorthi. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. InICML,

[25] [25]

Gans trained by a two time-scale update rule converge to a local nash equi- librium.NeurIPS, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equi- librium.NeurIPS, 2017. 6, 7, 16

2017

[26] [26]

Denoising dif- fusion probabilistic models.NeurIPS, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.NeurIPS, 2020. 3, 5, 13

2020

[27] [27]

LRM: Large reconstruction model for single im- age to 3d.arXiv, 2023

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large reconstruction model for single im- age to 3d.arXiv, 2023. 2

2023

[28] [28]

Rangeldm: Fast realistic lidar point cloud generation

Qianjiang Hu, Zhimin Zhang, and Wei Hu. Rangeldm: Fast realistic lidar point cloud generation. InECCV, 2025. 3

2025

[29] [29]

VEGS: View extrapolation of ur- ban scenes in 3d gaussian splatting using learned priors

Sungwon Hwang, Min-Jung Kim, Taewoong Kang, Jayeon Kang, and Jaegul Choo. VEGS: View extrapolation of ur- ban scenes in 3d gaussian splatting using learned priors. arXiv, 2024. 2, 3

2024

[30] [30]

Image-to-image translation with conditional adver- sarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. InCVPR, 2017. 4

2017

[31] [31]

Codenerf: Disentan- gled neural radiance fields for object categories

Wonbong Jang and Lourdes Agapito. Codenerf: Disentan- gled neural radiance fields for object categories. InICCV,

[32] [32]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InCVPR, 2019. 2

2019

[33] [33]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 2023. 22

2023

[34] [34]

Autosplat: Constrained gaussian splatting for autonomous driving scene reconstruction.arXiv, 2024

Mustafa Khan, Hamidreza Fazlali, Dhruv Sharma, Tong- tong Cao, Dongfeng Bai, Yuan Ren, and Bingbing Liu. Autosplat: Constrained gaussian splatting for autonomous driving scene reconstruction.arXiv, 2024. 1, 2

2024

[35] [35]

Auto-encoding variational bayes

Diederik P Kingma. Auto-encoding variational bayes. arXiv, 2013. 3, 5, 13

2013

[36] [36]

Adam: A method for stochastic optimization.ICLR, 2015

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.ICLR, 2015. 14

2015

[37] [37]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, 2023. 6, 15, 16

2023

[38] [38]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv, 2023. 3

2023

[39] [39]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InECCV, 2022. 8

2022

[40] [40]

Magic3d: High- resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High- resolution text-to-3d content creation. InCVPR, 2023. 2, 3

2023

[41] [41]

Neural scene rasterization for large scene rendering in real time

Jeffrey Yunfan Liu, Yun Chen, Ze Yang, Jingkang Wang, Sivabalan Manivasagam, and Raquel Urtasun. Neural scene rasterization for large scene rendering in real time. InICCV,

[42] [42]

One-2-3-45: Any sin- gle image to 3d mesh in 45 seconds without per-shape op- timization.NeurIPS, 2024

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any sin- gle image to 3d mesh in 45 seconds without per-shape op- timization.NeurIPS, 2024. 3

2024

[43] [43]

Meshformer: High- quality mesh generation with 3d-guided reconstruction model.arXiv, 2024

Minghua Liu, Chong Zeng, Xinyue Wei, Ruoxi Shi, Ling- hao Chen, Chao Xu, Mengqi Zhang, Zhaoning Wang, Xi- aoshuai Zhang, Isabella Liu, et al. Meshformer: High- quality mesh generation with 3d-guided reconstruction model.arXiv, 2024. 20

2024

[44] [44]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. InICCV, 2023. 3

2023

[45] [45]

Meshdif- fusion: Score-based generative 3d mesh modeling.arXiv,

Zhen Liu, Yao Feng, Michael J Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu. Meshdif- fusion: Score-based generative 3d mesh modeling.arXiv,

[46] [46]

Wonder3d: Single image to 3d using cross-domain diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. InCVPR,

[47] [47]

Diffusion probabilistic models for 3d point cloud generation

Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. InCVPR, 2021. 3

2021

[48] [48]

Towards zero domain gap: A comprehensive study of realistic lidar simulation for autonomy testing

Sivabalan Manivasagam, Ioan Andrei B ˆarsan, Jingkang Wang, Ze Yang, and Raquel Urtasun. Towards zero domain gap: A comprehensive study of realistic lidar simulation for autonomy testing. InICCV, 2023. 1

2023

[49] [49]

Lt3sd: Latent trees for 3d scene diffusion.arXiv, 2024

Quan Meng, Lei Li, Matthias Nießner, and Angela Dai. Lt3sd: Latent trees for 3d scene diffusion.arXiv, 2024. 3

2024

[50] [50]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021. 2

2021

[51] [51]

Autorf: Learning 3d object radiance fields from single view observations

Norman M ¨uller, Andrea Simonelli, Lorenzo Porzi, Samuel Rota Bul `o, Matthias Nießner, and Peter Kontschieder. Autorf: Learning 3d object radiance fields from single view observations. InCVPR, 2022. 2

2022

[52] [52]

Diffrf: Rendering-guided 3d radiance field diffusion

Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. InCVPR, 2023. 2, 3, 5, 6

2023

[53] [53]

Extracting triangular 3d models, materials, and lighting from images

Jacob Munkberg, Jon Hasselgren, Tianchang Shen, Jun Gao, Wenzheng Chen, Alex Evans, Thomas M ¨uller, and Sanja Fidler. Extracting triangular 3d models, materials, and lighting from images. InCVPR, 2022. 2

2022

[54] [54]

Giraffe: Repre- senting scenes as compositional generative neural feature fields

Michael Niemeyer and Andreas Geiger. Giraffe: Repre- senting scenes as compositional generative neural feature fields. InCVPR, 2021. 2

2021

[55] [55]

Au- todecoding latent 3d diffusion models.NeurIPS, 2023

Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc V Gool, and Sergey Tulyakov. Au- todecoding latent 3d diffusion models.NeurIPS, 2023. 2, 3

2023

[56] [56]

Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction

Michael Oechsle, Songyou Peng, and Andreas Geiger. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. InICCV, 2021. 2

2021

[57] [57]

Neural scene graphs for dynamic scenes

Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. Neural scene graphs for dynamic scenes. In CVPR, 2021. 2

2021

[58] [58]

SDXL: Improving latent diffusion mod- els for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion mod- els for high-resolution image synthesis. InICLR, 2024. 3

2024

[59] [59]

Dreamfusion: Text-to-3d using 2d diffusion.arXiv,

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv,

[60] [60]

Neural lighting simulation for urban scenes

Ava Pun, Gary Sun, Jingkang Wang, Yun Chen, Ze Yang, Sivabalan Manivasagam, Wei-Chiu Ma, and Raquel Ur- tasun. Neural lighting simulation for urban scenes. In NeurIPS, 2023. 2, 22

2023

[61] [61]

Towards realis- tic scene generation with lidar diffusion models

Haoxi Ran, Vitor Guizilini, and Yue Wang. Towards realis- tic scene generation with lidar diffusion models. InCVPR,

[62] [62]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. In CVPR, 2024. 3 10

2024

[63] [63]

SCube: Instant large-scale scene reconstruc- tion using voxsplats.arXiv, 2024

Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, and Ji- ahui Huang. SCube: Instant large-scale scene reconstruc- tion using voxsplats.arXiv, 2024. 2

2024

[64] [64]

L3dg: Latent 3d gaussian diffusion

Barbara Roessle, Norman M ¨uller, Lorenzo Porzi, Samuel Rota Bul `o, Peter Kontschieder, Angela Dai, and Matthias Nießner. L3dg: Latent 3d gaussian diffusion. arXiv, 2024. 2, 3

2024

[65] [65]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 2, 3, 5

2022

[66] [66]

Lgsvl simulator: A high fidelity simulator for autonomous driving

Guodong Rong, Byung Hyun Shin, Hadi Tabatabaee, Qiang Lu, Steve Lemke, M ¯artin ¸ˇs Mo ˇzeiko, Eric Boise, Geehoon Uhm, Mark Gerow, and Shalin Mehta. Lgsvl simulator: A high fidelity simulator for autonomous driving. InITSC,

[67] [67]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InMICCAI, 2015. 3, 5, 13

2015

[68] [68]

Progressive distillation for fast sampling of diffusion models.arXiv, 2022

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv, 2022. 13

2022

[69] [69]

Adv3d: Gener- ating safety-critical 3d objects through closed-loop simula- tion

Jay Sarva, Jingkang Wang, James Tu, Yuwen Xiong, Siva- balan Manivasagam, and Raquel Urtasun. Adv3d: Gener- ating safety-critical 3d objects through closed-loop simula- tion. InCoRL, 2023. 1

2023

[70] [70]

AirSim: High-fidelity visual and physical simula- tion for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-fidelity visual and physical simula- tion for autonomous vehicles. InField and service robotics,

[71] [71]

Gina-3d: Learning to generate implicit neural assets in the wild

Bokui Shen, Xinchen Yan, Charles R Qi, Mahyar Najibi, Boyang Deng, Leonidas Guibas, Yin Zhou, and Dragomir Anguelov. Gina-3d: Learning to generate implicit neural assets in the wild. InCVPR, 2023. 2

2023

[72] [72]

Mvdream: Multi-view diffusion for 3d generation.arXiv, 2023

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation.arXiv, 2023. 3

2023

[73] [73]

3d neural field genera- tion using triplane diffusion

J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field genera- tion using triplane diffusion. InCVPR, 2023. 3, 5

2023

[74] [74]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015. 5

2015

[75] [75]

Denois- ing diffusion implicit models.arXiv, 2020

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models.arXiv, 2020. 3, 5

2020

[76] [76]

Score- based generative modeling through stochastic differential equations.arXiv, 2020

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations.arXiv, 2020. 3, 5

2020

[77] [77]

Viewset diffusion:(0-) image-conditioned 3d gen- erative models from 2d data

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset diffusion:(0-) image-conditioned 3d gen- erative models from 2d data. InICCV, 2023. 3

2023

[78] [78]

Block-nerf: Scalable large scene neural view synthesis

Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. InCVPR, 2022. 1

2022

[79] [79]

Dreamgaussian: Generative gaussian splatting for efficient 3d content creation.arXiv, 2023

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation.arXiv, 2023. 3

2023

[80] [80]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. In ECCV, 2025. 3

2025