arxiv: 2604.06113 · v1 · submitted 2026-04-07 · 💻 cs.CV

Recognition: no theorem link

SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

Hiba Dahmani , Nathan Piasco , Moussab Bennehar , Luis Rold\~ao , Dzmitry Tsishkou , Laurent Caraffa , Jean-Philippe Tarel , Roland Br\'emond

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D scene generationdiffusion modelssemantic conditioningvoxel gridsdriving scenesprogressive outpaintingphotorealistic renderingmultiview consistency

0 comments

The pith

Semantic voxel-guided diffusion with progressive outpainting generates large-scale multiview-consistent 3D driving scenes that render photorealistically without per-scene optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to produce expansive 3D models of urban driving environments that stay geometrically and visually coherent when viewed from many different angles and distances. Earlier methods either lose 3D structure when converting from 2D generators or cannot grow beyond small bounded areas. The approach builds a discrete voxel structure called the Σ-Voxfield grid by running a semantic-conditioned diffusion model on local neighborhoods, then expands these neighborhoods outward through overlapping regions. A deferred renderer turns the resulting grid into photorealistic images for arbitrary sensors and paths. This matters because it removes the need to optimize a new model for every new scene while keeping computation moderate.

Core claim

We propose a 3D generative framework based on the Σ-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally we render the generated Σ-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization.

What carries the argument

The Σ-Voxfield grid is the central object: a discrete 3D voxel representation in which each occupied voxel stores a fixed number of colorized surface samples; the semantic-conditioned diffusion model generates these grids locally and the progressive outpainting mechanism extends them while preserving structure.

If this is right

Diverse large-scale urban outdoor scenes can be produced directly from semantic conditions.
Generated scenes render into photorealistic images under arbitrary sensor configurations and camera trajectories.
Multiview geometric and appearance consistency is preserved across the entire scene extent.
Moderate computation cost is achieved compared with existing large-scale generation methods.
No per-scene optimization is required to obtain renderable output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The generated scenes could supply unlimited synthetic training data to improve perception models for autonomous vehicles.
Modifying the semantic input map before diffusion would allow controlled editing of scene layout without retraining.
The local-plus-outpainting pattern could be adapted to indoor or natural environments by changing the semantic vocabulary.
Integration with real-time engines could let users interactively expand virtual worlds while maintaining visual quality.

Load-bearing premise

That semantic-conditioned diffusion applied to local voxel neighborhoods plus progressive outpainting over overlaps will automatically produce globally consistent geometry and appearance at large scales without introducing artifacts or coherence loss.

What would settle it

Generate a scene hundreds of meters across, then render it along a camera trajectory that crosses multiple outpainted boundaries; visible seams, geometric distortions, or appearance mismatches in the novel views would show the consistency claim does not hold.

Figures

Figures reproduced from arXiv: 2604.06113 by Dzmitry Tsishkou, Hiba Dahmani, Jean-Philippe Tarel, Laurent Caraffa, Luis Rold\~ao, Moussab Bennehar, Nathan Piasco, Roland Br\'emond.

**Figure 1.** Figure 1: Our model generates large-scale 3D driving scenes given a coarse semantic voxel grid. Here we show a generated large-scale driving scene spanning ≈100 000 m2 Abstract. Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, … view at source ↗

**Figure 2.** Figure 2: Overview of SEM-ROVER. Our framework performs generation directly in 3D to ensure consistent geometry and appearance. It represents the scene with a ΣVoxfield grid and applies transformer-based diffusion over 3D tokens. To scale to large environments efficiently, we use an iterative outpainting strategy. Finally, a deferred rendering engine converts the generated 3D scene into photorealistic views. rather… view at source ↗

**Figure 3.** Figure 3: Simplified illustration of 2D Σ Voxfield. a) A 2D voxel containing a continuous colorized surface field. b) 2D Σ Voxfield points, uniformly sampled on the surface. c) Rendering method of the Σ Voxfield, where each point is replaced by a 2D Gaussians aligned with the implicit surface. Definition. To describe a large outdoor driving scene, we propose to use Σ-Voxfield grid. A Σ-Voxfield is a local and di… view at source ↗

**Figure 4.** Figure 4: Qualitative results on WOD. We show three WOD scenes (a–c). For each scene, the top strip visualizes the semantic voxel rendering used for conditioning, and the bottom strip shows the corresponding generated scene from 5 camera views. 4.2 Qualitative Results We evaluate on WOD ( [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on PandaSet. We show three PandaSet scenes (a–c). For each scene, the top strip visualizes the semantic voxel rendering used for conditioning, and the bottom strip shows the corresponding generated scene from 5 camera views. Ours(a) InfC(a) Ours(b) InfC(b) [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison to InfiniCube [17]. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results: generative capabilities. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Semantic editing: We remove existing cars by editing the semantic grid (red boxes), then regenerate the content conditioned on the updated semantics. The generated results remain coherent and consistent with the surrounding scene. (1) (2) (3) [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Scene inpainting We mask a voxel region as shows (1) (darker voxel colors for unmasked) and re-generate only this region while keeping the rest of the Σ-Voxfield grid fixed. (2) and (3) show two inpainted results (red box) that remain coherent with the background and vary in both structure and appearance. while keeping the rest of the grid fixed. While different samples produce diverse foreground realizati… view at source ↗

**Figure 1.** Figure 1: 4.6 Limitations While our method enables scalable and semantically structured 3D scene generation, several limitations remain. First, our model provides limited control over the generated appearance: conditioning is primarily semantic and geometric, so attributes such as texture style, lighting, and material properties cannot be specified explicitly. Second, our formulation focuses on static scene generat… view at source ↗

**Figure 1.** Figure 1: Textured meshes after data processing. Top line shows 2 scenes from Pandaset [31], bottom are 2 scenes from WOD [28]. Orange triangles in the colorized mesh denote faces not textured during the procedure. The construction of Σ-Voxfield grids from raw multi-view driving logs follows a systematic pipeline designed to transform unstructured sensor data into a semantically-aware, discretized volumetric repres… view at source ↗

**Figure 2.** Figure 2: Visualization of Σ-VoxFields. (a) Large-scene Σ-VoxFields with their corresponding semantic layouts. (b) Examples of Σ-VoxFields training samples, each containing 50–150 neighboring Σ-VoxFields used to train our model [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the Σ-Voxfield diffusion model. 2 Model architectures and training details 2.1 Σ-Voxfield diffusion model The denoising network is implemented as a transformer over local sets of Σ-Voxfields, illustrated in [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Deferred rendering architectures. (a) Autoregressive Stable Diffusion: our renderer based on SD model, with additional conditioning for autoregressive generation (previous frame) and consistent 3D generation (3D buffer from Σ-Voxfield rendering and sky mask); (b) Video Stable Diffusion: our renderer based on VSD model, with additional conditioning from 3D buffers and sky masks. 3 Spatial Outpainting Strate… view at source ↗

**Figure 5.** Figure 5: Top-view visualization of progressive region selection on a semantic voxel grid. Each panel shows the region selected at one iteration of Algorithm 1, with the overlaid number indicating the selection order. The extraction expands outward from already covered regions, producing overlapping local sets suitable for conditional outpainting [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Top-view visualization of progressive outpainting. The first row shows the semantic voxel grid layout conditioning, while the second row shows the corresponding generated 3D Σ-Voxfield. The panels correspond to different stages of region expansion and illustrate that our Repaint [18] based outpainting preserves spatial coherence across neighboring local generations. 3.1 Ablation on the number of sampled … view at source ↗

**Figure 7.** Figure 7: Chamfer Distance between sampled voxel points and the ground-truth scene mesh as a function of the number of surface samples per voxel. The error saturates around N = 20. 4 Additional Results We show additional results of our method on WOD in [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Additional Qualitative results on WOD [28]. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Additional Qualitative results on Pandaset [31]. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison with baselines. We show four scenes (a–d), each rendered from the front camera comparing our method to Infinicube [17] and GEN3C [14] [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: 3D Buffers. Rendering of our generated Σ-Voxelfield grids along with the generated scene geometry as normal map [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Large scene generation. The semantic voxel grid exhibit the conditioning used to create a large driving sequence. We also show a top view rendering of the generated Σ-Voxfield grid [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

read the original abstract

Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $\Sigma$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $\Sigma$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEM-ROVER's local semantic voxel diffusion with progressive outpainting offers a workable route to large driving scenes but leaves global consistency unproven without metrics.

read the letter

The paper introduces a Σ-Voxfield grid where voxels hold fixed colorized surface samples, then generates it via semantic-conditioned diffusion on local neighborhoods using 3D positional encodings, followed by overlapping outpainting to reach city scale and a deferred renderer for multiview images. This combination is the actual new piece: it sidesteps image-to-3D distillation and per-scene optimization while targeting outdoor driving data needs. The pipeline description is clear and the scaling strategy makes sense on paper for keeping computation moderate. Credit is due for framing the problem around consistency across views and sensors rather than just visual quality on small patches. The approach sits in a useful middle ground between object-centric 3D generators and full scene video models. The soft spot is the missing quantitative backbone. The abstract asserts extensive experiments and moderate cost advantages, yet supplies no numbers, baselines, ablations, or failure cases. Without those, it is impossible to tell whether local diffusion plus overlap propagation actually holds geometry and appearance together at scale or whether small neighborhood errors compound into view-inconsistent renderings in occluded urban areas. The stress-test worry about drift is reasonable given the bounded receptive fields; if the full paper has no global consistency loss or post-processing, that remains a live question. This paper is for people building simulation data for autonomous driving or robotics who need controllable large 3D assets. A reader already working on voxel or diffusion scene models will find the representation and outpainting details worth examining. It deserves peer review because the core idea is technically grounded enough to merit checking the implementation and results, even if revisions are needed on the evaluation side. I would send it to referees.

Referee Report

2 major / 2 minor

Summary. The paper proposes SEM-ROVER, a 3D generative framework for large-scale outdoor driving scenes based on the Σ-Voxfield grid, a discrete voxel representation storing a fixed number of colorized surface samples per occupied voxel. A semantic-conditioned diffusion model operates on local voxel neighborhoods using 3D positional encodings; generation scales to city-scale scenes via progressive spatial outpainting over overlapping regions. A deferred rendering module then produces photorealistic images from arbitrary sensor configurations and trajectories without per-scene optimization. The abstract states that extensive experiments demonstrate generation of diverse, multiview-consistent scenes at moderate computational cost.

Significance. If the global consistency claims hold, the work offers a practical advance in scalable 3D scene synthesis for driving environments, enabling diverse, renderable data for sensor simulation and autonomous driving training without the view restrictions or optimization overhead of prior image/video distillation or small-scale 3D methods. The combination of local semantic diffusion with outpainting and deferred rendering is a concrete technical contribution that could reduce reliance on expensive real-world data collection.

major comments (2)

[Abstract and §3] Abstract and §3 (method description): the central claim that progressive spatial outpainting over overlapping regions produces globally consistent Σ-Voxfield grids at large scales rests on local diffusion steps plus overlap propagation, yet no global attention, consistency loss, or post-hoc correction for accumulated geometric/appearance drift is described. This is load-bearing for the large-scale multiview-consistent rendering claim, especially in complex urban scenes with occlusions and varying semantics.
[§4] §4 (experiments): the abstract asserts that 'extensive experiments' support diversity, photorealism, and moderate cost, but the provided description contains no quantitative metrics, baseline comparisons, ablation studies on outpainting consistency, or failure-case analysis. Without these, the empirical support for the strongest claim cannot be verified.

minor comments (2)

[Abstract] Abstract: the Σ-Voxfield grid is introduced without a concise definition or equation specifying the fixed number of surface samples per voxel; adding a brief formalization would improve clarity for readers unfamiliar with the representation.
[Method] Notation throughout: 3D positional encodings are mentioned but not specified (e.g., sinusoidal, learned, or Fourier features); an equation or reference in the method section would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments highlight important aspects of our claims on scalability and empirical validation. We address each point below, clarifying the design rationale where possible and committing to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method description): the central claim that progressive spatial outpainting over overlapping regions produces globally consistent Σ-Voxfield grids at large scales rests on local diffusion steps plus overlap propagation, yet no global attention, consistency loss, or post-hoc correction for accumulated geometric/appearance drift is described. This is load-bearing for the large-scale multiview-consistent rendering claim, especially in complex urban scenes with occlusions and varying semantics.

Authors: We agree that the consistency mechanism merits explicit discussion. The Σ-Voxfield representation stores a fixed number of colorized surface samples per voxel, and the semantic-conditioned diffusion operates on local neighborhoods with 3D positional encodings. Progressive outpainting is performed over overlapping spatial regions so that each new patch is conditioned on the already-generated adjacent structure; this overlap propagation, together with the semantic labels, is intended to anchor geometry and appearance without global attention (which would be infeasible at city scale). No explicit consistency loss or post-hoc correction is applied because the deferred rendering stage operates directly on the completed grid. We will add a dedicated subsection in §3 that formalizes the overlap-based propagation argument and include additional qualitative visualizations demonstrating long-range consistency across occlusions. We view this as a partial revision because the core technical choice remains unchanged. revision: partial
Referee: [§4] §4 (experiments): the abstract asserts that 'extensive experiments' support diversity, photorealism, and moderate cost, but the provided description contains no quantitative metrics, baseline comparisons, ablation studies on outpainting consistency, or failure-case analysis. Without these, the empirical support for the strongest claim cannot be verified.

Authors: We acknowledge that the current presentation of §4 does not sufficiently foreground the quantitative evidence. The manuscript does contain comparisons against image/video distillation baselines and small-scale 3D generative methods, together with metrics for rendered-image quality and cross-view consistency; however, the ablation on overlap size and explicit failure-case analysis are only briefly mentioned. We will expand §4 with (i) a table of quantitative metrics (FID, multiview PSNR/SSIM, and runtime), (ii) ablations isolating the contribution of overlap propagation and semantic conditioning to consistency, (iii) side-by-side baseline comparisons, and (iv) a dedicated failure-case figure highlighting residual drift in complex urban scenes. These additions will be included in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained training on external data

full rationale

The paper defines a new discrete 3D representation (Σ-Voxfield) and trains a semantic-conditioned diffusion model on local neighborhoods with positional encodings, then extends via overlapping outpainting and deferred rendering. All claims rest on empirical training from external datasets and standard diffusion objectives; no equations, predictions, or uniqueness theorems reduce by construction to fitted inputs or self-citations. The large-scale consistency is presented as an experimental outcome rather than a tautological derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of the newly introduced Σ-Voxfield representation and the ability of the diffusion model to maintain coherence during outpainting; no explicit free parameters are listed in the abstract, though diffusion training inherently involves many hyperparameters.

invented entities (1)

Σ-Voxfield grid no independent evidence
purpose: Discrete 3D representation in which each occupied voxel stores a fixed number of colorized surface samples
Introduced as the core data structure enabling the semantic diffusion and deferred rendering pipeline.

pith-pipeline@v0.9.0 · 5545 in / 1336 out tokens · 64677 ms · 2026-05-10T19:28:21.415278+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 16 canonical work pages · 7 internal anchors

[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review arXiv 2023
[3]

In: Proc

Cai, Y., Zhang, H., Zhang, K., Liang, Y., Ren, M., Luan, F., Liu, Q., Kim, S.Y., Zhang, J., Zhang, Z., Zhou, Y., Zhang, Y., Yang, X., Lin, Z., Yuille, A.: Baking gaussian splatting into diffusion denoiser for fast and scalable single-stage image- to-3D generation and reconstruction. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2025)

2025
[4]

Cernea, D.: OpenMVS: Multi-view stereo reconstruction library (2020),https: //cdcseacave.github.io

2020
[5]

In: Proc

Chen, Z., Yang, J., Huang, J., de Lutio, R., Esturo, J.M., Ivanovic, B., Litany, O., Gojcic, Z., Fidler, S., Pavone, M., Song, L., Wang, Y.: OmniRe: Omni urban scene reconstruction. In: Proc. of the International Conf. on Learning Representations (ICLR) (2025)

2025
[6]

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding (2016),https://arxiv.org/abs/1604.01685

work page arXiv 2016
[7]

In: Proc

Dhariwal, P., Nichol, A.: Scalable Diffusion Models with Transformers. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2023)

2023
[8]

In: Proc

Gao, R., Chen, K., Li, Z., Hong, L., Li, Z., Xu, Q.: MagicDrive3D: Controllable 3D generation for any-view rendering in street scenes. In: Proc. of the IEEE Winter Conference on Applications of Computer Vision (WACV) (2026)

2026
[9]

In: Proc

Gao, R., Chen, K., Xie, E., Hong, L., Li, Z., Yeung, D.Y., Xu, Q.: MagicDrive: Street view generation with diverse 3D geometry control. In: Proc. of the Interna- tional Conf. on Learning Representations (ICLR) (2024)

2024
[10]

arXiv:2601.15221 (2026)

Guo,H.,Shao,J.,Chen,X.,Tan,X.,Miao,S.,Shen,Y.,Liao,Y.:ScenDi:3D-to-2D scene diffusion cascades for urban generation. arXiv:2601.15221 (2026)

work page arXiv 2026
[11]

In: SIGGRAPH 2024 Conference Papers (2024)

Huang, B., Yu, Z., Chen, A., Geiger, A., Gao, S.: 2D Gaussian Splatting for Ge- ometrically Accurate Radiance Fields. In: SIGGRAPH 2024 Conference Papers (2024)

2024
[12]

Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2022),https:// arxiv.org/abs/1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth Anything 3: Recovering the visual space from any views. arXiv:2511.10647 (2025)

work page internal anchor Pith review arXiv 2025
[14]

In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

Liu, X., Zhang, Y., Chen, Z., Wang, J., Tang, Y., et al.: GEN3C: Generalizable 3d scene generation from video diffusion with 3D cache. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

2024
[15]

ACM SIGGRAPH Computer Graphics21(4), 163–169 (1987)

Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface con- struction algorithm. ACM SIGGRAPH Computer Graphics21(4), 163–169 (1987). https://doi.org/10.1145/37401.37422

work page doi:10.1145/37401.37422 1987
[16]

ArXiv:2404.06780 (2024) 30 H

Lu, F., Lin, K.Y., Xu, Y., Li, H., Chen, G., Jiang, C.: Urban Architect: Steerable 3D urban scene generation with layout prior. ArXiv:2404.06780 (2024) 30 H. Dahmani et al

work page arXiv 2024
[17]

In: Proc

Lu, Y., Ren, X., Yang, J., Shen, T., Wu, Z., Gao, J., Wang, Y., Chen, S., Chen, M., Fidler, S., Huang, J.: InfiniCube: Unbounded and controllable dynamic 3D driving scene generation with world-guided video models. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2025)

2025
[18]

In: IEEE Conf

Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Gool, L.V.: Re- Paint: Inpainting using denoising diffusion probabilistic models. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2022)

2022
[19]

In: Proc

Mao,J.,Li,B.,Ivanovic,B.,Chen,Y.,Wang,Y.,You,Y.,Xiao,C.,Xu,D.,Pavone, M., Wang, Y.: DreamDrive: Generative 4D scene modeling from street view images. In: Proc. IEEE International Conf. on Robotics and Automation (ICRA) (2024)

2024
[20]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-E: A system for generating 3D point clouds from complex prompts. ArXiv:2212.08751 (2022)

work page internal anchor Pith review arXiv 2022
[21]

NVIDIA, :, Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chat- topadhyay, P., Chen, Y., Cui, Y., Ding, Y., Dworakowski, D., Fan, J., Fenzi, M., Ferroni, F., Fidler, S., Fox, D., Ge, S., Ge, Y., Gu, J., Gururani, S., He, E., Huang, J., Huffman, J., Jannaty, P., Jin, J., Kim, S.W., Klár, G., Lam, G., Lan, S., Leal- Taixe, L., Li, A., Li, ...

work page internal anchor Pith review arXiv 2025
[22]

In: Proc

Ost, J., Ramazzina, A., Joshi, A., Bömer, M., Bijelic, M., Heide, F.: LSD-3D: Large-scale 3D driving scene generation with geometry grounding. In: Proc. of the Conf. on Artificial Intelligence (AAAI) (2026)

2026
[23]

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion (2022),https://arxiv.org/abs/2209.14988

work page internal anchor Pith review arXiv 2022
[24]

Qiu, Q., Gao, H., Hua, W., Huang, G., He, X.: Priorlane: A prior knowledge en- hanced lane detection approach based on transformer (2023),https://arxiv.org/ abs/2209.06994

work page arXiv 2023
[25]

In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

Ren, X., Lu, Y., Liang, H., Wu, Z., Ling, H., Chen, M., Fidler, S., Williams, F., Huang, J.: SCube: Instant large-scale scene reconstruction using voxsplats. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

2024
[26]

In: Proc

Roessle, B., Gallo, I., Belongie, S., Geiger, A.: L3DG: Latent 3D gaussian diffusion. In: Proc. of SIGGRAPH Asia (2024)

2024
[27]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022),https://arxiv.org/abs/ 2112.10752

work page Pith review arXiv 2022
[28]

In: IEEE Conf

Sun,P.,Kretzschmar,H.,Dotiwalla,X.,Chouard,A.,Patnaik,V.,Tsui,P.,Guo,J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Wang, W., Anguelov, D.: Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2020)

2020
[29]

Valevski, D., Leviathan, Y., Arar, M., Fruchter, S.: Diffusion models are real-time game engines (2025),https://arxiv.org/abs/2408.14837

work page arXiv 2025
[30]

Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation (2025), https://arxiv.org/abs/2412.01506 SEM-ROVER 31

work page arXiv 2025
[31]

In: IEEE/CVF International Conf

Xiao, J., Owens, A., Torralba, A.: PandaSet: Advanced sensor suite dataset for au- tonomous driving. In: IEEE/CVF International Conf. on Computer Vision (ICCV) (2021)

2021
[32]

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers (2021), https://arxiv.org/abs/2105.15203

work page arXiv 2021
[33]

In: Proc

Xiong, B., Wei, S.T., Zheng, X.Y., Cao, Y.P., Lian, Z., Wang, P.S.: OctFusion: Octree-based diffusion models for 3D shape generation. In: Proc. of the Conference of the European Association for Computer Graphics (Eurographics) (2025)

2025
[34]

In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

Zhang, B., Cheng, Y., Yang, J., Wang, C., Zhao, F., Tang, Y., Chen, D., Guo, B.: GaussianCube: A structured and explicit radiance representation for 3D genera- tive modeling. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

2024
[35]

Open3D: A Modern Library for 3D Data Processing

Zhou, Q.Y., Park, J., Koltun, V.: Open3D: A modern library for 3D data process- ing. arXiv:1801.09847 (2018)

work page internal anchor Pith review arXiv 2018