pith. machine review for the scientific record. sign in

arxiv: 2604.06113 · v1 · submitted 2026-04-07 · 💻 cs.CV

Recognition: no theorem link

SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene generationdiffusion modelssemantic conditioningvoxel gridsdriving scenesprogressive outpaintingphotorealistic renderingmultiview consistency
0
0 comments X

The pith

Semantic voxel-guided diffusion with progressive outpainting generates large-scale multiview-consistent 3D driving scenes that render photorealistically without per-scene optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to produce expansive 3D models of urban driving environments that stay geometrically and visually coherent when viewed from many different angles and distances. Earlier methods either lose 3D structure when converting from 2D generators or cannot grow beyond small bounded areas. The approach builds a discrete voxel structure called the Σ-Voxfield grid by running a semantic-conditioned diffusion model on local neighborhoods, then expands these neighborhoods outward through overlapping regions. A deferred renderer turns the resulting grid into photorealistic images for arbitrary sensors and paths. This matters because it removes the need to optimize a new model for every new scene while keeping computation moderate.

Core claim

We propose a 3D generative framework based on the Σ-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally we render the generated Σ-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization.

What carries the argument

The Σ-Voxfield grid is the central object: a discrete 3D voxel representation in which each occupied voxel stores a fixed number of colorized surface samples; the semantic-conditioned diffusion model generates these grids locally and the progressive outpainting mechanism extends them while preserving structure.

If this is right

  • Diverse large-scale urban outdoor scenes can be produced directly from semantic conditions.
  • Generated scenes render into photorealistic images under arbitrary sensor configurations and camera trajectories.
  • Multiview geometric and appearance consistency is preserved across the entire scene extent.
  • Moderate computation cost is achieved compared with existing large-scale generation methods.
  • No per-scene optimization is required to obtain renderable output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The generated scenes could supply unlimited synthetic training data to improve perception models for autonomous vehicles.
  • Modifying the semantic input map before diffusion would allow controlled editing of scene layout without retraining.
  • The local-plus-outpainting pattern could be adapted to indoor or natural environments by changing the semantic vocabulary.
  • Integration with real-time engines could let users interactively expand virtual worlds while maintaining visual quality.

Load-bearing premise

That semantic-conditioned diffusion applied to local voxel neighborhoods plus progressive outpainting over overlaps will automatically produce globally consistent geometry and appearance at large scales without introducing artifacts or coherence loss.

What would settle it

Generate a scene hundreds of meters across, then render it along a camera trajectory that crosses multiple outpainted boundaries; visible seams, geometric distortions, or appearance mismatches in the novel views would show the consistency claim does not hold.

Figures

Figures reproduced from arXiv: 2604.06113 by Dzmitry Tsishkou, Hiba Dahmani, Jean-Philippe Tarel, Laurent Caraffa, Luis Rold\~ao, Moussab Bennehar, Nathan Piasco, Roland Br\'emond.

Figure 1
Figure 1. Figure 1: Our model generates large-scale 3D driving scenes given a coarse semantic voxel grid. Here we show a generated large-scale driving scene spanning ≈100 000 m2 Abstract. Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video gen￾erative models distilled to 3D space, … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SEM-ROVER. Our framework performs generation directly in 3D to ensure consistent geometry and appearance. It represents the scene with a Σ￾Voxfield grid and applies transformer-based diffusion over 3D tokens. To scale to large environments efficiently, we use an iterative outpainting strategy. Finally, a deferred rendering engine converts the generated 3D scene into photorealistic views. rather… view at source ↗
Figure 3
Figure 3. Figure 3: Simplified illustration of 2D Σ Voxfield. a) A 2D voxel containing a con￾tinuous colorized surface field. b) 2D Σ Voxfield points, uniformly sampled on the surface. c) Rendering method of the Σ Vox￾field, where each point is replaced by a 2D Gaussians aligned with the implicit surface. Definition. To describe a large out￾door driving scene, we propose to use Σ-Voxfield grid. A Σ-Voxfield is a lo￾cal and di… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on WOD. We show three WOD scenes (a–c). For each scene, the top strip visualizes the semantic voxel rendering used for conditioning, and the bottom strip shows the corresponding generated scene from 5 camera views. 4.2 Qualitative Results We evaluate on WOD ( [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on PandaSet. We show three PandaSet scenes (a–c). For each scene, the top strip visualizes the semantic voxel rendering used for conditioning, and the bottom strip shows the corresponding generated scene from 5 camera views. Ours(a) InfC(a) Ours(b) InfC(b) [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison to InfiniCube [17]. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results: generative capabilities. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Semantic editing: We remove existing cars by editing the semantic grid (red boxes), then regenerate the content conditioned on the updated semantics. The generated results remain coherent and consistent with the surrounding scene. (1) (2) (3) [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Scene inpainting We mask a voxel region as shows (1) (darker voxel colors for unmasked) and re-generate only this region while keeping the rest of the Σ-Voxfield grid fixed. (2) and (3) show two inpainted results (red box) that remain coherent with the background and vary in both structure and appearance. while keeping the rest of the grid fixed. While different samples produce diverse foreground realizati… view at source ↗
Figure 1
Figure 1. Figure 1: 4.6 Limitations While our method enables scalable and semantically structured 3D scene gener￾ation, several limitations remain. First, our model provides limited control over the generated appearance: conditioning is primarily semantic and geometric, so attributes such as texture style, lighting, and material properties cannot be specified explicitly. Second, our formulation focuses on static scene generat… view at source ↗
Figure 1
Figure 1. Figure 1: Textured meshes after data processing. Top line shows 2 scenes from Pandaset [31], bottom are 2 scenes from WOD [28]. Orange triangles in the colorized mesh denote faces not textured during the procedure. The construction of Σ-Voxfield grids from raw multi-view driving logs fol￾lows a systematic pipeline designed to transform unstructured sensor data into a semantically-aware, discretized volumetric repres… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of Σ-VoxFields. (a) Large-scene Σ-VoxFields with their cor￾responding semantic layouts. (b) Examples of Σ-VoxFields training samples, each con￾taining 50–150 neighboring Σ-VoxFields used to train our model [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the Σ-Voxfield diffusion model. 2 Model architectures and training details 2.1 Σ-Voxfield diffusion model The denoising network is implemented as a transformer over local sets of Σ-Voxfields, illustrated in [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Deferred rendering architectures. (a) Autoregressive Stable Diffusion: our renderer based on SD model, with additional conditioning for autoregressive generation (previous frame) and consistent 3D generation (3D buffer from Σ-Voxfield rendering and sky mask); (b) Video Stable Diffusion: our renderer based on VSD model, with additional conditioning from 3D buffers and sky masks. 3 Spatial Outpainting Strate… view at source ↗
Figure 5
Figure 5. Figure 5: Top-view visualization of progressive region selection on a semantic voxel grid. Each panel shows the region selected at one iteration of Algorithm 1, with the overlaid number indicating the selection order. The extraction expands outward from already covered regions, producing overlapping local sets suitable for conditional outpainting [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Top-view visualization of progressive outpainting. The first row shows the se￾mantic voxel grid layout conditioning, while the second row shows the corresponding generated 3D Σ-Voxfield. The panels correspond to different stages of region expan￾sion and illustrate that our Repaint [18] based outpainting preserves spatial coherence across neighboring local generations. 3.1 Ablation on the number of sampled … view at source ↗
Figure 7
Figure 7. Figure 7: Chamfer Distance between sampled voxel points and the ground-truth scene mesh as a func￾tion of the number of surface sam￾ples per voxel. The error saturates around N = 20. 4 Additional Results We show additional results of our method on WOD in [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional Qualitative results on WOD [28]. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional Qualitative results on Pandaset [31]. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison with baselines. We show four scenes (a–d), each rendered from the front camera comparing our method to Infinicube [17] and GEN3C [14] [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: 3D Buffers. Rendering of our generated Σ-Voxelfield grids along with the generated scene geometry as normal map [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Large scene generation. The semantic voxel grid exhibit the conditioning used to create a large driving sequence. We also show a top view rendering of the generated Σ-Voxfield grid [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
read the original abstract

Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $\Sigma$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $\Sigma$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SEM-ROVER, a 3D generative framework for large-scale outdoor driving scenes based on the Σ-Voxfield grid, a discrete voxel representation storing a fixed number of colorized surface samples per occupied voxel. A semantic-conditioned diffusion model operates on local voxel neighborhoods using 3D positional encodings; generation scales to city-scale scenes via progressive spatial outpainting over overlapping regions. A deferred rendering module then produces photorealistic images from arbitrary sensor configurations and trajectories without per-scene optimization. The abstract states that extensive experiments demonstrate generation of diverse, multiview-consistent scenes at moderate computational cost.

Significance. If the global consistency claims hold, the work offers a practical advance in scalable 3D scene synthesis for driving environments, enabling diverse, renderable data for sensor simulation and autonomous driving training without the view restrictions or optimization overhead of prior image/video distillation or small-scale 3D methods. The combination of local semantic diffusion with outpainting and deferred rendering is a concrete technical contribution that could reduce reliance on expensive real-world data collection.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method description): the central claim that progressive spatial outpainting over overlapping regions produces globally consistent Σ-Voxfield grids at large scales rests on local diffusion steps plus overlap propagation, yet no global attention, consistency loss, or post-hoc correction for accumulated geometric/appearance drift is described. This is load-bearing for the large-scale multiview-consistent rendering claim, especially in complex urban scenes with occlusions and varying semantics.
  2. [§4] §4 (experiments): the abstract asserts that 'extensive experiments' support diversity, photorealism, and moderate cost, but the provided description contains no quantitative metrics, baseline comparisons, ablation studies on outpainting consistency, or failure-case analysis. Without these, the empirical support for the strongest claim cannot be verified.
minor comments (2)
  1. [Abstract] Abstract: the Σ-Voxfield grid is introduced without a concise definition or equation specifying the fixed number of surface samples per voxel; adding a brief formalization would improve clarity for readers unfamiliar with the representation.
  2. [Method] Notation throughout: 3D positional encodings are mentioned but not specified (e.g., sinusoidal, learned, or Fourier features); an equation or reference in the method section would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments highlight important aspects of our claims on scalability and empirical validation. We address each point below, clarifying the design rationale where possible and committing to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): the central claim that progressive spatial outpainting over overlapping regions produces globally consistent Σ-Voxfield grids at large scales rests on local diffusion steps plus overlap propagation, yet no global attention, consistency loss, or post-hoc correction for accumulated geometric/appearance drift is described. This is load-bearing for the large-scale multiview-consistent rendering claim, especially in complex urban scenes with occlusions and varying semantics.

    Authors: We agree that the consistency mechanism merits explicit discussion. The Σ-Voxfield representation stores a fixed number of colorized surface samples per voxel, and the semantic-conditioned diffusion operates on local neighborhoods with 3D positional encodings. Progressive outpainting is performed over overlapping spatial regions so that each new patch is conditioned on the already-generated adjacent structure; this overlap propagation, together with the semantic labels, is intended to anchor geometry and appearance without global attention (which would be infeasible at city scale). No explicit consistency loss or post-hoc correction is applied because the deferred rendering stage operates directly on the completed grid. We will add a dedicated subsection in §3 that formalizes the overlap-based propagation argument and include additional qualitative visualizations demonstrating long-range consistency across occlusions. We view this as a partial revision because the core technical choice remains unchanged. revision: partial

  2. Referee: [§4] §4 (experiments): the abstract asserts that 'extensive experiments' support diversity, photorealism, and moderate cost, but the provided description contains no quantitative metrics, baseline comparisons, ablation studies on outpainting consistency, or failure-case analysis. Without these, the empirical support for the strongest claim cannot be verified.

    Authors: We acknowledge that the current presentation of §4 does not sufficiently foreground the quantitative evidence. The manuscript does contain comparisons against image/video distillation baselines and small-scale 3D generative methods, together with metrics for rendered-image quality and cross-view consistency; however, the ablation on overlap size and explicit failure-case analysis are only briefly mentioned. We will expand §4 with (i) a table of quantitative metrics (FID, multiview PSNR/SSIM, and runtime), (ii) ablations isolating the contribution of overlap propagation and semantic conditioning to consistency, (iii) side-by-side baseline comparisons, and (iv) a dedicated failure-case figure highlighting residual drift in complex urban scenes. These additions will be included in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained training on external data

full rationale

The paper defines a new discrete 3D representation (Σ-Voxfield) and trains a semantic-conditioned diffusion model on local neighborhoods with positional encodings, then extends via overlapping outpainting and deferred rendering. All claims rest on empirical training from external datasets and standard diffusion objectives; no equations, predictions, or uniqueness theorems reduce by construction to fitted inputs or self-citations. The large-scale consistency is presented as an experimental outcome rather than a tautological derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of the newly introduced Σ-Voxfield representation and the ability of the diffusion model to maintain coherence during outpainting; no explicit free parameters are listed in the abstract, though diffusion training inherently involves many hyperparameters.

invented entities (1)
  • Σ-Voxfield grid no independent evidence
    purpose: Discrete 3D representation in which each occupied voxel stores a fixed number of colorized surface samples
    Introduced as the core data structure enabling the semantic diffusion and deferred rendering pipeline.

pith-pipeline@v0.9.0 · 5545 in / 1336 out tokens · 64677 ms · 2026-05-10T19:28:21.415278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 16 canonical work pages · 7 internal anchors

  1. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  2. [3]

    In: Proc

    Cai, Y., Zhang, H., Zhang, K., Liang, Y., Ren, M., Luan, F., Liu, Q., Kim, S.Y., Zhang, J., Zhang, Z., Zhou, Y., Zhang, Y., Yang, X., Lin, Z., Yuille, A.: Baking gaussian splatting into diffusion denoiser for fast and scalable single-stage image- to-3D generation and reconstruction. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2025)

  3. [4]

    Cernea, D.: OpenMVS: Multi-view stereo reconstruction library (2020),https: //cdcseacave.github.io

  4. [5]

    In: Proc

    Chen, Z., Yang, J., Huang, J., de Lutio, R., Esturo, J.M., Ivanovic, B., Litany, O., Gojcic, Z., Fidler, S., Pavone, M., Song, L., Wang, Y.: OmniRe: Omni urban scene reconstruction. In: Proc. of the International Conf. on Learning Representations (ICLR) (2025)

  5. [6]

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding (2016),https://arxiv.org/abs/1604.01685

  6. [7]

    In: Proc

    Dhariwal, P., Nichol, A.: Scalable Diffusion Models with Transformers. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2023)

  7. [8]

    In: Proc

    Gao, R., Chen, K., Li, Z., Hong, L., Li, Z., Xu, Q.: MagicDrive3D: Controllable 3D generation for any-view rendering in street scenes. In: Proc. of the IEEE Winter Conference on Applications of Computer Vision (WACV) (2026)

  8. [9]

    In: Proc

    Gao, R., Chen, K., Xie, E., Hong, L., Li, Z., Yeung, D.Y., Xu, Q.: MagicDrive: Street view generation with diverse 3D geometry control. In: Proc. of the Interna- tional Conf. on Learning Representations (ICLR) (2024)

  9. [10]

    arXiv:2601.15221 (2026)

    Guo,H.,Shao,J.,Chen,X.,Tan,X.,Miao,S.,Shen,Y.,Liao,Y.:ScenDi:3D-to-2D scene diffusion cascades for urban generation. arXiv:2601.15221 (2026)

  10. [11]

    In: SIGGRAPH 2024 Conference Papers (2024)

    Huang, B., Yu, Z., Chen, A., Geiger, A., Gao, S.: 2D Gaussian Splatting for Ge- ometrically Accurate Radiance Fields. In: SIGGRAPH 2024 Conference Papers (2024)

  11. [12]

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2022),https:// arxiv.org/abs/1312.6114

  12. [13]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth Anything 3: Recovering the visual space from any views. arXiv:2511.10647 (2025)

  13. [14]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

    Liu, X., Zhang, Y., Chen, Z., Wang, J., Tang, Y., et al.: GEN3C: Generalizable 3d scene generation from video diffusion with 3D cache. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

  14. [15]

    ACM SIGGRAPH Computer Graphics21(4), 163–169 (1987)

    Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface con- struction algorithm. ACM SIGGRAPH Computer Graphics21(4), 163–169 (1987). https://doi.org/10.1145/37401.37422

  15. [16]

    ArXiv:2404.06780 (2024) 30 H

    Lu, F., Lin, K.Y., Xu, Y., Li, H., Chen, G., Jiang, C.: Urban Architect: Steerable 3D urban scene generation with layout prior. ArXiv:2404.06780 (2024) 30 H. Dahmani et al

  16. [17]

    In: Proc

    Lu, Y., Ren, X., Yang, J., Shen, T., Wu, Z., Gao, J., Wang, Y., Chen, S., Chen, M., Fidler, S., Huang, J.: InfiniCube: Unbounded and controllable dynamic 3D driving scene generation with world-guided video models. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2025)

  17. [18]

    In: IEEE Conf

    Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Gool, L.V.: Re- Paint: Inpainting using denoising diffusion probabilistic models. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2022)

  18. [19]

    In: Proc

    Mao,J.,Li,B.,Ivanovic,B.,Chen,Y.,Wang,Y.,You,Y.,Xiao,C.,Xu,D.,Pavone, M., Wang, Y.: DreamDrive: Generative 4D scene modeling from street view images. In: Proc. IEEE International Conf. on Robotics and Automation (ICRA) (2024)

  19. [20]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-E: A system for generating 3D point clouds from complex prompts. ArXiv:2212.08751 (2022)

  20. [21]

    NVIDIA, :, Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chat- topadhyay, P., Chen, Y., Cui, Y., Ding, Y., Dworakowski, D., Fan, J., Fenzi, M., Ferroni, F., Fidler, S., Fox, D., Ge, S., Ge, Y., Gu, J., Gururani, S., He, E., Huang, J., Huffman, J., Jannaty, P., Jin, J., Kim, S.W., Klár, G., Lam, G., Lan, S., Leal- Taixe, L., Li, A., Li, ...

  21. [22]

    In: Proc

    Ost, J., Ramazzina, A., Joshi, A., Bömer, M., Bijelic, M., Heide, F.: LSD-3D: Large-scale 3D driving scene generation with geometry grounding. In: Proc. of the Conf. on Artificial Intelligence (AAAI) (2026)

  22. [23]

    Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion (2022),https://arxiv.org/abs/2209.14988

  23. [24]

    Qiu, Q., Gao, H., Hua, W., Huang, G., He, X.: Priorlane: A prior knowledge en- hanced lane detection approach based on transformer (2023),https://arxiv.org/ abs/2209.06994

  24. [25]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

    Ren, X., Lu, Y., Liang, H., Wu, Z., Ling, H., Chen, M., Fidler, S., Williams, F., Huang, J.: SCube: Instant large-scale scene reconstruction using voxsplats. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

  25. [26]

    In: Proc

    Roessle, B., Gallo, I., Belongie, S., Geiger, A.: L3DG: Latent 3D gaussian diffusion. In: Proc. of SIGGRAPH Asia (2024)

  26. [27]

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022),https://arxiv.org/abs/ 2112.10752

  27. [28]

    In: IEEE Conf

    Sun,P.,Kretzschmar,H.,Dotiwalla,X.,Chouard,A.,Patnaik,V.,Tsui,P.,Guo,J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Wang, W., Anguelov, D.: Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2020)

  28. [29]

    Valevski, D., Leviathan, Y., Arar, M., Fruchter, S.: Diffusion models are real-time game engines (2025),https://arxiv.org/abs/2408.14837

  29. [30]

    Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation (2025), https://arxiv.org/abs/2412.01506 SEM-ROVER 31

  30. [31]

    In: IEEE/CVF International Conf

    Xiao, J., Owens, A., Torralba, A.: PandaSet: Advanced sensor suite dataset for au- tonomous driving. In: IEEE/CVF International Conf. on Computer Vision (ICCV) (2021)

  31. [32]

    Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers (2021), https://arxiv.org/abs/2105.15203

  32. [33]

    In: Proc

    Xiong, B., Wei, S.T., Zheng, X.Y., Cao, Y.P., Lian, Z., Wang, P.S.: OctFusion: Octree-based diffusion models for 3D shape generation. In: Proc. of the Conference of the European Association for Computer Graphics (Eurographics) (2025)

  33. [34]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

    Zhang, B., Cheng, Y., Yang, J., Wang, C., Zhao, F., Tang, Y., Chen, D., Guo, B.: GaussianCube: A structured and explicit radiance representation for 3D genera- tive modeling. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

  34. [35]

    Open3D: A Modern Library for 3D Data Processing

    Zhou, Q.Y., Park, J., Koltun, V.: Open3D: A modern library for 3D data process- ing. arXiv:1801.09847 (2018)