pith. sign in

arxiv: 2605.17682 · v1 · pith:3Y73Y6ETnew · submitted 2026-05-17 · 💻 cs.CV

GEM: Gaussian Evolution Model for Occupancy Forecasting and Motion Planning

Pith reviewed 2026-05-20 13:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D Gaussian primitivesoccupancy forecastingmotion planningautonomous drivingnon-autoregressivesemantic occupancycontinuous-time dynamicsworld model
0
0 comments X

The pith

GEM represents driving scenes as continuous 4D Gaussian primitives that can be queried directly at any future time for occupancy forecasting and motion planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that driving scenes can be modeled as explicit continuous 4D Gaussian primitives whose spatial geometry, temporal support, and motion are decoupled and learned independently. This replaces the common practice of discretizing scenes into latents or tokens and forecasting through fixed-step autoregressive generation, which accumulates errors and lacks timing flexibility. A sympathetic reader would care because the approach promises compact, interpretable representations that support efficient full-horizon forecasts and direct ego-trajectory prediction for planning without sequential rollout. The method splats queried Gaussians into semantic occupancy volumes at arbitrary timestamps.

Core claim

GEM represents driving scenes as explicit continuous 4D Gaussian primitives with learned dynamics. Instead of rolling out future occupancy states step by step, GEM directly queries the Gaussian world representation at arbitrary timestamps and splats the corresponding conditional 3D Gaussians into semantic occupancy volumes. By decoupling spatial geometry, temporal support, and primitive motion, the predicted world becomes easier to inspect as each primitive's evolution can be followed continuously. The same representation supports motion planning by predicting future ego trajectories from the learned Gaussian world.

What carries the argument

Explicit continuous 4D Gaussian primitives with decoupled spatial geometry, temporal support, and primitive motion, which enable direct querying at arbitrary timestamps followed by splatting into occupancy volumes.

If this is right

  • Forecasting runs non-autoregressively over the full horizon without stepwise error accumulation.
  • Arbitrary-timestamp querying supplies temporal flexibility absent from fixed-step models.
  • Individual primitive evolution can be inspected continuously for interpretability.
  • The Gaussian world directly yields ego-trajectory predictions for motion planning.
  • The representation stays compact while achieving state-of-the-art occupancy forecasting performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit decoupling of geometry, time support, and motion may simplify adaptation to new sensor configurations or environments not seen in training.
  • Tracking separate primitives could provide a natural way to analyze and correct specific prediction failures such as missed moving objects.
  • The continuous-time formulation might extend naturally to tasks requiring irregular or event-driven timing, such as reactive collision avoidance.

Load-bearing premise

Driving scenes can be compactly and accurately captured by a set of explicit continuous 4D Gaussian primitives whose spatial geometry, temporal support, and primitive motion can be decoupled and learned independently without losing critical scene details.

What would settle it

A head-to-head evaluation on real driving datasets showing that long-horizon semantic occupancy forecasts from GEM contain higher error or more visual artifacts than autoregressive baselines, especially when queried at timestamps outside the training interval, would falsify the claimed advantages.

Figures

Figures reproduced from arXiv: 2605.17682 by Cheng Chen, Hao Huang, Saurabh Bagchi.

Figure 1
Figure 1. Figure 1: Comparison between discretized occupancy world models and our continuous 4D Gaussian world model. Existing methods typically forecast future occupancy through (a) discrete tokens or (b) volumetric features at fixed timestamps, often requiring sequential rollout. In contrast, our GEM shown in (c) represents the scene as evolving 4D semantic Gaussian primitives that can be directly queried at arbitrary times… view at source ↗
Figure 2
Figure 2. Figure 2: Unlike a full 4D covariance shown in (a) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall pipeline of GEM. Historical multi-view images and ego states are encoded to refine [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of future occupancy forecasting at [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Arbitrary-timestamp occupancy forecasting of two scenes with GEM. The continuous 4D [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Future 3D semantic occupancy forecasting and motion planning are central to autonomous driving, as they require models to reason about how surrounding scenes evolve and how the ego vehicle should act. Existing occupancy world models commonly discretize scenes into latent embeddings, volumetric features, or quantized tokens, and forecast future states through fixed-step autoregressive generation. This limits temporal flexibility, obscures scene evolution, accumulates errors over long horizons, and poorly matches the continuous-time dynamics of real driving scenes. We propose GEM, a Gaussian Evolution Model for non-autoregressive occupancy world modeling, where driving scenes are represented as explicit continuous 4D Gaussian primitives with learned dynamics. Instead of rolling out future occupancy states step by step, GEM directly queries the Gaussian world representation at arbitrary timestamps and splats the corresponding conditional 3D Gaussians into semantic occupancy volumes. This enables efficient forecasting over the full horizon while retaining a compact and interpretable scene representation. By decoupling spatial geometry, temporal support, and primitive motion, GEM makes the predicted world easier to inspect, as each primitive's evolution can be followed continuously over time. The same representation also supports motion planning by predicting future ego trajectories from the learned Gaussian world. Extensive experiments show that GEM achieves state-of-the-art future semantic occupancy forecasting and strong motion planning performance, while providing flexible temporal querying.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes GEM, a Gaussian Evolution Model for non-autoregressive occupancy world modeling in autonomous driving. Scenes are represented as explicit continuous 4D Gaussian primitives with learned dynamics; future semantic occupancy is obtained by directly querying the representation at arbitrary timestamps and splatting conditional 3D Gaussians, while the same representation supports ego-trajectory prediction for motion planning. The approach is positioned as overcoming error accumulation and temporal rigidity in autoregressive discrete models while providing interpretability through decoupled primitive evolution.

Significance. If the empirical claims hold, GEM would supply a compact, continuous, and queryable alternative to latent or tokenized occupancy world models, with clear advantages for flexible-horizon forecasting and downstream planning. The explicit primitive representation could also aid inspection and debugging of predicted scene evolution.

major comments (1)
  1. [§3] §3 (Method): The central modeling choice decouples spatial geometry, temporal support, and primitive motion into independently learned components. This decoupling is load-bearing for both the non-autoregressive querying claim and the interpretability argument, yet the manuscript provides no explicit interaction terms or joint optimization between primitives. Real driving scenes contain coupled dynamics (e.g., one vehicle’s braking affecting neighboring trajectories); without an ablation demonstrating that independent dynamics suffice, the forecasting accuracy over long horizons remains at risk.
minor comments (2)
  1. [Abstract] Abstract: The SOTA claim is stated without naming the primary dataset, key metrics, or quantitative margins over baselines; a single sentence summarizing the strongest empirical result would improve readability.
  2. [§4] §4 (Experiments): Tables reporting occupancy forecasting should include per-horizon breakdowns and error bars; the current presentation makes it difficult to assess whether gains are consistent or driven by particular time scales.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. We address the major comment regarding the decoupling of components in §3 below, providing clarifications on joint optimization and committing to an added ablation.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central modeling choice decouples spatial geometry, temporal support, and primitive motion into independently learned components. This decoupling is load-bearing for both the non-autoregressive querying claim and the interpretability argument, yet the manuscript provides no explicit interaction terms or joint optimization between primitives. Real driving scenes contain coupled dynamics (e.g., one vehicle’s braking affecting neighboring trajectories); without an ablation demonstrating that independent dynamics suffice, the forecasting accuracy over long horizons remains at risk.

    Authors: We thank the referee for this important observation. While each primitive's motion parameters are indeed learned independently to support continuous-time querying and per-primitive interpretability, all primitives are optimized jointly in an end-to-end manner. The training objective is a loss on the splatted semantic occupancy volumes at queried timestamps, so gradients flow across the entire set of primitives; this enables the model to capture implicit inter-primitive couplings present in the data (e.g., realistic braking or avoidance behaviors emerge from how neighboring primitives co-evolve to minimize occupancy error). Explicit interaction terms were deliberately omitted to preserve the compact, decoupled representation that underpins both non-autoregressive querying and inspection of individual primitive trajectories. We acknowledge that a dedicated ablation would strengthen the long-horizon claim and will add one in the revised manuscript: a comparison against a variant that injects explicit interactions (e.g., via lightweight cross-primitive attention on motion parameters). Our current SOTA forecasting results already indicate that the independent-dynamics formulation suffices in practice, but the new ablation will quantify any accuracy-complexity trade-off. revision: yes

Circularity Check

0 steps flagged

No circularity detected in GEM's representation and forecasting approach

full rationale

The paper introduces an explicit continuous 4D Gaussian primitive representation with learned dynamics for non-autoregressive occupancy forecasting and motion planning. Forecasting is performed by direct querying of the representation at arbitrary timestamps followed by splatting, which follows directly from the continuous-time modeling choice rather than reducing to any fitted parameter or self-referential definition by construction. No equations, self-citations, uniqueness theorems, or ansatzes from prior author work are invoked in a load-bearing way that would force the central claims. The performance results are presented as empirical outcomes from experiments, and the decoupling of geometry, temporal support, and motion is an explicit modeling assumption rather than a tautology. This is a standard case of a self-contained modeling paper with independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on representing scenes via Gaussian primitives whose dynamics are learned from data; this introduces new modeling assumptions and entities not derived from prior literature in the abstract.

free parameters (1)
  • learned dynamics of Gaussian primitives
    The motion and temporal evolution parameters for each primitive are learned, constituting fitted values that enable the forecasting.
axioms (1)
  • domain assumption Driving scenes can be represented as explicit continuous 4D Gaussian primitives with decoupled spatial geometry, temporal support, and primitive motion
    This is the foundational modeling choice invoked throughout the abstract to enable direct timestamp querying.
invented entities (1)
  • 4D Gaussian primitives no independent evidence
    purpose: To serve as the explicit continuous representation of scene elements that evolve over time
    New postulated representation introduced to replace discrete latent or token-based scene encodings

pith-pipeline@v0.9.0 · 5760 in / 1428 out tokens · 32930 ms · 2026-05-20T13:17:24.405791+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    V AD: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  2. [2]

    Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

    Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

  3. [3]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17853–17862, June 2023

  4. [4]

    ST-P3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. ST-P3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

  5. [5]

    OccWorld: Learning a 3d occupancy world model for autonomous driving

    Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. OccWorld: Learning a 3d occupancy world model for autonomous driving. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024

  6. [6]

    Occ-LLM: Enhancing autonomous driving with occupancy-based large language models

    Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-LLM: Enhancing autonomous driving with occupancy-based large language models. InIEEE Interna- tional Conference on Robotics and Automation, pages 8434–8441. IEEE, 2025

  7. [7]

    RenderWorld: World model with self-supervised 3d label

    Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Haiyang Liu, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, et al. RenderWorld: World model with self-supervised 3d label. InIEEE International Conference on Robotics and Automation, pages 6063–6070. IEEE, 2025

  8. [8]

    Semi-supervised vision-centric 3d occupancy world model for autonomous driving

    Xiang Li, Pengfei Li, Yupeng Zheng, Wei Sun, Yan Wang, and Yilun Chen. Semi-supervised vision-centric 3d occupancy world model for autonomous driving. InInternational Conference on Learning Representations, 2025

  9. [9]

    SparseWorld: A flexible, adaptive, and efficient 4d occupancy world model powered by sparse and dynamic queries

    Chenxu Dang, Haiyan Liu, Jason Bao, Pei An, Xinyue Tang, An Pan, Jie Ma, Bingchuan Sun, and Yan Wang. SparseWorld: A flexible, adaptive, and efficient 4d occupancy world model powered by sparse and dynamic queries. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 3497–3505, 2026

  10. [10]

    Occ3D: A large-scale 3d occupancy prediction benchmark for autonomous driving

    Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3D: A large-scale 3d occupancy prediction benchmark for autonomous driving. InAdvances in Neural Information Processing Systems, volume 36, pages 64318–64330, 2023

  11. [11]

    Driving in the Occupancy World: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving

    Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the Occupancy World: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9327–9335, 2025

  12. [12]

    I2- World: Intra-inter tokenization for efficient dynamic 4d scene forecasting

    Zhimin Liao, Ping Wei, Ruijie Zhang, Shuaijia Chen, Haoxuan Wang, and Ziyang Ren. I2- World: Intra-inter tokenization for efficient dynamic 4d scene forecasting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25810–25819, 2025

  13. [13]

    Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024

    Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. OccLLaMA: An occupancy-language-action generative world model for autonomous driving. arXiv preprint arXiv:2409.03272, 2024

  14. [14]

    Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting

    Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. InInternational Conference on Learning Representations, 2024. 10

  15. [15]

    4D-Rotor Gaussian Splatting: Towards efficient novel view synthesis for dynamic scenes

    Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Baoquan Chen. 4D-Rotor Gaussian Splatting: Towards efficient novel view synthesis for dynamic scenes. In ACM SIGGRAPH, pages 1–11, 2024

  16. [16]

    4d gaussian splatting as a learned dynamical system

    Arnold Caleb Asiimwe and Carl V ondrick. 4d gaussian splatting as a learned dynamical system. arXiv preprint arXiv:2512.19648, 2025

  17. [17]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):139–1, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):139–1, 2023

  18. [18]

    GaussianFormer: Scene as gaussians for vision-based 3d semantic occupancy prediction

    Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. GaussianFormer: Scene as gaussians for vision-based 3d semantic occupancy prediction. InEuropean Conference on Computer Vision, pages 376–393. Springer, 2024

  19. [19]

    SparseBEV: High- performance sparse 3d object detection from multi-camera videos

    Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. SparseBEV: High- performance sparse 3d object detection from multi-camera videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18580–18590, 2023

  20. [20]

    GaussianFormer-2: Probabilistic gaussian superposition for efficient 3d occu- pancy prediction

    Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, and Jiwen Lu. GaussianFormer-2: Probabilistic gaussian superposition for efficient 3d occu- pancy prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27477–27486, 2025

  21. [21]

    The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks

    Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4413–4421, 2018

  22. [22]

    nuScenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11621–11631, 2020

  23. [23]

    Gaussian-based World Model: Gaussian priors for voxel-based occupancy prediction and future motion prediction

    Tuo Feng, Wenguan Wang, and Yi Yang. Gaussian-based World Model: Gaussian priors for voxel-based occupancy prediction and future motion prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25239–25249, October 2025

  24. [24]

    Tri-perspective view for vision-based 3d semantic occupancy prediction

    Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9223–9232, 2023

  25. [25]

    SurroundOcc: Multi-camera 3d occupancy prediction for autonomous driving

    Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. SurroundOcc: Multi-camera 3d occupancy prediction for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21729–21740, 2023

  26. [26]

    OccFormer: Dual-path transformer for vision- based 3d semantic occupancy prediction

    Yunpeng Zhang, Zheng Zhu, and Dalong Du. OccFormer: Dual-path transformer for vision- based 3d semantic occupancy prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9433–9443, 2023

  27. [27]

    CVT-Occ: Cost volume temporal fusion for 3d occupancy prediction

    Zhangchen Ye, Tao Jiang, Chenfeng Xu, Yiming Li, and Hang Zhao. CVT-Occ: Cost volume temporal fusion for 3d occupancy prediction. InEuropean Conference on Computer Vision, pages 381–397. Springer, 2024

  28. [28]

    STCOcc: Sparse spatial-temporal cascade renovation for 3d occupancy and scene flow prediction

    Zhimin Liao, Ping Wei, Shuaijia Chen, Haoxuan Wang, and Ziyang Ren. STCOcc: Sparse spatial-temporal cascade renovation for 3d occupancy and scene flow prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1516–1526, June 2025

  29. [29]

    ALOcc: Adaptive lifting-based 3d semantic occupancy and cost volume-based flow predictions

    Dubing Chen, Jin Fang, Wencheng Han, Xinjing Cheng, Junbo Yin, Chengzhong Xu, Fa- had Shahbaz Khan, and Jianbing Shen. ALOcc: Adaptive lifting-based 3d semantic occupancy and cost volume-based flow predictions. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 4156–4166, 2025. 11

  30. [30]

    GaussianWorld: Gaussian world model for streaming 3d occupancy prediction

    Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, and Jiwen Lu. GaussianWorld: Gaussian world model for streaming 3d occupancy prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6772–6781, 2025

  31. [31]

    GaussianOcc: Fully self-supervised and efficient 3d occupancy estimation with gaussian splatting

    Wanshui Gan, Fang Liu, Hongbin Xu, Ningkai Mo, and Naoto Yokoya. GaussianOcc: Fully self-supervised and efficient 3d occupancy estimation with gaussian splatting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28980–28990, 2025

  32. [32]

    GaussianFlowOcc: Sparse and weakly supervised occupancy estimation using gaussian splatting and temporal flow

    Simon Boeder, Fabian Gigengack, and Benjamin Risse. GaussianFlowOcc: Sparse and weakly supervised occupancy estimation using gaussian splatting and temporal flow. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24943–24954, 2025

  33. [33]

    Vision-only gaussian splatting for collaborative semantic occupancy prediction

    Cheng Chen, Hao Huang, and Saurabh Bagchi. Vision-only gaussian splatting for collaborative semantic occupancy prediction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2796–2804, 2026

  34. [34]

    Dynamic 3D Gaussians: Tracking by persistent dynamic view synthesis

    Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3D Gaussians: Tracking by persistent dynamic view synthesis. InInternational Conference on 3D Vision, pages 800–809. IEEE, 2024

  35. [35]

    Street Gaussians: Modeling dynamic urban scenes with gaussian splatting

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street Gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024

  36. [36]

    Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model

    Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, and Xiaoxiao Long. DOME: Taming diffusion model into high-fidelity controllable occupancy world model.arXiv preprint arXiv:2410.10429, 2024

  37. [37]

    COME: Adding scene-centric forecasting control to occupancy world model

    Yining Shi, Kun Jiang, Qiang Meng, Ke Wang, Jiabao Wang, Wenchao Sun, Tuopu Wen, mengmeng yang, and Diange Yang. COME: Adding scene-centric forecasting control to occupancy world model. InAdvances in Neural Information Processing Systems, 2026

  38. [38]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. A Related Work 3D occupancy prediction.3D occupancy prediction [ 24] has become an important scene repre- sentation for autonomous driving because it provides dense geometric and semantic understanding beyond object-level ...