GEM: Gaussian Evolution Model for Occupancy Forecasting and Motion Planning

Cheng Chen; Hao Huang; Saurabh Bagchi

arxiv: 2605.17682 · v1 · pith:3Y73Y6ETnew · submitted 2026-05-17 · 💻 cs.CV

GEM: Gaussian Evolution Model for Occupancy Forecasting and Motion Planning

Cheng Chen , Hao Huang , Saurabh Bagchi This is my paper

Pith reviewed 2026-05-20 13:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D Gaussian primitivesoccupancy forecastingmotion planningautonomous drivingnon-autoregressivesemantic occupancycontinuous-time dynamicsworld model

0 comments

The pith

GEM represents driving scenes as continuous 4D Gaussian primitives that can be queried directly at any future time for occupancy forecasting and motion planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that driving scenes can be modeled as explicit continuous 4D Gaussian primitives whose spatial geometry, temporal support, and motion are decoupled and learned independently. This replaces the common practice of discretizing scenes into latents or tokens and forecasting through fixed-step autoregressive generation, which accumulates errors and lacks timing flexibility. A sympathetic reader would care because the approach promises compact, interpretable representations that support efficient full-horizon forecasts and direct ego-trajectory prediction for planning without sequential rollout. The method splats queried Gaussians into semantic occupancy volumes at arbitrary timestamps.

Core claim

GEM represents driving scenes as explicit continuous 4D Gaussian primitives with learned dynamics. Instead of rolling out future occupancy states step by step, GEM directly queries the Gaussian world representation at arbitrary timestamps and splats the corresponding conditional 3D Gaussians into semantic occupancy volumes. By decoupling spatial geometry, temporal support, and primitive motion, the predicted world becomes easier to inspect as each primitive's evolution can be followed continuously. The same representation supports motion planning by predicting future ego trajectories from the learned Gaussian world.

What carries the argument

Explicit continuous 4D Gaussian primitives with decoupled spatial geometry, temporal support, and primitive motion, which enable direct querying at arbitrary timestamps followed by splatting into occupancy volumes.

If this is right

Forecasting runs non-autoregressively over the full horizon without stepwise error accumulation.
Arbitrary-timestamp querying supplies temporal flexibility absent from fixed-step models.
Individual primitive evolution can be inspected continuously for interpretability.
The Gaussian world directly yields ego-trajectory predictions for motion planning.
The representation stays compact while achieving state-of-the-art occupancy forecasting performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit decoupling of geometry, time support, and motion may simplify adaptation to new sensor configurations or environments not seen in training.
Tracking separate primitives could provide a natural way to analyze and correct specific prediction failures such as missed moving objects.
The continuous-time formulation might extend naturally to tasks requiring irregular or event-driven timing, such as reactive collision avoidance.

Load-bearing premise

Driving scenes can be compactly and accurately captured by a set of explicit continuous 4D Gaussian primitives whose spatial geometry, temporal support, and primitive motion can be decoupled and learned independently without losing critical scene details.

What would settle it

A head-to-head evaluation on real driving datasets showing that long-horizon semantic occupancy forecasts from GEM contain higher error or more visual artifacts than autoregressive baselines, especially when queried at timestamps outside the training interval, would falsify the claimed advantages.

Figures

Figures reproduced from arXiv: 2605.17682 by Cheng Chen, Hao Huang, Saurabh Bagchi.

**Figure 1.** Figure 1: Comparison between discretized occupancy world models and our continuous 4D Gaussian world model. Existing methods typically forecast future occupancy through (a) discrete tokens or (b) volumetric features at fixed timestamps, often requiring sequential rollout. In contrast, our GEM shown in (c) represents the scene as evolving 4D semantic Gaussian primitives that can be directly queried at arbitrary times… view at source ↗

**Figure 2.** Figure 2: Unlike a full 4D covariance shown in (a) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overall pipeline of GEM. Historical multi-view images and ego states are encoded to refine [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of future occupancy forecasting at [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Arbitrary-timestamp occupancy forecasting of two scenes with GEM. The continuous 4D [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Future 3D semantic occupancy forecasting and motion planning are central to autonomous driving, as they require models to reason about how surrounding scenes evolve and how the ego vehicle should act. Existing occupancy world models commonly discretize scenes into latent embeddings, volumetric features, or quantized tokens, and forecast future states through fixed-step autoregressive generation. This limits temporal flexibility, obscures scene evolution, accumulates errors over long horizons, and poorly matches the continuous-time dynamics of real driving scenes. We propose GEM, a Gaussian Evolution Model for non-autoregressive occupancy world modeling, where driving scenes are represented as explicit continuous 4D Gaussian primitives with learned dynamics. Instead of rolling out future occupancy states step by step, GEM directly queries the Gaussian world representation at arbitrary timestamps and splats the corresponding conditional 3D Gaussians into semantic occupancy volumes. This enables efficient forecasting over the full horizon while retaining a compact and interpretable scene representation. By decoupling spatial geometry, temporal support, and primitive motion, GEM makes the predicted world easier to inspect, as each primitive's evolution can be followed continuously over time. The same representation also supports motion planning by predicting future ego trajectories from the learned Gaussian world. Extensive experiments show that GEM achieves state-of-the-art future semantic occupancy forecasting and strong motion planning performance, while providing flexible temporal querying.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GEM proposes continuous 4D Gaussian primitives for direct-time querying in occupancy forecasting, but the decoupling of geometry, support, and motion leaves open questions about capturing real object interactions.

read the letter

The core idea here is a non-autoregressive world model that represents driving scenes as explicit 4D Gaussians you can query at any timestamp instead of stepping through discrete future frames. That avoids the usual error accumulation in autoregressive rollouts and gives a more compact, inspectable representation than latent embeddings or tokens. The paper does a reasonable job laying out why this matters for planning, since the same primitives can feed into ego trajectory prediction without separate modules. Decoupling spatial geometry from temporal support and primitive motion is the main technical move, and it looks like a clean way to make evolution traceable per element. On the experiments side, the abstract claims SOTA forecasting and solid planning results, but the lack of visible baselines, metrics, or ablation details in the summary makes it hard to judge how much the representation actually delivers versus standard methods. The bigger soft spot is whether independent per-primitive dynamics can handle coupled behaviors without extra interaction terms. Real scenes have vehicles and pedestrians whose motions depend on each other, and if the model learns each Gaussian's trajectory in isolation, it could miss those dependencies even if the math for individual primitives is sound. This is not a fatal flaw on paper, but it is the assumption that needs the strongest empirical check. The work is aimed at researchers building world models for autonomous driving who want alternatives to discrete or autoregressive setups. A reader already working on Gaussian splatting or continuous representations would find the formulation useful to discuss. I would bring it to a reading group to walk through the querying mechanism and see the actual numbers. It deserves peer review because the representational shift is distinct enough to warrant feedback on both the interaction modeling and the forecasting results.

Referee Report

1 major / 2 minor

Summary. The paper proposes GEM, a Gaussian Evolution Model for non-autoregressive occupancy world modeling in autonomous driving. Scenes are represented as explicit continuous 4D Gaussian primitives with learned dynamics; future semantic occupancy is obtained by directly querying the representation at arbitrary timestamps and splatting conditional 3D Gaussians, while the same representation supports ego-trajectory prediction for motion planning. The approach is positioned as overcoming error accumulation and temporal rigidity in autoregressive discrete models while providing interpretability through decoupled primitive evolution.

Significance. If the empirical claims hold, GEM would supply a compact, continuous, and queryable alternative to latent or tokenized occupancy world models, with clear advantages for flexible-horizon forecasting and downstream planning. The explicit primitive representation could also aid inspection and debugging of predicted scene evolution.

major comments (1)

[§3] §3 (Method): The central modeling choice decouples spatial geometry, temporal support, and primitive motion into independently learned components. This decoupling is load-bearing for both the non-autoregressive querying claim and the interpretability argument, yet the manuscript provides no explicit interaction terms or joint optimization between primitives. Real driving scenes contain coupled dynamics (e.g., one vehicle’s braking affecting neighboring trajectories); without an ablation demonstrating that independent dynamics suffice, the forecasting accuracy over long horizons remains at risk.

minor comments (2)

[Abstract] Abstract: The SOTA claim is stated without naming the primary dataset, key metrics, or quantitative margins over baselines; a single sentence summarizing the strongest empirical result would improve readability.
[§4] §4 (Experiments): Tables reporting occupancy forecasting should include per-horizon breakdowns and error bars; the current presentation makes it difficult to assess whether gains are consistent or driven by particular time scales.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. We address the major comment regarding the decoupling of components in §3 below, providing clarifications on joint optimization and committing to an added ablation.

read point-by-point responses

Referee: [§3] §3 (Method): The central modeling choice decouples spatial geometry, temporal support, and primitive motion into independently learned components. This decoupling is load-bearing for both the non-autoregressive querying claim and the interpretability argument, yet the manuscript provides no explicit interaction terms or joint optimization between primitives. Real driving scenes contain coupled dynamics (e.g., one vehicle’s braking affecting neighboring trajectories); without an ablation demonstrating that independent dynamics suffice, the forecasting accuracy over long horizons remains at risk.

Authors: We thank the referee for this important observation. While each primitive's motion parameters are indeed learned independently to support continuous-time querying and per-primitive interpretability, all primitives are optimized jointly in an end-to-end manner. The training objective is a loss on the splatted semantic occupancy volumes at queried timestamps, so gradients flow across the entire set of primitives; this enables the model to capture implicit inter-primitive couplings present in the data (e.g., realistic braking or avoidance behaviors emerge from how neighboring primitives co-evolve to minimize occupancy error). Explicit interaction terms were deliberately omitted to preserve the compact, decoupled representation that underpins both non-autoregressive querying and inspection of individual primitive trajectories. We acknowledge that a dedicated ablation would strengthen the long-horizon claim and will add one in the revised manuscript: a comparison against a variant that injects explicit interactions (e.g., via lightweight cross-primitive attention on motion parameters). Our current SOTA forecasting results already indicate that the independent-dynamics formulation suffices in practice, but the new ablation will quantify any accuracy-complexity trade-off. revision: yes

Circularity Check

0 steps flagged

No circularity detected in GEM's representation and forecasting approach

full rationale

The paper introduces an explicit continuous 4D Gaussian primitive representation with learned dynamics for non-autoregressive occupancy forecasting and motion planning. Forecasting is performed by direct querying of the representation at arbitrary timestamps followed by splatting, which follows directly from the continuous-time modeling choice rather than reducing to any fitted parameter or self-referential definition by construction. No equations, self-citations, uniqueness theorems, or ansatzes from prior author work are invoked in a load-bearing way that would force the central claims. The performance results are presented as empirical outcomes from experiments, and the decoupling of geometry, temporal support, and motion is an explicit modeling assumption rather than a tautology. This is a standard case of a self-contained modeling paper with independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on representing scenes via Gaussian primitives whose dynamics are learned from data; this introduces new modeling assumptions and entities not derived from prior literature in the abstract.

free parameters (1)

learned dynamics of Gaussian primitives
The motion and temporal evolution parameters for each primitive are learned, constituting fitted values that enable the forecasting.

axioms (1)

domain assumption Driving scenes can be represented as explicit continuous 4D Gaussian primitives with decoupled spatial geometry, temporal support, and primitive motion
This is the foundational modeling choice invoked throughout the abstract to enable direct timestamp querying.

invented entities (1)

4D Gaussian primitives no independent evidence
purpose: To serve as the explicit continuous representation of scene elements that evolve over time
New postulated representation introduced to replace discrete latent or token-based scene encodings

pith-pipeline@v0.9.0 · 5760 in / 1428 out tokens · 32930 ms · 2026-05-20T13:17:24.405791+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

V AD: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

work page 2023
[2]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

work page 2022
[3]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17853–17862, June 2023

work page 2023
[4]

ST-P3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. ST-P3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

work page 2022
[5]

OccWorld: Learning a 3d occupancy world model for autonomous driving

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. OccWorld: Learning a 3d occupancy world model for autonomous driving. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024

work page 2024
[6]

Occ-LLM: Enhancing autonomous driving with occupancy-based large language models

Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-LLM: Enhancing autonomous driving with occupancy-based large language models. InIEEE Interna- tional Conference on Robotics and Automation, pages 8434–8441. IEEE, 2025

work page 2025
[7]

RenderWorld: World model with self-supervised 3d label

Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Haiyang Liu, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, et al. RenderWorld: World model with self-supervised 3d label. InIEEE International Conference on Robotics and Automation, pages 6063–6070. IEEE, 2025

work page 2025
[8]

Semi-supervised vision-centric 3d occupancy world model for autonomous driving

Xiang Li, Pengfei Li, Yupeng Zheng, Wei Sun, Yan Wang, and Yilun Chen. Semi-supervised vision-centric 3d occupancy world model for autonomous driving. InInternational Conference on Learning Representations, 2025

work page 2025
[9]

SparseWorld: A flexible, adaptive, and efficient 4d occupancy world model powered by sparse and dynamic queries

Chenxu Dang, Haiyan Liu, Jason Bao, Pei An, Xinyue Tang, An Pan, Jie Ma, Bingchuan Sun, and Yan Wang. SparseWorld: A flexible, adaptive, and efficient 4d occupancy world model powered by sparse and dynamic queries. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 3497–3505, 2026

work page 2026
[10]

Occ3D: A large-scale 3d occupancy prediction benchmark for autonomous driving

Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3D: A large-scale 3d occupancy prediction benchmark for autonomous driving. InAdvances in Neural Information Processing Systems, volume 36, pages 64318–64330, 2023

work page 2023
[11]

Driving in the Occupancy World: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving

Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the Occupancy World: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9327–9335, 2025

work page 2025
[12]

I2- World: Intra-inter tokenization for efficient dynamic 4d scene forecasting

Zhimin Liao, Ping Wei, Ruijie Zhang, Shuaijia Chen, Haoxuan Wang, and Ziyang Ren. I2- World: Intra-inter tokenization for efficient dynamic 4d scene forecasting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25810–25819, 2025

work page 2025
[13]

Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024

Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. OccLLaMA: An occupancy-language-action generative world model for autonomous driving. arXiv preprint arXiv:2409.03272, 2024

work page arXiv 2024
[14]

Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting

Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. InInternational Conference on Learning Representations, 2024. 10

work page 2024
[15]

4D-Rotor Gaussian Splatting: Towards efficient novel view synthesis for dynamic scenes

Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Baoquan Chen. 4D-Rotor Gaussian Splatting: Towards efficient novel view synthesis for dynamic scenes. In ACM SIGGRAPH, pages 1–11, 2024

work page 2024
[16]

4d gaussian splatting as a learned dynamical system

Arnold Caleb Asiimwe and Carl V ondrick. 4d gaussian splatting as a learned dynamical system. arXiv preprint arXiv:2512.19648, 2025

work page arXiv 2025
[17]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):139–1, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):139–1, 2023

work page 2023
[18]

GaussianFormer: Scene as gaussians for vision-based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. GaussianFormer: Scene as gaussians for vision-based 3d semantic occupancy prediction. InEuropean Conference on Computer Vision, pages 376–393. Springer, 2024

work page 2024
[19]

SparseBEV: High- performance sparse 3d object detection from multi-camera videos

Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. SparseBEV: High- performance sparse 3d object detection from multi-camera videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18580–18590, 2023

work page 2023
[20]

GaussianFormer-2: Probabilistic gaussian superposition for efficient 3d occu- pancy prediction

Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, and Jiwen Lu. GaussianFormer-2: Probabilistic gaussian superposition for efficient 3d occu- pancy prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27477–27486, 2025

work page 2025
[21]

The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks

Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4413–4421, 2018

work page 2018
[22]

nuScenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11621–11631, 2020

work page 2020
[23]

Gaussian-based World Model: Gaussian priors for voxel-based occupancy prediction and future motion prediction

Tuo Feng, Wenguan Wang, and Yi Yang. Gaussian-based World Model: Gaussian priors for voxel-based occupancy prediction and future motion prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25239–25249, October 2025

work page 2025
[24]

Tri-perspective view for vision-based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9223–9232, 2023

work page 2023
[25]

SurroundOcc: Multi-camera 3d occupancy prediction for autonomous driving

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. SurroundOcc: Multi-camera 3d occupancy prediction for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21729–21740, 2023

work page 2023
[26]

OccFormer: Dual-path transformer for vision- based 3d semantic occupancy prediction

Yunpeng Zhang, Zheng Zhu, and Dalong Du. OccFormer: Dual-path transformer for vision- based 3d semantic occupancy prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9433–9443, 2023

work page 2023
[27]

CVT-Occ: Cost volume temporal fusion for 3d occupancy prediction

Zhangchen Ye, Tao Jiang, Chenfeng Xu, Yiming Li, and Hang Zhao. CVT-Occ: Cost volume temporal fusion for 3d occupancy prediction. InEuropean Conference on Computer Vision, pages 381–397. Springer, 2024

work page 2024
[28]

STCOcc: Sparse spatial-temporal cascade renovation for 3d occupancy and scene flow prediction

Zhimin Liao, Ping Wei, Shuaijia Chen, Haoxuan Wang, and Ziyang Ren. STCOcc: Sparse spatial-temporal cascade renovation for 3d occupancy and scene flow prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1516–1526, June 2025

work page 2025
[29]

ALOcc: Adaptive lifting-based 3d semantic occupancy and cost volume-based flow predictions

Dubing Chen, Jin Fang, Wencheng Han, Xinjing Cheng, Junbo Yin, Chengzhong Xu, Fa- had Shahbaz Khan, and Jianbing Shen. ALOcc: Adaptive lifting-based 3d semantic occupancy and cost volume-based flow predictions. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 4156–4166, 2025. 11

work page 2025
[30]

GaussianWorld: Gaussian world model for streaming 3d occupancy prediction

Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, and Jiwen Lu. GaussianWorld: Gaussian world model for streaming 3d occupancy prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6772–6781, 2025

work page 2025
[31]

GaussianOcc: Fully self-supervised and efficient 3d occupancy estimation with gaussian splatting

Wanshui Gan, Fang Liu, Hongbin Xu, Ningkai Mo, and Naoto Yokoya. GaussianOcc: Fully self-supervised and efficient 3d occupancy estimation with gaussian splatting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28980–28990, 2025

work page 2025
[32]

GaussianFlowOcc: Sparse and weakly supervised occupancy estimation using gaussian splatting and temporal flow

Simon Boeder, Fabian Gigengack, and Benjamin Risse. GaussianFlowOcc: Sparse and weakly supervised occupancy estimation using gaussian splatting and temporal flow. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24943–24954, 2025

work page 2025
[33]

Vision-only gaussian splatting for collaborative semantic occupancy prediction

Cheng Chen, Hao Huang, and Saurabh Bagchi. Vision-only gaussian splatting for collaborative semantic occupancy prediction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2796–2804, 2026

work page 2026
[34]

Dynamic 3D Gaussians: Tracking by persistent dynamic view synthesis

Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3D Gaussians: Tracking by persistent dynamic view synthesis. InInternational Conference on 3D Vision, pages 800–809. IEEE, 2024

work page 2024
[35]

Street Gaussians: Modeling dynamic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street Gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024

work page 2024
[36]

Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model

Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, and Xiaoxiao Long. DOME: Taming diffusion model into high-fidelity controllable occupancy world model.arXiv preprint arXiv:2410.10429, 2024

work page arXiv 2024
[37]

COME: Adding scene-centric forecasting control to occupancy world model

Yining Shi, Kun Jiang, Qiang Meng, Ke Wang, Jiabao Wang, Wenchao Sun, Tuopu Wen, mengmeng yang, and Diange Yang. COME: Adding scene-centric forecasting control to occupancy world model. InAdvances in Neural Information Processing Systems, 2026

work page 2026
[38]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. A Related Work 3D occupancy prediction.3D occupancy prediction [ 24] has become an important scene repre- sentation for autonomous driving because it provides dense geometric and semantic understanding beyond object-level ...

work page 2019

[1] [1]

V AD: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

work page 2023

[2] [2]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

work page 2022

[3] [3]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17853–17862, June 2023

work page 2023

[4] [4]

ST-P3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. ST-P3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

work page 2022

[5] [5]

OccWorld: Learning a 3d occupancy world model for autonomous driving

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. OccWorld: Learning a 3d occupancy world model for autonomous driving. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024

work page 2024

[6] [6]

Occ-LLM: Enhancing autonomous driving with occupancy-based large language models

Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-LLM: Enhancing autonomous driving with occupancy-based large language models. InIEEE Interna- tional Conference on Robotics and Automation, pages 8434–8441. IEEE, 2025

work page 2025

[7] [7]

RenderWorld: World model with self-supervised 3d label

Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Haiyang Liu, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, et al. RenderWorld: World model with self-supervised 3d label. InIEEE International Conference on Robotics and Automation, pages 6063–6070. IEEE, 2025

work page 2025

[8] [8]

Semi-supervised vision-centric 3d occupancy world model for autonomous driving

Xiang Li, Pengfei Li, Yupeng Zheng, Wei Sun, Yan Wang, and Yilun Chen. Semi-supervised vision-centric 3d occupancy world model for autonomous driving. InInternational Conference on Learning Representations, 2025

work page 2025

[9] [9]

SparseWorld: A flexible, adaptive, and efficient 4d occupancy world model powered by sparse and dynamic queries

Chenxu Dang, Haiyan Liu, Jason Bao, Pei An, Xinyue Tang, An Pan, Jie Ma, Bingchuan Sun, and Yan Wang. SparseWorld: A flexible, adaptive, and efficient 4d occupancy world model powered by sparse and dynamic queries. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 3497–3505, 2026

work page 2026

[10] [10]

Occ3D: A large-scale 3d occupancy prediction benchmark for autonomous driving

Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3D: A large-scale 3d occupancy prediction benchmark for autonomous driving. InAdvances in Neural Information Processing Systems, volume 36, pages 64318–64330, 2023

work page 2023

[11] [11]

Driving in the Occupancy World: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving

Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the Occupancy World: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9327–9335, 2025

work page 2025

[12] [12]

I2- World: Intra-inter tokenization for efficient dynamic 4d scene forecasting

Zhimin Liao, Ping Wei, Ruijie Zhang, Shuaijia Chen, Haoxuan Wang, and Ziyang Ren. I2- World: Intra-inter tokenization for efficient dynamic 4d scene forecasting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25810–25819, 2025

work page 2025

[13] [13]

Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024

Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. OccLLaMA: An occupancy-language-action generative world model for autonomous driving. arXiv preprint arXiv:2409.03272, 2024

work page arXiv 2024

[14] [14]

Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting

Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. InInternational Conference on Learning Representations, 2024. 10

work page 2024

[15] [15]

4D-Rotor Gaussian Splatting: Towards efficient novel view synthesis for dynamic scenes

Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Baoquan Chen. 4D-Rotor Gaussian Splatting: Towards efficient novel view synthesis for dynamic scenes. In ACM SIGGRAPH, pages 1–11, 2024

work page 2024

[16] [16]

4d gaussian splatting as a learned dynamical system

Arnold Caleb Asiimwe and Carl V ondrick. 4d gaussian splatting as a learned dynamical system. arXiv preprint arXiv:2512.19648, 2025

work page arXiv 2025

[17] [17]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):139–1, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):139–1, 2023

work page 2023

[18] [18]

GaussianFormer: Scene as gaussians for vision-based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. GaussianFormer: Scene as gaussians for vision-based 3d semantic occupancy prediction. InEuropean Conference on Computer Vision, pages 376–393. Springer, 2024

work page 2024

[19] [19]

SparseBEV: High- performance sparse 3d object detection from multi-camera videos

Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. SparseBEV: High- performance sparse 3d object detection from multi-camera videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18580–18590, 2023

work page 2023

[20] [20]

GaussianFormer-2: Probabilistic gaussian superposition for efficient 3d occu- pancy prediction

Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, and Jiwen Lu. GaussianFormer-2: Probabilistic gaussian superposition for efficient 3d occu- pancy prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27477–27486, 2025

work page 2025

[21] [21]

The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks

Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4413–4421, 2018

work page 2018

[22] [22]

nuScenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11621–11631, 2020

work page 2020

[23] [23]

Gaussian-based World Model: Gaussian priors for voxel-based occupancy prediction and future motion prediction

Tuo Feng, Wenguan Wang, and Yi Yang. Gaussian-based World Model: Gaussian priors for voxel-based occupancy prediction and future motion prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25239–25249, October 2025

work page 2025

[24] [24]

Tri-perspective view for vision-based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9223–9232, 2023

work page 2023

[25] [25]

SurroundOcc: Multi-camera 3d occupancy prediction for autonomous driving

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. SurroundOcc: Multi-camera 3d occupancy prediction for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21729–21740, 2023

work page 2023

[26] [26]

OccFormer: Dual-path transformer for vision- based 3d semantic occupancy prediction

Yunpeng Zhang, Zheng Zhu, and Dalong Du. OccFormer: Dual-path transformer for vision- based 3d semantic occupancy prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9433–9443, 2023

work page 2023

[27] [27]

CVT-Occ: Cost volume temporal fusion for 3d occupancy prediction

Zhangchen Ye, Tao Jiang, Chenfeng Xu, Yiming Li, and Hang Zhao. CVT-Occ: Cost volume temporal fusion for 3d occupancy prediction. InEuropean Conference on Computer Vision, pages 381–397. Springer, 2024

work page 2024

[28] [28]

STCOcc: Sparse spatial-temporal cascade renovation for 3d occupancy and scene flow prediction

Zhimin Liao, Ping Wei, Shuaijia Chen, Haoxuan Wang, and Ziyang Ren. STCOcc: Sparse spatial-temporal cascade renovation for 3d occupancy and scene flow prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1516–1526, June 2025

work page 2025

[29] [29]

ALOcc: Adaptive lifting-based 3d semantic occupancy and cost volume-based flow predictions

Dubing Chen, Jin Fang, Wencheng Han, Xinjing Cheng, Junbo Yin, Chengzhong Xu, Fa- had Shahbaz Khan, and Jianbing Shen. ALOcc: Adaptive lifting-based 3d semantic occupancy and cost volume-based flow predictions. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 4156–4166, 2025. 11

work page 2025

[30] [30]

GaussianWorld: Gaussian world model for streaming 3d occupancy prediction

Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, and Jiwen Lu. GaussianWorld: Gaussian world model for streaming 3d occupancy prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6772–6781, 2025

work page 2025

[31] [31]

GaussianOcc: Fully self-supervised and efficient 3d occupancy estimation with gaussian splatting

Wanshui Gan, Fang Liu, Hongbin Xu, Ningkai Mo, and Naoto Yokoya. GaussianOcc: Fully self-supervised and efficient 3d occupancy estimation with gaussian splatting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28980–28990, 2025

work page 2025

[32] [32]

GaussianFlowOcc: Sparse and weakly supervised occupancy estimation using gaussian splatting and temporal flow

Simon Boeder, Fabian Gigengack, and Benjamin Risse. GaussianFlowOcc: Sparse and weakly supervised occupancy estimation using gaussian splatting and temporal flow. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24943–24954, 2025

work page 2025

[33] [33]

Vision-only gaussian splatting for collaborative semantic occupancy prediction

Cheng Chen, Hao Huang, and Saurabh Bagchi. Vision-only gaussian splatting for collaborative semantic occupancy prediction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2796–2804, 2026

work page 2026

[34] [34]

Dynamic 3D Gaussians: Tracking by persistent dynamic view synthesis

Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3D Gaussians: Tracking by persistent dynamic view synthesis. InInternational Conference on 3D Vision, pages 800–809. IEEE, 2024

work page 2024

[35] [35]

Street Gaussians: Modeling dynamic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street Gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024

work page 2024

[36] [36]

Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model

Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, and Xiaoxiao Long. DOME: Taming diffusion model into high-fidelity controllable occupancy world model.arXiv preprint arXiv:2410.10429, 2024

work page arXiv 2024

[37] [37]

COME: Adding scene-centric forecasting control to occupancy world model

Yining Shi, Kun Jiang, Qiang Meng, Ke Wang, Jiabao Wang, Wenchao Sun, Tuopu Wen, mengmeng yang, and Diange Yang. COME: Adding scene-centric forecasting control to occupancy world model. InAdvances in Neural Information Processing Systems, 2026

work page 2026

[38] [38]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. A Related Work 3D occupancy prediction.3D occupancy prediction [ 24] has become an important scene repre- sentation for autonomous driving because it provides dense geometric and semantic understanding beyond object-level ...

work page 2019