pith. sign in

arxiv: 2511.22039 · v3 · submitted 2025-11-27 · 💻 cs.CV

SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

Pith reviewed 2026-05-17 05:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords occupancy forecastingtrajectory conditioningsparse representationtransformernuScenes3D scene predictionworld modelspatiotemporal modeling
0
0 comments X

The pith

A transformer with sparse occupancy predicts future 3D scenes directly from images, bypassing BEV and discrete tokens to reach state-of-the-art results on nuScenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a method for forecasting how 3D environments will occupy space in the next one to three seconds, conditioned on a chosen vehicle trajectory. It generates predictions end-to-end from raw camera images by maintaining a sparse occupancy structure inside a transformer, rather than first converting data into bird's-eye-view maps or compressing it into a fixed set of discrete tokens. The design aims to let attention mechanisms directly model how objects move and interact over time without losing detail to those intermediate steps. If the approach holds, it suggests that simpler, less constrained representations can support more reliable long-horizon scene understanding for tasks such as motion planning.

Core claim

The central claim is that predicting multi-frame future occupancy in an end-to-end manner from raw image features, using a sparse representation inside a transformer that avoids both bird's-eye-view projection and discrete tokenization from VAEs, allows more effective capture of spatiotemporal dependencies and delivers higher accuracy under arbitrary future trajectory conditioning than prior methods.

What carries the argument

Sparse occupancy representation inside a transformer that ingests raw image features directly and conditions on future trajectories.

If this is right

  • Delivers higher accuracy than existing approaches for 1-3 second occupancy forecasts on nuScenes.
  • Maintains performance when the future trajectory is chosen arbitrarily rather than following the ground-truth path.
  • Avoids information loss associated with discrete tokenization and fixed BEV grids.
  • Supports direct use of raw image features without hand-designed geometric transformations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sparse-transformer pattern could be tested on longer forecast horizons or additional sensor modalities such as LiDAR.
  • If successful, this style of model might reduce reliance on separate perception modules that produce BEV or object lists before planning.
  • The approach invites experiments that measure whether the transformer implicitly learns the geometric relationships it no longer receives explicitly.

Load-bearing premise

That removing explicit bird's-eye-view geometric priors and discrete token capacity limits will not introduce new representational bottlenecks that the model cannot resolve on its own.

What would settle it

A controlled evaluation on the nuScenes validation set in which the model shows no significant improvement, or outright lower accuracy, than strong VAE-based or BEV-based baselines for 1-3 second occupancy forecasts under the same trajectory inputs.

Figures

Figures reproduced from arXiv: 2511.22039 by Jiayuan Du, Kun Zhan, Qijun Chen, Wenbo Hou, Yiming Zhao, Yong Pan, Zhenglong Guo, Zhihui Hao.

Figure 1
Figure 1. Figure 1: Without VAE codebook and BEV representations, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Occupancy is modeled as a collection of anchors formed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: We embed the sensor observations, occupancy priors, and trajectories into feature vectors. These embeddings pass through [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of our proposed SparseWorld-TC are presented here. Our method effectively captures both dynamic and static [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Long-term 4D occupancy world model forecasting. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Forecasting with conditioned trajectories. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Gaussian splatting reconstruction during training. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Future observation forecasting on validation set. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional qualitative results in the different trajectory conditions. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative results of our proposed SparseWorld-TC are presented here. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative results of our proposed SparseWorld-TC are presented here. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
read the original abstract

This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird's eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1-3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SparseWorld-TC, a trajectory-conditioned sparse occupancy world model for 3D scene occupancy forecasting. It predicts multi-frame future occupancy end-to-end from raw image features using a transformer on a sparse representation, explicitly avoiding VAE-based discrete tokenization and BEV projection with its geometric priors. The central claims are state-of-the-art performance on nuScenes for 1-3 second forecasting with a significant margin over prior methods, plus robust accuracy under arbitrary future trajectory conditioning.

Significance. If the quantitative results and robustness claims hold after proper validation, the work could meaningfully advance occupancy-based world models for autonomous driving by removing capacity limits from discrete tokens and structural constraints from BEV. The focus on arbitrary trajectory conditioning addresses a practical need, though its value depends on whether the sparse transformer truly learns the required 3D geometry implicitly without new bottlenecks.

major comments (3)
  1. [Abstract] Abstract: The claim of achieving SOTA performance 'by a significant margin' is presented without any numerical results, specific metrics (e.g., mIoU or IoU at 1s/2s/3s), listed baselines, or error bars, which is load-bearing for the empirical contribution and prevents assessment of whether the sparse design actually delivers the asserted gains.
  2. [§3] §3 (Method): The assertion that bypassing BEV projection and discrete tokenization lets the transformer capture spatiotemporal dependencies more effectively rests on the unverified assumption that the model will learn camera geometry, depth, and 3D lifting implicitly from sparse tokens and raw features alone; no ablation or analysis is provided to rule out relocated representational bottlenecks, directly engaging the stress-test concern.
  3. [§4] §4 (Experiments): The robustness claim under 'arbitrary future trajectory conditioning' lacks detail on how trajectories are sampled or conditioned during training and evaluation, and no results are shown for out-of-distribution trajectories that would test whether the sparse representation generalizes beyond the training distribution.
minor comments (2)
  1. [§4] Ensure all compared baselines include their original publication references and implementation details for reproducibility.
  2. [§3] Clarify the exact sparsity mechanism (e.g., token selection criteria or masking strategy) with a diagram or pseudocode in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment point by point below, indicating where revisions have been made to the manuscript to improve clarity and strengthen the empirical support.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of achieving SOTA performance 'by a significant margin' is presented without any numerical results, specific metrics (e.g., mIoU or IoU at 1s/2s/3s), listed baselines, or error bars, which is load-bearing for the empirical contribution and prevents assessment of whether the sparse design actually delivers the asserted gains.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised manuscript, we have updated the abstract to report the specific mIoU values achieved at the 1s, 2s, and 3s horizons, along with the primary baselines and the observed margins of improvement. This change directly addresses the concern while preserving the abstract's conciseness. revision: yes

  2. Referee: [§3] §3 (Method): The assertion that bypassing BEV projection and discrete tokenization lets the transformer capture spatiotemporal dependencies more effectively rests on the unverified assumption that the model will learn camera geometry, depth, and 3D lifting implicitly from sparse tokens and raw features alone; no ablation or analysis is provided to rule out relocated representational bottlenecks, directly engaging the stress-test concern.

    Authors: This comment correctly identifies a gap in supporting analysis. While the end-to-end performance gains on nuScenes provide indirect evidence that the sparse transformer learns the necessary 3D structure, we acknowledge the absence of targeted ablations. We have added a new subsection with feature visualization and an ablation comparing models with and without auxiliary depth supervision to demonstrate that geometric information is captured implicitly without introducing new bottlenecks. revision: yes

  3. Referee: [§4] §4 (Experiments): The robustness claim under 'arbitrary future trajectory conditioning' lacks detail on how trajectories are sampled or conditioned during training and evaluation, and no results are shown for out-of-distribution trajectories that would test whether the sparse representation generalizes beyond the training distribution.

    Authors: We thank the referee for this observation. Section 3.2 describes the conditioning mechanism, but additional implementation details were indeed warranted. In the revised experiments section, we have expanded the description of trajectory sampling (including the use of ground-truth trajectories mixed with controlled perturbations during training) and added quantitative results on out-of-distribution trajectories featuring higher velocities and sharper turns, confirming that performance remains robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an architectural proposal for a sparse occupancy transformer that predicts future 3D occupancy end-to-end from raw image features while bypassing BEV projection and discrete VAE tokenization. No equations, parameter-fitting procedures, or self-referential definitions appear in the provided text. Performance claims on nuScenes are framed as empirical outcomes of the design choice rather than quantities derived by construction from fitted inputs or prior self-citations. The central motivation (attention-based spatiotemporal modeling) draws on external precedents such as GPT and VGGT without load-bearing self-citation chains or uniqueness theorems imported from the authors' own prior work. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The implicit assumption that a sparse point representation plus transformer attention suffices for 3D scene dynamics is treated as a domain assumption rather than a derived result.

axioms (1)
  • domain assumption Transformer attention can capture spatiotemporal dependencies in sparse 3D occupancy without explicit geometric priors
    Invoked when the paper states that bypassing BEV allows the transformer to capture dependencies more effectively.

pith-pipeline@v0.9.0 · 5487 in / 1258 out tokens · 23009 ms · 2026-05-17T05:22:08.795160+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

    cs.CV 2026-05 conditional novelty 6.0

    HiPR improves 3D occupancy prediction by adaptively reparameterizing projection sampling ranges using LiDAR height priors instead of fixed uniform pillars.

  2. Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

    cs.CV 2026-05 unverdicted novelty 6.0

    HiPR improves 3D occupancy prediction by reparameterizing image-to-voxel projections using LiDAR-derived height priors to adapt sampling ranges to scene sparsity and height variations.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Dynamiccity: Large-scale 4d oc- cupancy generation from dynamic scenes.arXiv preprint arXiv:2410.18084, 2024

    Hengwei Bian, Lingdong Kong, Haozhe Xie, Liang Pan, Yu Qiao, and Ziwei Liu. Dynamiccity: Large-scale 4d oc- cupancy generation from dynamic scenes.arXiv preprint arXiv:2410.18084, 2024. 1, 2

  3. [3]

    nuscenes: A mul- timodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul- timodal dataset for autonomous driving. InCVPR, pages 11621–11631, 2020. 2

  4. [4]

    Occprophet: Pushing efficiency frontier of camera-only 4d occupancy forecasting with observer-forecaster-refiner framework.arXiv preprint arXiv:2502.15180, 2025

    Junliang Chen, Huaiyuan Xu, Yi Wang, and Lap-Pui Chau. Occprophet: Pushing efficiency frontier of camera-only 4d occupancy forecasting with observer-forecaster-refiner framework.arXiv preprint arXiv:2502.15180, 2025. 2

  5. [5]

    Sparseworld: A flexible, adaptive, and efficient 4d occupancy world model powered by sparse and dynamic queries.arXiv preprint arXiv:2510.17482, 2025

    Chenxu Dang, Haiyan Liu, Guangjun Bao, Pei An, Xinyue Tang, Jie Ma, Bingchuan Sun, and Yan Wang. Sparseworld: A flexible, adaptive, and efficient 4d occupancy world model powered by sparse and dynamic queries.arXiv preprint arXiv:2510.17482, 2025. 6

  6. [6]

    Understanding world or predict- ing future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

    Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Ze- fang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predict- ing future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025. 1

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 4

  8. [8]

    Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model

    Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, and Xiaoxiao Long. Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model.arXiv preprint arXiv:2410.10429, 2024. 1, 2, 6, 7

  9. [9]

    Fsf-net: Enhance 4d occupancy forecasting with coarse bev scene flow for autonomous driving.arXiv preprint arXiv:2409.15841, 2024

    Erxin Guo, Pei An, You Yang, Qiong Liu, and An-An Liu. Fsf-net: Enhance 4d occupancy forecasting with coarse bev scene flow for autonomous driving.arXiv preprint arXiv:2409.15841, 2024. 2

  10. [10]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 4, 6

  11. [11]

    Tri-perspective view for vision-based 3d se- mantic occupancy prediction

    Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d se- mantic occupancy prediction. InCVPR, pages 9223–9232,

  12. [12]

    Available: https://arxiv.org/abs/2408.14197

    Bu Jin, Xiaotao Hu, Yupeng Zheng, Xiaoyang Guo, Qian Zhang, Yao Yao, Diming Zhang, Xiaoxiao Long, Wei Yin, et al. Occvar: Scalable 4d occupancy prediction via next- scale prediction.arXiv preprint arXiv:2408.14197, 2024. 2, 6

  13. [13]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  14. [14]

    Point cloud forecasting as a proxy for 4d occupancy forecasting

    Tarasha Khurana, Peiyun Hu, David Held, and Deva Ra- manan. Point cloud forecasting as a proxy for 4d occupancy forecasting. InCVPR, pages 1116–1124, 2023. 6

  15. [15]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 1, 2

  16. [16]

    3d and 4d world modeling: A survey

    Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3d and 4d world modeling: A survey. arXiv preprint arXiv:2509.07996, 2025. 1

  17. [17]

    Uniscene: Unified occupancy-centric driving scene generation.arXiv preprint arXiv:2412.05435, 2024

    Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation.arXiv preprint arXiv:2412.05435, 2024. 6

  18. [18]

    Semi-supervised vision-centric 3d occu- pancy world model for autonomous driving.arXiv preprint arXiv:2502.07309, 2025

    Xiang Li, Pengfei Li, Yupeng Zheng, Wei Sun, Yan Wang, and Yilun Chen. Semi-supervised vision-centric 3d occu- pancy world model for autonomous driving.arXiv preprint arXiv:2502.07309, 2025. 2, 6

  19. [19]

    Fb-occ: 3d occupancy prediction based on forward-backward view transformation,

    Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492, 2023. 2, 7

  20. [20]

    Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.PAMI, 2024

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.PAMI, 2024. 4

  21. [21]

    Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InCVPR, pages 14864–14873, 2024. 7

  22. [22]

    Stcocc: Sparse spatial-temporal cascade reno- vation for 3d occupancy and scene flow prediction

    Zhimin Liao, Ping Wei, Shuaijia Chen, Haoxuan Wang, and Ziyang Ren. Stcocc: Sparse spatial-temporal cascade reno- vation for 3d occupancy and scene flow prediction. InCVPR, pages 1516–1526, 2025. 7

  23. [23]

    I 2-world: Intra-inter tok- enization for efficient dynamic 4d scene forecasting.arXiv preprint arXiv:2507.09144, 2025

    Zhimin Liao, Ping Wei, Ruijie Zhang, Shuaijia Chen, Haox- uan Wang, and Ziyang Ren. I 2-world: Intra-inter tok- enization for efficient dynamic 4d scene forecasting.arXiv preprint arXiv:2507.09144, 2025. 1, 2, 6, 7

  24. [24]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InICCV, pages 2980–2988, 2017. 6

  25. [25]

    Sparse4D: Multi-view 3D object detection with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022

    Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d: Multi-view 3d object detec- tion with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022. 3, 4

  26. [26]

    Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos

    Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos. InICCV, pages 18580– 18590, 2023. 2, 4

  27. [27]

    Petrv2: A unified framework for 3d perception from multi-camera images

    Yingfei Liu, Junjie Yan, Fan Jia, Shuailin Li, Aqi Gao, Tian- cai Wang, and Xiangyu Zhang. Petrv2: A unified framework for 3d perception from multi-camera images. InICCV, pages 3262–3272, 2023. 3

  28. [28]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016. 7

  29. [29]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7 9

  30. [30]

    Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models.arXiv preprint arXiv:2412.03934, 2024

    Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Siheng Chen, Mike Chen, Sanja Fidler, et al. Infinicube: Unbounded and con- trollable dynamic 3d driving scene generation with world- guided video models.arXiv preprint arXiv:2412.03934,

  31. [31]

    Scube: Instant large-scale scene reconstruction using voxsplats.NIPS, 37:97670–97698, 2024

    Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction using voxsplats.NIPS, 37:97670–97698, 2024. 8, 1

  32. [32]

    Come: Adding scene-centric forecasting control to occupancy world model.arXiv preprint arXiv:2506.13260,

    Yining Shi, Kun Jiang, Qiang Meng, Ke Wang, Jiabao Wang, Wenchao Sun, Tuopu Wen, Mengmeng Yang, and Diange Yang. Come: Adding scene-centric forecasting control to occupancy world model.arXiv preprint arXiv:2506.13260,

  33. [33]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 7

  34. [34]

    Sparsedrive: End-to-end au- tonomous driving via sparse scene representation

    Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Hao- ran Wu, and Sifa Zheng. Sparsedrive: End-to-end au- tonomous driving via sparse scene representation. InICRA, pages 8795–8801. IEEE, 2025. 2, 4

  35. [35]

    Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction

    Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction. InCVPR, pages 15035–15044,

  36. [36]

    Driv- ingforward: Feed-forward 3d gaussian splatting for driving scene reconstruction from flexible surround-view input

    Qijian Tian, Xin Tan, Yuan Xie, and Lizhuang Ma. Driv- ingforward: Feed-forward 3d gaussian splatting for driving scene reconstruction from flexible surround-view input. In AAAI, pages 7374–7382, 2025. 8, 1

  37. [37]

    Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving.NIPS, 36:64318–64330, 2023

    Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving.NIPS, 36:64318–64330, 2023. 1, 2, 6

  38. [38]

    Neural discrete representation learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InNIPS, page 6309–6318, Red Hook, NY , USA, 2017. Curran Associates Inc. 1, 2

  39. [39]

    Opus: occupancy prediction using a sparse set.NIPS, 37:119861–119885, 2024

    Jiabao Wang, Zhaojiang Liu, Qiang Meng, Liujiang Yan, Ke Wang, Jie Yang, Wei Liu, Qibin Hou, and Ming-Ming Cheng. Opus: occupancy prediction using a sparse set.NIPS, 37:119861–119885, 2024. 3, 6

  40. [40]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, pages 5294– 5306, 2025. 1, 2

  41. [41]

    Occsora: 4d occupancy generation models as world simulators for au- tonomous driving

    Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, and Jiwen Lu. Occsora: 4d occupancy generation models as world simulators for au- tonomous driving.arXiv preprint arXiv:2405.20337, 2024. 1, 2

  42. [42]

    Exploring object-centric temporal modeling for efficient multi-view 3d object detection

    Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xi- angyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. InICCV, pages 3621–3631, 2023. 4

  43. [43]

    Detr3d: 3d object detection from multi-view images via 3d-to-2d queries

    Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InCoRL, pages 180–191. PMLR, 2022. 4

  44. [44]

    Uniocc: A unified benchmark for occupancy forecasting and prediction in autonomous driving.arXiv preprint arXiv:2503.24381,

    Yuping Wang, Xiangyu Huang, Xiaokang Sun, Mingxuan Yan, Shuo Xing, Zhengzhong Tu, and Jiachen Li. Uniocc: A unified benchmark for occupancy forecasting and prediction in autonomous driving.arXiv preprint arXiv:2503.24381,

  45. [45]

    Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024

    Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024. 2, 6

  46. [46]

    Delta-triplane transformers as occupancy world models.arXiv preprint arXiv:2503.07338,

    Haoran Xu, Peixi Peng, Guang Tan, Yiqian Chang, Yisen Zhao, and Yonghong Tian. Delta-triplane transformers as occupancy world models.arXiv preprint arXiv:2503.07338,

  47. [47]

    Spatiotemporal decoupling for efficient vision-based occupancy forecasting

    Jingyi Xu, Xieyuanli Chen, Junyi Ma, Jiawei Huang, Jintao Xu, Yue Wang, and Ling Pei. Spatiotemporal decoupling for efficient vision-based occupancy forecasting. InCVPR, pages 22338–22347, 2025. 2

  48. [48]

    Occ-llm: Enhancing autonomous driv- ing with occupancy-based large language models.arXiv preprint arXiv:2502.06419, 2025

    Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-llm: Enhancing autonomous driv- ing with occupancy-based large language models.arXiv preprint arXiv:2502.06419, 2025. 1, 2, 6

  49. [49]

    Renderworld: World model with self-supervised 3d label.arXiv preprint arXiv:2409.11356,

    Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, et al. Renderworld: World model with self-supervised 3d label.arXiv preprint arXiv:2409.11356,

  50. [50]

    Driving in the occupancy world: Vision-centric 4d occupancy forecast- ing and planning via world models for autonomous driving

    Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the occupancy world: Vision-centric 4d occupancy forecast- ing and planning via world models for autonomous driving. InAAAI, pages 9327–9335, 2025. 2

  51. [51]

    Visual point cloud forecasting enables scalable autonomous driving

    Zetong Yang, Li Chen, Yanan Sun, and Hongyang Li. Visual point cloud forecasting enables scalable autonomous driving. InCVPR, pages 14673–14684, 2024. 6

  52. [52]

    An efficient occupancy world model via decoupled dynamic flow and image-assisted training.arXiv preprint arXiv:2412.13772, 2024

    Haiming Zhang, Ying Xue, Xu Yan, Jiacheng Zhang, We- ichao Qiu, Dongfeng Bai, Bingbing Liu, Shuguang Cui, and Zhen Li. An efficient occupancy world model via decoupled dynamic flow and image-assisted training.arXiv preprint arXiv:2412.13772, 2024. 1, 2, 6, 7

  53. [53]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023. 2

  54. [54]

    Occworld: Learning a 3d occupancy world model for autonomous driving

    Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InECCV, pages 55–72. Springer, 2024. 1, 2, 6, 7

  55. [55]

    Gaussianad: Gaussian-centric end-to- end autonomous driving

    Wenzhao Zheng, Junjie Wu, Yao Zheng, Sicheng Zuo, Zixun Xie, Longchao Yang, Yong Pan, Zhihui Hao, Peng Jia, Xi- anpeng Lang, et al. Gaussianad: Gaussian-centric end-to- end autonomous driving.arXiv preprint arXiv:2412.10371,

  56. [56]

    TE”, “PE

    2 10 SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model Supplementary Material A. Additional Quantitative Experiments A.1. Ray-level mIoU SparseOcc [35] proposes RayIoU (Ray-level mIoU) to solve the inconsistency penalty along the depth axis raised in traditional voxel-level mIoU criteria. We evaluate our SparseWorld-TC-Large* model with ...