SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
Pith reviewed 2026-05-17 05:22 UTC · model grok-4.3
The pith
A transformer with sparse occupancy predicts future 3D scenes directly from images, bypassing BEV and discrete tokens to reach state-of-the-art results on nuScenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that predicting multi-frame future occupancy in an end-to-end manner from raw image features, using a sparse representation inside a transformer that avoids both bird's-eye-view projection and discrete tokenization from VAEs, allows more effective capture of spatiotemporal dependencies and delivers higher accuracy under arbitrary future trajectory conditioning than prior methods.
What carries the argument
Sparse occupancy representation inside a transformer that ingests raw image features directly and conditions on future trajectories.
If this is right
- Delivers higher accuracy than existing approaches for 1-3 second occupancy forecasts on nuScenes.
- Maintains performance when the future trajectory is chosen arbitrarily rather than following the ground-truth path.
- Avoids information loss associated with discrete tokenization and fixed BEV grids.
- Supports direct use of raw image features without hand-designed geometric transformations.
Where Pith is reading between the lines
- The same sparse-transformer pattern could be tested on longer forecast horizons or additional sensor modalities such as LiDAR.
- If successful, this style of model might reduce reliance on separate perception modules that produce BEV or object lists before planning.
- The approach invites experiments that measure whether the transformer implicitly learns the geometric relationships it no longer receives explicitly.
Load-bearing premise
That removing explicit bird's-eye-view geometric priors and discrete token capacity limits will not introduce new representational bottlenecks that the model cannot resolve on its own.
What would settle it
A controlled evaluation on the nuScenes validation set in which the model shows no significant improvement, or outright lower accuracy, than strong VAE-based or BEV-based baselines for 1-3 second occupancy forecasts under the same trajectory inputs.
Figures
read the original abstract
This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird's eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1-3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SparseWorld-TC, a trajectory-conditioned sparse occupancy world model for 3D scene occupancy forecasting. It predicts multi-frame future occupancy end-to-end from raw image features using a transformer on a sparse representation, explicitly avoiding VAE-based discrete tokenization and BEV projection with its geometric priors. The central claims are state-of-the-art performance on nuScenes for 1-3 second forecasting with a significant margin over prior methods, plus robust accuracy under arbitrary future trajectory conditioning.
Significance. If the quantitative results and robustness claims hold after proper validation, the work could meaningfully advance occupancy-based world models for autonomous driving by removing capacity limits from discrete tokens and structural constraints from BEV. The focus on arbitrary trajectory conditioning addresses a practical need, though its value depends on whether the sparse transformer truly learns the required 3D geometry implicitly without new bottlenecks.
major comments (3)
- [Abstract] Abstract: The claim of achieving SOTA performance 'by a significant margin' is presented without any numerical results, specific metrics (e.g., mIoU or IoU at 1s/2s/3s), listed baselines, or error bars, which is load-bearing for the empirical contribution and prevents assessment of whether the sparse design actually delivers the asserted gains.
- [§3] §3 (Method): The assertion that bypassing BEV projection and discrete tokenization lets the transformer capture spatiotemporal dependencies more effectively rests on the unverified assumption that the model will learn camera geometry, depth, and 3D lifting implicitly from sparse tokens and raw features alone; no ablation or analysis is provided to rule out relocated representational bottlenecks, directly engaging the stress-test concern.
- [§4] §4 (Experiments): The robustness claim under 'arbitrary future trajectory conditioning' lacks detail on how trajectories are sampled or conditioned during training and evaluation, and no results are shown for out-of-distribution trajectories that would test whether the sparse representation generalizes beyond the training distribution.
minor comments (2)
- [§4] Ensure all compared baselines include their original publication references and implementation details for reproducibility.
- [§3] Clarify the exact sparsity mechanism (e.g., token selection criteria or masking strategy) with a diagram or pseudocode in the method section.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment point by point below, indicating where revisions have been made to the manuscript to improve clarity and strengthen the empirical support.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim of achieving SOTA performance 'by a significant margin' is presented without any numerical results, specific metrics (e.g., mIoU or IoU at 1s/2s/3s), listed baselines, or error bars, which is load-bearing for the empirical contribution and prevents assessment of whether the sparse design actually delivers the asserted gains.
Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised manuscript, we have updated the abstract to report the specific mIoU values achieved at the 1s, 2s, and 3s horizons, along with the primary baselines and the observed margins of improvement. This change directly addresses the concern while preserving the abstract's conciseness. revision: yes
-
Referee: [§3] §3 (Method): The assertion that bypassing BEV projection and discrete tokenization lets the transformer capture spatiotemporal dependencies more effectively rests on the unverified assumption that the model will learn camera geometry, depth, and 3D lifting implicitly from sparse tokens and raw features alone; no ablation or analysis is provided to rule out relocated representational bottlenecks, directly engaging the stress-test concern.
Authors: This comment correctly identifies a gap in supporting analysis. While the end-to-end performance gains on nuScenes provide indirect evidence that the sparse transformer learns the necessary 3D structure, we acknowledge the absence of targeted ablations. We have added a new subsection with feature visualization and an ablation comparing models with and without auxiliary depth supervision to demonstrate that geometric information is captured implicitly without introducing new bottlenecks. revision: yes
-
Referee: [§4] §4 (Experiments): The robustness claim under 'arbitrary future trajectory conditioning' lacks detail on how trajectories are sampled or conditioned during training and evaluation, and no results are shown for out-of-distribution trajectories that would test whether the sparse representation generalizes beyond the training distribution.
Authors: We thank the referee for this observation. Section 3.2 describes the conditioning mechanism, but additional implementation details were indeed warranted. In the revised experiments section, we have expanded the description of trajectory sampling (including the use of ground-truth trajectories mixed with controlled perturbations during training) and added quantitative results on out-of-distribution trajectories featuring higher velocities and sharper turns, confirming that performance remains robust. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an architectural proposal for a sparse occupancy transformer that predicts future 3D occupancy end-to-end from raw image features while bypassing BEV projection and discrete VAE tokenization. No equations, parameter-fitting procedures, or self-referential definitions appear in the provided text. Performance claims on nuScenes are framed as empirical outcomes of the design choice rather than quantities derived by construction from fitted inputs or prior self-citations. The central motivation (attention-based spatiotemporal modeling) draws on external precedents such as GPT and VGGT without load-bearing self-citation chains or uniqueness theorems imported from the authors' own prior work. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer attention can capture spatiotemporal dependencies in sparse 3D occupancy without explicit geometric priors
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we employ a sparse occupancy representation that bypasses the intermediate bird's eye view (BEV) projection and its explicit geometric priors... pure attention-based transformer architecture
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy
HiPR improves 3D occupancy prediction by adaptively reparameterizing projection sampling ranges using LiDAR height priors instead of fixed uniform pillars.
-
Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy
HiPR improves 3D occupancy prediction by reparameterizing image-to-voxel projections using LiDAR-derived height priors to adapt sampling ranges to scene sparsity and height variations.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Hengwei Bian, Lingdong Kong, Haozhe Xie, Liang Pan, Yu Qiao, and Ziwei Liu. Dynamiccity: Large-scale 4d oc- cupancy generation from dynamic scenes.arXiv preprint arXiv:2410.18084, 2024. 1, 2
-
[3]
nuscenes: A mul- timodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul- timodal dataset for autonomous driving. InCVPR, pages 11621–11631, 2020. 2
work page 2020
-
[4]
Junliang Chen, Huaiyuan Xu, Yi Wang, and Lap-Pui Chau. Occprophet: Pushing efficiency frontier of camera-only 4d occupancy forecasting with observer-forecaster-refiner framework.arXiv preprint arXiv:2502.15180, 2025. 2
-
[5]
Chenxu Dang, Haiyan Liu, Guangjun Bao, Pei An, Xinyue Tang, Jie Ma, Bingchuan Sun, and Yan Wang. Sparseworld: A flexible, adaptive, and efficient 4d occupancy world model powered by sparse and dynamic queries.arXiv preprint arXiv:2510.17482, 2025. 6
-
[6]
Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Ze- fang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predict- ing future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025. 1
work page 2025
-
[7]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 4
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model
Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, and Xiaoxiao Long. Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model.arXiv preprint arXiv:2410.10429, 2024. 1, 2, 6, 7
-
[9]
Erxin Guo, Pei An, You Yang, Qiong Liu, and An-An Liu. Fsf-net: Enhance 4d occupancy forecasting with coarse bev scene flow for autonomous driving.arXiv preprint arXiv:2409.15841, 2024. 2
-
[10]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 4, 6
work page 2016
-
[11]
Tri-perspective view for vision-based 3d se- mantic occupancy prediction
Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d se- mantic occupancy prediction. InCVPR, pages 9223–9232,
-
[12]
Available: https://arxiv.org/abs/2408.14197
Bu Jin, Xiaotao Hu, Yupeng Zheng, Xiaoyang Guo, Qian Zhang, Yao Yao, Diming Zhang, Xiaoxiao Long, Wei Yin, et al. Occvar: Scalable 4d occupancy prediction via next- scale prediction.arXiv preprint arXiv:2408.14197, 2024. 2, 6
-
[13]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
-
[14]
Point cloud forecasting as a proxy for 4d occupancy forecasting
Tarasha Khurana, Peiyun Hu, David Held, and Deva Ra- manan. Point cloud forecasting as a proxy for 4d occupancy forecasting. InCVPR, pages 1116–1124, 2023. 6
work page 2023
-
[15]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[16]
3d and 4d world modeling: A survey
Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3d and 4d world modeling: A survey. arXiv preprint arXiv:2509.07996, 2025. 1
-
[17]
Uniscene: Unified occupancy-centric driving scene generation.arXiv preprint arXiv:2412.05435, 2024
Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation.arXiv preprint arXiv:2412.05435, 2024. 6
-
[18]
Xiang Li, Pengfei Li, Yupeng Zheng, Wei Sun, Yan Wang, and Yilun Chen. Semi-supervised vision-centric 3d occu- pancy world model for autonomous driving.arXiv preprint arXiv:2502.07309, 2025. 2, 6
-
[19]
Fb-occ: 3d occupancy prediction based on forward-backward view transformation,
Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492, 2023. 2, 7
-
[20]
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.PAMI, 2024. 4
work page 2024
-
[21]
Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InCVPR, pages 14864–14873, 2024. 7
work page 2024
-
[22]
Stcocc: Sparse spatial-temporal cascade reno- vation for 3d occupancy and scene flow prediction
Zhimin Liao, Ping Wei, Shuaijia Chen, Haoxuan Wang, and Ziyang Ren. Stcocc: Sparse spatial-temporal cascade reno- vation for 3d occupancy and scene flow prediction. InCVPR, pages 1516–1526, 2025. 7
work page 2025
-
[23]
Zhimin Liao, Ping Wei, Ruijie Zhang, Shuaijia Chen, Haox- uan Wang, and Ziyang Ren. I 2-world: Intra-inter tok- enization for efficient dynamic 4d scene forecasting.arXiv preprint arXiv:2507.09144, 2025. 1, 2, 6, 7
-
[24]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InICCV, pages 2980–2988, 2017. 6
work page 2017
-
[25]
Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d: Multi-view 3d object detec- tion with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022. 3, 4
-
[26]
Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos
Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos. InICCV, pages 18580– 18590, 2023. 2, 4
work page 2023
-
[27]
Petrv2: A unified framework for 3d perception from multi-camera images
Yingfei Liu, Junjie Yan, Fan Jia, Shuailin Li, Aqi Gao, Tian- cai Wang, and Xiangyu Zhang. Petrv2: A unified framework for 3d perception from multi-camera images. InICCV, pages 3262–3272, 2023. 3
work page 2023
-
[28]
SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016. 7
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[29]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7 9
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Siheng Chen, Mike Chen, Sanja Fidler, et al. Infinicube: Unbounded and con- trollable dynamic 3d driving scene generation with world- guided video models.arXiv preprint arXiv:2412.03934,
-
[31]
Scube: Instant large-scale scene reconstruction using voxsplats.NIPS, 37:97670–97698, 2024
Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction using voxsplats.NIPS, 37:97670–97698, 2024. 8, 1
work page 2024
-
[32]
Yining Shi, Kun Jiang, Qiang Meng, Ke Wang, Jiabao Wang, Wenchao Sun, Tuopu Wen, Mengmeng Yang, and Diange Yang. Come: Adding scene-centric forecasting control to occupancy world model.arXiv preprint arXiv:2506.13260,
-
[33]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Sparsedrive: End-to-end au- tonomous driving via sparse scene representation
Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Hao- ran Wu, and Sifa Zheng. Sparsedrive: End-to-end au- tonomous driving via sparse scene representation. InICRA, pages 8795–8801. IEEE, 2025. 2, 4
work page 2025
-
[35]
Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction. InCVPR, pages 15035–15044,
-
[36]
Qijian Tian, Xin Tan, Yuan Xie, and Lizhuang Ma. Driv- ingforward: Feed-forward 3d gaussian splatting for driving scene reconstruction from flexible surround-view input. In AAAI, pages 7374–7382, 2025. 8, 1
work page 2025
-
[37]
Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving.NIPS, 36:64318–64330, 2023. 1, 2, 6
work page 2023
-
[38]
Neural discrete representation learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InNIPS, page 6309–6318, Red Hook, NY , USA, 2017. Curran Associates Inc. 1, 2
work page 2017
-
[39]
Opus: occupancy prediction using a sparse set.NIPS, 37:119861–119885, 2024
Jiabao Wang, Zhaojiang Liu, Qiang Meng, Liujiang Yan, Ke Wang, Jie Yang, Wei Liu, Qibin Hou, and Ming-Ming Cheng. Opus: occupancy prediction using a sparse set.NIPS, 37:119861–119885, 2024. 3, 6
work page 2024
-
[40]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, pages 5294– 5306, 2025. 1, 2
work page 2025
-
[41]
Occsora: 4d occupancy generation models as world simulators for au- tonomous driving
Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, and Jiwen Lu. Occsora: 4d occupancy generation models as world simulators for au- tonomous driving.arXiv preprint arXiv:2405.20337, 2024. 1, 2
-
[42]
Exploring object-centric temporal modeling for efficient multi-view 3d object detection
Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xi- angyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. InICCV, pages 3621–3631, 2023. 4
work page 2023
-
[43]
Detr3d: 3d object detection from multi-view images via 3d-to-2d queries
Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InCoRL, pages 180–191. PMLR, 2022. 4
work page 2022
-
[44]
Yuping Wang, Xiangyu Huang, Xiaokang Sun, Mingxuan Yan, Shuo Xing, Zhengzhong Tu, and Jiachen Li. Uniocc: A unified benchmark for occupancy forecasting and prediction in autonomous driving.arXiv preprint arXiv:2503.24381,
-
[45]
Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024. 2, 6
-
[46]
Delta-triplane transformers as occupancy world models.arXiv preprint arXiv:2503.07338,
Haoran Xu, Peixi Peng, Guang Tan, Yiqian Chang, Yisen Zhao, and Yonghong Tian. Delta-triplane transformers as occupancy world models.arXiv preprint arXiv:2503.07338,
-
[47]
Spatiotemporal decoupling for efficient vision-based occupancy forecasting
Jingyi Xu, Xieyuanli Chen, Junyi Ma, Jiawei Huang, Jintao Xu, Yue Wang, and Ling Pei. Spatiotemporal decoupling for efficient vision-based occupancy forecasting. InCVPR, pages 22338–22347, 2025. 2
work page 2025
-
[48]
Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-llm: Enhancing autonomous driv- ing with occupancy-based large language models.arXiv preprint arXiv:2502.06419, 2025. 1, 2, 6
-
[49]
Renderworld: World model with self-supervised 3d label.arXiv preprint arXiv:2409.11356,
Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, et al. Renderworld: World model with self-supervised 3d label.arXiv preprint arXiv:2409.11356,
-
[50]
Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the occupancy world: Vision-centric 4d occupancy forecast- ing and planning via world models for autonomous driving. InAAAI, pages 9327–9335, 2025. 2
work page 2025
-
[51]
Visual point cloud forecasting enables scalable autonomous driving
Zetong Yang, Li Chen, Yanan Sun, and Hongyang Li. Visual point cloud forecasting enables scalable autonomous driving. InCVPR, pages 14673–14684, 2024. 6
work page 2024
-
[52]
Haiming Zhang, Ying Xue, Xu Yan, Jiacheng Zhang, We- ichao Qiu, Dongfeng Bai, Bingbing Liu, Shuguang Cui, and Zhen Li. An efficient occupancy world model via decoupled dynamic flow and image-assisted training.arXiv preprint arXiv:2412.13772, 2024. 1, 2, 6, 7
-
[53]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023. 2
work page 2023
-
[54]
Occworld: Learning a 3d occupancy world model for autonomous driving
Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InECCV, pages 55–72. Springer, 2024. 1, 2, 6, 7
work page 2024
-
[55]
Gaussianad: Gaussian-centric end-to- end autonomous driving
Wenzhao Zheng, Junjie Wu, Yao Zheng, Sicheng Zuo, Zixun Xie, Longchao Yang, Yong Pan, Zhihui Hao, Peng Jia, Xi- anpeng Lang, et al. Gaussianad: Gaussian-centric end-to- end autonomous driving.arXiv preprint arXiv:2412.10371,
-
[56]
2 10 SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model Supplementary Material A. Additional Quantitative Experiments A.1. Ray-level mIoU SparseOcc [35] proposes RayIoU (Ray-level mIoU) to solve the inconsistency penalty along the depth axis raised in traditional voxel-level mIoU criteria. We evaluate our SparseWorld-TC-Large* model with ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.