ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

James Hays; Lingjun Zhao; Lu Gan; Yandong Luo

arxiv: 2512.03370 · v3 · submitted 2025-12-03 · 💻 cs.CV

ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

Lingjun Zhao , Yandong Luo , James Hays , Lu Gan This is my paper

Pith reviewed 2026-05-17 03:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords Gaussian representationopen-vocabulary 3D understandingzero-shot semantic occupancymulti-modal supervisionvision foundation models3D scene understandingshelf-supervised learning

0 comments

The pith

ShelfGaussian achieves open-vocabulary 3D scene understanding by supervising Gaussian representations with off-the-shelf 2D vision foundation models at both image and scene levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ShelfGaussian to model 3D scenes with Gaussians that can handle open-vocabulary semantics without requiring 3D annotations. It combines a Multi-Modal Gaussian Transformer for querying features across sensor types with a Shelf-Supervised Learning Paradigm that aligns representations at both 2D image and 3D scene scales using existing vision models. This setup targets the shortcomings of closed-set labeled Gaussians and purely 2D self-supervised approaches by preserving geometry while enabling flexible semantic understanding. A sympathetic reader would care because it could reduce dependence on costly 3D labels for tasks like occupancy prediction in robotics and autonomous systems. Experiments claim state-of-the-art zero-shot results on Occ3D-nuScenes along with real-world tests on an unmanned ground vehicle.

Core claim

ShelfGaussian is an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models. It introduces a Multi-Modal Gaussian Transformer that allows Gaussians to query diverse sensor features and a Shelf-Supervised Learning Paradigm that jointly optimizes the Gaussians at 2D image and 3D scene levels, resulting in superior geometry and semantics compared to prior closed-set or camera-only methods.

What carries the argument

The Multi-Modal Gaussian Transformer combined with the Shelf-Supervised Learning Paradigm, which together let Gaussians draw multi-modal features and receive joint 2D-3D supervision from vision foundation models.

Load-bearing premise

Features extracted from off-the-shelf 2D vision foundation models transfer reliably enough to produce accurate 3D geometry and open-vocabulary semantics when the Gaussians are optimized jointly at image and scene levels.

What would settle it

A benchmark run on Occ3D-nuScenes showing that ShelfGaussian does not exceed prior zero-shot semantic occupancy methods in accuracy or geometry quality would undermine the central performance claim.

Figures

Figures reproduced from arXiv: 2512.03370 by James Hays, Lingjun Zhao, Lu Gan, Yandong Luo.

**Figure 1.** Figure 1: We propose ShelfGaussian for Gaussian-based 3D scene understanding under open-vocabulary, multi-modal and multi-task scenario. (a) Our model is able to assist a robot in predicting open-set occupancy from any sensor modalities with the help of VFMs. (b) Compared to existing Gaussian-based methods, ours provides a generalizable solution for 3D scene understanding. ting (3DGS) [37] is naturally extended int… view at source ↗

**Figure 2.** Figure 2: Overview of ShelfGaussian. ShelfGaussian employs off-the-shelf VFMs to extract depth and DINO feature maps from multiview images, and trains LiDAR and radar backbones to extract related features. These are then fed into our multi-modal Gaussian transformer to predict sparse sets of 3D Gaussians to represent the scene. During training, Gaussians are rendered into camera views for VFMbased 2D supervision,… view at source ↗

**Figure 3.** Figure 3: Overview of DINO-Driven Pseudo Labeling Engine. We teleoperate our UGV through urban scenarios to collect paired image and point cloud sequences along with trajectories from onboard camera and LiDAR. LiDAR points are then projected to image and decorated with pixel-wise DINO features. These points are aggregated and voxelized at a customized resolution to be 3D pseudo labels. The final predicted Gaussians … view at source ↗

**Figure 4.** Figure 4: Dual-CSR Structure for CUDA-Accelerated Gaussian2Voxel. Gaussian→Tile CSR: index pointers store tile offsets per Gaussian, indices record tile IDs, and values store Gaussian IDs. Tile→Gaussian CSR: index pointers store Gaussian offsets per tile, and indices record Gaussian IDs obtained by sorting and run-length encoding (RLE) tile-Gaussian pairs. enable highly efficient Gaussian-to-voxel splatting, cap… view at source ↗

**Figure 5.** Figure 5: Qualitative results of ShelfGaussian on nuScenes dataset. The figure demonstrates the predicted semantic occupancy queried by semantic classes in Tab. 1, ground-truth labels from Occ3D [68] and occupancy of open-set queries from ShelfGaussian-LCR model. Best viewed on screen and color bar is given in Tab. 1. Mod. DINOv3 DINOv2 IoU mIoU others barrier bicycle bus car const. veh. motorcycle pedestrian traffi… view at source ↗

**Figure 6.** Figure 6: Qualitative results of ShelfGaussian on custom dataset collected by a UGV. The figure shows the rendered depth map, DINO feature map, and occupancy of novel categories from ShelfGaussian-CO model. Best viewed on screen and in color. Gaussian-Planner BEV-Planner Scene -0557 Scene -0914 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of BEV-Planner [45] and Gaussian-Planner on nuScenes [7] dataset. Red and cyan lines denote the ground-truth and predicted trajectories separately. complementary information from both domains. Benchmark of G2V Splatting Module. We benchmark our G2V spalting module against other open-source methods [30, 35] in two settings: 18k Gaussians with 1024-dim features and 9k Gaussians with 7… view at source ↗

**Figure 8.** Figure 8: Visualization of the coordinate frames of different sensors and the ego vehicle. Red, green and blue arrows denote the x, y and z axes, respectively. 6.2. Custom Dataset Collection By teleoperating our UGV, we collect a custom dataset in common urban scenarios. We choose four scenes: street, park, grassland and garden. We split our dataset into a 90% subset for training and a 10% subset for testing, result… view at source ↗

**Figure 9.** Figure 9: Scene reconstruction results of four urban scenes. The top row shows the completed scenes decorated with DINO features, visualized by mapping PCA components to RGB colors. The bottom row shows the robot trajectories within four urban scenes. Mod. 2D Loss 3D Loss BCE Loss Feat. Loss IoU mIoU C 1.0 1.0 1.0 58.66 17.52 1.0 4.0 8.0 61.38 18.56 1.0 8.0 16.0 63.25 19.07 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results of ShelfGaussian on nuScenes [7] dataset validation split. RGB Image Rendered Depth Pseudo Depth GT Rendered Feat. Pseudo Feat. GT Open-Set Occ. "pedestrian" "road" "sidewalk" "stop sign" "vegetation" "road" "sidewalk" "car" "road" "bench" "vegetation" [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results of ShelfGaussian on our custom dataset testing split. 7.6. Training Efficiency Method Mod. Train. Time (h) Memory (GB) IoU mIoU GaussTR [35] C 22 20 44.54 12.27 ShelfGaussian C 25 15 63.25 19.07 L 31 28 66.10 19.34 C+R 31 28 62.84 19.42 L+C 32 29 69.24 21.52 L+C+R 32 29 69.45 21.78 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in the-wild performance across diverse urban scenarios. Project website: https://lunarlab-gatech.github.io/ShelfGaussian/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ShelfGaussian uses 2D vision foundation models to supervise open-vocabulary 3D Gaussians via a new transformer and joint 2D/3D optimization, but the geometry accuracy from that transfer still lacks direct checks.

read the letter

ShelfGaussian takes Gaussian scene representations and adds supervision from off-the-shelf 2D vision foundation models for open-vocabulary 3D understanding. The new elements are the Multi-Modal Gaussian Transformer that lets individual Gaussians query features across sensor types and the Shelf-Supervised Learning Paradigm that optimizes the representation at both the rendered 2D image level and the full 3D scene level at once. This moves past earlier Gaussian work that either required closed-set 3D labels or relied on pure 2D self-supervision that left geometry weak. The approach is straightforward and makes sense for anyone trying to scale 3D perception without fresh 3D annotations. The reported zero-shot results on Occ3D-nuScenes for semantic occupancy and the real UGV runs in urban settings give the method some practical grounding. The soft spot is the core assumption that 2D VFM features, when used in this joint setup, will reliably produce accurate 3D structure and semantics. The stress-test note is right to flag the lack of direct evidence such as depth consistency metrics or 3D-only ablations; without those it is not yet clear how well the method resolves projection ambiguities or multi-view inconsistencies in complex nuScenes scenes. The abstract states SOTA performance, but the strength of that claim rests on how thoroughly the paper shows the 2D-to-3D transfer actually works. This paper is for people working on 3D scene understanding in robotics and autonomous systems who already know Gaussian splatting and want to incorporate 2D foundation models. A reader focused on practical open-vocabulary methods would get concrete implementation ideas and benchmark numbers from it. I would bring it to a reading group to talk through the supervision design. It deserves peer review because the combination is new enough and the evaluation uses standard benchmarks, even though more targeted analysis on the geometry transfer would make the central claim more secure.

Referee Report

2 major / 3 minor

Summary. The paper introduces ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). It proposes a Multi-Modal Gaussian Transformer enabling Gaussians to query features from diverse sensor modalities and a Shelf-Supervised Learning Paradigm that optimizes Gaussians jointly at 2D image and 3D scene levels. The central claim is state-of-the-art zero-shot semantic occupancy prediction on Occ3D-nuScenes, with additional evaluation on real-world UGV scenarios for in-the-wild performance.

Significance. If the results hold, this work could meaningfully advance open-vocabulary 3D perception by demonstrating effective transfer from 2D VFMs to 3D Gaussian representations without 3D labels, offering efficiency gains over closed-set or purely 2D-supervised methods for tasks like semantic occupancy in robotics and autonomous driving.

major comments (2)

[§4.2] §4.2 and Table 2 (Occ3D-nuScenes results): The SOTA zero-shot semantic occupancy claim rests on the transfer of 2D VFM features via joint image/scene optimization, yet the paper provides no 3D-only ablations or depth consistency metrics to verify that this corrects projection ambiguities and multi-view inconsistencies in nuScenes urban scenes; this is load-bearing for the central claim.
[§3.2] §3.2 (Multi-Modal Gaussian Transformer): The fusion and querying mechanism across modalities is described at a high level, but it is unclear how the transformer handles absent modalities (e.g., no LiDAR in camera-only evaluations), which directly affects the claimed multi-modal advantage and zero-shot generalization.

minor comments (3)

[Abstract] Abstract: 'in the-wild' appears inconsistently as 'in-the-wild' in the main text; standardize terminology.
[Figure 3] Figure 3 and §4.3: The qualitative UGV results would benefit from quantitative metrics (e.g., mIoU or depth error) alongside visuals to strengthen the in-the-wild evaluation.
[Related Work] Related work section: Several recent Gaussian-based open-vocabulary methods are cited, but cross-references to specific ablation baselines (e.g., vs. pure 2D self-supervision) could be more explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§4.2] §4.2 and Table 2 (Occ3D-nuScenes results): The SOTA zero-shot semantic occupancy claim rests on the transfer of 2D VFM features via joint image/scene optimization, yet the paper provides no 3D-only ablations or depth consistency metrics to verify that this corrects projection ambiguities and multi-view inconsistencies in nuScenes urban scenes; this is load-bearing for the central claim.

Authors: We acknowledge the value of isolating the contribution of the 3D scene-level supervision. The joint optimization is designed to enforce multi-view consistency by back-propagating 3D scene features into the Gaussian parameters, which in principle mitigates projection ambiguities that arise from independent 2D image supervision. In the revised manuscript we will add 3D-only ablations (removing the scene-level term) to Table 2 and report a depth consistency metric (e.g., average reprojection error across views) on the Occ3D-nuScenes validation set to quantify the improvement. These additions will directly address the load-bearing aspect of the central claim. revision: yes
Referee: [§3.2] §3.2 (Multi-Modal Gaussian Transformer): The fusion and querying mechanism across modalities is described at a high level, but it is unclear how the transformer handles absent modalities (e.g., no LiDAR in camera-only evaluations), which directly affects the claimed multi-modal advantage and zero-shot generalization.

Authors: The Multi-Modal Gaussian Transformer employs modality-specific encoders followed by a shared cross-attention layer. When a modality is unavailable, its corresponding key/value projections are masked out and the attention is computed only over the remaining modalities; a learnable modality embedding is still provided so that the Gaussian queries remain well-conditioned. This design permits both multi-modal training and camera-only inference without retraining. We will expand §3.2 with explicit pseudocode and a short ablation on modality dropout to make the mechanism unambiguous and to reinforce the zero-shot generalization argument. revision: yes

Circularity Check

0 steps flagged

No circularity: method builds on external VFM supervision and reports empirical results on standard benchmarks

full rationale

The paper's core contributions—the Multi-Modal Gaussian Transformer and Shelf-Supervised Learning Paradigm—are defined as novel mechanisms that query features from independent off-the-shelf vision foundation models and optimize Gaussians jointly at image and scene levels. The state-of-the-art zero-shot semantic occupancy claim is presented as an experimental outcome on the external Occ3D-nuScenes benchmark rather than a mathematical derivation that reduces to its own inputs. No equations, fitted parameters, or self-citations are shown to create self-definitional loops or rename known results as predictions. The supervision source and evaluation data remain external to the paper's own constructs, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Limited information available from abstract only; no explicit free parameters, axioms, or invented entities described.

pith-pipeline@v0.9.0 · 5514 in / 1067 out tokens · 19042 ms · 2026-05-17T03:19:55.188664+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 3 internal anchors

[1]

Principal component analysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010

Herv ´e Abdi and Lynne J Williams. Principal component analysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010. 4, 6, 2

work page 2010
[2]

Transfusion: Robust lidar-camera fusion for 3d object detection with transform- ers

Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. Transfusion: Robust lidar-camera fusion for 3d object detection with transform- ers. InCVPR, pages 1090–1099, 2022. 3

work page 2022
[3]

Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation

Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation. InICCV, pages 22025–22035,

work page
[4]

Gaussianflowocc: Sparse and weakly supervised occupancy estimation using gaussian splatting and temporal flow.arXiv preprint arXiv:2502.17288, 2025

Simon Boeder, Fabian Gigengack, and Benjamin Risse. Gaussianflowocc: Sparse and weakly supervised occupancy estimation using gaussian splatting and temporal flow.arXiv preprint arXiv:2502.17288, 2025. 2, 3, 4

work page arXiv 2025
[5]

Lan- gocc: Open vocabulary occupancy estimation via volume rendering

Simon Boeder, Fabian Gigengack, and Benjamin Risse. Lan- gocc: Open vocabulary occupancy estimation via volume rendering. In3DV, pages 200–210. IEEE, 2025. 2, 6

work page 2025
[6]

Parallel sparse matrix- vector and matrix-transpose-vector multiplication using compressed sparse blocks

Aydin Buluc ¸, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E Leiserson. Parallel sparse matrix- vector and matrix-transpose-vector multiplication using compressed sparse blocks. InProceedings of the twenty-first annual symposium on Parallelism in algorithms and archi- tectures, pages 233–244, 2009. 5

work page 2009
[7]

nuscenes: A mul- timodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul- timodal dataset for autonomous driving. InCVPR, pages 11621–11631, 2020. 5, 6, 8, 3, 4

work page 2020
[8]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 1

work page 2021
[9]

ros2 camera lidar fusion: Ros2 package to calculate the intrinsic and extrinsic cam- era calibration and fuse camera & lidar.https : / / github

Clemente Donoso (CDonosoK). ros2 camera lidar fusion: Ros2 package to calculate the intrinsic and extrinsic cam- era calibration and fuse camera & lidar.https : / / github . com / CDonosoK / ros2 _ camera _ lidar _ fusion, 2025. 1

work page 2025
[10]

Gaussianbev: 3d gaussian representation meets perception models for bev segmentation

Florian Chabot, Nicolas Granger, and Guillaume Lapouge. Gaussianbev: 3d gaussian representation meets perception models for bev segmentation. InWACV, pages 2250–2259. IEEE, 2025. 6

work page 2025
[11]

Pointbev: A sparse approach for bev predictions

Loick Chambon, Eloi Zablocki, Micka ¨el Chen, Florent Bar- toccioni, Patrick P ´erez, and Matthieu Cord. Pointbev: A sparse approach for bev predictions. InCVPR, pages 15195– 15204, 2024. 6

work page 2024
[12]

Gaussrender: Learning 3d occupancy with gaussian rendering

Loick Chambon, Eloi Zablocki, Alexandre Boulch, Mick- ael Chen, and Matthieu Cord. Gaussrender: Learning 3d occupancy with gaussian rendering. InICCV, pages 27010– 27020, 2025. 1, 2

work page 2025
[13]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR, pages 19457–19467, 2024. 2

work page 2024
[14]

Clip2scene: Towards label-efficient 3d scene under- standing by clip

Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. Clip2scene: Towards label-efficient 3d scene under- standing by clip. InCVPR, pages 7020–7030, 2023. 2

work page 2023
[15]

Futr3d: A unified sensor fusion framework for 3d detection

Xuanyao Chen, Tianyuan Zhang, Yue Wang, Yilun Wang, and Hang Zhao. Futr3d: A unified sensor fusion framework for 3d detection. InCVPR, pages 172–181, 2023. 3

work page 2023
[16]

Pla: Language-driven open- vocabulary 3d scene understanding

Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Pla: Language-driven open- vocabulary 3d scene understanding. InCVPR, pages 7010– 7019, 2023. 2

work page 2023
[17]

Lowis3d: Language-driven open-world instance-level 3d scene understanding.IEEE TPAMI, 46(12):8517–8533, 2024

Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Lowis3d: Language-driven open-world instance-level 3d scene understanding.IEEE TPAMI, 46(12):8517–8533, 2024. 2

work page 2024
[18]

Depth map prediction from a single image using a multi-scale deep net- work.NeurIPS, 27, 2014

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.NeurIPS, 27, 2014. 4

work page 2014
[19]

A simple attempt for 3d occupancy estimation in au- tonomous driving.CoRR, 2023

Wanshui Gan, Ningkai Mo, Hongbin Xu, and Naoto Yokoya. A simple attempt for 3d occupancy estimation in au- tonomous driving.CoRR, 2023. 2, 6

work page 2023
[20]

Gaussianocc: Fully self-supervised and ef- ficient 3d occupancy estimation with gaussian splatting

Wanshui Gan, Fang Liu, Hongbin Xu, Ningkai Mo, and Naoto Yokoya. Gaussianocc: Fully self-supervised and ef- ficient 3d occupancy estimation with gaussian splatting. In ICCV, pages 28980–28990, 2025. 1, 2, 6

work page 2025
[21]

Unim- ov3d: Uni-modality open-vocabulary 3d scene understand- ing with fine-grained feature representation.arXiv preprint arXiv:2401.11395, 2024

Qingdong He, Jinlong Peng, Zhengkai Jiang, Kai Wu, Xiaozhong Ji, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Mingang Chen, and Yunsheng Wu. Unim- ov3d: Uni-modality open-vocabulary 3d scene understand- ing with fine-grained feature representation.arXiv preprint arXiv:2401.11395, 2024. 2

work page arXiv 2024
[22]

Fiery: Future instance prediction in bird’s- eye view from surround monocular cameras

Anthony Hu, Zak Murez, Nikhil Mohan, Sof ´ıa Dudas, Jef- frey Hawke, Vijay Badrinarayanan, Roberto Cipolla, and Alex Kendall. Fiery: Future instance prediction in bird’s- eye view from surround monocular cameras. InICCV, pages 15273–15282, 2021. 6

work page 2021
[23]

Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE TPAMI, 2024

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE TPAMI, 2024. 6

work page 2024
[24]

St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In ECCV, pages 533–549. Springer, 2022. 5, 6

work page 2022
[25]

Quantaichi: a compiler for quantized simu- lations.ACM Transactions on Graphics (TOG), 40(4):1–16,

Yuanming Hu, Jiafeng Liu, Xuanda Yang, Mingkuan Xu, Ye Kuang, Weiwei Xu, Qiang Dai, William T Freeman, and Fr´edo Durand. Quantaichi: a compiler for quantized simu- lations.ACM Transactions on Graphics (TOG), 40(4):1–16,

work page
[26]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In CVPR, pages 17853–17862, 2023. 5, 6 9

work page 2023
[27]

Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training

Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo. Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training. InICCV, pages 22157– 22167, 2023. 2

work page 2023
[28]

Tri-perspective view for vision-based 3d se- mantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d se- mantic occupancy prediction. InCVPR, pages 9223–9232,

work page
[29]

Selfocc: Self-supervised vision-based 3d oc- cupancy prediction

Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, and Jiwen Lu. Selfocc: Self-supervised vision-based 3d oc- cupancy prediction. InCVPR, pages 19946–19956, 2024. 2, 6

work page 2024
[30]

Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction. InECCV, pages 376–393. Springer, 2024. 1, 2, 5, 7, 8, 3

work page 2024
[31]

Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction

Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, and Jiwen Lu. Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction. InCVPR, pages 27477– 27486, 2025. 1, 2

work page 2025
[32]

Openins3d: Snap and lookup for 3d open-vocabulary instance segmentation

Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, and Joan Lasenby. Openins3d: Snap and lookup for 3d open-vocabulary instance segmentation. InECCV, pages 169–185. Springer, 2024. 2

work page 2024
[33]

Bench2drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving. In NeurIPS 2024 Datasets and Benchmarks Track, 2024. 5

work page 2024
[34]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InICCV, pages 8340– 8350, 2023. 5, 6

work page 2023
[35]

Gausstr: Foundation model-aligned gaussian transformer for self-supervised 3d spatial understanding

Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tian- wei Lin, Zhizhong Su, Wenyu Liu, and Xinggang Wang. Gausstr: Foundation model-aligned gaussian transformer for self-supervised 3d spatial understanding. InCVPR, pages 11960–11970, 2025. 2, 4, 5, 6, 7, 8, 3

work page 2025
[36]

Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment

Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InCVPR, pages 24905–24916, 2025. 5, 6, 1, 2

work page 2025
[37]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[38]

Shelf-supervised cross-modal pre-training for 3d ob- ject detection

Mehar Khurana, Neehar Peri, James Hays, and Deva Ra- manan. Shelf-supervised cross-modal pre-training for 3d ob- ject detection. InCoRL, 2024. 2

work page 2024
[39]

Open3dsg: Open- vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships

Sebastian Koch, Narunas Vaskevicius, Mirco Colosi, Pe- dro Hermosilla, and Timo Ropinski. Open3dsg: Open- vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. InCVPR, pages 14183–14193, 2024. 2

work page 2024
[40]

Pointpillars: Fast encoders for object detection from point clouds

Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InCVPR, pages 12697–12705, 2019. 3, 6

work page 2019
[41]

Dense multimodal align- ment for open-vocabulary 3d scene understanding

Ruihuang Li, Zhengqiang Zhang, Chenhang He, Zhiyuan Ma, Vishal M Patel, and Lei Zhang. Dense multimodal align- ment for open-vocabulary 3d scene understanding. InECCV, pages 416–434. Springer, 2024. 2

work page 2024
[42]

Unifying voxel-based representation with transformer for 3d object detection.NeurIPS, 35:18442– 18455, 2022

Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, and Jiaya Jia. Unifying voxel-based representation with transformer for 3d object detection.NeurIPS, 35:18442– 18455, 2022. 3

work page 2022
[43]

V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion

Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anand- kumar. V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion. InCVPR, pages 9087– 9098, 2023. 1

work page 2023
[44]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE TPAMI, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE TPAMI, 2024. 6

work page 2024
[45]

Is ego status all you need for open-loop end-to-end autonomous driving? InCVPR, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InCVPR, pages 14864–14873, 2024. 5, 6, 8

work page 2024
[46]

Bevfusion: A simple and robust lidar-camera fusion framework.NeurIPS, 35:10421–10434, 2022

Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework.NeurIPS, 35:10421–10434, 2022. 3

work page 2022
[47]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InCVPR, pages 2117–2125,

work page
[48]

Oc- cvla: Vision-language-action model with implicit 3d occu- pancy supervision.arXiv preprint arXiv:2509.05578, 2025

Ruixun Liu, Lingyu Kong, Derun Li, and Hang Zhao. Oc- cvla: Vision-language-action model with implicit 3d occu- pancy supervision.arXiv preprint arXiv:2509.05578, 2025. 5

work page arXiv 2025
[49]

Gaussianfusion: Gaussian-based multi-sensor fu- sion for end-to-end autonomous driving.arXiv preprint arXiv:2506.00034, 2025

Shuai Liu, Quanmin Liang, Zefeng Li, Boyang Li, and Kai Huang. Gaussianfusion: Gaussian-based multi-sensor fu- sion for end-to-end autonomous driving.arXiv preprint arXiv:2506.00034, 2025. 5

work page arXiv 2025
[50]

Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation

Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation. InICRA, 2023. 3

work page 2023
[51]

Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data

Shiyang Lu, Haonan Chang, Eric Pu Jing, Abdeslam Boular- ias, and Kostas Bekris. Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data. InCoRL, pages 1610–1620. PMLR, 2023. 2

work page 2023
[52]

Open-vocabulary point-cloud object detection without 3d an- notation

Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka, Kurt Keutzer, and Shanghang Zhang. Open-vocabulary point-cloud object detection without 3d an- notation. InCVPR, pages 1190–1199, 2023. 2

work page 2023
[53]

Robot operating system 2: Design, architecture, and uses in the wild.Science robotics, 7(66):eabm6074, 2022

Steven Macenski, Tully Foote, Brian Gerkey, Chris Lalancette, and William Woodall. Robot operating system 2: Design, architecture, and uses in the wild.Science robotics, 7(66):eabm6074, 2022. 1 10

work page 2022
[54]

Opensu3d: Open world 3d scene understanding using foundation models

Rafay Mohiuddin, Sai Manoj Prakhya, Fiona Collins, Ziyuan Liu, and Andr´e Borrmann. Opensu3d: Open world 3d scene understanding using foundation models. InICRA, pages 13560–13566. IEEE, 2025. 2

work page 2025
[55]

Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance

Phuc Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. InCVPR, pages 4018–4028, 2024. 2

work page 2024
[56]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 3, 4, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Better call sal: Towards learning to segment anything in lidar

Aljo ˇsa O ˇsep, Tim Meinhardt, Francesco Ferroni, Neehar Peri, Deva Ramanan, and Laura Leal-Taix ´e. Better call sal: Towards learning to segment anything in lidar. InECCV, pages 71–90. Springer, 2024. 2

work page 2024
[58]

Renderocc: Vision-centric 3d occupancy predic- tion with 2d rendering supervision

Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Hongwei Xie, Bing Wang, Li Liu, and Shanghang Zhang. Renderocc: Vision-centric 3d occupancy predic- tion with 2d rendering supervision. InICRA, pages 12404– 12411. IEEE, 2024. 2

work page 2024
[59]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InCVPR, pages 815–824, 2023. 2

work page 2023
[60]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InECCV, pages 194–210. Springer, 2020. 6

work page 2020
[61]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763, 2021. 1, 2, 5

work page 2021
[62]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 3, 4, 6, 7, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Graphgsocc: Semantic and geometric graph transformer for 3d gaussian splating-based occupancy prediction.arXiv preprint arXiv:2506.14825, 2025

Ke Song, Yunhe Wu, Chunchit Siu, and Huiyuan Xiong. Graphgsocc: Semantic and geometric graph transformer for 3d gaussian splating-based occupancy prediction.arXiv preprint arXiv:2506.14825, 2025. 1

work page arXiv 2025
[64]

Openmask3d: Open-vocabulary 3d instance segmenta- tion,

Ayc ¸a Takmaz, Elisabetta Fedele, Robert W Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. Open- mask3d: Open-vocabulary 3d instance segmentation.arXiv preprint arXiv:2306.13631, 2023. 2

work page arXiv 2023
[65]

Search3d: Hierarchical open-vocabulary 3d segmentation

Ayca Takmaz, Alexandros Delitzas, Robert W Sumner, Francis Engelmann, Johanna Wald, and Federico Tombari. Search3d: Hierarchical open-vocabulary 3d segmentation. IEEE Robotics and Automation Letters, 2025. 2

work page 2025
[66]

Towards learning to complete anything in lidar

Ayc ¸a Takmaz, Cristiano Saltori, Neehar Peri, Tim Mein- hardt, Riccardo de Lutio, Laura Leal-Taix´e, and Aljoˇsa Oˇsep. Towards learning to complete anything in lidar. InICML,

work page
[67]

Ovo: Open-vocabulary occupancy

Zhiyu Tan, Zichao Dong, Cheng Zhang, Weikun Zhang, Hang Ji, and Hao Li. Ovo: Open-vocabulary occupancy. arXiv preprint arXiv:2305.16133, 2023. 2

work page arXiv 2023
[68]

Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving

Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. InNeurIPS, pages 64318–64330,

work page
[69]

Scene as occupancy

Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. InICCV, pages 8406–8415, 2023. 1, 5, 6

work page 2023
[70]

Attention is all you need.NeurIPS, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017. 3, 5

work page 2017
[71]

Pop- 3d: Open-vocabulary 3d occupancy prediction from images

Antonin V obecky, Oriane Sim´eoni, David Hurych, Spyridon Gidaris, Andrei Bursuc, Patrick P´erez, and Josef Sivic. Pop- 3d: Open-vocabulary 3d occupancy prediction from images. NeurIPS, 36:50545–50557, 2023. 2

work page 2023
[72]

Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view repre- sentation

Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhen- guo Li, Bernt Schiele, and Liwei Wang. Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view repre- sentation. InICCV, pages 6792–6802, 2023. 3

work page 2023
[73]

Distillnerf: Perceiving 3d scenes from single-glance images by distilling neural fields and foundation model features.NeurIPS, 37:62334–62361,

Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven Waslander, Yue Wang, Sanja Fidler, Marco Pavone, and Peter Karkus. Distillnerf: Perceiving 3d scenes from single-glance images by distilling neural fields and foundation model features.NeurIPS, 37:62334–62361,

work page
[74]

Openoccupancy: A large scale benchmark for sur- rounding semantic occupancy perception

Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, and Xingang Wang. Openoccupancy: A large scale benchmark for sur- rounding semantic occupancy perception. InICCV, pages 17850–17859, 2023. 1

work page 2023
[75]

Open-vocabulary octree- graph for 3d scene understanding

Zhigang Wang, Yifei Su, Chenhui Li, Dong Wang, Yan Huang, Xuelong Li, and Bin Zhao. Open-vocabulary octree- graph for 3d scene understanding. InICCV, pages 7037– 7047, 2025. 2

work page 2025
[76]

Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving. InICCV, pages 21729–21740, 2023. 1

work page 2023
[77]

Sam4d: Segment anything in camera and lidar streams

Jianyun Xu, Song Wang, Ziqian Ni, Chunyong Hu, Sheng Yang, Jianke Zhu, and Qiang Li. Sam4d: Segment anything in camera and lidar streams. InICCV, 2025. 2

work page 2025
[78]

arXiv preprint arXiv:2311.17707 (2023)

Mutian Xu, Xingyilang Yin, Lingteng Qiu, Yang Liu, Xin Tong, and Xiaoguang Han. Sampro3d: Locating sam prompts in 3d for zero-shot scene segmentation.arXiv preprint arXiv:2311.17707, 2023. 2

work page arXiv 2023
[79]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125,

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yu- liang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125, 2025. 5

work page arXiv 2025
[80]

Gaussianpretrain: A simple uni- fied 3d gaussian representation for visual pre-training in au- tonomous driving.arXiv preprint arXiv:2411.12452, 2024

Shaoqing Xu, Fang Li, Shengyin Jiang, Ziying Song, Li Liu, and Zhi-xin Yang. Gaussianpretrain: A simple uni- fied 3d gaussian representation for visual pre-training in au- tonomous driving.arXiv preprint arXiv:2411.12452, 2024. 1, 2 11

work page arXiv 2024

Showing first 80 references.

[1] [1]

Principal component analysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010

Herv ´e Abdi and Lynne J Williams. Principal component analysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010. 4, 6, 2

work page 2010

[2] [2]

Transfusion: Robust lidar-camera fusion for 3d object detection with transform- ers

Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. Transfusion: Robust lidar-camera fusion for 3d object detection with transform- ers. InCVPR, pages 1090–1099, 2022. 3

work page 2022

[3] [3]

Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation

Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation. InICCV, pages 22025–22035,

work page

[4] [4]

Gaussianflowocc: Sparse and weakly supervised occupancy estimation using gaussian splatting and temporal flow.arXiv preprint arXiv:2502.17288, 2025

Simon Boeder, Fabian Gigengack, and Benjamin Risse. Gaussianflowocc: Sparse and weakly supervised occupancy estimation using gaussian splatting and temporal flow.arXiv preprint arXiv:2502.17288, 2025. 2, 3, 4

work page arXiv 2025

[5] [5]

Lan- gocc: Open vocabulary occupancy estimation via volume rendering

Simon Boeder, Fabian Gigengack, and Benjamin Risse. Lan- gocc: Open vocabulary occupancy estimation via volume rendering. In3DV, pages 200–210. IEEE, 2025. 2, 6

work page 2025

[6] [6]

Parallel sparse matrix- vector and matrix-transpose-vector multiplication using compressed sparse blocks

Aydin Buluc ¸, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E Leiserson. Parallel sparse matrix- vector and matrix-transpose-vector multiplication using compressed sparse blocks. InProceedings of the twenty-first annual symposium on Parallelism in algorithms and archi- tectures, pages 233–244, 2009. 5

work page 2009

[7] [7]

nuscenes: A mul- timodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul- timodal dataset for autonomous driving. InCVPR, pages 11621–11631, 2020. 5, 6, 8, 3, 4

work page 2020

[8] [8]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 1

work page 2021

[9] [9]

ros2 camera lidar fusion: Ros2 package to calculate the intrinsic and extrinsic cam- era calibration and fuse camera & lidar.https : / / github

Clemente Donoso (CDonosoK). ros2 camera lidar fusion: Ros2 package to calculate the intrinsic and extrinsic cam- era calibration and fuse camera & lidar.https : / / github . com / CDonosoK / ros2 _ camera _ lidar _ fusion, 2025. 1

work page 2025

[10] [10]

Gaussianbev: 3d gaussian representation meets perception models for bev segmentation

Florian Chabot, Nicolas Granger, and Guillaume Lapouge. Gaussianbev: 3d gaussian representation meets perception models for bev segmentation. InWACV, pages 2250–2259. IEEE, 2025. 6

work page 2025

[11] [11]

Pointbev: A sparse approach for bev predictions

Loick Chambon, Eloi Zablocki, Micka ¨el Chen, Florent Bar- toccioni, Patrick P ´erez, and Matthieu Cord. Pointbev: A sparse approach for bev predictions. InCVPR, pages 15195– 15204, 2024. 6

work page 2024

[12] [12]

Gaussrender: Learning 3d occupancy with gaussian rendering

Loick Chambon, Eloi Zablocki, Alexandre Boulch, Mick- ael Chen, and Matthieu Cord. Gaussrender: Learning 3d occupancy with gaussian rendering. InICCV, pages 27010– 27020, 2025. 1, 2

work page 2025

[13] [13]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR, pages 19457–19467, 2024. 2

work page 2024

[14] [14]

Clip2scene: Towards label-efficient 3d scene under- standing by clip

Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. Clip2scene: Towards label-efficient 3d scene under- standing by clip. InCVPR, pages 7020–7030, 2023. 2

work page 2023

[15] [15]

Futr3d: A unified sensor fusion framework for 3d detection

Xuanyao Chen, Tianyuan Zhang, Yue Wang, Yilun Wang, and Hang Zhao. Futr3d: A unified sensor fusion framework for 3d detection. InCVPR, pages 172–181, 2023. 3

work page 2023

[16] [16]

Pla: Language-driven open- vocabulary 3d scene understanding

Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Pla: Language-driven open- vocabulary 3d scene understanding. InCVPR, pages 7010– 7019, 2023. 2

work page 2023

[17] [17]

Lowis3d: Language-driven open-world instance-level 3d scene understanding.IEEE TPAMI, 46(12):8517–8533, 2024

Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Lowis3d: Language-driven open-world instance-level 3d scene understanding.IEEE TPAMI, 46(12):8517–8533, 2024. 2

work page 2024

[18] [18]

Depth map prediction from a single image using a multi-scale deep net- work.NeurIPS, 27, 2014

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.NeurIPS, 27, 2014. 4

work page 2014

[19] [19]

A simple attempt for 3d occupancy estimation in au- tonomous driving.CoRR, 2023

Wanshui Gan, Ningkai Mo, Hongbin Xu, and Naoto Yokoya. A simple attempt for 3d occupancy estimation in au- tonomous driving.CoRR, 2023. 2, 6

work page 2023

[20] [20]

Gaussianocc: Fully self-supervised and ef- ficient 3d occupancy estimation with gaussian splatting

Wanshui Gan, Fang Liu, Hongbin Xu, Ningkai Mo, and Naoto Yokoya. Gaussianocc: Fully self-supervised and ef- ficient 3d occupancy estimation with gaussian splatting. In ICCV, pages 28980–28990, 2025. 1, 2, 6

work page 2025

[21] [21]

Unim- ov3d: Uni-modality open-vocabulary 3d scene understand- ing with fine-grained feature representation.arXiv preprint arXiv:2401.11395, 2024

Qingdong He, Jinlong Peng, Zhengkai Jiang, Kai Wu, Xiaozhong Ji, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Mingang Chen, and Yunsheng Wu. Unim- ov3d: Uni-modality open-vocabulary 3d scene understand- ing with fine-grained feature representation.arXiv preprint arXiv:2401.11395, 2024. 2

work page arXiv 2024

[22] [22]

Fiery: Future instance prediction in bird’s- eye view from surround monocular cameras

Anthony Hu, Zak Murez, Nikhil Mohan, Sof ´ıa Dudas, Jef- frey Hawke, Vijay Badrinarayanan, Roberto Cipolla, and Alex Kendall. Fiery: Future instance prediction in bird’s- eye view from surround monocular cameras. InICCV, pages 15273–15282, 2021. 6

work page 2021

[23] [23]

Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE TPAMI, 2024

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE TPAMI, 2024. 6

work page 2024

[24] [24]

St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In ECCV, pages 533–549. Springer, 2022. 5, 6

work page 2022

[25] [25]

Quantaichi: a compiler for quantized simu- lations.ACM Transactions on Graphics (TOG), 40(4):1–16,

Yuanming Hu, Jiafeng Liu, Xuanda Yang, Mingkuan Xu, Ye Kuang, Weiwei Xu, Qiang Dai, William T Freeman, and Fr´edo Durand. Quantaichi: a compiler for quantized simu- lations.ACM Transactions on Graphics (TOG), 40(4):1–16,

work page

[26] [26]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In CVPR, pages 17853–17862, 2023. 5, 6 9

work page 2023

[27] [27]

Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training

Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo. Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training. InICCV, pages 22157– 22167, 2023. 2

work page 2023

[28] [28]

Tri-perspective view for vision-based 3d se- mantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d se- mantic occupancy prediction. InCVPR, pages 9223–9232,

work page

[29] [29]

Selfocc: Self-supervised vision-based 3d oc- cupancy prediction

Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, and Jiwen Lu. Selfocc: Self-supervised vision-based 3d oc- cupancy prediction. InCVPR, pages 19946–19956, 2024. 2, 6

work page 2024

[30] [30]

Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction. InECCV, pages 376–393. Springer, 2024. 1, 2, 5, 7, 8, 3

work page 2024

[31] [31]

Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction

Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, and Jiwen Lu. Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction. InCVPR, pages 27477– 27486, 2025. 1, 2

work page 2025

[32] [32]

Openins3d: Snap and lookup for 3d open-vocabulary instance segmentation

Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, and Joan Lasenby. Openins3d: Snap and lookup for 3d open-vocabulary instance segmentation. InECCV, pages 169–185. Springer, 2024. 2

work page 2024

[33] [33]

Bench2drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving. In NeurIPS 2024 Datasets and Benchmarks Track, 2024. 5

work page 2024

[34] [34]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InICCV, pages 8340– 8350, 2023. 5, 6

work page 2023

[35] [35]

Gausstr: Foundation model-aligned gaussian transformer for self-supervised 3d spatial understanding

Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tian- wei Lin, Zhizhong Su, Wenyu Liu, and Xinggang Wang. Gausstr: Foundation model-aligned gaussian transformer for self-supervised 3d spatial understanding. InCVPR, pages 11960–11970, 2025. 2, 4, 5, 6, 7, 8, 3

work page 2025

[36] [36]

Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment

Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InCVPR, pages 24905–24916, 2025. 5, 6, 1, 2

work page 2025

[37] [37]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page

[38] [38]

Shelf-supervised cross-modal pre-training for 3d ob- ject detection

Mehar Khurana, Neehar Peri, James Hays, and Deva Ra- manan. Shelf-supervised cross-modal pre-training for 3d ob- ject detection. InCoRL, 2024. 2

work page 2024

[39] [39]

Open3dsg: Open- vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships

Sebastian Koch, Narunas Vaskevicius, Mirco Colosi, Pe- dro Hermosilla, and Timo Ropinski. Open3dsg: Open- vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. InCVPR, pages 14183–14193, 2024. 2

work page 2024

[40] [40]

Pointpillars: Fast encoders for object detection from point clouds

Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InCVPR, pages 12697–12705, 2019. 3, 6

work page 2019

[41] [41]

Dense multimodal align- ment for open-vocabulary 3d scene understanding

Ruihuang Li, Zhengqiang Zhang, Chenhang He, Zhiyuan Ma, Vishal M Patel, and Lei Zhang. Dense multimodal align- ment for open-vocabulary 3d scene understanding. InECCV, pages 416–434. Springer, 2024. 2

work page 2024

[42] [42]

Unifying voxel-based representation with transformer for 3d object detection.NeurIPS, 35:18442– 18455, 2022

Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, and Jiaya Jia. Unifying voxel-based representation with transformer for 3d object detection.NeurIPS, 35:18442– 18455, 2022. 3

work page 2022

[43] [43]

V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion

Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anand- kumar. V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion. InCVPR, pages 9087– 9098, 2023. 1

work page 2023

[44] [44]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE TPAMI, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE TPAMI, 2024. 6

work page 2024

[45] [45]

Is ego status all you need for open-loop end-to-end autonomous driving? InCVPR, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InCVPR, pages 14864–14873, 2024. 5, 6, 8

work page 2024

[46] [46]

Bevfusion: A simple and robust lidar-camera fusion framework.NeurIPS, 35:10421–10434, 2022

Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework.NeurIPS, 35:10421–10434, 2022. 3

work page 2022

[47] [47]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InCVPR, pages 2117–2125,

work page

[48] [48]

Oc- cvla: Vision-language-action model with implicit 3d occu- pancy supervision.arXiv preprint arXiv:2509.05578, 2025

Ruixun Liu, Lingyu Kong, Derun Li, and Hang Zhao. Oc- cvla: Vision-language-action model with implicit 3d occu- pancy supervision.arXiv preprint arXiv:2509.05578, 2025. 5

work page arXiv 2025

[49] [49]

Gaussianfusion: Gaussian-based multi-sensor fu- sion for end-to-end autonomous driving.arXiv preprint arXiv:2506.00034, 2025

Shuai Liu, Quanmin Liang, Zefeng Li, Boyang Li, and Kai Huang. Gaussianfusion: Gaussian-based multi-sensor fu- sion for end-to-end autonomous driving.arXiv preprint arXiv:2506.00034, 2025. 5

work page arXiv 2025

[50] [50]

Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation

Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation. InICRA, 2023. 3

work page 2023

[51] [51]

Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data

Shiyang Lu, Haonan Chang, Eric Pu Jing, Abdeslam Boular- ias, and Kostas Bekris. Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data. InCoRL, pages 1610–1620. PMLR, 2023. 2

work page 2023

[52] [52]

Open-vocabulary point-cloud object detection without 3d an- notation

Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka, Kurt Keutzer, and Shanghang Zhang. Open-vocabulary point-cloud object detection without 3d an- notation. InCVPR, pages 1190–1199, 2023. 2

work page 2023

[53] [53]

Robot operating system 2: Design, architecture, and uses in the wild.Science robotics, 7(66):eabm6074, 2022

Steven Macenski, Tully Foote, Brian Gerkey, Chris Lalancette, and William Woodall. Robot operating system 2: Design, architecture, and uses in the wild.Science robotics, 7(66):eabm6074, 2022. 1 10

work page 2022

[54] [54]

Opensu3d: Open world 3d scene understanding using foundation models

Rafay Mohiuddin, Sai Manoj Prakhya, Fiona Collins, Ziyuan Liu, and Andr´e Borrmann. Opensu3d: Open world 3d scene understanding using foundation models. InICRA, pages 13560–13566. IEEE, 2025. 2

work page 2025

[55] [55]

Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance

Phuc Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. InCVPR, pages 4018–4028, 2024. 2

work page 2024

[56] [56]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 3, 4, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

Better call sal: Towards learning to segment anything in lidar

Aljo ˇsa O ˇsep, Tim Meinhardt, Francesco Ferroni, Neehar Peri, Deva Ramanan, and Laura Leal-Taix ´e. Better call sal: Towards learning to segment anything in lidar. InECCV, pages 71–90. Springer, 2024. 2

work page 2024

[58] [58]

Renderocc: Vision-centric 3d occupancy predic- tion with 2d rendering supervision

Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Hongwei Xie, Bing Wang, Li Liu, and Shanghang Zhang. Renderocc: Vision-centric 3d occupancy predic- tion with 2d rendering supervision. InICRA, pages 12404– 12411. IEEE, 2024. 2

work page 2024

[59] [59]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InCVPR, pages 815–824, 2023. 2

work page 2023

[60] [60]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InECCV, pages 194–210. Springer, 2020. 6

work page 2020

[61] [61]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763, 2021. 1, 2, 5

work page 2021

[62] [62]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 3, 4, 6, 7, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Graphgsocc: Semantic and geometric graph transformer for 3d gaussian splating-based occupancy prediction.arXiv preprint arXiv:2506.14825, 2025

Ke Song, Yunhe Wu, Chunchit Siu, and Huiyuan Xiong. Graphgsocc: Semantic and geometric graph transformer for 3d gaussian splating-based occupancy prediction.arXiv preprint arXiv:2506.14825, 2025. 1

work page arXiv 2025

[64] [64]

Openmask3d: Open-vocabulary 3d instance segmenta- tion,

Ayc ¸a Takmaz, Elisabetta Fedele, Robert W Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. Open- mask3d: Open-vocabulary 3d instance segmentation.arXiv preprint arXiv:2306.13631, 2023. 2

work page arXiv 2023

[65] [65]

Search3d: Hierarchical open-vocabulary 3d segmentation

Ayca Takmaz, Alexandros Delitzas, Robert W Sumner, Francis Engelmann, Johanna Wald, and Federico Tombari. Search3d: Hierarchical open-vocabulary 3d segmentation. IEEE Robotics and Automation Letters, 2025. 2

work page 2025

[66] [66]

Towards learning to complete anything in lidar

Ayc ¸a Takmaz, Cristiano Saltori, Neehar Peri, Tim Mein- hardt, Riccardo de Lutio, Laura Leal-Taix´e, and Aljoˇsa Oˇsep. Towards learning to complete anything in lidar. InICML,

work page

[67] [67]

Ovo: Open-vocabulary occupancy

Zhiyu Tan, Zichao Dong, Cheng Zhang, Weikun Zhang, Hang Ji, and Hao Li. Ovo: Open-vocabulary occupancy. arXiv preprint arXiv:2305.16133, 2023. 2

work page arXiv 2023

[68] [68]

Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving

Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. InNeurIPS, pages 64318–64330,

work page

[69] [69]

Scene as occupancy

Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. InICCV, pages 8406–8415, 2023. 1, 5, 6

work page 2023

[70] [70]

Attention is all you need.NeurIPS, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017. 3, 5

work page 2017

[71] [71]

Pop- 3d: Open-vocabulary 3d occupancy prediction from images

Antonin V obecky, Oriane Sim´eoni, David Hurych, Spyridon Gidaris, Andrei Bursuc, Patrick P´erez, and Josef Sivic. Pop- 3d: Open-vocabulary 3d occupancy prediction from images. NeurIPS, 36:50545–50557, 2023. 2

work page 2023

[72] [72]

Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view repre- sentation

Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhen- guo Li, Bernt Schiele, and Liwei Wang. Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view repre- sentation. InICCV, pages 6792–6802, 2023. 3

work page 2023

[73] [73]

Distillnerf: Perceiving 3d scenes from single-glance images by distilling neural fields and foundation model features.NeurIPS, 37:62334–62361,

Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven Waslander, Yue Wang, Sanja Fidler, Marco Pavone, and Peter Karkus. Distillnerf: Perceiving 3d scenes from single-glance images by distilling neural fields and foundation model features.NeurIPS, 37:62334–62361,

work page

[74] [74]

Openoccupancy: A large scale benchmark for sur- rounding semantic occupancy perception

Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, and Xingang Wang. Openoccupancy: A large scale benchmark for sur- rounding semantic occupancy perception. InICCV, pages 17850–17859, 2023. 1

work page 2023

[75] [75]

Open-vocabulary octree- graph for 3d scene understanding

Zhigang Wang, Yifei Su, Chenhui Li, Dong Wang, Yan Huang, Xuelong Li, and Bin Zhao. Open-vocabulary octree- graph for 3d scene understanding. InICCV, pages 7037– 7047, 2025. 2

work page 2025

[76] [76]

Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving. InICCV, pages 21729–21740, 2023. 1

work page 2023

[77] [77]

Sam4d: Segment anything in camera and lidar streams

Jianyun Xu, Song Wang, Ziqian Ni, Chunyong Hu, Sheng Yang, Jianke Zhu, and Qiang Li. Sam4d: Segment anything in camera and lidar streams. InICCV, 2025. 2

work page 2025

[78] [78]

arXiv preprint arXiv:2311.17707 (2023)

Mutian Xu, Xingyilang Yin, Lingteng Qiu, Yang Liu, Xin Tong, and Xiaoguang Han. Sampro3d: Locating sam prompts in 3d for zero-shot scene segmentation.arXiv preprint arXiv:2311.17707, 2023. 2

work page arXiv 2023

[79] [79]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125,

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yu- liang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125, 2025. 5

work page arXiv 2025

[80] [80]

Gaussianpretrain: A simple uni- fied 3d gaussian representation for visual pre-training in au- tonomous driving.arXiv preprint arXiv:2411.12452, 2024

Shaoqing Xu, Fang Li, Shengyin Jiang, Ziying Song, Li Liu, and Zhi-xin Yang. Gaussianpretrain: A simple uni- fied 3d gaussian representation for visual pre-training in au- tonomous driving.arXiv preprint arXiv:2411.12452, 2024. 1, 2 11

work page arXiv 2024