pith. sign in

arxiv: 2606.24353 · v1 · pith:5KTKCDH4new · submitted 2026-06-23 · 💻 cs.CV · cs.LG

Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints

Pith reviewed 2026-06-26 00:38 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords open-vocabulary segmentationbird's-eye viewBEV perceptionvision-language modelsgeometric constraintsautonomous drivingGaussian splattingnuScenes
0
0 comments X

The pith

A three-stage geometric constraint system lifts 2D vision-language model outputs into consistent open-vocabulary bird's-eye-view maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework called OVBEVSeg that performs bird's-eye-view segmentation on categories never seen in training by starting from vision-language model outputs in 2D images. It tackles the geometric inconsistency that arises when these 2D labels are projected into a unified top-down view by adding reliable 3D projection for pseudo-labels, per-scene structural constraints, and distillation for speed. A sympathetic reader would care because this removes the closed-set limitation that makes current BEV systems brittle in real driving environments with unexpected objects. The method reports 15.3 mIoU gains on unseen classes on nuScenes while matching semi-supervised baselines that use up to 40 percent ground-truth labels and running at 2.5 times the speed with far less memory.

Core claim

OVBEVSeg enhances efficient Gaussian splatting unprojection through three progressive stages of 3D geometric constraints: 2D-to-BEV pseudo-labeling via reliable 3D projection for open-vocabulary generalization, joint 2D-BEV per-scene optimization with BEV structural constraints for geometric consistency, and 3D geometric distillation for online efficiency. On the nuScenes dataset this produces state-of-the-art open-vocabulary BEV segmentation that outperforms closed-set methods by 15.3 mIoU on unseen categories, remains competitive with self- and semi-supervised baselines trained on up to 40 percent ground-truth annotations, and delivers 2.5 times faster inference at 0.22 times the memory of

What carries the argument

Three-stage progressive application of 3D geometric constraints to Gaussian splatting unprojection that enforces consistency when lifting 2D VLM semantics into BEV space.

If this is right

  • Open-vocabulary recognition becomes possible in BEV perception without any ground-truth labels for new classes.
  • Performance on unseen categories exceeds that of closed-set trained models by a large margin.
  • Accuracy stays competitive with methods that require up to 40 percent labeled data while using none for novel classes.
  • Inference runs 2.5 times faster and uses 0.22 times the memory of prior projection-based approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same constraint stages could be applied to other multi-camera tasks such as 3D object detection or occupancy prediction where 2D semantics must be lifted consistently.
  • Reduced reliance on dense novel-class annotations could lower the data-collection cost for training perception stacks in new geographic regions.
  • If the constraints generalize, they might allow a single model to handle both known and unknown road users without retraining.

Load-bearing premise

Reliable 3D projection for pseudo-labeling together with BEV structural constraints and distillation can remove the geometric inconsistency of lifting 2D VLM semantics into BEV without introducing new systematic errors or needing novel-class ground truth.

What would settle it

A test set of scenes containing novel objects at multiple depths where the produced BEV maps show persistent misalignment or label leakage between objects would show that the geometric constraints do not resolve the inconsistency.

Figures

Figures reproduced from arXiv: 2606.24353 by Dae Jung Kim, Hojun Choi, Hyunjung Shim, Jinhan Lee, Kisung Kim, Seulbin Hwang.

Figure 1
Figure 1. Figure 1: (c,d) Ground-truth or pseudo BEV annotations. (a,e) 3D and BEV renderings with plain 3D Gaussian splatting, showing dispersed splats beyond object regions. (b,f) 3D and BEV renderings with our BAGS, with splats concentrated along object bound￾aries. (g,h) Online BEV predictions: the baseline misses novel classes (e.g., “truck”) and exhibits boundary distortions, which are successfully resolved by our metho… view at source ↗
Figure 2
Figure 2. Figure 2: Our method consists of three modules: (1) PBL generates BEV pseudo-labels by leveraging unsupervised 3D object discovery [47], foundational 2D segmentation [43], and VLM grounding [42] to construct a 2D–BEV correspondence set S˜. (2) BAGS performs 3D geometry reconstruction using 2D, depth, and BEV occupancy supervision from S˜. (3) BAGD distills this optimized 3D geometry into a feed-forward 3DGS￾based st… view at source ↗
Figure 3
Figure 3. Figure 3: Open-world 2D–BEV auto-labeling. “truck” “car” 3D Language Gaussians “car” “car” SAM CLIP Image Encoder Encoder Latent Space Supervise Render Teacher Distribution Student Distribution Decoder Reconstruction Loss KL Loss Vehicle Subclass Feature Disentangled! Multi-View 2D Images [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a–b) Our BEV predictions vs. the baseline: orange regions denote both base and novel vehicle classes. (c–d) Our BEV renderings (BAGS) vs. vanilla 3DGS. Distillation ratio. We evaluate various BAGD ratios to assess their efficacy in 3D geometric knowledge transfer. A typical nuScenes [2] scene consists of ap￾proximately 40 frames captured at 2 Hz. Given the high temporal redundancy between consecutive fram… view at source ↗
Figure 6
Figure 6. Figure 6: Language-embedded OV 3D scene understanding via the Lang–BAGS protocol. BEV renderings, whereas BAGS (d) effectively resolves these inconsistencies, ensuring robustly aligned 3D geometries as reliable OVBS supervision. Open-world 2D–BEV auto-labeling. Given the increasing scale of datasets, au￾tomatic multimodal annotation is essential for real-world AI applications like AD. In contrast to closed-set appro… view at source ↗
read the original abstract

Bird's-eye view (BEV) perception fuses multi-camera images into a unified top-down representation for autonomous driving. Despite recent progress, state-of-the-art methods remain confined to closed-set scenarios, making them vulnerable to unpredictable real-world environments. In this work, we introduce open-vocabulary BEV segmentation (OVBS), which leverages vision-language models (VLMs) to recognize categories beyond the training set while maintaining precise BEV perception and real-time efficiency. A key challenge in OVBS lies in the 3D geometric inconsistency inherent in the ill-posed lifting of 2D VLM semantics into BEV. To address this, we propose OVBEVSeg, a geometry-aware OVBS framework that enhances efficient Gaussian splatting (GS)-based unprojection by leveraging robust 3D geometric constraints across three progressive stages: (1) 2D-to-BEV pseudo-labeling via reliable 3D projection for OV generalization; (2) joint 2D-BEV per-scene optimization with BEV structural constraints for 3D geometric consistency; and (3) 3D geometric distillation for online efficiency. On the nuScenes dataset, OVBEVSeg achieves state-of-the-art performance, outperforming closed-set methods by 15.3 mIoU on unseen categories. Remarkably, even with no novel-class ground-truth labels, it remains competitive with self- and semi-supervised baselines trained with up to 40% of ground-truth annotations. Furthermore, it achieves 2.5x faster inference with only 0.22x the memory consumption of projection-based methods. Project page: https://hchoi256.github.io/projects/ovbevseg/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces OVBEVSeg, a three-stage geometry-aware framework for open-vocabulary BEV segmentation that lifts 2D VLM semantics into BEV using efficient Gaussian splatting unprojection enhanced by reliable 3D projection for pseudo-labeling, BEV structural constraints in joint per-scene optimization, and 3D geometric distillation for efficiency. On nuScenes it claims SOTA open-vocabulary performance with a 15.3 mIoU gain on unseen categories over closed-set methods, competitiveness with self-/semi-supervised baselines using up to 40% ground-truth annotations, and 2.5x faster inference at 0.22x memory of projection-based approaches.

Significance. If the geometric consistency claims hold with the reported gains, the work would meaningfully advance open-vocabulary 3D perception for autonomous driving by enabling generalization to novel classes without novel-class labels while preserving real-time efficiency; the combination of pseudo-labeling, structural constraints, and distillation is a plausible route to mitigating lifting inconsistencies.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (method overview): the central claim that the three-stage pipeline resolves 3D geometric inconsistency without systematic errors or novel-class GT rests on the reliability of the 3D projection step and the effectiveness of the BEV constraints, yet no quantitative ablation isolating each stage's contribution to mIoU on unseen classes is described, which is load-bearing for attributing the 15.3 mIoU gain.
  2. [Abstract] Abstract: the efficiency claims (2.5x faster inference, 0.22x memory) and competitiveness with 40%-annotation baselines are presented without reported variance, exact baseline definitions, or per-scene optimization details, preventing verification that the distillation stage preserves the consistency achieved in stage 2.
minor comments (1)
  1. [Abstract] The project page URL is given but no supplementary material or code link is referenced in the abstract, which would aid reproducibility of the GS-based unprojection and constraint implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below with point-by-point responses.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method overview): the central claim that the three-stage pipeline resolves 3D geometric inconsistency without systematic errors or novel-class GT rests on the reliability of the 3D projection step and the effectiveness of the BEV constraints, yet no quantitative ablation isolating each stage's contribution to mIoU on unseen classes is described, which is load-bearing for attributing the 15.3 mIoU gain.

    Authors: We agree that isolating the per-stage contributions specifically to mIoU on unseen classes is important for rigorously attributing the reported 15.3 mIoU gain. The manuscript includes component ablations in Section 4.3, but these do not provide a dedicated breakdown of unseen-class mIoU gains per stage. We will add a quantitative ablation table focused on unseen categories (with and without each stage) to the revised manuscript or supplementary material. revision: yes

  2. Referee: [Abstract] Abstract: the efficiency claims (2.5x faster inference, 0.22x memory) and competitiveness with 40%-annotation baselines are presented without reported variance, exact baseline definitions, or per-scene optimization details, preventing verification that the distillation stage preserves the consistency achieved in stage 2.

    Authors: We will revise the abstract and relevant sections to include exact baseline definitions (specifying the self-/semi-supervised methods and annotation percentages used) and report standard deviations for the efficiency metrics where multiple runs are available. Per-scene optimization details are already described in Section 3.2; we will add a cross-reference and a brief analysis showing that the distilled model in stage 3 retains the geometric consistency metrics from stage 2. If variance was not computed in the original experiments, we will note this limitation explicitly. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and available description outline a three-stage pipeline relying on external geometric projection, VLM features, and optimization constraints drawn from standard computer vision techniques. No equations, self-citations, or fitted quantities are shown that reduce predictions or uniqueness claims to the paper's own inputs by construction. The performance claims are benchmarked against independent datasets and baselines, keeping the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5855 in / 1190 out tokens · 18549 ms · 2026-06-26T00:38:56.296176+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Bhat,S.F.,Birkl,R.,Wofk,D.,Wonka,P.,Müller,M.:Zoedepth:Zero-shottransfer by combining relative and metric depth. CoRRabs/2302.12288(2023)

  2. [2]

    In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020

    Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. pp. 11618–11628 (2020)

  3. [3]

    Cao, Y., Jv, Y., Xu, D.: 3dgs-det: Empower 3d gaussian splatting with boundary guidance and box-focused sampling for 3d object detection (2024)

  4. [4]

    In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025, Tucson, AZ, USA, February 26 - March 6, 2025

    Chabot, F., Granger, N., Lapouge, G.: Gaussianbev: 3d gaussian representation meets perception models for bev segmentation. In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025, Tucson, AZ, USA, February 26 - March 6, 2025. pp. 2250–2259 (2025)

  5. [5]

    Chambon, L., Zablocki, É., Chen, M., Bartoccioni, F., Pérez, P., Cord, M.: Point- bev:Asparseapproachtobevpredictions.In:IEEE/CVFConferenceonComputer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 15195–15204 (2024)

  6. [6]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024

    Charatan, D., Li, S.L., Tagliasacchi, A., Sitzmann, V.: Pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 19457–19467 (2024)

  7. [7]

    CoRRabs/2410.05111(2024)

    Chen, Q., Yang, S., Du, S., Tang, T., Chen, P., Huo, Y.: Lidar-gs:real-time lidar re-simulation using gaussian splatting. CoRRabs/2410.05111(2024)

  8. [8]

    In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, Septem- ber 29-October 4, 2024, Proceedings, Part XXI

    Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T., Cai, J.: Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, Septem- ber 29-October 4, 2024, Proceedings, Part XXI. vol. 15079, pp. 370–386 (2024)

  9. [9]

    Choi, H., Lim, Y., Shin, J., Shim, H.: Cot-pl: Visual chain-of-thought reasoning meets pseudo-labeling for open-vocabulary object detection (2025)

  10. [10]

    In: IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2024 - Workshops, Seattle, WA, USA, June 17-18, 2024

    Chung, J., Oh, J., Lee, K.M.: Depth-regularized optimization for 3d gaussian splat- ting in few-shot images. In: IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2024 - Workshops, Seattle, WA, USA, June 17-18, 2024. pp. 811–820 (2024)

  11. [11]

    In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XL

    Etchegaray, D., Huang, Z., Harada, T., Luo, Y.: Find n’ propagate: Open- vocabulary 3d object detection in urban environments. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XL. vol. 15098, pp. 133–151 (2024)

  12. [12]

    Fan, Z., Cong, W., Wen, K., Wang, K., Zhang, J., Ding, X., Xu, D., Ivanovic, B., Pavone, M., Pavlakos, G., Wang, Z., Wang, Y.: Instantsplat: Sparse-view gaussian splatting in seconds (2025)

  13. [13]

    In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXVII

    Fang,G.,Wang,B.:Mini-splatting:Representingsceneswithaconstrainednumber of gaussians. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXVII. vol. 15135, pp. 165–181. Springer (2024)

  14. [14]

    In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXVII

    Fang,G.,Wang,B.:Mini-splatting:Representingsceneswithaconstrainednumber of gaussians. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXVII. vol. 15135, pp. 165–181. Springer (2024) 16 H. Choi et al

  15. [15]

    In: ECCV (2024)

    Gosala, N., Petek, K., Kiran, B.R., Yogamani, S.K., Drews-Jr, P.L.J., Burgard, W., Valada, A.: Letsmap: Unsupervised representation learning for label-efficient semantic BEV mapping. In: ECCV (2024)

  16. [16]

    In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022)

    Gu, X., Lin, T., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022)

  17. [17]

    Harley, A.W., Fang, Z., Li, J., Ambrus, R., Fragkiadaki, K.: Simple-bev: What really matters for multi-sensor BEV perception? In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023. pp. 2759–2765 (2023)

  18. [18]

    In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024

    Hess,G.,Tonderski,A.,Petersson,C.,Åström,K.,Svensson,L.:Lidarclipor:HowI learned to talk to point clouds. In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024. pp. 7423–7432 (2024)

  19. [19]

    In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021

    Hu, A., Murez, Z., Mohan, N., Dudas, S., Hawke, J., Badrinarayanan, V., Cipolla, R.,Kendall,A.:FIERY:futureinstancepredictioninbird’s-eyeviewfromsurround monocular cameras. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. pp. 15253–15262 (2021)

  20. [20]

    In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXVIII

    Hu, S., Chen, L., Wu, P., Li, H., Yan, J., Tao, D.: ST-P3: end-to-end vision-based autonomous driving via spatial-temporal feature learning. In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXVIII. vol. 13698, pp. 533–549 (2022)

  21. [21]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023

    Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., Lu, L., Jia, X., Liu, Q., Dai, J., Qiao, Y., Li, H.: Planning-oriented autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 17853– 17862 (2023)

  22. [22]

    BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

    Huang, J., Huang, G., Zhu, Z., Du, D.: Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. CoRRabs/2112.11790(2021)

  23. [23]

    In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

    Jia, X., Gao, Y., Chen, L., Yan, J., Liu, P.L., Li, H.: Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 7919–7929 (2023)

  24. [24]

    ACM Trans

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42, 139:1–139:14 (2023)

  25. [25]

    In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W., Dollár, P., Girshick, R.B.: Segment anything. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 3992–4003 (2023)

  26. [26]

    In: IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24,

    Li, S., Fischer, T., Ke, L., Ding, H., Danelljan, M., Yu, F.: Ovtrack: Open- vocabulary multiple object tracking. In: IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24,

  27. [27]

    5567–5577 (2023)

    pp. 5567–5577 (2023)

  28. [28]

    Li, Y., Ge, Z., Yu, G., Yang, J., Wang, Z., Shi, Y., Sun, J., Li, Z.: Bevdepth: Ac- quisition of reliable depth for multi-view 3d object detection. In: Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Sympo- sium on Educational Advance...

  29. [29]

    Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Qiao, Y., Dai, J.: Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotem- poraltransformers.In:ComputerVision-ECCV2022-17thEuropeanConference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX. pp. 1–18 (2022)

  30. [30]

    Lin, Y., Wang, C., Chang, C., Sun, H.: An efficient framework for counting pedes- trians crossing a line using low-cost devices: the benefits of distilling the knowledge in a neural network. Multim. Tools Appl.80(3), 4037–4051 (2021)

  31. [31]

    In: IEEE/CVF International Con- ference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

    Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In: IEEE/CVF International Con- ference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 18534–18544 (2023)

  32. [32]

    In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVII

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: marrying DINO with grounded pre- training for open-set object detection. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVII. vol. 15105, pp. 38–55 (2024)

  33. [33]

    In: IEEE/CVF Interna- tional Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6,

    Liu, Y., Yan, J., Jia, F., Li, S., Gao, A., Wang, T., Zhang, X.: Petrv2: A unified framework for 3d perception from multi-camera images. In: IEEE/CVF Interna- tional Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6,

  34. [34]

    3239–3249 (2023)

    pp. 3239–3249 (2023)

  35. [35]

    In: 7th Inter- national Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: 7th Inter- national Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 (2019)

  36. [36]

    IEEE Robotics Autom

    Lu, C., van de Molengraft, M.J.G., Dubbelman, G.: Monocular semantic oc- cupancy grid mapping with convolutional variational encoder-decoder networks. IEEE Robotics Autom. Lett.4(2), 445–452 (2019)

  37. [37]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025

    Lu, S., Tsai, Y., Chen, Y.: Toward real-world BEV perception: Depth uncertainty estimation via gaussian splatting. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. pp. 17124–17133 (2025)

  38. [38]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024

    Lu, T., Yu, M., Xu, L., Xiangli, Y., Wang, L., Lin, D., Dai, B.: Scaffold-gs: Struc- tured 3d gaussians for view-adaptive rendering. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. IEEE (2024)

  39. [39]

    CoRRabs/2207.01987(2022)

    Lu, Y., Xu, C., Wei, X., Xie, X., Tomizuka, M., Keutzer, K., Zhang, S.: Open- vocabulary 3d detection via image-level class and debiased cross-modal contrastive learning. CoRRabs/2207.01987(2022)

  40. [40]

    IEEE Robotics Autom

    Pan, B., Sun, J., Leung, H.Y.T., Andonian, A., Zhou, B.: Cross-view semantic segmentation for sensing surroundings. IEEE Robotics Autom. Lett.5(3), 4867– 4873 (2020)

  41. [41]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023

    Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.A.: Openscene: 3d scene understanding with open vocabularies. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 815–824 (2023)

  42. [42]

    In: Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIV

    Philion, J., Fidler, S.: Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIV. vol. 12359, pp. 194–210 (2020)

  43. [43]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2024, Seattle, WA, USA, June 16-22, 2024

    Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 20051–20060 (2024) 18 H. Choi et al

  44. [44]

    In: Proceedings of the 38th In- ternational Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th In- ternational Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. vol. 139, pp. 8748–8...

  45. [45]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded SAM: assembling open-world models for diverse visual tasks. CoRR abs/2401.14159(2024)

  46. [46]

    In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020

    Roddick, T., Cipolla, R.: Predicting semantic map representations from images using pyramid occupancy networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. pp. 11135–11144 (2020)

  47. [47]

    Tan,K.,Shen,Y.,Tu,M.,Zhu,H.,Wang,B.,Chen,G.,Ye,H.,Sun,H.:Ufo:Unify- ing feed-forward and optimization-based methods for large driving scene modeling (2026)

  48. [48]

    In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA

    Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neu- ral networks. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. vol. 97, pp. 6105–6114 (2019)

  49. [49]

    de Vries Lentsch, T., Caesar, H., Gavrila, D.: UNION: unsupervised 3d object detection using object appearance-based pseudo-classes. In: Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 (2024)

  50. [50]

    In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

    Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 3598–3608 (2023)

  51. [51]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024

    Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 20310–20320 (2024)

  52. [52]

    IEEE Trans

    Wu, J., Li, X., Xu, S., Yuan, H., Ding, H., Yang, Y., Li, X., Zhang, J., Tong, Y., Jiang, X., Ghanem, B., Tao, D.: Towards open vocabulary learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell.46(7), 5092–5113 (2024)

  53. [53]

    IEEE Trans

    Wu, J., Li, X., Xu, S., Yuan, H., Ding, H., Yang, Y., Li, X., Zhang, J., Tong, Y., Jiang, X., Ghanem, B., Tao, D.: Towards open vocabulary learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell.46, 5092–5113 (2024)

  54. [54]

    Wu, Y., Meng, J., Li, H., Wu, C., Shi, Y., Cheng, X., Zhao, C., Feng, H., Ding, E., Wang, J., Zhang, J.: Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding. In: Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December ...

  55. [55]

    CoRRabs/2204.05088(2022)

    Xie, E., Yu, Z., Zhou, D., Philion, J., Anandkumar, A., Fidler, S., Luo, P., Álvarez, J.M.: M2bev: Multi-camera joint 3d detection and segmentation with unified birds- eye view representation. CoRRabs/2204.05088(2022)

  56. [56]

    In: IEEE/CVF Conference on OVBEVSeg 19 Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025

    Xu, H., Peng, S., Wang, F., Blum, H., Barath, D., Geiger, A., Pollefeys, M.: Depth- splat: Connecting gaussian splatting and depth. In: IEEE/CVF Conference on OVBEVSeg 19 Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. pp. 16453–16463 (2025)

  57. [57]

    In: CVPR (2023)

    Yang,L.,Qi,L.,Feng,L.,Zhang,W.,Shi,Y.:Revisitingweak-to-strongconsistency in semi-supervised semantic segmentation. In: CVPR (2023)

  58. [58]

    In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020

    Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell, T.: BDD100K: A diverse driving dataset for heterogeneous multitask learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. pp. 2633–2642 (2020)

  59. [59]

    In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021

    Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.: Open-vocabulary object detection using captions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. pp. 14393–14402 (2021)

  60. [60]

    ISPRS Int

    Zeng, Z., Boehm, J.: Exploration of an open vocabulary model on semantic seg- mentation for street scene imagery. ISPRS Int. J. Geo Inf.13(5), 153 (2024)

  61. [61]

    In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX

    Zhao,S.,Zhang,Z.,Schulter,S.,Zhao,L.,Kumar,B.G.V.,Stathopoulos,A.,Chan- draker, M., Metaxas, D.N.: Exploiting unlabeled data with vision and language models for object detection. In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX. vol. 13669, pp. 159–175 (2022)

  62. [62]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024

    Zhao, T., Chen, Y., Wu, Y., Liu, T., Du, B., Xiao, P., Qiu, S., Yang, H., Li, G., Yang, Y., Lin, Y.: Improving bird’s eye view semantic segmentation by task decom- position. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 15512–15521 (2024)

  63. [63]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

    Zhou,B.,Krähenbühl,P.:Cross-viewtransformersforreal-timemap-viewsemantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 13750–13759 (2022)

  64. [64]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024

    Zhou, H., Shao, J., Xu, L., Bai, D., Qiu, W., Liu, B., Wang, Y., Geiger, A., Liao, Y.: HUGS: holistic urban 3d scene understanding via gaussian splatting. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 21336–21345 (2024)

  65. [65]

    In: IEEE/CVFConferenceonComputerVisionandPatternRecognition,CVPR2024, Seattle, WA, USA, June 16-22, 2024

    Zhou, X., Lin, Z., Shan, X., Wang, Y., Sun, D., Yang, M.: Drivinggaussian: Com- posite gaussian splatting for surrounding dynamic autonomous driving scenes. In: IEEE/CVFConferenceonComputerVisionandPatternRecognition,CVPR2024, Seattle, WA, USA, June 16-22, 2024. pp. 21634–21643 (2024)

  66. [66]

    Zhu, J., Liu, L., Tang, Y., Wen, F., Li, W., Liu, Y.: Semi-supervised learning for visual bird’s eye view semantic segmentation (2024)

  67. [67]

    truck,” “bus,

    Zou, J., Tian, K., Zhu, Z., Ye, Y., Wang, X.: Diffbev: Conditional diffusion model for bird’s eye view perception. In: Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2...