arxiv: 2604.12551 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Cross-Attentive Multiview Fusion of Vision-Language Embeddings

Tomas Berriel Martins , Martin R. Oswald , Javier Civera

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords multiview fusionvision-language modelscross-attention3D semantic segmentationself-supervised learningzero-shot learning3D instance classification

0 comments

The pith

A multiview transformer cross-attends vision-language descriptors from multiple views and fuses them with consistency self-supervision into unified 3D instance embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a transformer-based method to combine 2D vision-language features captured from different camera angles into one strong 3D representation per object or scene element. Rather than averaging features or choosing one view at random, the architecture lets each view's descriptor attend to the others before fusion. A self-supervised loss that penalizes inconsistency across views is added to the usual classification training signal. This produces embeddings that improve 3D semantic and instance classification, including zero-shot transfer to new datasets.

Core claim

Cross-attending across vision-language descriptors from multiple viewpoints and fusing them into a unified per-3D-instance embedding, while adding multiview consistency as a self-supervision signal, yields embeddings that outperform naive averaging or single-view selection and reach state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain data.

What carries the argument

The Cross-Attentive Multiview Fusion (CAMFusion) transformer that cross-attends vision-language descriptors across views, fuses them into one embedding per 3D instance, and trains with both a supervised target-class loss and a multiview consistency self-supervision term.

If this is right

3D semantic and instance classification accuracy rises compared with averaging or single-view baselines on standard benchmarks.
Zero-shot performance on out-of-domain 3D data improves when the same fused embeddings are used.
Open-vocabulary 2D vision-language models can be lifted to 3D scenes without heuristic view selection.
Multiview consistency self-supervision adds measurable gains on top of supervised losses alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cross-attention-plus-consistency pattern could be tested on other multiview tasks such as 3D object detection or scene reconstruction.
If the fused embeddings prove more robust to viewpoint changes, they may reduce the need for dense view sampling in practical 3D mapping systems.
The method suggests a general route for turning any collection of 2D multimodal features into coherent 3D representations without heavy supervision.

Load-bearing premise

That letting descriptors from different views attend to one another plus enforcing cross-view consistency will create better unified 3D embeddings without adding view-specific biases or needing per-dataset retuning.

What would settle it

A held-out 3D dataset on which CAMFusion embeddings produce lower accuracy on semantic or instance classification than simple averaging of the same per-view descriptors.

Figures

Figures reproduced from arXiv: 2604.12551 by Javier Civera, Martin R. Oswald, Tomas Berriel Martins.

**Figure 1.** Figure 1: CAMFusion overview. We propose a method to fuse vision-language descriptors from multiple views. Given masks of an object in n images (red cameras), we extract per-view visionlanguage features (F1 . . . Fn) and aggregate them using our CAMFusion to produce a unified descriptor Fmv. We also introduce a multiview contrastive loss that enforces consistency between the fused descriptor Fmv and those from unse… view at source ↗

**Figure 2.** Figure 2: CAMFusion architecture. Given a set of vision-language descriptors F1, ..., Fn of a 3D instance at n views, these are processed by a multi-view transformer. At each block d, each embedding E d i alternates between attending to it self and attending to its memory Md i made of embeddings from other views. Finally, a learned latent pooling computes the final multi-view vision-language feature Fmv. From a coll… view at source ↗

**Figure 3.** Figure 3: Multiview contrastive loss w/o (a) and w/ (b) the class mask. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results for open-vocabulary semantic instance classification on Replica (top) and ScanNet200 (bottom), using GT instance masks for all methods. We visualize results of our CAMFusion against the ground-truth and the baselines OV-3DIS and Open YOLO 3D. Observe how CAMFusion produces sharper and more coherent object boundaries filtering out the segmentation noise observed in the baselines, resulti… view at source ↗

**Figure 5.** Figure 5: 3D Instance segmentation vs number of views 1 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAMFusion adds cross-attention across views and a consistency loss to fuse 2D VL descriptors into 3D instance embeddings, beating averaging baselines and hitting reported SOTA on classification tasks.

read the letter

The main point is a multiview transformer that cross-attends vision-language descriptors from several views to produce one embedding per 3D instance, with an added consistency term used as self-supervision alongside the usual supervised loss. This combination is presented as the new piece beyond simple averaging or single-view selection. The paper shows consistent improvements on 3D semantic and instance classification benchmarks and includes zero-shot tests on out-of-domain data. The architecture is straightforward to describe and the consistency signal is a low-cost addition that appears to help without new labels. Those are the concrete advances. The experiments support the claim that the method outperforms the obvious baselines, and the zero-shot numbers are the part that would matter most for downstream use in robotics or scene understanding. Soft spots are limited but real. The gains depend on how strong the baselines really are and whether the consistency loss introduces any hidden view bias that only shows up on certain data distributions. The abstract and reported results do not spell out detailed ablations that separate the cross-attention from the loss, so it is not yet clear how much each contributes or how sensitive the numbers are to dataset-specific choices. The out-of-domain results are encouraging but would benefit from more error analysis to show where the method still fails. This work is for people already working on lifting 2D open-vocabulary models to 3D or on multiview embedding fusion. A reader who needs a practical fusion step for instance-level 3D tasks will find usable ideas here. It deserves a serious referee because the problem is well-defined, the method is reproducible from the description, and the empirical claims are testable with standard benchmarks. I would send it to review and ask for expanded ablations plus any code or training details that would let others confirm the consistency term's contribution.

Referee Report

0 major / 3 minor

Summary. The paper introduces CAMFusion, a multiview transformer that cross-attends vision-language descriptors across multiple views to produce unified per-3D-instance embeddings. It adds a multiview consistency self-supervision term to the standard supervised classification loss. The method is shown to outperform naive averaging and single-view selection baselines, achieving state-of-the-art results on 3D semantic and instance classification benchmarks as well as zero-shot evaluations on out-of-domain datasets.

Significance. If the reported gains hold under scrutiny, the work would meaningfully advance open-vocabulary 3D scene understanding by replacing heuristic fusion with a learned cross-attentive mechanism plus consistency regularization. The zero-shot out-of-domain results, if robust, would be a notable strength for generalization claims in vision-language lifting.

minor comments (3)

§3.2: the cross-attention formulation would benefit from an explicit equation showing how query/key/value projections are shared or view-specific, as the current prose description leaves the exact parameter sharing ambiguous.
Table 2: the zero-shot column reports absolute accuracy but omits the corresponding numbers for the averaging and single-view baselines; adding these would strengthen the direct comparison.
§4.3: the consistency loss weight λ is stated to be fixed at 0.1 across all experiments; a brief sensitivity plot or table would help readers assess whether this choice is dataset-dependent.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work on CAMFusion, as well as the recommendation for minor revision. We are encouraged by the recognition of the potential impact on open-vocabulary 3D scene understanding through learned cross-attentive fusion and consistency regularization, including the noted strength of the zero-shot out-of-domain results.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical architectural contribution consisting of a multiview transformer that cross-attends vision-language descriptors from multiple views and adds a multiview consistency self-supervision term to the supervised loss. No derivation chain, first-principles equations, or mathematical predictions are present that could reduce the claimed performance gains to fitted parameters or self-referential definitions. The central results rest on benchmark comparisons (including zero-shot out-of-domain) rather than any self-citation load-bearing uniqueness theorem or ansatz smuggled via prior work. The approach is self-contained and externally falsifiable through standard training and evaluation protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no equations, hyperparameters, or architectural details are provided to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5477 in / 1186 out tokens · 40994 ms · 2026-05-10T14:51:00.586162+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 10 canonical work pages · 6 internal anchors

[1]

Perception Encoder: The best visual embeddings are not at the output of the network

Bolya, D., Huang, P.Y ., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H., et al.: Perception encoder: The best visual embeddings are not at the output of the network. arXiv preprint arXiv:2504.13181 (2025) 3, 7

work page internal anchor Pith review arXiv 2025
[2]

In: The Thirteenth International Conference on Learning Representations (2025) 4, 8, 12, 13

Boudjoghra, M.E.A., Dai, A., Lahoud, J., Cholakkal, H., Anwer, R.M., Khan, S., Khan, F.S.: Open-YOLO 3d: Towards fast and accurate open-vocabulary 3d instance segmentation. In: The Thirteenth International Conference on Learning Representations (2025) 4, 8, 12, 13

2025
[3]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Cabon, Y ., Stoffl, L., Antsfeld, L., Csurka, G., Chidlovskii, B., Revaud, J., Leroy, V .: Must3r: Multi-view network for stereo 3d reconstruction. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1050–1060 (2025) 2, 6

2025
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022) 3

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask trans- former for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022) 3

2022
[5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2818–2829 (2023) 1

2023
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cho, S., Shin, H., Hong, S., Arnab, A., Seo, P.H., Kim, S.: Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113–4123 (2024) 3, 5

2024
[7]

In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR) (2017) 7

Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR) (2017) 7

2017
[8]

In: International Conference on Learning Representations (2024) 9

Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Pollefeys, M., Tombari, F.: Open- NeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views. In: International Conference on Learning Representations (2024) 9

2024
[9]

Robotics: Science and Systems (2024) 1

Ewen, P., Chen, H., Chen, Y ., Li, A., Bagali, A., Gunjal, G., Vasudevan, R.: You’ve got to feel it to believe it: Multi-modal bayesian inference for semantic and property prediction. Robotics: Science and Systems (2024) 1

2024
[10]

arXiv preprint arXiv:2507.22052 (2025) 4, 11

Gong, Z., Li, X., Tosi, F., Han, J., Mattoccia, S., Cai, J., Poggi, M.: Ov3r: Open-vocabulary semantic 3d reconstruction from rgb videos. arXiv preprint arXiv:2507.22052 (2025) 4, 11

work page arXiv 2025
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hou, J., Dai, X., He, Z., Dai, A., Nießner, M.: Mask3d: Pre-training 2d vision transformers by learning masked 3d priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13510–13519 (2023) 4, 8, 13, 14

2023
[12]

In: European Conference on Computer Vision

Jiao, S., Zhu, H., Huang, J., Zhao, Y ., Wei, Y ., Shi, H.: Collaborative vision-text representation optimizing for open-vocabulary segmentation. In: European Conference on Computer Vision. pp. 399–416. Springer (2024) 3

2024
[13]

arXiv preprint arXiv:2507.23134 (2025) 9, 12, 13

Jung, S., Zheng, J., Zhang, K., Qiao, N., Chen, A.Y ., Xia, L., Liu, C., Sun, Y ., Zeng, X., Huang, H.W., et al.: Details matter for indoor open-vocabulary 3d instance segmentation. arXiv preprint arXiv:2507.23134 (2025) 9, 12, 13

work page arXiv 2025
[14]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y ., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4015–4026 (October
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Koch, S., Vaskevicius, N., Colosi, M., Hermosilla, P., Ropinski, T.: Open3dsg: Open- vocabulary 3d scene graphs from point clouds with queryable objects and open-set rela- tionships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14183–14193 (2024) 1 16 T. Berriel Martins et al

2024
[16]

In: European Conference on Computer Vision

Lan, M., Chen, C., Ke, Y ., Wang, X., Feng, L., Zhang, W.: Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. In: European Conference on Computer Vision. pp. 70–88. Springer (2024) 3

2024
[17]

Journal of Field Robotics42(1), 287–301 (2025) 1

Lee, K., Lee, K.: Terrain-aware path planning via semantic segmentation and uncertainty rejection filter with adversarial noise for mobile robots. Journal of Field Robotics42(1), 287–301 (2025) 1

2025
[18]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Y ., Dong, S., Wang, S., Yin, Y ., Yang, Y ., Fan, Q., Chen, B.: Slam3r: Real-time dense scene reconstruction from monocular rgb videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16651–16662 (2025) 6

2025
[19]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Liu, Z., Lin, Y ., Cao, Y ., Hu, H., Wei, Y ., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10012–10022 (2021) 3

2021
[20]

IEEE Robotics and Automation Letters pp

Martins, T.B., Oswald, M.R., Civera, J.: Open-vocabulary online semantic mapping for slam. IEEE Robotics and Automation Letters pp. 1–8 (2025) 2, 4, 6, 7, 8, 9, 11, 12, 19

2025
[21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Nguyen, P., Luu, M., Tran, A., Pham, C., Nguyen, K.: Any3dis: Class-agnostic 3d instance segmentation by 2d mask tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3636–3645 (June 2025) 1, 4, 13

2025
[22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Nguyen, P., Ngo, T.D., Kalogerakis, E., Gan, C., Tran, A., Pham, C., Nguyen, K.: Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4018–4028 (2024) 1, 4, 8, 11, 12, 13

2024
[23]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Ramé, M., Cord, Q., Szafraniec, M., Bojanowski, P., Joulin, A., Laptev, M.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

In: European Confer- ence on Computer Vision

Pan, T., Tang, L., Wang, X., Shan, S.: Tokenize anything via prompting. In: European Confer- ence on Computer Vision. pp. 330–348. Springer (2024) 3

2024
[25]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 815–824 (2023) 1, 4, 8, 9, 11, 12

2023
[26]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 1, 3

2021
[27]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V ., Hu, Y .T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

In: European conference on computer vision

Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3d semantic segmentation in the wild. In: European conference on computer vision. pp. 125–141. Springer (2022) 8

2022
[29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Sajjadi, M.S.M., Meyer, H., Pot, E., Bergmann, U., Greff, K., Radwan, N., V ora, S., Luˇci´c, M., Duckworth, D., Dosovitskiy, A., Uszkoreit, J., Funkhouser, T., Tagliasacchi, A.: Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and P...

2022
[30]

In: 2022 International Conference on Robotics and Automation (ICRA)

Schmid, L., Delmerico, J., Schönberger, J.L., Nieto, J., Pollefeys, M., Siegwart, R., Cadena, C.: Panoptic multi-tsdfs: a flexible representation for online multi-resolution volumetric mapping and long-term dynamic scene consistency. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 8018–8024. IEEE (2022) 1

2022
[31]

Harnessing vision foundation models for high-performance, training- free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024

Shi, Y ., Dong, M., Xu, C.: Harnessing vision foundation models for high-performance, training-free open vocabulary segmentation. arXiv preprint arXiv:2411.09219 (2024) 3, 5 CAMFusion 17

work page arXiv 2024
[32]

In: 2023 IEEE interna- tional symposium on mixed and augmented reality (ISMAR)

Stanescu, A., Mohr, P., Kozinski, M., Mori, S., Schmalstieg, D., Kalkofen, D.: State-aware configuration detection for augmented reality step-by-step tutorials. In: 2023 IEEE interna- tional symposium on mixed and augmented reality (ISMAR). pp. 157–166. IEEE (2023) 1

2023
[33]

The Replica Dataset: A Digital Replica of Indoor Spaces

Straub, J., Whelan, T., Ma, L., Chen, Y ., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., Clarkson, A., Yan, M., Budge, B., Yan, Y ., Pan, X., Yon, J., Zou, Y ., Leon, K., Carter, N., Briales, J., Gillingham, T., Mueggler, E., Pesqueira, L., Savva, M., Batra, D., Strasdat, H.M., Nardi, R.D., Goesele, M., Lovegrove, S., Newcombe, R...

work page internal anchor Pith review arXiv 1906
[34]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Sun, Q., Fang, Y ., Wu, L., Wang, X., Cao, Y .: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) 3

work page internal anchor Pith review arXiv 2023
[35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Sun, S., Li, R., Torr, P., Gu, X., Li, S.: Clip as rnn: Segment countless visual concepts without training endeavor. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13171–13182 (2024) 3

2024
[36]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Sun, Z., Fang, Y ., Wu, T., Zhang, P., Zang, Y ., Kong, S., Xiong, Y ., Lin, D., Wang, J.: Alpha-clip: A clip model focusing on wherever you want. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13019–13029 (2024) 1

2024
[37]

IEEE Robotics and Automation Letters (2025) 13

Takmaz, A., Delitzas, A., Sumner, R.W., Engelmann, F., Wald, J., Tombari, F.: Search3d: Hierarchical open-vocabulary 3d segmentation. IEEE Robotics and Automation Letters (2025) 13

2025
[38]

Advances in Neural Information Processing Systems36, 68367–68390 (2023) 3

Takmaz, A., Fedele, E., Sumner, R., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3d: Open-vocabulary 3d instance segmentation. Advances in Neural Information Processing Systems36, 68367–68390 (2023) 3

2023
[39]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 3

work page internal anchor Pith review arXiv 2025
[40]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) (2019) 7

Wald, J., Avetisyan, A., Navab, N., Tombari, F., Nießner, M.: Rio: 3d object instance re-localization in changing indoor environments. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) (2019) 7

2019
[41]

In: First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024 (2024) 2, 4, 9, 19

Werby, A., Huang, C., Büchner, M., Valada, A., Burgard, W.: Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. In: First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024 (2024) 2, 4, 9, 19

2024
[42]

In: Proceedings of Robotics: Science and Systems

Werby, A., Huang, C., Büchner, M., Valada, A., Burgard, W.: Hierarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation. In: Proceedings of Robotics: Science and Systems. Delft, Netherlands (July 2024). https://doi.org/10.15607/ RSS.2024.XX.0771, 11

2024
[43]

Transactions on Machine Learning Research (2025) 3, 5, 7, 19

Xiao, Y ., Fu, Q., Tao, H., Wu, Y ., Zhu, Z., Hoiem, D.: Textregion: Text-aligned region tokens from frozen image-text models. Transactions on Machine Learning Research (2025) 3, 5, 7, 19

2025
[44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Yeshwanth, C., Liu, Y ., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12–22 (2023) 7

2023
[45]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11975–11986 (October 2023) 1, 3

2023
[46]

In: European conference on computer vision

Zhang, B., Zhang, P., Dong, X., Zang, Y ., Wang, J.: Long-clip: Unlocking the long-text capability of clip. In: European conference on computer vision. pp. 310–325. Springer (2024) 1 18 T. Berriel Martins et al

2024
[47]

arXiv preprint arXiv:2411.10086 (2024) 3, 5

Zhang, D., Liu, F., Tang, Q.: Corrclip: Reconstructing correlations in clip with off- the-shelf foundation models for open-vocabulary semantic segmentation. arXiv preprint arXiv:2411.10086 (2024) 3, 5

work page arXiv 2024
[48]

In: European conference on computer vision

Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: European conference on computer vision. pp. 696–712. Springer (2022) 1

2022
[49]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 8954–8975 (2024) 1

Zhu, C., Chen, L.: A survey on open-vocabulary detection and segmentation: Past, present, and future. IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 8954–8975 (2024) 1

2024
[50]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhu, Z., Peng, S., Larsson, V ., Xu, W., Bao, H., Cui, Z., Oswald, M.R., Pollefeys, M.: Nice- slam: Neural implicit scalable encoding for slam. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12786–12796 (2022) 9 CAMFusion 19 A Additional Ablations A.1 Single-View Descriptor Analysis To isolate the impact of the ...

2022