SENSE: Stereo OpEN Vocabulary SEmantic Segmentation

Ezio Malis (ACENTAURI); Ga\'etan Bahl; Philippe Martinet (ACENTAURI); Thomas Campagnolo (ACENTAURI)

arxiv: 2604.15946 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.RO

SENSE: Stereo OpEN Vocabulary SEmantic Segmentation

Thomas Campagnolo (ACENTAURI) , Ezio Malis (ACENTAURI) , Philippe Martinet (ACENTAURI) , Ga\'etan Bahl This is my paper

Pith reviewed 2026-05-10 08:13 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords stereo visionopen-vocabulary semantic segmentationvision-language modelsgeometric cuesPhraseStereo datasetspatial reasoningzero-shot segmentation

0 comments

The pith

Stereo image pairs supply geometric cues that improve accuracy in open-vocabulary semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that stereo vision can be combined with vision-language models to add geometric cues to open-vocabulary semantic segmentation. This matters for a sympathetic reader because single-view approaches often lose spatial precision near boundaries and under occlusions in dynamic settings. SENSE is trained on the PhraseStereo dataset and reports concrete gains in phrase-grounded tasks plus zero-shot transfer. It shows a 2.9 percent Average Precision lift over baseline on PhraseStereo and relative mIoU gains of 3.5 percent on Cityscapes and 18 percent on KITTI. Joint semantic and geometric reasoning is presented as the route to more reliable language-driven scene understanding for robots.

Core claim

SENSE is the first stereo open-vocabulary semantic segmentation method. It uses stereo image pairs to introduce geometric cues that improve spatial reasoning inside vision-language models. Trained on PhraseStereo, the approach raises Average Precision by 2.9 percent over the baseline and 0.76 percent over the strongest competing method while delivering relative mIoU gains of 3.5 percent on Cityscapes and 18 percent on KITTI.

What carries the argument

SENSE, the method that fuses stereo-derived geometric cues with vision-language model features to produce open-vocabulary segmentations.

If this is right

More precise handling of object boundaries and occlusions in open-vocabulary tasks.
Stronger zero-shot generalization across datasets when geometry augments semantics.
Improved support for phrase-grounded segmentation in stereo-equipped robotic systems.
Joint semantic-geometry reasoning enables more accurate natural-language scene understanding for autonomous vehicles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The stereo cue mechanism could transfer to other multi-view camera rigs common in vehicles.
Robustness testing under varying stereo quality would clarify when the geometric benefit holds.

Load-bearing premise

Stereo image pairs reliably supply geometric cues that improve segmentation accuracy without errors from stereo matching, calibration, or occlusions outweighing the gains.

What would settle it

A controlled test on scenes with known stereo matching failures, such as low-texture regions or deliberately miscalibrated pairs, where SENSE shows no improvement or worse performance than the single-image baseline.

Figures

Figures reproduced from arXiv: 2604.15946 by Ezio Malis (ACENTAURI), Ga\'etan Bahl, Philippe Martinet (ACENTAURI), Thomas Campagnolo (ACENTAURI).

**Figure 1.** Figure 1: SENSE-512. Zero-shot open-vocabulary semantic segmentation from stereo image pairs (Cityscapes) using natural language prompts. By combining semantics, geometry, and language, SENSE improves scene understanding in Intelligent Transportation Systems (ITS). align textual and visual features to enable matching between language prompts and image content. However, these models are primarily designed for ima… view at source ↗

**Figure 2.** Figure 2: SENSE architecture. A dual-branch, weight-shared CLIP encoder [27] processes stereo images and outputs three intermediate features per branch. These are fused via the SIEF module (Sec. 3.2) after projection to P = 64. The decoder comprises three transformer blocks conditioned on the textual prompt through FiLM [12], and incorporates the SDAF module (Sec. 3.4) for disparity-aware refinement. Disparity maps … view at source ↗

**Figure 3.** Figure 3: Stereo Intermediate-level Embedding Fusion (SIEF) Module. SIEF learns adaptive weights to combine left and right intermediate stereo activation features from CLIP vision transformer and projects the fused embedding to a 64-dimensional representation. For the legend symbols, refer to [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture of the Semantic Disparity Attention Fusion (SDAF) Module. SDAF combines disparity-normalized geometric cues with semantic decoder features to refine the final segmentation output. For the legend symbols, refer to [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of referring expression semantic segmentation. A comparison of SENSE-352 in PhraseStereo dataset with CLIPSeg (PC+) [23] method. Predictions from SENSE-352 and CLIPSeg are visualized as sigmoid probability maps, highlighting confidence for the queried text prompt (blue box). OpenSeg [13] OpenWorldSAM [38] CLIPSeg (PC+) [23] MDETR [16] SENSE-352 (Ours) SENSE-512 (Ours) Time (ms) 123.3… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of zero-shot semantic segmentation. Comparison of SENSE-512 in Cityscapes and SENSE-352 in KITTI dataset, with the second best performing method, OpenSeg [13]. In GT and SENSE-512, pixels shown in black are pixels that are unlabeled. In OpenSeg, black pixels correspond to unknown label. The ablations show that both stereo fusion modules are essential. Removing SDAF lowers performance… view at source ↗

**Figure 7.** Figure 7: Overview of the inference pipeline for large image resolutions. Left: overlapping patch extraction from stereo input images (IL, IR) using a sliding-window strategy. Middle: patch-wise predictions for user-defined text queries (NCLS) computed by SENSE after normalization. Right: post-processing with cosine blending mask, reconstruction of full-resolution probability maps, softmax normalization, and CRF-ba… view at source ↗

**Figure 8.** Figure 8: Additional qualitative comparison of referring expression semantic segmentation of SENSE-352 in PhraseStereo dataset with CLIPSeg (PC+) [24] method. We show results for diverse expressions, including object-specific (”puddle on platform”), relational (”trees behind van”), and attribute-based (”red and white advertisement”) prompts. Predictions are visualized as sigmoid probability maps, highlighting confid… view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of zero-shot semantic segmentation on Cityscapes. We show predictions from our models (SENSE512 and SENSE-352) with baselines OpenSeg [13] and CLIPSeg (PC+) [23]. quantitative metrics reported in the main paper and further emphasize the improved results achieved by our approach. 10.4. Descriptions as text queries As shown in Tab. 5, the Cityscapes [9] dataset provides detailed desc… view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of zero-shot semantic segmentation on KITTI 2015. We show predictions from our method (SENSE-352) and baselines OpenSeg [13] and CLIPSeg (PC+) [23]. Method Disparity Estimation Model Time (ms) mIoU IoUFG AP SENSE-352 (Ours) Selective-IGEV† [34] 194.74† 47.2† 57.0† 78.8† HITNet [31] 104.09 47.1 56.8 78.3 MobileStereoNet [28] 76.91 47.1 57.1 78.9 SENSE-512 (Ours) Selective-IGEV† [34] … view at source ↗

**Figure 11.** Figure 11: From class labels to class descriptions on Cityscapes. We compare SENSE-512 predictions when prompted with standard class names and the corresponding natural language descriptions shown in Tab. 5. Despite using descriptive text instead of fixed labels, SENSE-512 achieves segmentation quality comparable to ground truth, demonstrating its ability to interpret scenes through concise language prompts while m… view at source ↗

read the original abstract

Open-vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single-view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase-grounded tasks and demonstrates generalization in zero-shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best competing method. SENSE also provides a relative improvement of +3.5% mIoU on Cityscapes and +18% on KITTI compared to the baseline work. By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent Transportation Systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces SENSE, the first stereo-based approach to open-vocabulary semantic segmentation. It leverages stereo image pairs to supply geometric cues for improved spatial reasoning and boundary precision, trains on a newly introduced PhraseStereo dataset, and reports gains of +2.9% AP on PhraseStereo over baseline (+0.76% over best competitor), +3.5% mIoU on Cityscapes, and +18% mIoU on KITTI relative to prior single-view baselines. The work also claims zero-shot generalization.

Significance. If the performance gains can be causally attributed to stereo geometric cues rather than dataset or architecture changes, the work would be a meaningful first step in combining stereo vision with vision-language models for open-vocabulary tasks. This is relevant for robotics and ITS applications where stereo cameras are common. The new PhraseStereo dataset is a concrete positive contribution that could support future research.

major comments (2)

[§4 (Experiments)] §4 (Experiments) and associated tables: the central claim that stereo pairs supply geometric cues improving open-vocabulary segmentation is not supported by controlled ablations. No experiments fix the VLM backbone, training data, hyperparameters, and loss while toggling only between stereo pairs and monocular input (or between full stereo and stereo with depth/disparity disabled). The reported deltas (+2.9% AP, +3.5% mIoU, +18% mIoU) therefore cannot be attributed specifically to geometric reasoning; they could arise from the new PhraseStereo training set or the custom stereo+VLM architecture.
[§3 (Method)] §3 (Method): the description of how stereo features are fused with the vision-language model (e.g., disparity estimation, feature concatenation, or attention over depth) is insufficient to assess whether geometric cues are actually used or whether stereo matching errors in occluded/textureless regions are mitigated. No equations or diagrams detail this fusion step, which is load-bearing for the geometric-cue hypothesis.

minor comments (3)

[Abstract] Abstract and §1: the phrase 'strong performance in phrase-grounded tasks' is vague; quantitative metrics and comparison to the exact baseline implementation should be stated explicitly.
[§4 (Experiments)] §4: no error bars, standard deviations, or statistical significance tests are mentioned for the reported percentage improvements, making it hard to judge whether the gains are reliable.
[§2 (Related Work)] Missing reference to prior stereo semantic segmentation works (even if closed-vocabulary) and to recent open-vocabulary methods that already incorporate depth or 3D cues.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas for strengthening the attribution of gains to stereo cues and improving methodological clarity. We address each major comment below and will incorporate revisions to enhance the manuscript.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments) and associated tables: the central claim that stereo pairs supply geometric cues improving open-vocabulary segmentation is not supported by controlled ablations. No experiments fix the VLM backbone, training data, hyperparameters, and loss while toggling only between stereo pairs and monocular input (or between full stereo and stereo with depth/disparity disabled). The reported deltas (+2.9% AP, +3.5% mIoU, +18% mIoU) therefore cannot be attributed specifically to geometric reasoning; they could arise from the new PhraseStereo training set or the custom stereo+VLM architecture.

Authors: We agree that the current set of experiments does not include a fully isolated ablation that holds the VLM backbone, training data, hyperparameters, and loss fixed while toggling only the stereo input. Our reported comparisons are against single-view baselines that use comparable VLMs, and the larger gains on stereo-specific datasets (PhraseStereo, KITTI) are consistent with the value of geometric cues. To directly address the concern, we will add a controlled ablation in the revised §4 that disables the stereo fusion module (or replaces stereo pairs with monocular input) while keeping all other factors identical. This will allow clearer attribution of performance deltas to the geometric reasoning component. revision: yes
Referee: [§3 (Method)] §3 (Method): the description of how stereo features are fused with the vision-language model (e.g., disparity estimation, feature concatenation, or attention over depth) is insufficient to assess whether geometric cues are actually used or whether stereo matching errors in occluded/textureless regions are mitigated. No equations or diagrams detail this fusion step, which is load-bearing for the geometric-cue hypothesis.

Authors: We acknowledge that the fusion mechanism in §3 requires additional detail to allow readers to evaluate how geometric cues are integrated and how stereo errors are handled. In the revised manuscript we will expand the method section with explicit equations for disparity estimation and the stereo-VLM feature fusion (including the specific attention or concatenation operations), add a dedicated diagram of the fusion pipeline, and include a short discussion of error mitigation strategies such as confidence-weighted fusion and multi-scale processing for occluded or textureless regions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated on external datasets without self-referential derivations

full rationale

The paper introduces SENSE as an empirical architecture combining stereo pairs with vision-language models for open-vocabulary segmentation. It reports performance gains on the newly introduced PhraseStereo dataset and on standard external benchmarks (Cityscapes, KITTI) without any equations, derivations, fitted-parameter predictions, or load-bearing self-citations. All claims rest on measured improvements against baselines and competitors rather than any quantity that reduces to its own inputs by construction. The evaluation is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that stereo pairs add useful geometric cues for semantic segmentation; no free parameters or invented entities are explicitly introduced in the abstract, though the underlying vision-language model and stereo matching implicitly carry many hyperparameters from prior work.

axioms (1)

domain assumption Stereo image pairs provide geometric cues that improve spatial reasoning and segmentation accuracy in open-vocabulary tasks
Directly invoked to justify the method's advantage over single-view baselines.

pith-pipeline@v0.9.0 · 5509 in / 1271 out tokens · 33967 ms · 2026-05-10T08:13:38.509588+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

Im- age segmentation with large language models: A survey with perspectives for intelligent transportation systems

Sanjeda Akter, Ibne Farabi Shihab, and Anuj Sharma. Im- age segmentation with large language models: A survey with perspectives for intelligent transportation systems. arXiv preprint arXiv:2506.14096, 2025. 1, 2, 5, 6

work page arXiv 2025
[2]

Augmented re- ality meets computer vision: Efficient data generation for urban driving scenes

Mescheder Lars Geiger Andreas Alhaija Hassan, Mustikovela Siva and Rother Carsten. Augmented re- ality meets computer vision: Efficient data generation for urban driving scenes. IJCV, 2018. 3, 5, 6, 7, 9, 10, 11

work page 2018
[3]

Zero-shot semantic segmentation

Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick P´erez. Zero-shot semantic segmentation. NeurIPS, 32, 2019. 5

work page 2019
[4]

Phrasestereo: The first open-vocabulary stereo image segmentation dataset

Thomas Campagnolo, Ezio Malis, Philippe Martinet, and Gaetan Bahl. Phrasestereo: The first open-vocabulary stereo image segmentation dataset. arXiv preprint arXiv:2510.00818, 2025. 2, 3, 5, 10

work page arXiv 2025
[5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021. 2

work page 2021
[6]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018. 1

work page 2018
[7]

Mocha-stereo: Motif chan- nel attention network for stereo matching

Ziyang Chen, Wei Long, He Yao, Yongjun Zhang, Bingshu Wang, Yongbin Qin, and Jia Wu. Mocha-stereo: Motif chan- nel attention network for stereo matching. In CVPR, pages 27768–27777, 2024. 4

work page 2024
[8]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022. 1

work page 2022
[9]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR,

work page
[10]

2, 3, 5, 6, 7, 9, 10, 11, 12

work page
[11]

Vision transformers need registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR,

work page
[12]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InNAACL, pages 4171– 4186, 2019. 2

work page 2019
[13]

Feature-wise transformations

Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio. Feature-wise transformations. Distill, 3(7):e11, 2018. 3, 4

work page 2018
[14]

Scal- ing open-vocabulary image segmentation with image-level labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. In ECCV, pages 540–557. Springer, 2022. 2, 5, 6, 7, 8, 11, 12, 13, 14

work page 2022
[15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 2

work page 2016
[16]

Defom-stereo: Depth foundation model based stereo matching

Hualie Jiang, Zhiqiang Lou, Laiyan Ding, Rui Xu, Minglang Tan, Wenjie Jiang, and Rui Huang. Defom-stereo: Depth foundation model based stereo matching. In CVPR, pages 21857–21867, 2025. 4

work page 2025
[17]

Mdetr- modulated detection for end-to-end multi-modal understand- ing

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- modulated detection for end-to-end multi-modal understand- ing. In ICCV, pages 1780–1790, 2021. 5, 6, 7

work page 2021
[18]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In ICCV, pages 4015–4026, 2023. 2

work page 2023
[19]

Lafferty, Andrew McCallum, and Fernando C

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learn- ing, page 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. 2

work page 2001
[20]

Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation. In ECCV, pages 70–88. Springer, 2024. 2

work page 2024
[21]

Recurrent multimodal interaction for referring image segmentation

Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. In ICCV, pages 1271–1280, 2017. 2

work page 2017
[22]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, pages 38–55. Springer, 2024. 2

work page 2024
[23]

One-stage deep stereo network

Ziming Liu, Ezio Malis, and Philippe Martinet. One-stage deep stereo network. In ICASSP, pages 3050–3054. IEEE,

work page
[24]

Image segmentation using text and image prompts

Timo L ¨uddecke and Alexander Ecker. Image segmentation using text and image prompts. In CVPR, pages 7086–7096,

work page
[25]

1, 2, 3, 5, 6, 7, 9, 10, 11, 12, 13

work page
[26]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

work page 2024
[27]

Know ”no” better: A data- driven approach for enhancing negation awareness in clip,

Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know ”no” better: A data- driven approach for enhancing negation awareness in clip,

work page
[28]

Automatic dif- ferentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic dif- ferentiation in pytorch. In NIPS 2017 Workshop on Autodiff,

work page 2017
[29]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, pages 8748–8763. PmLR, 2021. 1, 2, 3, 5 16

work page 2021
[30]

Mobilestereonet: Towards lightweight deep net- works for stereo matching

Faranak Shamsafar, Samuel Woerz, Rafia Rahim, and An- dreas Zell. Mobilestereonet: Towards lightweight deep net- works for stereo matching. In WACV, pages 2417–2426,

work page
[31]

Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation

Vladan Stojni ´c, Yannis Kalantidis, Ji ˇr´ı Matas, and Giorgos Tolias. Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation. InCVPR, pages 9794–9803, 2025. 1, 6

work page 2025
[32]

Segmenter: Transformer for semantic segmenta- tion

Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmenta- tion. In ICCV, pages 7262–7272, 2021. 1

work page 2021
[33]

Hitnet: Hierar- chical iterative tile refinement network for real-time stereo matching

Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh Kowdle, Sean Fanello, and Sofien Bouaziz. Hitnet: Hierar- chical iterative tile refinement network for real-time stereo matching. In CVPR, pages 14362–14372, 2021. 13

work page 2021
[34]

Sclip: Rethinking self-attention for dense vision-language inference

Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethinking self-attention for dense vision-language inference. In ECCV, pages 315–332. Springer, 2024. 2, 5, 6

work page 2024
[35]

Declip: Decoupled learning for open- vocabulary dense perception

Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, and Zhuotao Tian. Declip: Decoupled learning for open- vocabulary dense perception. In CVPR, pages 14824–14834,

work page
[36]

Selective-stereo: Adaptive frequency information selection for stereo matching

Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. In CVPR, pages 19701–19710, 2024. 3, 4, 5, 8, 13

work page 2024
[37]

Phrasecut: Language-based image segmen- tation in the wild

Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, and Subhransu Maji. Phrasecut: Language-based image segmen- tation in the wild. In CVPR, pages 10216–10225, 2020. 2, 5, 6

work page 2020
[38]

Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation

Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation. In ECCV, pages 320–

work page
[39]

Semantic projection network for zero-and few-label semantic segmentation

Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, and Zeynep Akata. Semantic projection network for zero-and few-label semantic segmentation. In CVPR, pages 8256–8265, 2019. 2

work page 2019
[40]

Openworldsam: Ex- tending sam2 for universal image segmentation with lan- guage prompts

Shiting Xiao, Rishabh Kabra, Yuhang Li, Donghyun Lee, Joao Carreira, and Priyadarshini Panda. Openworldsam: Ex- tending sam2 for universal image segmentation with lan- guage prompts. In NIPS, 2025. 5, 6, 7

work page 2025
[41]

Cross-modal self-attention network for referring image seg- mentation

Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. Cross-modal self-attention network for referring image seg- mentation. In CVPR, pages 10502–10511, 2019. 2

work page 2019
[42]

Prototypical matching and open set rejection for zero-shot semantic segmentation

Hui Zhang and Henghui Ding. Prototypical matching and open set rejection for zero-shot semantic segmentation. In ICCV, pages 6974–6983, 2021. 2

work page 2021
[43]

Vision-language models for vision tasks: A survey

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. TPAMI, 46(8):5625–5644, 2024. 2

work page 2024
[44]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In ECCV, pages 696–712. Springer,

work page
[45]

Zegclip: Towards adapting clip for zero-shot se- mantic segmentation

Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yifan Liu. Zegclip: Towards adapting clip for zero-shot se- mantic segmentation. In CVPR, pages 11175–11185, 2023. 1, 2

work page 2023
[46]

Segment everything everywhere all at once

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. NeurIPS, 36:19769–19782, 2023. 2 17

work page 2023

[1] [1]

Im- age segmentation with large language models: A survey with perspectives for intelligent transportation systems

Sanjeda Akter, Ibne Farabi Shihab, and Anuj Sharma. Im- age segmentation with large language models: A survey with perspectives for intelligent transportation systems. arXiv preprint arXiv:2506.14096, 2025. 1, 2, 5, 6

work page arXiv 2025

[2] [2]

Augmented re- ality meets computer vision: Efficient data generation for urban driving scenes

Mescheder Lars Geiger Andreas Alhaija Hassan, Mustikovela Siva and Rother Carsten. Augmented re- ality meets computer vision: Efficient data generation for urban driving scenes. IJCV, 2018. 3, 5, 6, 7, 9, 10, 11

work page 2018

[3] [3]

Zero-shot semantic segmentation

Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick P´erez. Zero-shot semantic segmentation. NeurIPS, 32, 2019. 5

work page 2019

[4] [4]

Phrasestereo: The first open-vocabulary stereo image segmentation dataset

Thomas Campagnolo, Ezio Malis, Philippe Martinet, and Gaetan Bahl. Phrasestereo: The first open-vocabulary stereo image segmentation dataset. arXiv preprint arXiv:2510.00818, 2025. 2, 3, 5, 10

work page arXiv 2025

[5] [5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021. 2

work page 2021

[6] [6]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018. 1

work page 2018

[7] [7]

Mocha-stereo: Motif chan- nel attention network for stereo matching

Ziyang Chen, Wei Long, He Yao, Yongjun Zhang, Bingshu Wang, Yongbin Qin, and Jia Wu. Mocha-stereo: Motif chan- nel attention network for stereo matching. In CVPR, pages 27768–27777, 2024. 4

work page 2024

[8] [8]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022. 1

work page 2022

[9] [9]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR,

work page

[10] [10]

2, 3, 5, 6, 7, 9, 10, 11, 12

work page

[11] [11]

Vision transformers need registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR,

work page

[12] [12]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InNAACL, pages 4171– 4186, 2019. 2

work page 2019

[13] [13]

Feature-wise transformations

Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio. Feature-wise transformations. Distill, 3(7):e11, 2018. 3, 4

work page 2018

[14] [14]

Scal- ing open-vocabulary image segmentation with image-level labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. In ECCV, pages 540–557. Springer, 2022. 2, 5, 6, 7, 8, 11, 12, 13, 14

work page 2022

[15] [15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 2

work page 2016

[16] [16]

Defom-stereo: Depth foundation model based stereo matching

Hualie Jiang, Zhiqiang Lou, Laiyan Ding, Rui Xu, Minglang Tan, Wenjie Jiang, and Rui Huang. Defom-stereo: Depth foundation model based stereo matching. In CVPR, pages 21857–21867, 2025. 4

work page 2025

[17] [17]

Mdetr- modulated detection for end-to-end multi-modal understand- ing

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- modulated detection for end-to-end multi-modal understand- ing. In ICCV, pages 1780–1790, 2021. 5, 6, 7

work page 2021

[18] [18]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In ICCV, pages 4015–4026, 2023. 2

work page 2023

[19] [19]

Lafferty, Andrew McCallum, and Fernando C

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learn- ing, page 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. 2

work page 2001

[20] [20]

Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy at- tention improves clip for open-vocabulary segmentation. In ECCV, pages 70–88. Springer, 2024. 2

work page 2024

[21] [21]

Recurrent multimodal interaction for referring image segmentation

Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. In ICCV, pages 1271–1280, 2017. 2

work page 2017

[22] [22]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, pages 38–55. Springer, 2024. 2

work page 2024

[23] [23]

One-stage deep stereo network

Ziming Liu, Ezio Malis, and Philippe Martinet. One-stage deep stereo network. In ICASSP, pages 3050–3054. IEEE,

work page

[24] [24]

Image segmentation using text and image prompts

Timo L ¨uddecke and Alexander Ecker. Image segmentation using text and image prompts. In CVPR, pages 7086–7096,

work page

[25] [25]

1, 2, 3, 5, 6, 7, 9, 10, 11, 12, 13

work page

[26] [26]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

work page 2024

[27] [27]

Know ”no” better: A data- driven approach for enhancing negation awareness in clip,

Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know ”no” better: A data- driven approach for enhancing negation awareness in clip,

work page

[28] [28]

Automatic dif- ferentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic dif- ferentiation in pytorch. In NIPS 2017 Workshop on Autodiff,

work page 2017

[29] [29]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, pages 8748–8763. PmLR, 2021. 1, 2, 3, 5 16

work page 2021

[30] [30]

Mobilestereonet: Towards lightweight deep net- works for stereo matching

Faranak Shamsafar, Samuel Woerz, Rafia Rahim, and An- dreas Zell. Mobilestereonet: Towards lightweight deep net- works for stereo matching. In WACV, pages 2417–2426,

work page

[31] [31]

Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation

Vladan Stojni ´c, Yannis Kalantidis, Ji ˇr´ı Matas, and Giorgos Tolias. Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation. InCVPR, pages 9794–9803, 2025. 1, 6

work page 2025

[32] [32]

Segmenter: Transformer for semantic segmenta- tion

Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmenta- tion. In ICCV, pages 7262–7272, 2021. 1

work page 2021

[33] [33]

Hitnet: Hierar- chical iterative tile refinement network for real-time stereo matching

Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh Kowdle, Sean Fanello, and Sofien Bouaziz. Hitnet: Hierar- chical iterative tile refinement network for real-time stereo matching. In CVPR, pages 14362–14372, 2021. 13

work page 2021

[34] [34]

Sclip: Rethinking self-attention for dense vision-language inference

Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethinking self-attention for dense vision-language inference. In ECCV, pages 315–332. Springer, 2024. 2, 5, 6

work page 2024

[35] [35]

Declip: Decoupled learning for open- vocabulary dense perception

Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, and Zhuotao Tian. Declip: Decoupled learning for open- vocabulary dense perception. In CVPR, pages 14824–14834,

work page

[36] [36]

Selective-stereo: Adaptive frequency information selection for stereo matching

Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. In CVPR, pages 19701–19710, 2024. 3, 4, 5, 8, 13

work page 2024

[37] [37]

Phrasecut: Language-based image segmen- tation in the wild

Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, and Subhransu Maji. Phrasecut: Language-based image segmen- tation in the wild. In CVPR, pages 10216–10225, 2020. 2, 5, 6

work page 2020

[38] [38]

Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation

Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation. In ECCV, pages 320–

work page

[39] [39]

Semantic projection network for zero-and few-label semantic segmentation

Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, and Zeynep Akata. Semantic projection network for zero-and few-label semantic segmentation. In CVPR, pages 8256–8265, 2019. 2

work page 2019

[40] [40]

Openworldsam: Ex- tending sam2 for universal image segmentation with lan- guage prompts

Shiting Xiao, Rishabh Kabra, Yuhang Li, Donghyun Lee, Joao Carreira, and Priyadarshini Panda. Openworldsam: Ex- tending sam2 for universal image segmentation with lan- guage prompts. In NIPS, 2025. 5, 6, 7

work page 2025

[41] [41]

Cross-modal self-attention network for referring image seg- mentation

Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. Cross-modal self-attention network for referring image seg- mentation. In CVPR, pages 10502–10511, 2019. 2

work page 2019

[42] [42]

Prototypical matching and open set rejection for zero-shot semantic segmentation

Hui Zhang and Henghui Ding. Prototypical matching and open set rejection for zero-shot semantic segmentation. In ICCV, pages 6974–6983, 2021. 2

work page 2021

[43] [43]

Vision-language models for vision tasks: A survey

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. TPAMI, 46(8):5625–5644, 2024. 2

work page 2024

[44] [44]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In ECCV, pages 696–712. Springer,

work page

[45] [45]

Zegclip: Towards adapting clip for zero-shot se- mantic segmentation

Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yifan Liu. Zegclip: Towards adapting clip for zero-shot se- mantic segmentation. In CVPR, pages 11175–11185, 2023. 1, 2

work page 2023

[46] [46]

Segment everything everywhere all at once

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. NeurIPS, 36:19769–19782, 2023. 2 17

work page 2023