Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation

Chenhao Wang; Yao Zhu; Yingrui Ji; Yu Meng

arxiv: 2605.15942 · v1 · pith:DO3JTEXFnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI

Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation

Chenhao Wang , Yingrui Ji , Yu Meng , Yao Zhu This is my paper

Pith reviewed 2026-05-20 18:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords open-vocabulary segmentationvision-language alignmentdecomposed promptscompositional semanticsfine-grained segmentationcross-attentiongeneralization to unseen compositions

0 comments

The pith

Factorizing text prompts into separate concept and attribute tokens lets segmentation models generalize to unseen category-attribute pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard open-vocabulary segmentation models fail on fine-grained descriptions because they encode whole sentences as single units, mixing object categories with their attributes. By explicitly splitting each prompt into one concept token plus multiple attribute tokens, the model can align vision features with each semantic piece on its own. A Feature-Gated Cross-Attention module then creates attribute-specific gating maps that combine information multiplicatively, while similarities are aggregated in log space for stable scoring. This decomposition is claimed to produce better results on benchmarks that test novel combinations never seen together in training data. The method plugs directly into existing transformer segmentation backbones without changing their core architecture.

Core claim

We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching.

What carries the argument

Decomposed Vision-Language Alignment, which splits prompts into independent concept and attribute tokens and applies Feature-Gated Cross-Attention for multiplicative fusion per attribute.

If this is right

The framework integrates directly into existing transformer-based open-vocabulary segmentation models without architectural overhaul.
Separate cross-modal interactions per token enforce compositional semantics that holistic sentence encodings lose.
Log-space aggregation of per-token similarities yields more stable and interpretable matching scores than direct averaging.
The approach targets improved performance specifically on fine-grained benchmarks that test generalization to unseen compositions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token decomposition could be tested in related tasks such as attribute-aware image retrieval or referring expression comprehension.
If prompt factorization is learned rather than hand-specified, the method might scale to longer or more complex descriptions.
The multiplicative gating maps might transfer to other multimodal fusion settings where selective attribute emphasis is useful.

Load-bearing premise

Textual prompts can be reliably split into independent concept and attribute tokens that keep their compositional meaning without introducing errors or losing information.

What would settle it

Run the model on a held-out test set containing only attribute-category pairs absent from training data and measure whether segmentation accuracy on those pairs remains no higher than that of a standard holistic-prompt baseline.

Figures

Figures reproduced from arXiv: 2605.15942 by Chenhao Wang, Yao Zhu, Yingrui Ji, Yu Meng.

**Figure 1.** Figure 1: Overall architecture of the proposed method. (a) Explicit Prompt Decomposition and Feature-Gated Cross-Attention. The compositional text prompt is explicitly decoupled into independent concept and attribute tokens. Visual queries interact with the concept via standard cross-attention, and with attributes via multiplicative feature-gating to enforce compositional constraints. (b) Log-Space AND Composition… view at source ↗

**Figure 2.** Figure 2: Qualitative visualization of the Feature-Gated Cross-Attention mechanism for two examples. The top row prompt decomposes into concept building, attribute hipped roof building, and attribute residential building. The bottom row prompt decomposes into concept building, attribute flat roof building, and attribute commercial building. Panel a shows the original images. Panel b shows the concept attention maps.… view at source ↗

read the original abstract

Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a Decomposed Vision-Language Alignment framework for fine-grained open-vocabulary segmentation. It explicitly factorizes textual prompts into a single concept token and multiple attribute tokens, enabling separate cross-modal interactions per semantic unit. A Feature-Gated Cross-Attention module generates attribute-specific gating maps for multiplicative fusion at the feature level, while per-token similarities are aggregated in log-space at the scoring level to produce compositional matching. The approach is presented as integrable into existing transformer-based segmentation models to improve generalization on unseen attribute-category compositions.

Significance. If the factorization and modules operate as intended, the framework could offer a more explicit mechanism for enforcing compositional semantics than holistic prompt encoding, potentially benefiting open-vocabulary segmentation on fine-grained benchmarks. The log-space aggregation and gated attention are presented as stable and interpretable additions. However, the manuscript supplies no quantitative results, benchmarks, ablation studies, or error analysis, so the actual significance cannot be assessed from the provided text.

major comments (2)

[Abstract] Abstract: The central claim of improved generalization to unseen attribute-category compositions rests on the explicit factorization of prompts into independent concept and attribute tokens plus the Feature-Gated Cross-Attention module, yet the text provides no empirical results, tables, or figures to support that this factorization preserves semantics or yields measurable gains.
[Abstract] Method description (implicit in Abstract): The assumption that textual prompts can be reliably decomposed into semantically independent tokens without entanglement or boundary errors is load-bearing for the per-token cross-modal interactions and log-space aggregation; if factorization introduces omissions or incorrect boundaries, the claimed compositional enforcement no longer follows.

minor comments (1)

[Abstract] Abstract: Consider specifying the parsing or model used to produce the concept and attribute tokens, as this step is foundational but described only at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript describing the Decomposed Vision-Language Alignment framework. We address each of the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of improved generalization to unseen attribute-category compositions rests on the explicit factorization of prompts into independent concept and attribute tokens plus the Feature-Gated Cross-Attention module, yet the text provides no empirical results, tables, or figures to support that this factorization preserves semantics or yields measurable gains.

Authors: The referee correctly notes that the current manuscript text does not present empirical results, tables, or figures. This version emphasizes the proposed framework and its integration into transformer-based models. We agree that to substantiate the claims of improved generalization, quantitative evaluations are necessary. In the revised manuscript, we will include benchmark results, ablation studies, and error analysis demonstrating the benefits of the decomposed alignment approach. revision: yes
Referee: [Abstract] Method description (implicit in Abstract): The assumption that textual prompts can be reliably decomposed into semantically independent tokens without entanglement or boundary errors is load-bearing for the per-token cross-modal interactions and log-space aggregation; if factorization introduces omissions or incorrect boundaries, the claimed compositional enforcement no longer follows.

Authors: We agree that the reliability of the prompt decomposition is critical to the framework's success. The manuscript outlines an explicit factorization process where prompts are parsed into one concept token and multiple attribute tokens using semantic role labeling or similar NLP techniques. This enables the separate cross-modal interactions. To mitigate risks of entanglement or boundary errors, the log-space aggregation allows for flexible matching. We will expand the method description in the revision to include the exact decomposition algorithm and robustness checks. revision: yes

Circularity Check

0 steps flagged

Proposed decomposition and gating modules are additive architectural choices with no reduction to inputs by construction

full rationale

The paper introduces an explicit factorization of prompts into concept and attribute tokens, followed by per-token cross-modal interactions, a Feature-Gated Cross-Attention module, and log-space aggregation as new components integrated into existing transformer architectures. No equations or steps in the abstract or described framework reduce a claimed prediction or result to a fitted parameter or prior self-citation by definition; the generalization improvements are attributed to these design additions rather than any self-referential equivalence. The derivation remains self-contained as an independent modeling proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that prompts admit clean factorization into semantic units and on the introduction of a new attention module whose effectiveness is asserted rather than derived from prior results.

axioms (1)

domain assumption Textual prompts can be explicitly factorized into a concept token and multiple attribute tokens that enable separate cross-modal interactions.
This factorization is invoked as the starting point for the entire Decomposed Vision-Language Alignment framework.

invented entities (1)

Feature-Gated Cross-Attention module no independent evidence
purpose: Generates attribute-specific gating maps to fuse information in a multiplicative manner and enforce compositional semantics.
New module introduced to perform the gated fusion at the feature level.

pith-pipeline@v0.9.0 · 5662 in / 1274 out tokens · 68482 ms · 2026-05-20T18:40:32.989962+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

[1]

IEEE Transactions on Pattern Anal- ysis & Machine Intelligence45(06), 7430–7443 (Jun 2023).https://doi.org/10.1109/TPAMI

Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for image classification. PAMI38(7), 1425–1438 (2016).https://doi.org/10.1109/TPAMI. 2015.2487979

work page doi:10.1109/tpami 2016
[2]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

IEEE transactions on pattern analysis and machine intelli- gence40(4), 834–848 (2017)

Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence40(4), 834–848 (2017)

work page 2017
[4]

Advances in neural information processing systems34, 17864–17875 (2021)

Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. Advances in neural information processing systems34, 17864–17875 (2021)

work page 2021
[5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cho,S.,Shin,H.,Hong,S.,Arnab,A.,Seo,P.H.,Kim,S.:Cat-seg:Costaggregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113–4123 (2024)

work page 2024
[6]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmenta- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11583–11592 (2022)

work page 2022
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Doveh, S., Arbelle, A., Harary, S., Schwartz, E., Herzig, R., Giryes, R., Feris, R., Panda, R., Ullman, S., Karlinsky, L.: Teaching structured vision & language con- cepts to vision & language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2657–2668 (2023)

work page 2023
[8]

In: NeurIPS (2013)

Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T.: Devise: A deep visual-semantic embedding model. In: NeurIPS (2013)

work page 2013
[9]

In: ECCV

Ghiasi,G.,Gu,X.,Cui,Y.,Lin,T.Y.:Scalingopen-vocabularyimagesegmentation with image-level labels. In: ECCV. pp. 540–557 (2022).https://doi.org/10. 1007/978-3-031-20059-5_31

work page 2022
[10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

Huang, X., Ren, L., Liu, C., Wang, Y., Yu, H., Schmitt, M., Hänsch, R., Sun, X., Huang, H., Mayer, H.: Urban building classification (ubc) - a dataset for individual building detection and classification from satellite imagery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1413–1421 (June 2022)

work page 2022
[11]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing pp

Ji, Y., Wang, C., Chen, J., Chen, J., Yue, A., Meng, Y., Sui, C.: Movseg: Efficient adaptation of vision-language models for multispectral open-vocabulary segmenta- tion. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing pp. 1–12 (2026).https://doi.org/10.1109/JSTARS.2026.3658442

work page doi:10.1109/jstars.2026.3658442 2026
[12]

In: International conference on machine learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)

work page 2021
[13]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr- modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1780–1790 (2021)

work page 2021
[14]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023) Decomposed VL Alignment for Fine-Grained OV Segmentation 17

work page 2023
[15]

Language-driven Semantic Segmentation

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10965–10975 (2022)

work page 2022
[17]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, Y.L., Xu, Y., Mao, X., Lu, C.: Symmetry and group in attribute-object com- positions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11316–11325 (2020)

work page 2020
[18]

Motiondiffuser: Controllable multi-agent motion prediction using diffusion

Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR. pp. 7061–7070 (2023).https://doi.org/10.1109/CVPR52729.2023. 00682

work page doi:10.1109/cvpr52729.2023 2023
[19]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, C., Ding, H., Jiang, X.: Gres: Generalized referring expression segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 23592–23601 (2023)

work page 2023
[20]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

work page 2015
[21]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Naeem, M.F., Xian, Y., Tombari, F., Akata, Z.: Compositional zero-shot learning via learning graph embeddings of semantic relations. In: CVPR. pp. 8863–8873 (2021).https://doi.org/10.1109/CVPR46437.2021.00875

work page doi:10.1109/cvpr46437.2021.00875 2021
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Naeem, M.F., Xian, Y., Tombari, F., Akata, Z.: Learning graph embeddings for compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 953–962 (2021)

work page 2021
[23]

In: ECCV

Nagarajan,T.,Grauman,K.:Attributesasoperators:Factorizingunseenattribute- object compositions. In: ECCV. pp. 169–185 (2018).https://doi.org/10.1007/ 978-3-030-01261-8_11

work page 2018
[24]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ramanathan, V., Kalia, A., Petrovic, V., Wen, Y., Zheng, B., Guo, B., Wang, R., Marquez, A., Kovvuri, R., Kadian, A., Mousavi, A., Song, Y., Dubey, A., Mahajan, D.: Paco: Parts and attributes of common objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7141–7151 (June 2023)

work page 2023
[26]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

In: CVPR

Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: Probing vision and language models for visio-linguistic composition- ality. In: CVPR. pp. 5238–5248 (2022).https://doi.org/10.1109/CVPR52688. 2022.00519

work page doi:10.1109/cvpr52688 2022
[28]

In: ICCV

Tokmakov, P., Wang, Y.X., Hebert, M.: Learning compositional representations for few-shot recognition. In: ICCV. pp. 6372–6381 (2019)

work page 2019
[29]

Wang, C., Ji, Y., Meng, Y., Zhang, Y., Zhu, Y.: Sopseg: Prompt-based small object instance segmentation in remote sensing imagery (2025),https://arxiv.org/abs/ 2509.03002 18 Chenhao Wang, Yingrui Ji, Yu Meng, and Yao Zhu

work page arXiv 2025
[30]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T.: Cris: Clip-driven referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11686–11695 (2022)

work page 2022
[31]

PAMI41(9), 2251–2265 (2019).https://doi

Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. PAMI41(9), 2251–2265 (2019).https://doi. org/10.1109/TPAMI.2018.2857768

work page doi:10.1109/tpami.2018.2857768 2019
[32]

In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., Wang, X.: Groupvit: Semantic segmentation emerges from text supervision. In: CVPR. pp. 18134–18144 (2022).https://doi.org/10.1109/CVPR52688.2022.01763

work page doi:10.1109/cvpr52688.2022.01763 2022
[33]

Motiondiffuser: Controllable multi-agent motion prediction using diffusion

Xu, J., Hou, J., Zhang, Y., Feng, R., Wang, Y., Qiao, Y., Xie, W.: Learning open- vocabulary semantic segmentation models from natural language supervision. In: CVPR. pp. 2935–2944 (2023).https://doi.org/10.1109/CVPR52729.2023.00287

work page doi:10.1109/cvpr52729.2023.00287 2023
[34]

In: Furnell, S., Clarke, N

Xu, J., Hou, J., Zhang, Y., Feng, R., Wang, Y., Qiao, Y., Xie, W.: A simple base- line for open-vocabulary semantic segmentation with pre-trained vision-language model. In: ECCV. pp. 736–753 (2022).https://doi.org/10.1007/978-3-031- 19842-7_42

work page doi:10.1007/978-3-031- 2022
[35]

Motiondiffuser: Controllable multi-agent motion prediction using diffusion

Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open- vocabulary semantic segmentation. In: CVPR. pp. 2945–2954 (2023).https: //doi.org/10.1109/CVPR52729.2023.00288

work page doi:10.1109/cvpr52729.2023.00288 2023
[36]

Proceedings of the IEEE/CVF European Conference on Computer Vision (ECCV) (2022)

Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., Bai, X.: A simple baseline for open vocabulary semantic segmentation with pre-trained vision-language model. Proceedings of the IEEE/CVF European Conference on Computer Vision (ECCV) (2022)

work page 2022
[37]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: Language- aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18155– 18165 (2022)

work page 2022
[38]

Yu, Q., He, J., Deng, X., Shen, X., Chen, L.C.: Convolutions die hard: Open- vocabulary segmentation with single frozen convolutional clip. vol. 36, pp. 32215– 32234 (2023)

work page 2023
[39]

Yuksekgonul, M., Bianchi, F., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: ICLR (2023)

work page 2023
[40]

Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? (2022)

work page 2022
[41]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zareian, A., Rosa, K., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: CVPR. pp. 14393–14402 (2021).https://doi.org/10.1109/ CVPR46437.2021.01417

work page arXiv 2021
[42]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Zou, X., Dou, Z.Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15116–15127 (2023)

work page 2023

[1] [1]

IEEE Transactions on Pattern Anal- ysis & Machine Intelligence45(06), 7430–7443 (Jun 2023).https://doi.org/10.1109/TPAMI

Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for image classification. PAMI38(7), 1425–1438 (2016).https://doi.org/10.1109/TPAMI. 2015.2487979

work page doi:10.1109/tpami 2016

[2] [2]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

IEEE transactions on pattern analysis and machine intelli- gence40(4), 834–848 (2017)

Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence40(4), 834–848 (2017)

work page 2017

[4] [4]

Advances in neural information processing systems34, 17864–17875 (2021)

Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. Advances in neural information processing systems34, 17864–17875 (2021)

work page 2021

[5] [5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cho,S.,Shin,H.,Hong,S.,Arnab,A.,Seo,P.H.,Kim,S.:Cat-seg:Costaggregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113–4123 (2024)

work page 2024

[6] [6]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmenta- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11583–11592 (2022)

work page 2022

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Doveh, S., Arbelle, A., Harary, S., Schwartz, E., Herzig, R., Giryes, R., Feris, R., Panda, R., Ullman, S., Karlinsky, L.: Teaching structured vision & language con- cepts to vision & language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2657–2668 (2023)

work page 2023

[8] [8]

In: NeurIPS (2013)

Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T.: Devise: A deep visual-semantic embedding model. In: NeurIPS (2013)

work page 2013

[9] [9]

In: ECCV

Ghiasi,G.,Gu,X.,Cui,Y.,Lin,T.Y.:Scalingopen-vocabularyimagesegmentation with image-level labels. In: ECCV. pp. 540–557 (2022).https://doi.org/10. 1007/978-3-031-20059-5_31

work page 2022

[10] [10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

Huang, X., Ren, L., Liu, C., Wang, Y., Yu, H., Schmitt, M., Hänsch, R., Sun, X., Huang, H., Mayer, H.: Urban building classification (ubc) - a dataset for individual building detection and classification from satellite imagery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1413–1421 (June 2022)

work page 2022

[11] [11]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing pp

Ji, Y., Wang, C., Chen, J., Chen, J., Yue, A., Meng, Y., Sui, C.: Movseg: Efficient adaptation of vision-language models for multispectral open-vocabulary segmenta- tion. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing pp. 1–12 (2026).https://doi.org/10.1109/JSTARS.2026.3658442

work page doi:10.1109/jstars.2026.3658442 2026

[12] [12]

In: International conference on machine learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)

work page 2021

[13] [13]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr- modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1780–1790 (2021)

work page 2021

[14] [14]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023) Decomposed VL Alignment for Fine-Grained OV Segmentation 17

work page 2023

[15] [15]

Language-driven Semantic Segmentation

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10965–10975 (2022)

work page 2022

[17] [17]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, Y.L., Xu, Y., Mao, X., Lu, C.: Symmetry and group in attribute-object com- positions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11316–11325 (2020)

work page 2020

[18] [18]

Motiondiffuser: Controllable multi-agent motion prediction using diffusion

Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR. pp. 7061–7070 (2023).https://doi.org/10.1109/CVPR52729.2023. 00682

work page doi:10.1109/cvpr52729.2023 2023

[19] [19]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, C., Ding, H., Jiang, X.: Gres: Generalized referring expression segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 23592–23601 (2023)

work page 2023

[20] [20]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

work page 2015

[21] [21]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Naeem, M.F., Xian, Y., Tombari, F., Akata, Z.: Compositional zero-shot learning via learning graph embeddings of semantic relations. In: CVPR. pp. 8863–8873 (2021).https://doi.org/10.1109/CVPR46437.2021.00875

work page doi:10.1109/cvpr46437.2021.00875 2021

[22] [22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Naeem, M.F., Xian, Y., Tombari, F., Akata, Z.: Learning graph embeddings for compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 953–962 (2021)

work page 2021

[23] [23]

In: ECCV

Nagarajan,T.,Grauman,K.:Attributesasoperators:Factorizingunseenattribute- object compositions. In: ECCV. pp. 169–185 (2018).https://doi.org/10.1007/ 978-3-030-01261-8_11

work page 2018

[24] [24]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ramanathan, V., Kalia, A., Petrovic, V., Wen, Y., Zheng, B., Guo, B., Wang, R., Marquez, A., Kovvuri, R., Kadian, A., Mousavi, A., Song, Y., Dubey, A., Mahajan, D.: Paco: Parts and attributes of common objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7141–7151 (June 2023)

work page 2023

[26] [26]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

In: CVPR

Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: Probing vision and language models for visio-linguistic composition- ality. In: CVPR. pp. 5238–5248 (2022).https://doi.org/10.1109/CVPR52688. 2022.00519

work page doi:10.1109/cvpr52688 2022

[28] [28]

In: ICCV

Tokmakov, P., Wang, Y.X., Hebert, M.: Learning compositional representations for few-shot recognition. In: ICCV. pp. 6372–6381 (2019)

work page 2019

[29] [29]

Wang, C., Ji, Y., Meng, Y., Zhang, Y., Zhu, Y.: Sopseg: Prompt-based small object instance segmentation in remote sensing imagery (2025),https://arxiv.org/abs/ 2509.03002 18 Chenhao Wang, Yingrui Ji, Yu Meng, and Yao Zhu

work page arXiv 2025

[30] [30]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T.: Cris: Clip-driven referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11686–11695 (2022)

work page 2022

[31] [31]

PAMI41(9), 2251–2265 (2019).https://doi

Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. PAMI41(9), 2251–2265 (2019).https://doi. org/10.1109/TPAMI.2018.2857768

work page doi:10.1109/tpami.2018.2857768 2019

[32] [32]

In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., Wang, X.: Groupvit: Semantic segmentation emerges from text supervision. In: CVPR. pp. 18134–18144 (2022).https://doi.org/10.1109/CVPR52688.2022.01763

work page doi:10.1109/cvpr52688.2022.01763 2022

[33] [33]

Motiondiffuser: Controllable multi-agent motion prediction using diffusion

Xu, J., Hou, J., Zhang, Y., Feng, R., Wang, Y., Qiao, Y., Xie, W.: Learning open- vocabulary semantic segmentation models from natural language supervision. In: CVPR. pp. 2935–2944 (2023).https://doi.org/10.1109/CVPR52729.2023.00287

work page doi:10.1109/cvpr52729.2023.00287 2023

[34] [34]

In: Furnell, S., Clarke, N

Xu, J., Hou, J., Zhang, Y., Feng, R., Wang, Y., Qiao, Y., Xie, W.: A simple base- line for open-vocabulary semantic segmentation with pre-trained vision-language model. In: ECCV. pp. 736–753 (2022).https://doi.org/10.1007/978-3-031- 19842-7_42

work page doi:10.1007/978-3-031- 2022

[35] [35]

Motiondiffuser: Controllable multi-agent motion prediction using diffusion

Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open- vocabulary semantic segmentation. In: CVPR. pp. 2945–2954 (2023).https: //doi.org/10.1109/CVPR52729.2023.00288

work page doi:10.1109/cvpr52729.2023.00288 2023

[36] [36]

Proceedings of the IEEE/CVF European Conference on Computer Vision (ECCV) (2022)

Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., Bai, X.: A simple baseline for open vocabulary semantic segmentation with pre-trained vision-language model. Proceedings of the IEEE/CVF European Conference on Computer Vision (ECCV) (2022)

work page 2022

[37] [37]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: Language- aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18155– 18165 (2022)

work page 2022

[38] [38]

Yu, Q., He, J., Deng, X., Shen, X., Chen, L.C.: Convolutions die hard: Open- vocabulary segmentation with single frozen convolutional clip. vol. 36, pp. 32215– 32234 (2023)

work page 2023

[39] [39]

Yuksekgonul, M., Bianchi, F., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: ICLR (2023)

work page 2023

[40] [40]

Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? (2022)

work page 2022

[41] [41]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zareian, A., Rosa, K., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: CVPR. pp. 14393–14402 (2021).https://doi.org/10.1109/ CVPR46437.2021.01417

work page arXiv 2021

[42] [42]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Zou, X., Dou, Z.Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15116–15127 (2023)

work page 2023