pith. sign in

arxiv: 2605.15942 · v1 · pith:DO3JTEXFnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI

Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation

Pith reviewed 2026-05-20 18:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords open-vocabulary segmentationvision-language alignmentdecomposed promptscompositional semanticsfine-grained segmentationcross-attentiongeneralization to unseen compositions
0
0 comments X

The pith

Factorizing text prompts into separate concept and attribute tokens lets segmentation models generalize to unseen category-attribute pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard open-vocabulary segmentation models fail on fine-grained descriptions because they encode whole sentences as single units, mixing object categories with their attributes. By explicitly splitting each prompt into one concept token plus multiple attribute tokens, the model can align vision features with each semantic piece on its own. A Feature-Gated Cross-Attention module then creates attribute-specific gating maps that combine information multiplicatively, while similarities are aggregated in log space for stable scoring. This decomposition is claimed to produce better results on benchmarks that test novel combinations never seen together in training data. The method plugs directly into existing transformer segmentation backbones without changing their core architecture.

Core claim

We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching.

What carries the argument

Decomposed Vision-Language Alignment, which splits prompts into independent concept and attribute tokens and applies Feature-Gated Cross-Attention for multiplicative fusion per attribute.

If this is right

  • The framework integrates directly into existing transformer-based open-vocabulary segmentation models without architectural overhaul.
  • Separate cross-modal interactions per token enforce compositional semantics that holistic sentence encodings lose.
  • Log-space aggregation of per-token similarities yields more stable and interpretable matching scores than direct averaging.
  • The approach targets improved performance specifically on fine-grained benchmarks that test generalization to unseen compositions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token decomposition could be tested in related tasks such as attribute-aware image retrieval or referring expression comprehension.
  • If prompt factorization is learned rather than hand-specified, the method might scale to longer or more complex descriptions.
  • The multiplicative gating maps might transfer to other multimodal fusion settings where selective attribute emphasis is useful.

Load-bearing premise

Textual prompts can be reliably split into independent concept and attribute tokens that keep their compositional meaning without introducing errors or losing information.

What would settle it

Run the model on a held-out test set containing only attribute-category pairs absent from training data and measure whether segmentation accuracy on those pairs remains no higher than that of a standard holistic-prompt baseline.

Figures

Figures reproduced from arXiv: 2605.15942 by Chenhao Wang, Yao Zhu, Yingrui Ji, Yu Meng.

Figure 1
Figure 1. Figure 1: Overall architecture of the proposed method. (a) Explicit Prompt Decom￾position and Feature-Gated Cross-Attention. The compositional text prompt is explicitly decoupled into independent concept and attribute tokens. Visual queries interact with the concept via standard cross-attention, and with attributes via mul￾tiplicative feature-gating to enforce compositional constraints. (b) Log-Space AND Composition… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative visualization of the Feature-Gated Cross-Attention mechanism for two examples. The top row prompt decomposes into concept building, attribute hipped roof building, and attribute residential building. The bottom row prompt decomposes into concept building, attribute flat roof building, and attribute commercial building. Panel a shows the original images. Panel b shows the concept attention maps.… view at source ↗
read the original abstract

Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a Decomposed Vision-Language Alignment framework for fine-grained open-vocabulary segmentation. It explicitly factorizes textual prompts into a single concept token and multiple attribute tokens, enabling separate cross-modal interactions per semantic unit. A Feature-Gated Cross-Attention module generates attribute-specific gating maps for multiplicative fusion at the feature level, while per-token similarities are aggregated in log-space at the scoring level to produce compositional matching. The approach is presented as integrable into existing transformer-based segmentation models to improve generalization on unseen attribute-category compositions.

Significance. If the factorization and modules operate as intended, the framework could offer a more explicit mechanism for enforcing compositional semantics than holistic prompt encoding, potentially benefiting open-vocabulary segmentation on fine-grained benchmarks. The log-space aggregation and gated attention are presented as stable and interpretable additions. However, the manuscript supplies no quantitative results, benchmarks, ablation studies, or error analysis, so the actual significance cannot be assessed from the provided text.

major comments (2)
  1. [Abstract] Abstract: The central claim of improved generalization to unseen attribute-category compositions rests on the explicit factorization of prompts into independent concept and attribute tokens plus the Feature-Gated Cross-Attention module, yet the text provides no empirical results, tables, or figures to support that this factorization preserves semantics or yields measurable gains.
  2. [Abstract] Method description (implicit in Abstract): The assumption that textual prompts can be reliably decomposed into semantically independent tokens without entanglement or boundary errors is load-bearing for the per-token cross-modal interactions and log-space aggregation; if factorization introduces omissions or incorrect boundaries, the claimed compositional enforcement no longer follows.
minor comments (1)
  1. [Abstract] Abstract: Consider specifying the parsing or model used to produce the concept and attribute tokens, as this step is foundational but described only at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript describing the Decomposed Vision-Language Alignment framework. We address each of the major comments below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of improved generalization to unseen attribute-category compositions rests on the explicit factorization of prompts into independent concept and attribute tokens plus the Feature-Gated Cross-Attention module, yet the text provides no empirical results, tables, or figures to support that this factorization preserves semantics or yields measurable gains.

    Authors: The referee correctly notes that the current manuscript text does not present empirical results, tables, or figures. This version emphasizes the proposed framework and its integration into transformer-based models. We agree that to substantiate the claims of improved generalization, quantitative evaluations are necessary. In the revised manuscript, we will include benchmark results, ablation studies, and error analysis demonstrating the benefits of the decomposed alignment approach. revision: yes

  2. Referee: [Abstract] Method description (implicit in Abstract): The assumption that textual prompts can be reliably decomposed into semantically independent tokens without entanglement or boundary errors is load-bearing for the per-token cross-modal interactions and log-space aggregation; if factorization introduces omissions or incorrect boundaries, the claimed compositional enforcement no longer follows.

    Authors: We agree that the reliability of the prompt decomposition is critical to the framework's success. The manuscript outlines an explicit factorization process where prompts are parsed into one concept token and multiple attribute tokens using semantic role labeling or similar NLP techniques. This enables the separate cross-modal interactions. To mitigate risks of entanglement or boundary errors, the log-space aggregation allows for flexible matching. We will expand the method description in the revision to include the exact decomposition algorithm and robustness checks. revision: yes

Circularity Check

0 steps flagged

Proposed decomposition and gating modules are additive architectural choices with no reduction to inputs by construction

full rationale

The paper introduces an explicit factorization of prompts into concept and attribute tokens, followed by per-token cross-modal interactions, a Feature-Gated Cross-Attention module, and log-space aggregation as new components integrated into existing transformer architectures. No equations or steps in the abstract or described framework reduce a claimed prediction or result to a fitted parameter or prior self-citation by definition; the generalization improvements are attributed to these design additions rather than any self-referential equivalence. The derivation remains self-contained as an independent modeling proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that prompts admit clean factorization into semantic units and on the introduction of a new attention module whose effectiveness is asserted rather than derived from prior results.

axioms (1)
  • domain assumption Textual prompts can be explicitly factorized into a concept token and multiple attribute tokens that enable separate cross-modal interactions.
    This factorization is invoked as the starting point for the entire Decomposed Vision-Language Alignment framework.
invented entities (1)
  • Feature-Gated Cross-Attention module no independent evidence
    purpose: Generates attribute-specific gating maps to fuse information in a multiplicative manner and enforce compositional semantics.
    New module introduced to perform the gated fusion at the feature level.

pith-pipeline@v0.9.0 · 5662 in / 1274 out tokens · 68482 ms · 2026-05-20T18:40:32.989962+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

  1. [1]

    IEEE Transactions on Pattern Anal- ysis & Machine Intelligence45(06), 7430–7443 (Jun 2023).https://doi.org/10.1109/TPAMI

    Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for image classification. PAMI38(7), 1425–1438 (2016).https://doi.org/10.1109/TPAMI. 2015.2487979

  2. [2]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

  3. [3]

    IEEE transactions on pattern analysis and machine intelli- gence40(4), 834–848 (2017)

    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence40(4), 834–848 (2017)

  4. [4]

    Advances in neural information processing systems34, 17864–17875 (2021)

    Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. Advances in neural information processing systems34, 17864–17875 (2021)

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cho,S.,Shin,H.,Hong,S.,Arnab,A.,Seo,P.H.,Kim,S.:Cat-seg:Costaggregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113–4123 (2024)

  6. [6]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmenta- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11583–11592 (2022)

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Doveh, S., Arbelle, A., Harary, S., Schwartz, E., Herzig, R., Giryes, R., Feris, R., Panda, R., Ullman, S., Karlinsky, L.: Teaching structured vision & language con- cepts to vision & language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2657–2668 (2023)

  8. [8]

    In: NeurIPS (2013)

    Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T.: Devise: A deep visual-semantic embedding model. In: NeurIPS (2013)

  9. [9]

    In: ECCV

    Ghiasi,G.,Gu,X.,Cui,Y.,Lin,T.Y.:Scalingopen-vocabularyimagesegmentation with image-level labels. In: ECCV. pp. 540–557 (2022).https://doi.org/10. 1007/978-3-031-20059-5_31

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    Huang, X., Ren, L., Liu, C., Wang, Y., Yu, H., Schmitt, M., Hänsch, R., Sun, X., Huang, H., Mayer, H.: Urban building classification (ubc) - a dataset for individual building detection and classification from satellite imagery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1413–1421 (June 2022)

  11. [11]

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing pp

    Ji, Y., Wang, C., Chen, J., Chen, J., Yue, A., Meng, Y., Sui, C.: Movseg: Efficient adaptation of vision-language models for multispectral open-vocabulary segmenta- tion. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing pp. 1–12 (2026).https://doi.org/10.1109/JSTARS.2026.3658442

  12. [12]

    In: International conference on machine learning

    Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)

  13. [13]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr- modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1780–1790 (2021)

  14. [14]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023) Decomposed VL Alignment for Fine-Grained OV Segmentation 17

  15. [15]

    Language-driven Semantic Segmentation

    Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022)

  16. [16]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10965–10975 (2022)

  17. [17]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, Y.L., Xu, Y., Mao, X., Lu, C.: Symmetry and group in attribute-object com- positions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11316–11325 (2020)

  18. [18]

    Motiondiffuser: Controllable multi-agent motion prediction using diffusion

    Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR. pp. 7061–7070 (2023).https://doi.org/10.1109/CVPR52729.2023. 00682

  19. [19]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, C., Ding, H., Jiang, X.: Gres: Generalized referring expression segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 23592–23601 (2023)

  20. [20]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

  21. [21]

    In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Naeem, M.F., Xian, Y., Tombari, F., Akata, Z.: Compositional zero-shot learning via learning graph embeddings of semantic relations. In: CVPR. pp. 8863–8873 (2021).https://doi.org/10.1109/CVPR46437.2021.00875

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Naeem, M.F., Xian, Y., Tombari, F., Akata, Z.: Learning graph embeddings for compositional zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 953–962 (2021)

  23. [23]

    In: ECCV

    Nagarajan,T.,Grauman,K.:Attributesasoperators:Factorizingunseenattribute- object compositions. In: ECCV. pp. 169–185 (2018).https://doi.org/10.1007/ 978-3-030-01261-8_11

  24. [24]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Ramanathan, V., Kalia, A., Petrovic, V., Wen, Y., Zheng, B., Guo, B., Wang, R., Marquez, A., Kovvuri, R., Kadian, A., Mousavi, A., Song, Y., Dubey, A., Mahajan, D.: Paco: Parts and attributes of common objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7141–7151 (June 2023)

  26. [26]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

  27. [27]

    In: CVPR

    Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: Probing vision and language models for visio-linguistic composition- ality. In: CVPR. pp. 5238–5248 (2022).https://doi.org/10.1109/CVPR52688. 2022.00519

  28. [28]

    In: ICCV

    Tokmakov, P., Wang, Y.X., Hebert, M.: Learning compositional representations for few-shot recognition. In: ICCV. pp. 6372–6381 (2019)

  29. [29]

    Wang, C., Ji, Y., Meng, Y., Zhang, Y., Zhu, Y.: Sopseg: Prompt-based small object instance segmentation in remote sensing imagery (2025),https://arxiv.org/abs/ 2509.03002 18 Chenhao Wang, Yingrui Ji, Yu Meng, and Yao Zhu

  30. [30]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T.: Cris: Clip-driven referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11686–11695 (2022)

  31. [31]

    PAMI41(9), 2251–2265 (2019).https://doi

    Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. PAMI41(9), 2251–2265 (2019).https://doi. org/10.1109/TPAMI.2018.2857768

  32. [32]

    In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., Wang, X.: Groupvit: Semantic segmentation emerges from text supervision. In: CVPR. pp. 18134–18144 (2022).https://doi.org/10.1109/CVPR52688.2022.01763

  33. [33]

    Motiondiffuser: Controllable multi-agent motion prediction using diffusion

    Xu, J., Hou, J., Zhang, Y., Feng, R., Wang, Y., Qiao, Y., Xie, W.: Learning open- vocabulary semantic segmentation models from natural language supervision. In: CVPR. pp. 2935–2944 (2023).https://doi.org/10.1109/CVPR52729.2023.00287

  34. [34]

    In: Furnell, S., Clarke, N

    Xu, J., Hou, J., Zhang, Y., Feng, R., Wang, Y., Qiao, Y., Xie, W.: A simple base- line for open-vocabulary semantic segmentation with pre-trained vision-language model. In: ECCV. pp. 736–753 (2022).https://doi.org/10.1007/978-3-031- 19842-7_42

  35. [35]

    Motiondiffuser: Controllable multi-agent motion prediction using diffusion

    Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open- vocabulary semantic segmentation. In: CVPR. pp. 2945–2954 (2023).https: //doi.org/10.1109/CVPR52729.2023.00288

  36. [36]

    Proceedings of the IEEE/CVF European Conference on Computer Vision (ECCV) (2022)

    Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., Bai, X.: A simple baseline for open vocabulary semantic segmentation with pre-trained vision-language model. Proceedings of the IEEE/CVF European Conference on Computer Vision (ECCV) (2022)

  37. [37]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: Language- aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18155– 18165 (2022)

  38. [38]

    Yu, Q., He, J., Deng, X., Shen, X., Chen, L.C.: Convolutions die hard: Open- vocabulary segmentation with single frozen convolutional clip. vol. 36, pp. 32215– 32234 (2023)

  39. [39]

    Yuksekgonul, M., Bianchi, F., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: ICLR (2023)

  40. [40]

    Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? (2022)

  41. [41]

    In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zareian, A., Rosa, K., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: CVPR. pp. 14393–14402 (2021).https://doi.org/10.1109/ CVPR46437.2021.01417

  42. [42]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Zou, X., Dou, Z.Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15116–15127 (2023)