pith. sign in

arxiv: 2605.20110 · v1 · pith:FTICY4F3new · submitted 2026-05-19 · 💻 cs.CV

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

Pith reviewed 2026-05-20 05:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords referring segmentationset-level concept predictionlarge vision language modelshierarchical semantic decompositionopen-ended segmentationmulti-target groundingvideo referring segmentation
0
0 comments X

The pith

Reformulating referring segmentation as set-level natural language concept prediction lets models handle multiple and open-ended targets more accurately than special tokens allow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that previous LVLM-based methods for referring segmentation treat each target with separate special tokens and therefore miss collective set properties such as completeness and mutual exclusivity. Instead it generates natural-language concepts from the LVLM and uses those concepts as conditions for decoding an entire mask set at once. A hierarchical decomposition first produces one shared set-level concept that defines the overall scope and then breaks it into finer concept groups aligned with subsets of targets. This matters because many practical queries involve groups, cross-category collections, or variable numbers of objects where independent token outputs produce incomplete or redundant masks. The approach is supported by a new annotation pipeline that adds hierarchical concept labels to existing datasets.

Core claim

The central claim is that open-ended referring segmentation is best solved by explicit set-level concept prediction: an LVLM first produces a shared natural-language concept describing the target scope, which is then refined into fine-grained concept groups that serve as semantic conditions for joint mask-set decoding, replacing the earlier practice of emitting sequential special segmentation tokens.

What carries the argument

hierarchical semantic decomposition that predicts a shared set-level concept before refining it into fine-grained concept groups for joint mask-set decoding

Load-bearing premise

LVLM-generated natural-language concepts reliably capture set-level properties such as completeness and mutual exclusivity better than special segmentation tokens.

What would settle it

A controlled test in which a special-token baseline receives identical hierarchical concept supervision yet still shows no performance gain over the original token method on multi-target subsets of gRefCOCO or MUSE.

Figures

Figures reproduced from arXiv: 2605.20110 by Jiaqi Wang, Long Xing, Qiaosheng Zhang, Shengyuan Ding, Shuangrui Ding, Yibin Wang, Yizhuo Li, Yuhang Zang, Zhixiong Zhang.

Figure 1
Figure 1. Figure 1: Overview of Set-Concept Segmentation (SETCON). (a) Compared to previous per-token formulation, SETCON produces more distinct and complete mask sets in open-ended scenarios. (b) SETCON predicts interpretable, hierarchical set-level concepts and uses them as semantic conditions for mask-set decoding instead of indistinguishable special tokens. (c) Quantitative results show that SETCON achieves the best perfo… view at source ↗
Figure 2
Figure 2. Figure 2: Pilot study motivating explicit set-level concept prediction. (a) special-token baselines degrade sharply as the target count grows, while SETCON remains comparatively stable. (b) t-SNE projection of [SEG] representations from the Sa2VA-based baseline, colored by semantic category and 2D spatial position; clusters align more clearly with position than with category. semantic supervision via a two-stage ann… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of SETCON. Given a text query Q and visual input V, the LVLM produces a response R containing a global set-level concept and its decomposed sub-category concepts. The multimodal decoder is trainable while the image encoder and detector decoder remain frozen, to jointly predict the mask set with per-target labels. task toward reasoning segmentation with open-ended instructions. In the video dom… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation analysis of SETCON on MUSE. (a) t-SNE projection of segmentation conditions, colored by category: explicit concept conditioning yields tighter, more category-aligned clusters than the token-only variant. (b) F1@0.5 versus the number of referred categories (1–5): hierarchical decomposition consistently outperforms the flat variant, with the gap widening as cardinality grows. suggesting that modelin… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on in-the-wild images. We show the user prompt, input image, the LISA-format baseline, and SETCON, with the predicted sub-category concepts listed below. Across reasoning-style queries spanning multi-instance, cross-category, and open-ended scenarios, SETCON tends to produce more complete and semantically coherent mask sets. USER: If a tennis player wanted to adjust his gear during a… view at source ↗
Figure 6
Figure 6. Figure 6: Failure cases on the MUSE benchmark. Open-ended queries expose limitations in target disambiguation and concept granularity, which may lead to incorrect or incomplete predictions. Qualitative Results. To more intuitively demonstrate the segmentation performance of our framework in real-world scenarios, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional qualitative results on referring video object segmentation. SETCON maintains stable target identities and temporally consistent masks across challenging video sequences with occlusion, distractors, and appearance changes, by using explicit semantic concepts as persistent anchors shared across frames. We provide additional qualitative results on referring video object segmentation in [PITH_FULL_… view at source ↗
Figure 8
Figure 8. Figure 8: Statistics of the training corpus produced by the proposed two-stage hierarchical annota￾tion pipeline, including the per-sample distribution of sub-categories, and the length distribution of concept phrases. with the goal of replacing rigid category names (e.g., light) by more contextually accurate phrases (e.g., traffic light, red bicycle). The source label is supplied explicitly as an anchor to suppress… view at source ↗
read the original abstract

Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 J&F on MeViS and +12.4 J&F on Ref-SeCVOS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes SetCon for open-ended referring segmentation, reformulating the task as explicit set-level concept prediction. It replaces special segmentation tokens with LVLM-generated natural-language concepts, using a hierarchical semantic decomposition that first predicts a shared set-level concept and then refines it into fine-grained concept groups. A two-stage annotation pipeline augments existing datasets with 236k samples and 784k concept phrases. SetCon reports state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE) with margins that increase as the number of referred targets grows, and transfers to a detect-and-track setting for new state-of-the-art results on seven referring video benchmarks.

Significance. If the results hold and the set-level concept prediction is isolated as the driver of improved handling of completeness and mutual exclusivity, the work could meaningfully advance referring segmentation for complex multi-target and open-ended scenarios. The creation of augmented hierarchical annotations (236k samples, 784k phrases) and the successful transfer to video benchmarks are concrete strengths that provide new resources and demonstrate broader applicability.

major comments (1)
  1. [Experiments / Method] The central claim that LVLM-generated natural-language concepts (via hierarchical decomposition) enable better capture of set-level properties than special tokens is load-bearing but unisolated. No ablation compares SetCon to a token-based decoder trained on the identical augmented dataset; without this, gains could derive from the new hierarchical supervision rather than the concept interface itself. This directly affects the interpretation of growing margins with increasing target count.
minor comments (2)
  1. [Abstract] The abstract and methods would benefit from explicit verification that set-level properties (completeness, mutual exclusivity) are actually enforced by the concept conditions, beyond reporting aggregate gIoU/J&F gains.
  2. [Method] Provide more detail on the joint mask-set decoding architecture and how the LVLM-generated concepts are injected as semantic conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address the major comment point by point below, providing clarifications and outlining revisions where appropriate.

read point-by-point responses
  1. Referee: [Experiments / Method] The central claim that LVLM-generated natural-language concepts (via hierarchical decomposition) enable better capture of set-level properties than special tokens is load-bearing but unisolated. No ablation compares SetCon to a token-based decoder trained on the identical augmented dataset; without this, gains could derive from the new hierarchical supervision rather than the concept interface itself. This directly affects the interpretation of growing margins with increasing target count.

    Authors: We agree that isolating the contribution of the natural-language concept interface from the hierarchical supervision is important for strengthening the central claim. The hierarchical decomposition is implemented via LVLM-generated set-level and group-level concepts that serve as semantic conditions for joint decoding, which we argue provides an explicit mechanism for capturing set properties such as completeness and mutual exclusivity. However, to directly address the concern, we will add an ablation in the revised manuscript: a token-based decoder variant trained on the identical augmented dataset (236k samples with hierarchical annotations) but using special segmentation tokens in place of the generated concepts. This comparison will clarify the source of the observed gains, including the increasing margins with higher target counts. We believe the results will support the role of the concept interface, but the new experiment will provide the requested isolation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper reformulates referring segmentation as set-level concept prediction using LVLM-generated natural-language concepts and a hierarchical decomposition, supported by a new two-stage annotation pipeline that augments datasets with 236k samples and 784k phrases. Performance gains are reported empirically on benchmarks with growing margins for multi-target cases, without any equations or steps that reduce predictions to fitted parameters by construction, self-definitional loops, or load-bearing self-citations. The central claims rest on the introduced interface and annotations rather than tautological redefinitions of inputs, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the ability of LVLMs to produce useful set-level concepts and on the assumption that the new annotation pipeline supplies faithful hierarchical supervision.

axioms (1)
  • domain assumption LVLMs can generate natural-language concepts that serve as effective semantic conditions for segmentation decoding
    Invoked when replacing segmentation-specific tokens with LVLM-generated concepts for joint mask-set decoding.
invented entities (1)
  • Hierarchical semantic decomposition into set-level and fine-grained concept groups no independent evidence
    purpose: To predict a shared concept defining target scope then refine it for target subsets
    New structure introduced to capture set-level properties; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5824 in / 1326 out tokens · 34727 ms · 2026-05-20T05:20:09.465844+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 5 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, pages 6833–6859, 2024

    Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Z Shou. One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, pages 6833–6859, 2024

  3. [3]

    Sam 3: Segment anything with concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. InThe Fourteenth International Conference on Learning Representations, 2026

  4. [4]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229, 2020

  5. [5]

    Sam4mllm: Enhance multi-modal large language model for referring expression segmentation

    Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

  6. [6]

    Samwise: Infusing wisdom in sam2 for text-driven video segmentation

    Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, and Giuseppe Averta. Samwise: Infusing wisdom in sam2 for text-driven video segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3395–3405, 2025

  7. [7]

    Mevis: A large-scale benchmark for video segmentation with motion expressions

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceedings of the IEEE/CVF international conference on computer vision, pages 2694–2703, 2023

  8. [8]

    Mevis: A multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. Mevis: A multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  9. [9]

    Vlt: Vision-language transformer and query generation for referring segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 7900–7916, 2022

    Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Vlt: Vision-language transformer and query generation for referring segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 7900–7916, 2022

  10. [10]

    Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree

    Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, and Jiaqi Wang. Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13614–13624, 2025

  11. [11]

    Language-bridged spatial-temporal interaction for referring video object segmentation

    Zihan Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Jizhong Han, and Si Liu. Language-bridged spatial-temporal interaction for referring video object segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4964–4973, 2022

  12. [12]

    Palm-e: an embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. InProceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023

  13. [13]

    The devil is in temporal token: High quality video reasoning segmentation

    Sitong Gong, Yunzhi Zhuge, Lu Zhang, Zongxin Yang, Pingping Zhang, and Huchuan Lu. The devil is in temporal token: High quality video reasoning segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29183–29192, 2025

  14. [14]

    Anomalygpt: Detecting industrial anomalies using large vision-language models

    Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. InProceedings of the AAAI conference on artificial intelligence, pages 1932–1940, 2024

  15. [15]

    Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation

    Mingfei Han, Yali Wang, Zhihui Li, Lina Yao, Xiaojun Chang, and Yu Qiao. Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13414–13423, 2023

  16. [16]

    Decoupling static and hierarchical motion perception for referring video segmentation

    Shuting He and Henghui Ding. Decoupling static and hierarchical motion perception for referring video segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13332–13341, 2024

  17. [17]

    Segmentation from natural language expressions

    Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In European conference on computer vision, pages 108–124, 2016. 13

  18. [18]

    Beyond one-to-one: Rethinking the referring image segmentation

    Yutao Hu, Qixiong Wang, Wenqi Shao, Enze Xie, Zhenguo Li, Jungong Han, and Ping Luo. Beyond one-to-one: Rethinking the referring image segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4067–4077, 2023

  19. [19]

    Densely connected parameter-efficient tuning for referring image segmentation

    Jiaqi Huang, Zunnan Xu, Ting Liu, Yong Liu, Haonan Han, Kehong Yuan, and Xiu Li. Densely connected parameter-efficient tuning for referring image segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3653–3661, 2025

  20. [20]

    Mmr: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation

    Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, and Daeshik Kim. Mmr: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. InThe Thirteenth International Conference on Learning Representations, 2025

  21. [21]

    Mdetr-modulated detection for end-to-end multi-modal understanding

    Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 1780–1790, 2021

  22. [22]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing, pages 787–798, 2014

  23. [23]

    Video object segmentation with referring expressions

    Anna Khoreva, Anna Rohrbach, and Brent Schiele. Video object segmentation with referring expressions. InProceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018

  24. [24]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  25. [25]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024

  26. [26]

    Text4seg: Reimagining image segmentation as text generation

    Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Text4seg: Reimagining image segmentation as text generation. InThe Thirteenth International Conference on Learning Representations, 2025

  27. [27]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975, 2022

  28. [28]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, 2023

  29. [29]

    Open-vocabulary semantic segmentation with mask-adapted clip

    Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7061–7070, 2023

  30. [30]

    Glus: Global-local reasoning unified into a single large language model for video segmentation

    Lang Lin, Xueyang Yu, Ziqi Pang, and Yu-Xiong Wang. Glus: Global-local reasoning unified into a single large language model for video segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8658–8667, 2025

  31. [31]

    Gres: Generalized referring expression segmentation

    Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23592–23601, 2023

  32. [32]

    Recurrent multimodal interaction for referring image segmentation

    Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. InProceedings of the IEEE international conference on computer vision, pages 1271–1280, 2017

  33. [33]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55, 2024

  34. [34]

    Universal segmen- tation at arbitrary granularity with language instruction

    Yong Liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, and Yansong Tang. Universal segmen- tation at arbitrary granularity with language instruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3459–3469, 2024. 14

  35. [35]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

  36. [36]

    Visionreasoner: Unified reasoning-integrated visual perception via reinforcement learning

    Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified reasoning-integrated visual perception via reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026

  37. [37]

    Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought

    Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 14699–14716, 2025

  38. [38]

    Image segmentation using text and image prompts

    Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022

  39. [39]

    Soc: Semantic-assisted object cluster for referring video object segmentation.Advances in Neural Information Processing Systems, pages 26425–26437, 2023

    Zhuoyan Luo, Yicheng Xiao, Yong Liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, and Yujiu Yang. Soc: Semantic-assisted object cluster for referring video object segmentation.Advances in Neural Information Processing Systems, pages 26425–26437, 2023

  40. [40]

    Generation and comprehension of unambiguous object descriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

  41. [41]

    Spectrum-guided multi-granularity referring video object segmentation

    Bo Miao, Mohammed Bennamoun, Yongsheng Gao, and Ajmal Mian. Spectrum-guided multi-granularity referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 920–930, 2023

  42. [42]

    Simple open- vocabulary object detection

    Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Doso- vitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open- vocabulary object detection. InEuropean conference on computer vision, pages 728–755, 2022

  43. [43]

    Videoglamm: A large multimodal model for pixel-level visual grounding in videos

    Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19036–19046, 2025

  44. [44]

    Glamm: Pixel grounding large multimodal model

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024

  45. [45]

    Sam 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InThe Thirteenth International Conference on Learning Representations, 2025

  46. [46]

    Pixellm: Pixel reasoning with large multimodal model

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024

  47. [47]

    Urvos: Unified referring video object segmentation network with a large-scale benchmark

    Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InEuropean conference on computer vision, pages 208–223, 2020

  48. [48]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  49. [49]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  50. [50]

    Videoanydoor: High-fidelity video object insertion with precise motion control

    Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video object insertion with precise motion control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

  51. [51]

    X-sam: From segment anything to any segmentation

    Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26187–26196, 2026. 15

  52. [52]

    Unlocking the po- tential of mllms in referring expression segmentation via a light-weight mask decoder.arXiv preprint arXiv:2508.04107, 2025

    Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, and Hong Wang. Unlocking the po- tential of mllms in referring expression segmentation via a light-weight mask decoder.arXiv preprint arXiv:2508.04107, 2025

  53. [53]

    Un- veiling parts beyond objects: Towards finer-granularity referring expression segmentation

    Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, and Jing Liu. Un- veiling parts beyond objects: Towards finer-granularity referring expression segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12998–13008, 2024

  54. [54]

    Deforming videos to masks: Flow matching for referring video segmentation.arXiv preprint arXiv:2510.06139, 2025

    Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li, Sizhe Dang, Chengzu Li, Harry Yang, Guang Dai, Mengmeng Wang, and Jingdong Wang. Deforming videos to masks: Flow matching for referring video segmentation.arXiv preprint arXiv:2510.06139, 2025

  55. [55]

    Hyperseg: Hybrid segmentation assistant with fine-grained visual perceiver

    Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Jie Hu, Dengjie Li, Zheng Zhao, and Yujiu Yang. Hyperseg: Hybrid segmentation assistant with fine-grained visual perceiver. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 8931–8941, June 2025

  56. [56]

    Instructseg: Unifying instructed visual segmentation with multi-modal large language models

    Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Hongfa Wang, and Yujiu Yang. Instructseg: Unifying instructed visual segmentation with multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20193–20203, 2025

  57. [57]

    Onlinerefer: A simple online baseline for referring video object segmentation

    Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, and Jianbing Shen. Onlinerefer: A simple online baseline for referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2761–2770, 2023

  58. [58]

    Language as queries for referring video object segmentation

    Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022

  59. [59]

    General object foundation model for images and videos at scale

    Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, and Song Bai. General object foundation model for images and videos at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3783–3795, 2024

  60. [60]

    Gsva: Generalized segmentation via multimodal large language models

    Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024

  61. [61]

    Region-based cluster discrimination for visual representation learning

    Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Roy Miles, et al. Region-based cluster discrimination for visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1793–1803, 2025

  62. [62]

    Viddar: Vision language model-based task-detrimental content detection for augmented reality.IEEE transactions on visualization and computer graphics, 2025

    Yanming Xiu, Tim Scargill, and Maria Gorlatova. Viddar: Vision language model-based task-detrimental content detection for augmented reality.IEEE transactions on visualization and computer graphics, 2025

  63. [63]

    Visa: Reasoning video object segmentation via large language models

    Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115, 2024

  64. [64]

    Lavt: Language- aware vision transformer for referring image segmentation

    Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Lavt: Language- aware vision transformer for referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022

  65. [65]

    Mattnet: Modular attention network for referring expression comprehension

    Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1307–1315, 2018

  66. [66]

    Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

    Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, et al. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025

  67. [67]

    Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

    Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding. Advances in neural information processing systems, 37:71737–71767, 2024

  68. [68]

    Evf-sam: Early vision-language fusion for text-prompted segment anything model

    Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model.arXiv preprint arXiv:2406.20076, 2024

  69. [69]

    Psalm: Pixelwise segmentation with large multi-modal model

    Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 16

  70. [70]

    Sec: Advancing complex video object segmentation via progressive concept construction

    Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Sec: Advancing complex video object segmentation via progressive concept construction. InThe Fourteenth International Conference on Learning Representations, 2026

  71. [71]

    Villa: Video reasoning segmentation with large language model

    Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, and Hengshuang Zhao. Villa: Video reasoning segmentation with large language model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23667–23677, 2025

  72. [72]

    Regionclip: Region-based language-image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16793–16803, 2022

  73. [73]

    Tracking with human-intent reasoning.arXiv preprint arXiv:2312.17448, 2023

    Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, and Xuansong Xie. Tracking with human-intent reasoning.arXiv preprint arXiv:2312.17448, 2023

  74. [74]

    Training-free spatio-temporal decoupled reasoning video segmentation with adaptive object memory

    Zhengtong Zhu, Jiaqing Fan, Zhixuan Liu, and Fanzhang Li. Training-free spatio-temporal decoupled reasoning video segmentation with adaptive object memory. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14022–14030, 2026

  75. [75]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183, 2023

  76. [76]

    Generalized decoding for pixel, image, and language

    Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15116–15127, 2023

  77. [77]

    Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782, 2023

    Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782, 2023. 17