SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
Pith reviewed 2026-05-20 05:20 UTC · model grok-4.3
The pith
Reformulating referring segmentation as set-level natural language concept prediction lets models handle multiple and open-ended targets more accurately than special tokens allow.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that open-ended referring segmentation is best solved by explicit set-level concept prediction: an LVLM first produces a shared natural-language concept describing the target scope, which is then refined into fine-grained concept groups that serve as semantic conditions for joint mask-set decoding, replacing the earlier practice of emitting sequential special segmentation tokens.
What carries the argument
hierarchical semantic decomposition that predicts a shared set-level concept before refining it into fine-grained concept groups for joint mask-set decoding
Load-bearing premise
LVLM-generated natural-language concepts reliably capture set-level properties such as completeness and mutual exclusivity better than special segmentation tokens.
What would settle it
A controlled test in which a special-token baseline receives identical hierarchical concept supervision yet still shows no performance gain over the original token method on multi-target subsets of gRefCOCO or MUSE.
Figures
read the original abstract
Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 J&F on MeViS and +12.4 J&F on Ref-SeCVOS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SetCon for open-ended referring segmentation, reformulating the task as explicit set-level concept prediction. It replaces special segmentation tokens with LVLM-generated natural-language concepts, using a hierarchical semantic decomposition that first predicts a shared set-level concept and then refines it into fine-grained concept groups. A two-stage annotation pipeline augments existing datasets with 236k samples and 784k concept phrases. SetCon reports state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE) with margins that increase as the number of referred targets grows, and transfers to a detect-and-track setting for new state-of-the-art results on seven referring video benchmarks.
Significance. If the results hold and the set-level concept prediction is isolated as the driver of improved handling of completeness and mutual exclusivity, the work could meaningfully advance referring segmentation for complex multi-target and open-ended scenarios. The creation of augmented hierarchical annotations (236k samples, 784k phrases) and the successful transfer to video benchmarks are concrete strengths that provide new resources and demonstrate broader applicability.
major comments (1)
- [Experiments / Method] The central claim that LVLM-generated natural-language concepts (via hierarchical decomposition) enable better capture of set-level properties than special tokens is load-bearing but unisolated. No ablation compares SetCon to a token-based decoder trained on the identical augmented dataset; without this, gains could derive from the new hierarchical supervision rather than the concept interface itself. This directly affects the interpretation of growing margins with increasing target count.
minor comments (2)
- [Abstract] The abstract and methods would benefit from explicit verification that set-level properties (completeness, mutual exclusivity) are actually enforced by the concept conditions, beyond reporting aggregate gIoU/J&F gains.
- [Method] Provide more detail on the joint mask-set decoding architecture and how the LVLM-generated concepts are injected as semantic conditions.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address the major comment point by point below, providing clarifications and outlining revisions where appropriate.
read point-by-point responses
-
Referee: [Experiments / Method] The central claim that LVLM-generated natural-language concepts (via hierarchical decomposition) enable better capture of set-level properties than special tokens is load-bearing but unisolated. No ablation compares SetCon to a token-based decoder trained on the identical augmented dataset; without this, gains could derive from the new hierarchical supervision rather than the concept interface itself. This directly affects the interpretation of growing margins with increasing target count.
Authors: We agree that isolating the contribution of the natural-language concept interface from the hierarchical supervision is important for strengthening the central claim. The hierarchical decomposition is implemented via LVLM-generated set-level and group-level concepts that serve as semantic conditions for joint decoding, which we argue provides an explicit mechanism for capturing set properties such as completeness and mutual exclusivity. However, to directly address the concern, we will add an ablation in the revised manuscript: a token-based decoder variant trained on the identical augmented dataset (236k samples with hierarchical annotations) but using special segmentation tokens in place of the generated concepts. This comparison will clarify the source of the observed gains, including the increasing margins with higher target counts. We believe the results will support the role of the concept interface, but the new experiment will provide the requested isolation. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper reformulates referring segmentation as set-level concept prediction using LVLM-generated natural-language concepts and a hierarchical decomposition, supported by a new two-stage annotation pipeline that augments datasets with 236k samples and 784k phrases. Performance gains are reported empirically on benchmarks with growing margins for multi-target cases, without any equations or steps that reduce predictions to fitted parameters by construction, self-definitional loops, or load-bearing self-citations. The central claims rest on the introduced interface and annotations rather than tautological redefinitions of inputs, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LVLMs can generate natural-language concepts that serve as effective semantic conditions for segmentation decoding
invented entities (1)
-
Hierarchical semantic decomposition into set-level and fine-grained concept groups
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Z Shou. One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, pages 6833–6859, 2024
work page 2024
-
[3]
Sam 3: Segment anything with concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[4]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229, 2020
work page 2020
-
[5]
Sam4mllm: Enhance multi-modal large language model for referring expression segmentation
Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024
work page 2024
-
[6]
Samwise: Infusing wisdom in sam2 for text-driven video segmentation
Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, and Giuseppe Averta. Samwise: Infusing wisdom in sam2 for text-driven video segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3395–3405, 2025
work page 2025
-
[7]
Mevis: A large-scale benchmark for video segmentation with motion expressions
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceedings of the IEEE/CVF international conference on computer vision, pages 2694–2703, 2023
work page 2023
-
[8]
Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. Mevis: A multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[9]
Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Vlt: Vision-language transformer and query generation for referring segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 7900–7916, 2022
work page 2022
-
[10]
Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree
Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, and Jiaqi Wang. Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13614–13624, 2025
work page 2025
-
[11]
Language-bridged spatial-temporal interaction for referring video object segmentation
Zihan Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Jizhong Han, and Si Liu. Language-bridged spatial-temporal interaction for referring video object segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4964–4973, 2022
work page 2022
-
[12]
Palm-e: an embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. InProceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023
work page 2023
-
[13]
The devil is in temporal token: High quality video reasoning segmentation
Sitong Gong, Yunzhi Zhuge, Lu Zhang, Zongxin Yang, Pingping Zhang, and Huchuan Lu. The devil is in temporal token: High quality video reasoning segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29183–29192, 2025
work page 2025
-
[14]
Anomalygpt: Detecting industrial anomalies using large vision-language models
Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. InProceedings of the AAAI conference on artificial intelligence, pages 1932–1940, 2024
work page 1932
-
[15]
Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation
Mingfei Han, Yali Wang, Zhihui Li, Lina Yao, Xiaojun Chang, and Yu Qiao. Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13414–13423, 2023
work page 2023
-
[16]
Decoupling static and hierarchical motion perception for referring video segmentation
Shuting He and Henghui Ding. Decoupling static and hierarchical motion perception for referring video segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13332–13341, 2024
work page 2024
-
[17]
Segmentation from natural language expressions
Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In European conference on computer vision, pages 108–124, 2016. 13
work page 2016
-
[18]
Beyond one-to-one: Rethinking the referring image segmentation
Yutao Hu, Qixiong Wang, Wenqi Shao, Enze Xie, Zhenguo Li, Jungong Han, and Ping Luo. Beyond one-to-one: Rethinking the referring image segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4067–4077, 2023
work page 2023
-
[19]
Densely connected parameter-efficient tuning for referring image segmentation
Jiaqi Huang, Zunnan Xu, Ting Liu, Yong Liu, Haonan Han, Kehong Yuan, and Xiu Li. Densely connected parameter-efficient tuning for referring image segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3653–3661, 2025
work page 2025
-
[20]
Mmr: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation
Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, and Daeshik Kim. Mmr: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[21]
Mdetr-modulated detection for end-to-end multi-modal understanding
Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 1780–1790, 2021
work page 2021
-
[22]
Referitgame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing, pages 787–798, 2014
work page 2014
-
[23]
Video object segmentation with referring expressions
Anna Khoreva, Anna Rohrbach, and Brent Schiele. Video object segmentation with referring expressions. InProceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018
work page 2018
-
[24]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023
work page 2023
-
[25]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024
work page 2024
-
[26]
Text4seg: Reimagining image segmentation as text generation
Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Text4seg: Reimagining image segmentation as text generation. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[27]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975, 2022
work page 2022
-
[28]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, 2023
work page 2023
-
[29]
Open-vocabulary semantic segmentation with mask-adapted clip
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7061–7070, 2023
work page 2023
-
[30]
Glus: Global-local reasoning unified into a single large language model for video segmentation
Lang Lin, Xueyang Yu, Ziqi Pang, and Yu-Xiong Wang. Glus: Global-local reasoning unified into a single large language model for video segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8658–8667, 2025
work page 2025
-
[31]
Gres: Generalized referring expression segmentation
Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23592–23601, 2023
work page 2023
-
[32]
Recurrent multimodal interaction for referring image segmentation
Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. InProceedings of the IEEE international conference on computer vision, pages 1271–1280, 2017
work page 2017
-
[33]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55, 2024
work page 2024
-
[34]
Universal segmen- tation at arbitrary granularity with language instruction
Yong Liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, and Yansong Tang. Universal segmen- tation at arbitrary granularity with language instruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3459–3469, 2024. 14
work page 2024
-
[35]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Visionreasoner: Unified reasoning-integrated visual perception via reinforcement learning
Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified reasoning-integrated visual perception via reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[37]
Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought
Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 14699–14716, 2025
work page 2025
-
[38]
Image segmentation using text and image prompts
Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022
work page 2022
-
[39]
Zhuoyan Luo, Yicheng Xiao, Yong Liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, and Yujiu Yang. Soc: Semantic-assisted object cluster for referring video object segmentation.Advances in Neural Information Processing Systems, pages 26425–26437, 2023
work page 2023
-
[40]
Generation and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016
work page 2016
-
[41]
Spectrum-guided multi-granularity referring video object segmentation
Bo Miao, Mohammed Bennamoun, Yongsheng Gao, and Ajmal Mian. Spectrum-guided multi-granularity referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 920–930, 2023
work page 2023
-
[42]
Simple open- vocabulary object detection
Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Doso- vitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open- vocabulary object detection. InEuropean conference on computer vision, pages 728–755, 2022
work page 2022
-
[43]
Videoglamm: A large multimodal model for pixel-level visual grounding in videos
Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19036–19046, 2025
work page 2025
-
[44]
Glamm: Pixel grounding large multimodal model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024
work page 2024
-
[45]
Sam 2: Segment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[46]
Pixellm: Pixel reasoning with large multimodal model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024
work page 2024
-
[47]
Urvos: Unified referring video object segmentation network with a large-scale benchmark
Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InEuropean conference on computer vision, pages 208–223, 2020
work page 2020
-
[48]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Videoanydoor: High-fidelity video object insertion with precise motion control
Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video object insertion with precise motion control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025
work page 2025
-
[51]
X-sam: From segment anything to any segmentation
Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26187–26196, 2026. 15
work page 2026
-
[52]
Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, and Hong Wang. Unlocking the po- tential of mllms in referring expression segmentation via a light-weight mask decoder.arXiv preprint arXiv:2508.04107, 2025
-
[53]
Un- veiling parts beyond objects: Towards finer-granularity referring expression segmentation
Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, and Jing Liu. Un- veiling parts beyond objects: Towards finer-granularity referring expression segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12998–13008, 2024
work page 2024
-
[54]
Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li, Sizhe Dang, Chengzu Li, Harry Yang, Guang Dai, Mengmeng Wang, and Jingdong Wang. Deforming videos to masks: Flow matching for referring video segmentation.arXiv preprint arXiv:2510.06139, 2025
-
[55]
Hyperseg: Hybrid segmentation assistant with fine-grained visual perceiver
Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Jie Hu, Dengjie Li, Zheng Zhao, and Yujiu Yang. Hyperseg: Hybrid segmentation assistant with fine-grained visual perceiver. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 8931–8941, June 2025
work page 2025
-
[56]
Instructseg: Unifying instructed visual segmentation with multi-modal large language models
Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Hongfa Wang, and Yujiu Yang. Instructseg: Unifying instructed visual segmentation with multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20193–20203, 2025
work page 2025
-
[57]
Onlinerefer: A simple online baseline for referring video object segmentation
Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, and Jianbing Shen. Onlinerefer: A simple online baseline for referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2761–2770, 2023
work page 2023
-
[58]
Language as queries for referring video object segmentation
Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022
work page 2022
-
[59]
General object foundation model for images and videos at scale
Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, and Song Bai. General object foundation model for images and videos at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3783–3795, 2024
work page 2024
-
[60]
Gsva: Generalized segmentation via multimodal large language models
Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024
work page 2024
-
[61]
Region-based cluster discrimination for visual representation learning
Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Roy Miles, et al. Region-based cluster discrimination for visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1793–1803, 2025
work page 2025
-
[62]
Yanming Xiu, Tim Scargill, and Maria Gorlatova. Viddar: Vision language model-based task-detrimental content detection for augmented reality.IEEE transactions on visualization and computer graphics, 2025
work page 2025
-
[63]
Visa: Reasoning video object segmentation via large language models
Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115, 2024
work page 2024
-
[64]
Lavt: Language- aware vision transformer for referring image segmentation
Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Lavt: Language- aware vision transformer for referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022
work page 2022
-
[65]
Mattnet: Modular attention network for referring expression comprehension
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1307–1315, 2018
work page 2018
-
[66]
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, et al. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding
Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding. Advances in neural information processing systems, 37:71737–71767, 2024
work page 2024
-
[68]
Evf-sam: Early vision-language fusion for text-prompted segment anything model
Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model.arXiv preprint arXiv:2406.20076, 2024
-
[69]
Psalm: Pixelwise segmentation with large multi-modal model
Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 16
work page 2024
-
[70]
Sec: Advancing complex video object segmentation via progressive concept construction
Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Sec: Advancing complex video object segmentation via progressive concept construction. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[71]
Villa: Video reasoning segmentation with large language model
Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, and Hengshuang Zhao. Villa: Video reasoning segmentation with large language model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23667–23677, 2025
work page 2025
-
[72]
Regionclip: Region-based language-image pretraining
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16793–16803, 2022
work page 2022
-
[73]
Tracking with human-intent reasoning.arXiv preprint arXiv:2312.17448, 2023
Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, and Xuansong Xie. Tracking with human-intent reasoning.arXiv preprint arXiv:2312.17448, 2023
-
[74]
Training-free spatio-temporal decoupled reasoning video segmentation with adaptive object memory
Zhengtong Zhu, Jiaqing Fan, Zhixuan Liu, and Fanzhang Li. Training-free spatio-temporal decoupled reasoning video segmentation with adaptive object memory. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14022–14030, 2026
work page 2026
-
[75]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183, 2023
work page 2023
-
[76]
Generalized decoding for pixel, image, and language
Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15116–15127, 2023
work page 2023
-
[77]
Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782, 2023. 17
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.