SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

Jiaqi Wang; Long Xing; Qiaosheng Zhang; Shengyuan Ding; Shuangrui Ding; Yibin Wang; Yizhuo Li; Yuhang Zang; Zhixiong Zhang

REVIEW 1 major objections 2 minor 1 cited by

Reformulating referring segmentation as set-level natural language concept prediction lets models handle multiple and open-ended targets more accurately than special tokens allow.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-20 05:20 UTC pith:FTICY4F3

load-bearing objection SetCon gets measurable gains on multi-target referring segmentation by switching to hierarchical natural-language concepts, but the contribution of that interface versus the new annotations is not cleanly isolated. the 1 major comments →

arxiv 2605.20110 v1 pith:FTICY4F3 submitted 2026-05-19 cs.CV

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

Zhixiong Zhang , Yizhuo Li , Shuangrui Ding , Yuhang Zang , Shengyuan Ding , Long Xing , Yibin Wang , Qiaosheng Zhang

show 1 more author

Jiaqi Wang

This is my paper

classification cs.CV

keywords referring segmentationset-level concept predictionlarge vision language modelshierarchical semantic decompositionopen-ended segmentationmulti-target groundingvideo referring segmentation

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that previous LVLM-based methods for referring segmentation treat each target with separate special tokens and therefore miss collective set properties such as completeness and mutual exclusivity. Instead it generates natural-language concepts from the LVLM and uses those concepts as conditions for decoding an entire mask set at once. A hierarchical decomposition first produces one shared set-level concept that defines the overall scope and then breaks it into finer concept groups aligned with subsets of targets. This matters because many practical queries involve groups, cross-category collections, or variable numbers of objects where independent token outputs produce incomplete or redundant masks. The approach is supported by a new annotation pipeline that adds hierarchical concept labels to existing datasets.

Core claim

The central claim is that open-ended referring segmentation is best solved by explicit set-level concept prediction: an LVLM first produces a shared natural-language concept describing the target scope, which is then refined into fine-grained concept groups that serve as semantic conditions for joint mask-set decoding, replacing the earlier practice of emitting sequential special segmentation tokens.

What carries the argument

hierarchical semantic decomposition that predicts a shared set-level concept before refining it into fine-grained concept groups for joint mask-set decoding

Load-bearing premise

LVLM-generated natural-language concepts reliably capture set-level properties such as completeness and mutual exclusivity better than special segmentation tokens.

What would settle it

A controlled test in which a special-token baseline receives identical hierarchical concept supervision yet still shows no performance gain over the original token method on multi-target subsets of gRefCOCO or MUSE.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

SetCon gets measurable gains on multi-target referring segmentation by switching to hierarchical natural-language concepts, but the contribution of that interface versus the new annotations is not cleanly isolated.

read the letter

Here's my read on the SetCon paper. The headline is that they get real improvements on open-ended and multi-target referring segmentation by moving to explicit set-level concept prediction using natural language from an LVLM, instead of special tokens. The hierarchical decomposition into shared set concept and then fine-grained groups is a fresh angle, and the results show the advantage widening as the number of targets goes up. What they do well is demonstrate the practical payoff. On gRefCOCO they add 3.3 gIoU and on MUSE it's 12.1, with the gap increasing for more complex queries. The transfer to video under detect-and-track also lands new SOTA numbers on seven benchmarks. Adding the hierarchical annotations to existing datasets is a solid engineering step that makes the supervision richer. The soft spot is the one the stress test flags. They introduce a two-stage pipeline for 236k samples and 784k concept phrases, but there's no direct comparison to a token-based baseline trained on that same augmented data. Without that, it's hard to know if the set properties like completeness and mutual exclusivity are truly better captured by the concept interface or if it's the extra data doing the work. If the paper has internal ablations on the hierarchy or concept generation, that would address it, but from what's described it feels like a missing control. This paper is aimed at researchers in referring expression comprehension and LVLM applications to segmentation. A reader interested in interactive vision or video understanding would get value from the empirical results and the reformulation idea. It deserves a serious referee because the core claim is testable and the benchmarks show consistent gains, even if some additional experiments would make the argument tighter. I'd recommend sending it out for review with a note to strengthen the isolation of the method's contribution.

Referee Report

1 major / 2 minor

Summary. The paper proposes SetCon for open-ended referring segmentation, reformulating the task as explicit set-level concept prediction. It replaces special segmentation tokens with LVLM-generated natural-language concepts, using a hierarchical semantic decomposition that first predicts a shared set-level concept and then refines it into fine-grained concept groups. A two-stage annotation pipeline augments existing datasets with 236k samples and 784k concept phrases. SetCon reports state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE) with margins that increase as the number of referred targets grows, and transfers to a detect-and-track setting for new state-of-the-art results on seven referring video benchmarks.

Significance. If the results hold and the set-level concept prediction is isolated as the driver of improved handling of completeness and mutual exclusivity, the work could meaningfully advance referring segmentation for complex multi-target and open-ended scenarios. The creation of augmented hierarchical annotations (236k samples, 784k phrases) and the successful transfer to video benchmarks are concrete strengths that provide new resources and demonstrate broader applicability.

major comments (1)

[Experiments / Method] The central claim that LVLM-generated natural-language concepts (via hierarchical decomposition) enable better capture of set-level properties than special tokens is load-bearing but unisolated. No ablation compares SetCon to a token-based decoder trained on the identical augmented dataset; without this, gains could derive from the new hierarchical supervision rather than the concept interface itself. This directly affects the interpretation of growing margins with increasing target count.

minor comments (2)

[Abstract] The abstract and methods would benefit from explicit verification that set-level properties (completeness, mutual exclusivity) are actually enforced by the concept conditions, beyond reporting aggregate gIoU/J&F gains.
[Method] Provide more detail on the joint mask-set decoding architecture and how the LVLM-generated concepts are injected as semantic conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address the major comment point by point below, providing clarifications and outlining revisions where appropriate.

read point-by-point responses

Referee: [Experiments / Method] The central claim that LVLM-generated natural-language concepts (via hierarchical decomposition) enable better capture of set-level properties than special tokens is load-bearing but unisolated. No ablation compares SetCon to a token-based decoder trained on the identical augmented dataset; without this, gains could derive from the new hierarchical supervision rather than the concept interface itself. This directly affects the interpretation of growing margins with increasing target count.

Authors: We agree that isolating the contribution of the natural-language concept interface from the hierarchical supervision is important for strengthening the central claim. The hierarchical decomposition is implemented via LVLM-generated set-level and group-level concepts that serve as semantic conditions for joint decoding, which we argue provides an explicit mechanism for capturing set properties such as completeness and mutual exclusivity. However, to directly address the concern, we will add an ablation in the revised manuscript: a token-based decoder variant trained on the identical augmented dataset (236k samples with hierarchical annotations) but using special segmentation tokens in place of the generated concepts. This comparison will clarify the source of the observed gains, including the increasing margins with higher target counts. We believe the results will support the role of the concept interface, but the new experiment will provide the requested isolation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper reformulates referring segmentation as set-level concept prediction using LVLM-generated natural-language concepts and a hierarchical decomposition, supported by a new two-stage annotation pipeline that augments datasets with 236k samples and 784k phrases. Performance gains are reported empirically on benchmarks with growing margins for multi-target cases, without any equations or steps that reduce predictions to fitted parameters by construction, self-definitional loops, or load-bearing self-citations. The central claims rest on the introduced interface and annotations rather than tautological redefinitions of inputs, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the ability of LVLMs to produce useful set-level concepts and on the assumption that the new annotation pipeline supplies faithful hierarchical supervision.

axioms (1)

domain assumption LVLMs can generate natural-language concepts that serve as effective semantic conditions for segmentation decoding
Invoked when replacing segmentation-specific tokens with LVLM-generated concepts for joint mask-set decoding.

invented entities (1)

Hierarchical semantic decomposition into set-level and fine-grained concept groups no independent evidence
purpose: To predict a shared concept defining target scope then refine it for target subsets
New structure introduced to capture set-level properties; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5824 in / 1326 out tokens · 34727 ms · 2026-05-20T05:20:09.465844+00:00 · methodology

0 comments

read the original abstract

Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 J&F on MeViS and +12.4 J&F on Ref-SeCVOS.

Figures

Figures reproduced from arXiv: 2605.20110 by Jiaqi Wang, Long Xing, Qiaosheng Zhang, Shengyuan Ding, Shuangrui Ding, Yibin Wang, Yizhuo Li, Yuhang Zang, Zhixiong Zhang.

**Figure 1.** Figure 1: Overview of Set-Concept Segmentation (SETCON). (a) Compared to previous per-token formulation, SETCON produces more distinct and complete mask sets in open-ended scenarios. (b) SETCON predicts interpretable, hierarchical set-level concepts and uses them as semantic conditions for mask-set decoding instead of indistinguishable special tokens. (c) Quantitative results show that SETCON achieves the best perfo… view at source ↗

**Figure 2.** Figure 2: Pilot study motivating explicit set-level concept prediction. (a) special-token baselines degrade sharply as the target count grows, while SETCON remains comparatively stable. (b) t-SNE projection of [SEG] representations from the Sa2VA-based baseline, colored by semantic category and 2D spatial position; clusters align more clearly with position than with category. semantic supervision via a two-stage ann… view at source ↗

**Figure 3.** Figure 3: Architecture of SETCON. Given a text query Q and visual input V, the LVLM produces a response R containing a global set-level concept and its decomposed sub-category concepts. The multimodal decoder is trainable while the image encoder and detector decoder remain frozen, to jointly predict the mask set with per-target labels. task toward reasoning segmentation with open-ended instructions. In the video dom… view at source ↗

**Figure 4.** Figure 4: Ablation analysis of SETCON on MUSE. (a) t-SNE projection of segmentation conditions, colored by category: explicit concept conditioning yields tighter, more category-aligned clusters than the token-only variant. (b) F1@0.5 versus the number of referred categories (1–5): hierarchical decomposition consistently outperforms the flat variant, with the gap widening as cardinality grows. suggesting that modelin… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on in-the-wild images. We show the user prompt, input image, the LISA-format baseline, and SETCON, with the predicted sub-category concepts listed below. Across reasoning-style queries spanning multi-instance, cross-category, and open-ended scenarios, SETCON tends to produce more complete and semantically coherent mask sets. USER: If a tennis player wanted to adjust his gear during a… view at source ↗

**Figure 6.** Figure 6: Failure cases on the MUSE benchmark. Open-ended queries expose limitations in target disambiguation and concept granularity, which may lead to incorrect or incomplete predictions. Qualitative Results. To more intuitively demonstrate the segmentation performance of our framework in real-world scenarios, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Additional qualitative results on referring video object segmentation. SETCON maintains stable target identities and temporally consistent masks across challenging video sequences with occlusion, distractors, and appearance changes, by using explicit semantic concepts as persistent anchors shared across frames. We provide additional qualitative results on referring video object segmentation in [PITH_FULL_… view at source ↗

**Figure 8.** Figure 8: Statistics of the training corpus produced by the proposed two-stage hierarchical annotation pipeline, including the per-sample distribution of sub-categories, and the length distribution of concept phrases. with the goal of replacing rigid category names (e.g., light) by more contextually accurate phrases (e.g., traffic light, red bicycle). The source label is supplied explicitly as an anchor to suppress… view at source ↗

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
cs.CV 2026-06 unverdicted novelty 7.0

RNG-Bench evaluates MLLMs on hidden-observation reconstruction in non-Markov games, finds forgetting as the dominant error source, and shows fine-tuning on optimal rollouts improves performance with transfer to other ...

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, pages 6833–6859, 2024

Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Z Shou. One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, pages 6833–6859, 2024

work page 2024
[3]

Sam 3: Segment anything with concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[4]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229, 2020

work page 2020
[5]

Sam4mllm: Enhance multi-modal large language model for referring expression segmentation

Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

work page 2024
[6]

Samwise: Infusing wisdom in sam2 for text-driven video segmentation

Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, and Giuseppe Averta. Samwise: Infusing wisdom in sam2 for text-driven video segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3395–3405, 2025

work page 2025
[7]

Mevis: A large-scale benchmark for video segmentation with motion expressions

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceedings of the IEEE/CVF international conference on computer vision, pages 2694–2703, 2023

work page 2023
[8]

Mevis: A multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. Mevis: A multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[9]

Vlt: Vision-language transformer and query generation for referring segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 7900–7916, 2022

Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Vlt: Vision-language transformer and query generation for referring segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 7900–7916, 2022

work page 2022
[10]

Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree

Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, and Jiaqi Wang. Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13614–13624, 2025

work page 2025
[11]

Language-bridged spatial-temporal interaction for referring video object segmentation

Zihan Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Jizhong Han, and Si Liu. Language-bridged spatial-temporal interaction for referring video object segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4964–4973, 2022

work page 2022
[12]

Palm-e: an embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. InProceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023

work page 2023
[13]

The devil is in temporal token: High quality video reasoning segmentation

Sitong Gong, Yunzhi Zhuge, Lu Zhang, Zongxin Yang, Pingping Zhang, and Huchuan Lu. The devil is in temporal token: High quality video reasoning segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29183–29192, 2025

work page 2025
[14]

Anomalygpt: Detecting industrial anomalies using large vision-language models

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. InProceedings of the AAAI conference on artificial intelligence, pages 1932–1940, 2024

work page 1932
[15]

Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation

Mingfei Han, Yali Wang, Zhihui Li, Lina Yao, Xiaojun Chang, and Yu Qiao. Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13414–13423, 2023

work page 2023
[16]

Decoupling static and hierarchical motion perception for referring video segmentation

Shuting He and Henghui Ding. Decoupling static and hierarchical motion perception for referring video segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13332–13341, 2024

work page 2024
[17]

Segmentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In European conference on computer vision, pages 108–124, 2016. 13

work page 2016
[18]

Beyond one-to-one: Rethinking the referring image segmentation

Yutao Hu, Qixiong Wang, Wenqi Shao, Enze Xie, Zhenguo Li, Jungong Han, and Ping Luo. Beyond one-to-one: Rethinking the referring image segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4067–4077, 2023

work page 2023
[19]

Densely connected parameter-efficient tuning for referring image segmentation

Jiaqi Huang, Zunnan Xu, Ting Liu, Yong Liu, Haonan Han, Kehong Yuan, and Xiu Li. Densely connected parameter-efficient tuning for referring image segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3653–3661, 2025

work page 2025
[20]

Mmr: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation

Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, and Daeshik Kim. Mmr: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[21]

Mdetr-modulated detection for end-to-end multi-modal understanding

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 1780–1790, 2021

work page 2021
[22]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing, pages 787–798, 2014

work page 2014
[23]

Video object segmentation with referring expressions

Anna Khoreva, Anna Rohrbach, and Brent Schiele. Video object segmentation with referring expressions. InProceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018

work page 2018
[24]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023
[25]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024

work page 2024
[26]

Text4seg: Reimagining image segmentation as text generation

Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Text4seg: Reimagining image segmentation as text generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[27]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975, 2022

work page 2022
[28]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, 2023

work page 2023
[29]

Open-vocabulary semantic segmentation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7061–7070, 2023

work page 2023
[30]

Glus: Global-local reasoning unified into a single large language model for video segmentation

Lang Lin, Xueyang Yu, Ziqi Pang, and Yu-Xiong Wang. Glus: Global-local reasoning unified into a single large language model for video segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8658–8667, 2025

work page 2025
[31]

Gres: Generalized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23592–23601, 2023

work page 2023
[32]

Recurrent multimodal interaction for referring image segmentation

Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. InProceedings of the IEEE international conference on computer vision, pages 1271–1280, 2017

work page 2017
[33]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55, 2024

work page 2024
[34]

Universal segmen- tation at arbitrary granularity with language instruction

Yong Liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, and Yansong Tang. Universal segmen- tation at arbitrary granularity with language instruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3459–3469, 2024. 14

work page 2024
[35]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Visionreasoner: Unified reasoning-integrated visual perception via reinforcement learning

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified reasoning-integrated visual perception via reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[37]

Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought

Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 14699–14716, 2025

work page 2025
[38]

Image segmentation using text and image prompts

Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022

work page 2022
[39]

Soc: Semantic-assisted object cluster for referring video object segmentation.Advances in Neural Information Processing Systems, pages 26425–26437, 2023

Zhuoyan Luo, Yicheng Xiao, Yong Liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, and Yujiu Yang. Soc: Semantic-assisted object cluster for referring video object segmentation.Advances in Neural Information Processing Systems, pages 26425–26437, 2023

work page 2023
[40]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

work page 2016
[41]

Spectrum-guided multi-granularity referring video object segmentation

Bo Miao, Mohammed Bennamoun, Yongsheng Gao, and Ajmal Mian. Spectrum-guided multi-granularity referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 920–930, 2023

work page 2023
[42]

Simple open- vocabulary object detection

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Doso- vitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open- vocabulary object detection. InEuropean conference on computer vision, pages 728–755, 2022

work page 2022
[43]

Videoglamm: A large multimodal model for pixel-level visual grounding in videos

Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19036–19046, 2025

work page 2025
[44]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024

work page 2024
[45]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[46]

Pixellm: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024

work page 2024
[47]

Urvos: Unified referring video object segmentation network with a large-scale benchmark

Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InEuropean conference on computer vision, pages 208–223, 2020

work page 2020
[48]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Videoanydoor: High-fidelity video object insertion with precise motion control

Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video object insertion with precise motion control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

work page 2025
[51]

X-sam: From segment anything to any segmentation

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26187–26196, 2026. 15

work page 2026
[52]

Unlocking the po- tential of mllms in referring expression segmentation via a light-weight mask decoder.arXiv preprint arXiv:2508.04107, 2025

Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, and Hong Wang. Unlocking the po- tential of mllms in referring expression segmentation via a light-weight mask decoder.arXiv preprint arXiv:2508.04107, 2025

work page arXiv 2025
[53]

Un- veiling parts beyond objects: Towards finer-granularity referring expression segmentation

Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, and Jing Liu. Un- veiling parts beyond objects: Towards finer-granularity referring expression segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12998–13008, 2024

work page 2024
[54]

arXiv preprint arXiv:2510.06139 (2025) 2

Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li, Sizhe Dang, Chengzu Li, Harry Yang, Guang Dai, Mengmeng Wang, and Jingdong Wang. Deforming videos to masks: Flow matching for referring video segmentation.arXiv preprint arXiv:2510.06139, 2025

work page arXiv 2025
[55]

Hyperseg: Hybrid segmentation assistant with fine-grained visual perceiver

Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Jie Hu, Dengjie Li, Zheng Zhao, and Yujiu Yang. Hyperseg: Hybrid segmentation assistant with fine-grained visual perceiver. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 8931–8941, June 2025

work page 2025
[56]

Instructseg: Unifying instructed visual segmentation with multi-modal large language models

Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Hongfa Wang, and Yujiu Yang. Instructseg: Unifying instructed visual segmentation with multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20193–20203, 2025

work page 2025
[57]

Onlinerefer: A simple online baseline for referring video object segmentation

Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, and Jianbing Shen. Onlinerefer: A simple online baseline for referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2761–2770, 2023

work page 2023
[58]

Language as queries for referring video object segmentation

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022

work page 2022
[59]

General object foundation model for images and videos at scale

Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, and Song Bai. General object foundation model for images and videos at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3783–3795, 2024

work page 2024
[60]

Gsva: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024

work page 2024
[61]

Region-based cluster discrimination for visual representation learning

Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Roy Miles, et al. Region-based cluster discrimination for visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1793–1803, 2025

work page 2025
[62]

Viddar: Vision language model-based task-detrimental content detection for augmented reality.IEEE transactions on visualization and computer graphics, 2025

Yanming Xiu, Tim Scargill, and Maria Gorlatova. Viddar: Vision language model-based task-detrimental content detection for augmented reality.IEEE transactions on visualization and computer graphics, 2025

work page 2025
[63]

Visa: Reasoning video object segmentation via large language models

Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115, 2024

work page 2024
[64]

Lavt: Language- aware vision transformer for referring image segmentation

Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Lavt: Language- aware vision transformer for referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022

work page 2022
[65]

Mattnet: Modular attention network for referring expression comprehension

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1307–1315, 2018

work page 2018
[66]

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, et al. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding. Advances in neural information processing systems, 37:71737–71767, 2024

work page 2024
[68]

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model.arXiv preprint arXiv:2406.20076, 2024

work page Pith review arXiv 2024
[69]

Psalm: Pixelwise segmentation with large multi-modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 16

work page 2024
[70]

Sec: Advancing complex video object segmentation via progressive concept construction

Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Sec: Advancing complex video object segmentation via progressive concept construction. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[71]

Villa: Video reasoning segmentation with large language model

Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, and Hengshuang Zhao. Villa: Video reasoning segmentation with large language model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23667–23677, 2025

work page 2025
[72]

Regionclip: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16793–16803, 2022

work page 2022
[73]

arXiv preprint arXiv:2312.17448 (2023) 4, 12

Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, and Xuansong Xie. Tracking with human-intent reasoning.arXiv preprint arXiv:2312.17448, 2023

work page arXiv 2023
[74]

Training-free spatio-temporal decoupled reasoning video segmentation with adaptive object memory

Zhengtong Zhu, Jiaqing Fan, Zhixuan Liu, and Fanzhang Li. Training-free spatio-temporal decoupled reasoning video segmentation with adaptive object memory. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14022–14030, 2026

work page 2026
[75]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183, 2023

work page 2023
[76]

Generalized decoding for pixel, image, and language

Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15116–15127, 2023

work page 2023
[77]

Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782, 2023

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782, 2023. 17

work page 2023

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, pages 6833–6859, 2024

Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Z Shou. One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, pages 6833–6859, 2024

work page 2024

[3] [3]

Sam 3: Segment anything with concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[4] [4]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229, 2020

work page 2020

[5] [5]

Sam4mllm: Enhance multi-modal large language model for referring expression segmentation

Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

work page 2024

[6] [6]

Samwise: Infusing wisdom in sam2 for text-driven video segmentation

Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, and Giuseppe Averta. Samwise: Infusing wisdom in sam2 for text-driven video segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3395–3405, 2025

work page 2025

[7] [7]

Mevis: A large-scale benchmark for video segmentation with motion expressions

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceedings of the IEEE/CVF international conference on computer vision, pages 2694–2703, 2023

work page 2023

[8] [8]

Mevis: A multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. Mevis: A multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[9] [9]

Vlt: Vision-language transformer and query generation for referring segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 7900–7916, 2022

Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Vlt: Vision-language transformer and query generation for referring segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 7900–7916, 2022

work page 2022

[10] [10]

Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree

Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, and Jiaqi Wang. Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13614–13624, 2025

work page 2025

[11] [11]

Language-bridged spatial-temporal interaction for referring video object segmentation

Zihan Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Jizhong Han, and Si Liu. Language-bridged spatial-temporal interaction for referring video object segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4964–4973, 2022

work page 2022

[12] [12]

Palm-e: an embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. InProceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023

work page 2023

[13] [13]

The devil is in temporal token: High quality video reasoning segmentation

Sitong Gong, Yunzhi Zhuge, Lu Zhang, Zongxin Yang, Pingping Zhang, and Huchuan Lu. The devil is in temporal token: High quality video reasoning segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29183–29192, 2025

work page 2025

[14] [14]

Anomalygpt: Detecting industrial anomalies using large vision-language models

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. InProceedings of the AAAI conference on artificial intelligence, pages 1932–1940, 2024

work page 1932

[15] [15]

Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation

Mingfei Han, Yali Wang, Zhihui Li, Lina Yao, Xiaojun Chang, and Yu Qiao. Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13414–13423, 2023

work page 2023

[16] [16]

Decoupling static and hierarchical motion perception for referring video segmentation

Shuting He and Henghui Ding. Decoupling static and hierarchical motion perception for referring video segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13332–13341, 2024

work page 2024

[17] [17]

Segmentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In European conference on computer vision, pages 108–124, 2016. 13

work page 2016

[18] [18]

Beyond one-to-one: Rethinking the referring image segmentation

Yutao Hu, Qixiong Wang, Wenqi Shao, Enze Xie, Zhenguo Li, Jungong Han, and Ping Luo. Beyond one-to-one: Rethinking the referring image segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4067–4077, 2023

work page 2023

[19] [19]

Densely connected parameter-efficient tuning for referring image segmentation

Jiaqi Huang, Zunnan Xu, Ting Liu, Yong Liu, Haonan Han, Kehong Yuan, and Xiu Li. Densely connected parameter-efficient tuning for referring image segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3653–3661, 2025

work page 2025

[20] [20]

Mmr: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation

Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, and Daeshik Kim. Mmr: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[21] [21]

Mdetr-modulated detection for end-to-end multi-modal understanding

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 1780–1790, 2021

work page 2021

[22] [22]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing, pages 787–798, 2014

work page 2014

[23] [23]

Video object segmentation with referring expressions

Anna Khoreva, Anna Rohrbach, and Brent Schiele. Video object segmentation with referring expressions. InProceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018

work page 2018

[24] [24]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023

[25] [25]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024

work page 2024

[26] [26]

Text4seg: Reimagining image segmentation as text generation

Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Text4seg: Reimagining image segmentation as text generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[27] [27]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975, 2022

work page 2022

[28] [28]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, 2023

work page 2023

[29] [29]

Open-vocabulary semantic segmentation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7061–7070, 2023

work page 2023

[30] [30]

Glus: Global-local reasoning unified into a single large language model for video segmentation

Lang Lin, Xueyang Yu, Ziqi Pang, and Yu-Xiong Wang. Glus: Global-local reasoning unified into a single large language model for video segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8658–8667, 2025

work page 2025

[31] [31]

Gres: Generalized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23592–23601, 2023

work page 2023

[32] [32]

Recurrent multimodal interaction for referring image segmentation

Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. InProceedings of the IEEE international conference on computer vision, pages 1271–1280, 2017

work page 2017

[33] [33]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55, 2024

work page 2024

[34] [34]

Universal segmen- tation at arbitrary granularity with language instruction

Yong Liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, and Yansong Tang. Universal segmen- tation at arbitrary granularity with language instruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3459–3469, 2024. 14

work page 2024

[35] [35]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Visionreasoner: Unified reasoning-integrated visual perception via reinforcement learning

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified reasoning-integrated visual perception via reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[37] [37]

Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought

Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 14699–14716, 2025

work page 2025

[38] [38]

Image segmentation using text and image prompts

Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022

work page 2022

[39] [39]

Soc: Semantic-assisted object cluster for referring video object segmentation.Advances in Neural Information Processing Systems, pages 26425–26437, 2023

Zhuoyan Luo, Yicheng Xiao, Yong Liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, and Yujiu Yang. Soc: Semantic-assisted object cluster for referring video object segmentation.Advances in Neural Information Processing Systems, pages 26425–26437, 2023

work page 2023

[40] [40]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

work page 2016

[41] [41]

Spectrum-guided multi-granularity referring video object segmentation

Bo Miao, Mohammed Bennamoun, Yongsheng Gao, and Ajmal Mian. Spectrum-guided multi-granularity referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 920–930, 2023

work page 2023

[42] [42]

Simple open- vocabulary object detection

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Doso- vitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open- vocabulary object detection. InEuropean conference on computer vision, pages 728–755, 2022

work page 2022

[43] [43]

Videoglamm: A large multimodal model for pixel-level visual grounding in videos

Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19036–19046, 2025

work page 2025

[44] [44]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024

work page 2024

[45] [45]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[46] [46]

Pixellm: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024

work page 2024

[47] [47]

Urvos: Unified referring video object segmentation network with a large-scale benchmark

Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InEuropean conference on computer vision, pages 208–223, 2020

work page 2020

[48] [48]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Videoanydoor: High-fidelity video object insertion with precise motion control

Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video object insertion with precise motion control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

work page 2025

[51] [51]

X-sam: From segment anything to any segmentation

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26187–26196, 2026. 15

work page 2026

[52] [52]

Unlocking the po- tential of mllms in referring expression segmentation via a light-weight mask decoder.arXiv preprint arXiv:2508.04107, 2025

Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, and Hong Wang. Unlocking the po- tential of mllms in referring expression segmentation via a light-weight mask decoder.arXiv preprint arXiv:2508.04107, 2025

work page arXiv 2025

[53] [53]

Un- veiling parts beyond objects: Towards finer-granularity referring expression segmentation

Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, and Jing Liu. Un- veiling parts beyond objects: Towards finer-granularity referring expression segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12998–13008, 2024

work page 2024

[54] [54]

arXiv preprint arXiv:2510.06139 (2025) 2

Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li, Sizhe Dang, Chengzu Li, Harry Yang, Guang Dai, Mengmeng Wang, and Jingdong Wang. Deforming videos to masks: Flow matching for referring video segmentation.arXiv preprint arXiv:2510.06139, 2025

work page arXiv 2025

[55] [55]

Hyperseg: Hybrid segmentation assistant with fine-grained visual perceiver

Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Jie Hu, Dengjie Li, Zheng Zhao, and Yujiu Yang. Hyperseg: Hybrid segmentation assistant with fine-grained visual perceiver. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 8931–8941, June 2025

work page 2025

[56] [56]

Instructseg: Unifying instructed visual segmentation with multi-modal large language models

Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Hongfa Wang, and Yujiu Yang. Instructseg: Unifying instructed visual segmentation with multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20193–20203, 2025

work page 2025

[57] [57]

Onlinerefer: A simple online baseline for referring video object segmentation

Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, and Jianbing Shen. Onlinerefer: A simple online baseline for referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2761–2770, 2023

work page 2023

[58] [58]

Language as queries for referring video object segmentation

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022

work page 2022

[59] [59]

General object foundation model for images and videos at scale

Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, and Song Bai. General object foundation model for images and videos at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3783–3795, 2024

work page 2024

[60] [60]

Gsva: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024

work page 2024

[61] [61]

Region-based cluster discrimination for visual representation learning

Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Roy Miles, et al. Region-based cluster discrimination for visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1793–1803, 2025

work page 2025

[62] [62]

Viddar: Vision language model-based task-detrimental content detection for augmented reality.IEEE transactions on visualization and computer graphics, 2025

Yanming Xiu, Tim Scargill, and Maria Gorlatova. Viddar: Vision language model-based task-detrimental content detection for augmented reality.IEEE transactions on visualization and computer graphics, 2025

work page 2025

[63] [63]

Visa: Reasoning video object segmentation via large language models

Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115, 2024

work page 2024

[64] [64]

Lavt: Language- aware vision transformer for referring image segmentation

Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Lavt: Language- aware vision transformer for referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022

work page 2022

[65] [65]

Mattnet: Modular attention network for referring expression comprehension

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1307–1315, 2018

work page 2018

[66] [66]

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, et al. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding. Advances in neural information processing systems, 37:71737–71767, 2024

work page 2024

[68] [68]

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model.arXiv preprint arXiv:2406.20076, 2024

work page Pith review arXiv 2024

[69] [69]

Psalm: Pixelwise segmentation with large multi-modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 16

work page 2024

[70] [70]

Sec: Advancing complex video object segmentation via progressive concept construction

Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Sec: Advancing complex video object segmentation via progressive concept construction. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[71] [71]

Villa: Video reasoning segmentation with large language model

Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, and Hengshuang Zhao. Villa: Video reasoning segmentation with large language model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23667–23677, 2025

work page 2025

[72] [72]

Regionclip: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16793–16803, 2022

work page 2022

[73] [73]

arXiv preprint arXiv:2312.17448 (2023) 4, 12

Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, and Xuansong Xie. Tracking with human-intent reasoning.arXiv preprint arXiv:2312.17448, 2023

work page arXiv 2023

[74] [74]

Training-free spatio-temporal decoupled reasoning video segmentation with adaptive object memory

Zhengtong Zhu, Jiaqing Fan, Zhixuan Liu, and Fanzhang Li. Training-free spatio-temporal decoupled reasoning video segmentation with adaptive object memory. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14022–14030, 2026

work page 2026

[75] [75]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183, 2023

work page 2023

[76] [76]

Generalized decoding for pixel, image, and language

Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15116–15127, 2023

work page 2023

[77] [77]

Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782, 2023

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782, 2023. 17

work page 2023